Tag: incident management
Why Your SLO Dashboard is Lying: Moving Beyond Vanity Metrics in Production
Discover how redefining service level objectives (SLOs) around business impact — not vanity uptime metrics — reduced incidents by 75% and saved $2.3M in lost revenue ...
AIOps for SRE — Using AI to Reduce On-Call Fatigue and Improve Reliability
Site reliability engineering (SRE) has become an emergent niche practice invented at Google to become a foundation of contemporary enterprise performance worldwide. With the continued growth of microservices, a multi-cloud infrastructure and continuous deployment pipelines adopted by ...
How AIOps is Revolutionizing DevOps Monitoring in the Cloud Era
As cloud-native systems grow more complex, traditional monitoring falls short. AIOps brings AI, automation, and predictive insights to DevOps—enabling real-time detection, diagnosis, and resolution across distributed environments for faster, smarter operations ...
The Breakneck Future of Codegen: Why AI SWE Must Be Matched with AI SRE
AI codegen is transforming software development — but as speed and complexity increase, so does fragility. AI for site reliability will need to keep pace to avoid system breakdown and engineer burnout. ...
Logz.io Leverages AI to Identify Anomalies in Real-Time
Logz.io added a real-time anomaly detection capability to its observability platform that simplifies correlation of the impact IT events have on business processes ...
New Relic Adds Ability to Correlate IT Events to Business Processes
New Relic Pathpoint enables DevOps teams to better understand the potential business impact of any change to an IT environment ...
SREs Say There’s Plenty of Room to Improve Incident Management
A global survey of site reliability engineers (SREs) found diagnosing issues is the most difficult aspect of incident management ...
Cloudflare Outage Outrage | Yet More FAA 5G Stupidity
In this week’s The Long View: Cloudflare suffers another huge outage while the FAA and FCC still disagree over 5G/NR near airports ...
5 Tips If You’re the First SRE Hire
Site reliability engineers (SREs) have a considerable set of tasks to juggle no matter where they work or how long their company has had an SRE practice. But if you’re the very ...
What SREs Can Learn From the Atlassian Outage of 2022
What happens when the tools and services you depend on to drive site reliability engineering turns out to be susceptible to reliability failures of their own? That’s the question teams at about ...
ServiceNow Launches Incident Management Platform Infused with Observability
ServiceNow added an incident management platform based on the Lightstep observability platform it acquired last year to its software-as-a-service (SaaS) portfolio. Ben Sigelman, general manager of Lightstep at ServiceNow, said Lightstep Incident ...
The Evolution of Incident Management
Have you ever thought about the history of incident management? If you’re an SRE, you might be so caught up in the day-to-day work of managing reliability and responding to incidents that ...

