The numbers are sobering, at best. A new global survey from New Relic pegs the median cost of a high-impact IT outage at $2 million per hour (that’s roughly $33,333 a minute) with a median annual hit of $76 million. That’s not just a bad week; it’s a major operational risk. And it’s landing just as enterprises layer in agentic and LLM-powered services that add speed and complexity to already distributed stacks.
The study canvassed 1,700 IT and engineering leaders and practitioners across 23 countries and 11 industries, offering a statistically broad snapshot of how modern stacks behave under stress and how to mitigate that stress.
The Advantages of Full-Stack Observability
One lever stands out: full-stack observability. New Relic defines this as visibility that spans five layers of the technology estate: infrastructure, applications and services, security monitoring, digital experience monitoring (DEM), and log management.
Among IT respondents with this end-to-end deployment, the median cost of significant outages falls by half, from $2 million to $1 million per hour. The operational signal aligns with the financial signal: only 23% of full-stack shops report weekly high-impact outages, versus 40% among those without full-stack coverage. Detection improves, too. Mean time to detection (MTTD) drops to 28 minutes, or seven minutes faster than peers lacking comprehensive visibility.
The causes behind those costs aren’t mysterious. The top three triggers: network failure, issues with third-party or cloud provider services, and internally deployed software changes. Yet many teams still learn about incidents the hard way. Forty-one percent of leaders say they hear about problems from customer complaints, manual checks, or ticket queues.
Meanwhile, engineers spend a remarkable 33% of their time on break-fix work rather than building features. That’s a sizable productivity tax on development progress.
Adoption of AI is reshaping the observability landscape. As LLM-powered applications and agentic systems proliferate, silent failures can ripple through pipelines, APIs, and downstream apps in ways traditional monitoring misses. On the upside, organizations are responding with AI to monitor AI: usage of AI monitoring climbed from 42% in 2024 to 54% in 2025. A mere 4% say they are not deploying, or planning to deploy, AI monitoring.
Asked which AI capabilities carry the most value for incident response, respondents ranked AI-assisted troubleshooting first, followed by automatic root-cause analysis, AI-assisted remediation, forecasting and predictive analytics, and AI-generated post-incident reviews.
The Business Case: High ROI
The business case for observability appears to pay for itself. Since adopting observability, 68% of organizations report measurable improvements in mean time to respond. Seventy-five percent cite positive ROI from observability investments. Nearly one in five (18%) report returns in the 3–10x range.
For executives, the top benefits are reduced unplanned downtime, greater operational efficiency, and lower security risk. Practitioners point to reduced alert fatigue, faster troubleshooting and root cause analysis, and smoother collaboration across teams.
Tooling strategy is also moving. The average number of observability tools per organization has fallen 27% since 2023, to a median of four. Some 52% plan to consolidate further onto a unified platform in the next 12–24 months, while 48% intend to increase investment in AIOps and machine learning capabilities. The motivation is practical: unify telemetry and standardize workflows so signals correlate quickly.
Unfortunately, none of this eliminates incidents, it just reframes them. With full-stack coverage, the study suggests that teams detect issues faster, see fewer weekly high-impact events, and cut the cost of the inevitable. In an era when customer experience is inseparable from software reliability, that combination matters.
The checklist is challenging to implement: instrument broadly across the five layers; bind logs, traces, metrics, and user experience into one view; explicitly monitor AI models and agents; and apply AI where it accelerates triage and remediation. Any company that accomplishes that level of infrastructure build-out certainly deserves fewer problems.



