A Transposit survey found the majority of respondents saw more frequent service incidents that affected their customers over the past 12 months.
What SREs Can Learn From the Atlassian Outage of 2022
What happens when the tools and services you depend on to drive site reliability engineering turns out to be susceptible to reliability failures of their own? That’s the question teams at about 400 businesses presumably asked themselves in the wake of a major outage in Atlassian Cloud. The incident offers a number of insights for […]
Leading Effective Incident Response Without Interminable Bridge Calls
There are easier ways to manage incident response without creating war rooms and packing IT staff onto bridge calls Your phone vibrates at 11 p.m., and you know that can only mean another major incident with one of the business’ critical systems. You get geared up for the war room, dial into the bridge call […]
When IT Disaster Strikes, Part 1: Resolving Incidents
As a developer or operations team member, there is nothing quite like the dread you feel when you hear the familiar ringtone of your on-call page at 3 a.m. Being on call means that you may be contacted at any time to investigate and fix issues that arise for the system, but that doesn’t mean […]
An outage war room primer
One aspect of the DevOps movement I’ve seen adopted at numerous companies is the idea that everyone supports their products by being on-call for any incidents that occur in the production environment. This responsibility often leads to participation in the outage war room. For those of you who may be new to this experience, I’m […]





