SRE Weekly Issue #24

My favorite read this week was this first article. It’s long, but it’s well worth a full read.

Articles

Got customers begging to throw money at you if only you’d let them run your SaaS in-house? John Vincent suggests you think twice before going down that road. This isn’t just a garden-variety opinion piece. Clearly John is drawing on extensive experience as he closely examines all of the many pitfalls in trying to convert a service into a reliable, sustainable, supportable on-premises product.

Firefall — Outage Post-Mortem for Wednesday February 20th, 2013

An old but excellent postmortem for an incident stemming from accidental termination of a MySQL cluster.

Thanks to logikal on hangops #incident_response for this one.

Why the Friday Grid Roll? – Second Life

Earlier this year, Linden Lab had to do an emergency grid roll on a Friday to patch the GHOST (glibc) vulnerability. April Linden (featured here previously) shares a bit on why it was necessary and how Linden handled GHOST.

Breakdown in Medication Reconciliation Leads to Inpatient Dose 16 Times Higher Than Home Dose – BWH Safety Matters

This article may be about a medication error, but this could have come straight from a service outage post-analysis:

For example, if the system makes it time consuming and difficult to complete safety steps, it is more likely that staff will skip these steps in an effort to meet productivity goals.

Misstep led to death of two firefighters

Having a standard incident response process is crucial. When we fail to follow it, incidents can escalate rapidly. In the case of this story from South Africa, the article alleges that the Incident Commander led a team into the fire, rather than staying outside to coordinate safety.

I believe that mistakes during incident response in my job don’t lead directly to deaths now, but how soon before they do? And are my errors perhaps causing deaths indirectly even now? (Hat-tip to Courtney E. for that line of thinking.)

RCM for NA14 Disruptions of Service (Salesforce)

Salesforce published a root cause analysis for the outage last week.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Stack Exchange Network Status — Partial Outage Postmortem – March 28th, 2016

Earlier this year, Stack Exchange suffered a short outage during a migration. The underlying issue seems to have been an inability to truly test the migration due to an inability to replicate the production environment (CDN and all) in development.

Outages

NBA 2K16
Westpac (AU bank)
iiNet (AU ISP)
Whatsapp
Iraq
- Iraq purportedly shut down its internet access (removed its BGP announcements) to prevent students from cheating on exams.
Virgin Mobile
- They offered users a data credit immediately.
Telstra
- Telstra had a long outage this week. They claim that the outage was caused by vandalism in Taree.
Datadog
- Thanks to acabrera on hangops #incident_response for this one.
Mailgun
Disney Ticketing
- Disney’s ticketing site suffered under an onslaught of traffic this week brought on by their free dining deal program. Reference: we had a heck of a time making our dining reservations.

SRE Weekly Issue #24

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues