SRE Weekly Issue #24

My favorite read this week was this first article. It’s long, but it’s well worth a full read.

Articles

Got customers begging to throw money at you if only you’d let them run your SaaS in-house? John Vincent suggests you think twice before going down that road. This isn’t just a garden-variety opinion piece. Clearly John is drawing on extensive experience as he closely examines all of the many pitfalls in trying to convert a service into a reliable, sustainable, supportable on-premises product.

An old but excellent postmortem for an incident stemming from accidental termination of a MySQL cluster.

Thanks to logikal on hangops #incident_response for this one.

Earlier this year, Linden Lab had to do an emergency grid roll on a Friday to patch the GHOST (glibc) vulnerability. April Linden (featured here previously) shares a bit on why it was necessary and how Linden handled GHOST.

This article may be about a medication error, but this could have come straight from a service outage post-analysis:

For example, if the system makes it time consuming and difficult to complete safety steps, it is more likely that staff will skip these steps in an effort to meet productivity goals.

Having a standard incident response process is crucial. When we fail to follow it, incidents can escalate rapidly. In the case of this story from South Africa, the article alleges that the Incident Commander led a team into the fire, rather than staying outside to coordinate safety.

I believe that mistakes during incident response in my job don’t lead directly to deaths now, but how soon before they do? And are my errors perhaps causing deaths indirectly even now? (Hat-tip to Courtney E. for that line of thinking.)

Salesforce published a root cause analysis for the outage last week.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Earlier this year, Stack Exchange suffered a short outage during a migration. The underlying issue seems to have been an inability to truly test the migration due to an inability to replicate the production environment (CDN and all) in development.

Outages

  • NBA 2K16
  • Westpac (AU bank)
  • iiNet (AU ISP)
  • Whatsapp
  • Iraq
    • Iraq purportedly shut down its internet access (removed its BGP announcements) to prevent students from cheating on exams.

  • Virgin Mobile
    • They offered users a data credit immediately.

  • Telstra
    • Telstra had a long outage this week. They claim that the outage was caused by vandalism in Taree.

  • Datadog
    • Thanks to acabrera on hangops #incident_response for this one.

  • Mailgun
  • Disney Ticketing
    • Disney’s ticketing site suffered under an onslaught of traffic this week brought on by their free dining deal program. Reference: we had a heck of a time making our dining reservations.

Updated: May 22, 2016 — 9:27 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme