SRE Weekly Issue #69

SPONSOR MESSAGE

Incident management is essential to modern DevOps environments. Learn why in the eBook, “Making the Case for Real-time Incident Management” from your friends at VictorOps. http://try.victorops.com/realtime_incident_mgmt/SREweekly

Articles

In February of 2016, a metal hospital gurney was inadvertently wheeled* into an MRI room, resulting in a costly near-miss accident. Brigham and Women’s Hospital posted about the mishap on their Safety Matters blog and also released a Q&A with their chief quality officer about their dedication to an open and just culture.

If an employee at Brigham makes a mistake that anyone else could make, we will work on improving the system, rather than punishing the employee. We believe that in every circumstance involving “human error” there are systemic opportunities for mitigating reoccurrence.

* Yes, I used the passive voice on purpose. See what I did there?

Sometimes logs help us prevent outages or discover a cause. But raise your hand if you’ve seen logging cause an outage. Yeah, me too.

Traditionally, auditd, together with Linux’s system call auditing support, has been used as part of security monitoring. Slack developed go-audit so that they could use system call auditing as a general monitoring tool. I can think of plenty of outages during which I’d have loved to be able to query system call patterns!

Dropbox has some pretty complex needs around feature gating. Apparently existing tools couldn’t satisfy their use case so they wrote and released a tool with sophisticated user segmentation support.

Why depend on fallible QA testing to spot regressions in a web UI? Computers are so much better at that kind of thing. Niffy spots the pixel changes between old and new code so you can see exactly what regressed before putting it in front of your users.

In this beautifully-illustrated article, Stripe engineer Jacqueline Xu explains how Stripe safely rolled out a major database schema upgrade. Many code paths had to be refactored, and they took a methodical, incremental approach to avoid downtime. Thanks to this article, I now know about Scientist and can’t wait to use it.

Speaking of Stripe, they have another polished post on how and why to add load shedding to your API.

Scientist is such an awesome idea. The idea is to try out a new code path and see if it produces the same result as the old code path. It only returns the new code path, so you know you can safely prove to yourself whether the new code path is safe before exposing users to it.

I’m including this article at least in part due to its mention of the February S3 outage. AWS had difficulty reporting the outage on its status site because of a dependency on S3.

Conway’s Law is extremely important to us as SREs. As we can see in the example of Sprouter, a poor organizational structure can produce unreliable software. My fellow SRE, Courtney Eckhardt, loves to say, “My job is applying Conway’s Law in reverse.”

Outages

  • AT&T VoIP
    • I received an anonymous anecdote from an SRE Weekly reader (thanks!) that this affected at least one hospital, with the result that critical phone communication was significantly hampered. What happened to the good old mostly-reliable traditional phone system? Irony: in the reader’s case, an announcement about the failure was sent out via email.
  • Three
    • This is the second case this year of a telecom outage resulting in SMSes being delivered to the wrong people. Am I the only one that finds this an extremely surprising and concerning failure mode?
  • eBay
  • Red Hat
Updated: April 23, 2017 — 6:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme