SRE Weekly Issue #69

Articles

The tale of the flying gurney: MRI machines can turn objects into projectiles

In February of 2016, a metal hospital gurney was inadvertently wheeled* into an MRI room, resulting in a costly near-miss accident. Brigham and Women’s Hospital posted about the mishap on their Safety Matters blog and also released a Q&A with their chief quality officer about their dedication to an open and just culture.

If an employee at Brigham makes a mistake that anyone else could make, we will work on improving the system, rather than punishing the employee. We believe that in every circumstance involving “human error” there are systemic opportunities for mitigating reoccurrence.

* Yes, I used the passive voice on purpose. See what I did there?

Lies My Parents Told Me (About Logs)

Sometimes logs help us prevent outages or discover a cause. But raise your hand if you’ve seen logging cause an outage. Yeah, me too.

Syscall Auditing at Scale

Traditionally, auditd, together with Linux’s system call auditing support, has been used as part of security monitoring. Slack developed go-audit so that they could use system call auditing as a general monitoring tool. I can think of plenty of outages during which I’d have loved to be able to query system call patterns!

Introducing Stormcrow

Dropbox has some pretty complex needs around feature gating. Apparently existing tools couldn’t satisfy their use case so they wrote and released a tool with sophisticated user segmentation support.

Niffy: Perceptual Diffing to Catch Invisible Bugs

Why depend on fallible QA testing to spot regressions in a web UI? Computers are so much better at that kind of thing. Niffy spots the pixel changes between old and new code so you can see exactly what regressed before putting it in front of your users.

Online migrations at scale

In this beautifully-illustrated article, Stripe engineer Jacqueline Xu explains how Stripe safely rolled out a major database schema upgrade. Many code paths had to be refactored, and they took a methodical, incremental approach to avoid downtime. Thanks to this article, I now know about Scientist and can’t wait to use it.

Scaling your API with rate limiters

Speaking of Stripe, they have another polished post on how and why to add load shedding to your API.

GitHub – github/scientist: A Ruby library for carefully refactoring critical paths.

Scientist is such an awesome idea. The idea is to try out a new code path and see if it produces the same result as the old code path. It only returns the new code path, so you know you can safely prove to yourself whether the new code path is safe before exposing users to it.

AWS status: The complete guide to monitoring status on the web’s largest cloud provider

I’m including this article at least in part due to its mention of the February S3 outage. AWS had difficulty reporting the outage on its status site because of a dependency on S3.

CASE STUDY: Etsy, Sprouter and Conway’s Law

Conway’s Law is extremely important to us as SREs. As we can see in the example of Sprouter, a poor organizational structure can produce unreliable software. My fellow SRE, Courtney Eckhardt, loves to say, “My job is applying Conway’s Law in reverse.”

Outages

AT&T VoIP
- I received an anonymous anecdote from an SRE Weekly reader (thanks!) that this affected at least one hospital, with the result that critical phone communication was significantly hampered. What happened to the good old mostly-reliable traditional phone system? Irony: in the reader’s case, an announcement about the failure was sent out via email.
Three
- This is the second case this year of a telecom outage resulting in SMSes being delivered to the wrong people. Am I the only one that finds this an extremely surprising and concerning failure mode?
eBay
Red Hat

SRE Weekly Issue #69

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues