SRE Weekly Issue #356

Thanks to all of you that took the time to share your ideas about choosing incidents to investigate! I got some great answers and I’m looking forward to pulling them together into an article.

I decided to give this GPT-3 thing a spin. It turns out that it absolutely can assemble a newsletter with links to the week’s top SRE stories, each with a short description. It even includes authors. The authors are even real people. The URLs, though… well, they look real, but they’re mysteriously all 404s, and the articles don’t actually exist. Guess you’re stuck with me for now!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article takes the idea of “internal customers” to its logical conclusion, by treating the platform in the same way as a startup company.

  Adam Buggia — Sym

This article uses nifty probability formulas to show that blaming an engineer for an incident may well result in diminished reliability and efficiency.

  Dan Slimmon

Here’s a report on the CircleCI security incident at the start of the year. There’s some good stuff in there about not blaming the specific engineer whose device was attacked.

  Rob Zuber — CircleCI

A hot take on how not to measure your incident response process.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

eBay’s notification platform team built a fault-tolerant, resilient system by injecting faults in the application level.

  Wei Chen — eBay

This one succinctly sums up why I haven’t covered the NOTAM outage much yet.

If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time.

  Lorin Hochstein

Don’t you love when merely running strace fixes the problem?

  Oren Eini

This air accident seems at its face to be a clear-cut story of negligence. There’s far more to it, and the author goes into detail on why blaming the captain can damage air safety industry-wide.

  Admiral Cloudberg

Updated: January 22, 2023 — 9:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme