General

SRE Weekly Issue #353

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ðŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

  Christopher Tozzi — ITPro Today

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

  Ivan Merill — Fiberplane

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

  Adam Fabio — Hackaday

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

  Poppy Linden — Linden Lab

This article proposes two criteria: Actionability and Investigability.

  Dan Slimmon

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

  Lawrence Jones — incident.io

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

  Admiral Cloudberg

SRE Weekly Issue #352

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ðŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Incident duration and severity are not related, and we have the in-depth data to prove it.

It’s time for another VOID report! I’m glad this project is still going strong.

  Courtney Nash — Verica

I haven’t been paying attention to the recent attempts to legislate cloud provider reliability, and this article was a great catch-up. There’s a lot going on here.

  Jeff Martens — Metrist

I’m still trying to figure out how I feel about this one, but I’m definitely glad I read it.

  Fred Hebert

FireHydrant published this report with statistics from over 50,000 incidents experienced by their customers.

  FireHydrant

Want to get a solid understanding of how the Linux shells work, including file descriptors, process management, and sessions? This one goes really deep with lots of example programs.

  Viacheslav Biriukov

Check it out, Google search finally has a proper status page!

  Google

It’s one of those “awesome ___” repos on GitHub, this time for resources about writing SLOs.

  Steve Azzopardi (@steveazz)

If you’re going to classify incidents by “root cause”, try these on for size: “production pressure”, “goal conflicts”, and more in this article.

  Lorin Hochstein

Sure, the pilots were engaging in an activity that could be considered dubious. But what’s really worth digging into in this air accident is how surprise may have led them to forget their training on how to recover stable flight.

More on the same accident:

  Admiral Cloudberg

SRE Weekly Issue #351

Seven years ago, I was busy pulling together content for the first several issues of SRE Weekly. Since then, I estimate that I’ve consumed over 6000 articles in my quest to curate content each week, most of them via text-to-speech. You all make it worthwhile! Thank you so much for reading, and thanks to all of the great authors out there for writing awesome articles. Here’s to another great year!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In this interview, Tammy Butow goes into detail on what it’s like being on call and how she improved a team’s horrible on-call burden by a factor of 10.

  Elena Boroda — Fiberplane

Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs?

  JJ Tang — Rootly
  This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

An unsanctioned (but not unheard of) action, a race condition, and multiple known design issues all contributed to this air accident.

  Admiral Cloudberg

A first-hand account of one way to handle DR in this reddit post. Worth reading through to the end.

  u/disasterrecoverywhat — reddit

Rackspace’s Hosted Microsoft Exchange offering has been down for over a week, and they’re assisting (and paying for) customers to move to Microsoft 365.

  Roger Montti — Search Engine Journal

It’s a good idea to leave yourself a safety hatch to administer your system when everthing’s gone to heck… otherwise you might have to break out the angle grinders.

  Oren Eini — Hibernating Rhinos

This intriguing debugging story also sheds some light on how Honeycomb’s custom-built columnar data store works.

  Paul Osman — Honeycomb
  Full disclosure: Honeycomb is my employer.

Tons of incredibly good advice in this infographic + article on debugging.

  Julia Evans

SRE Weekly Issue #350

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ðŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Here’s what happens when you give an SRE access to an AI copy writer.

  quercy

This episode of the DisasterCast podcast discusses designing a car such that when it fails, it is likely that the human can react instinctively to make the accident less severe.

  Drew Rae

Here’s a detailed followup for a Buildkite incident last month.

  Buildkite

Does “Incident Commander” make sense, or would a better term be “Response Conductor”?

  Matt Davis

Can emoji during incident response improve shared understanding?

  Will Gallego — Jeli

This is cool: the Compressed Log Processor can search compressed logs without uncompressing them.

  Jack (Yu) Luo and Devesh Agrawal

If you enjoy performance engineering and peeling back abstraction layers to ask underlying subsystems to explain themselves, this article’s for you

  Matt Smiley — GitLab

Balancing holiday cheer and on-call rotations for one is tricky, but take it from me — two pagers under one roof is madness!

  Paige Cruz — Chronosphere

A production of Tinker Tinker Tinker, LLC Frontier Theme