SRE Weekly Issue #351

Seven years ago, I was busy pulling together content for the first several issues of SRE Weekly. Since then, I estimate that I’ve consumed over 6000 articles in my quest to curate content each week, most of them via text-to-speech. You all make it worthwhile! Thank you so much for reading, and thanks to all of the great authors out there for writing awesome articles. Here’s to another great year!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In this interview, Tammy Butow goes into detail on what it’s like being on call and how she improved a team’s horrible on-call burden by a factor of 10.

  Elena Boroda — Fiberplane

Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs?

  JJ Tang — Rootly
  This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

An unsanctioned (but not unheard of) action, a race condition, and multiple known design issues all contributed to this air accident.

  Admiral Cloudberg

A first-hand account of one way to handle DR in this reddit post. Worth reading through to the end.

  u/disasterrecoverywhat — reddit

Rackspace’s Hosted Microsoft Exchange offering has been down for over a week, and they’re assisting (and paying for) customers to move to Microsoft 365.

  Roger Montti — Search Engine Journal

It’s a good idea to leave yourself a safety hatch to administer your system when everthing’s gone to heck… otherwise you might have to break out the angle grinders.

  Oren Eini — Hibernating Rhinos

This intriguing debugging story also sheds some light on how Honeycomb’s custom-built columnar data store works.

  Paul Osman — Honeycomb
  Full disclosure: Honeycomb is my employer.

Tons of incredibly good advice in this infographic + article on debugging.

  Julia Evans

SRE Weekly Issue #350

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Here’s what happens when you give an SRE access to an AI copy writer.

  quercy

This episode of the DisasterCast podcast discusses designing a car such that when it fails, it is likely that the human can react instinctively to make the accident less severe.

  Drew Rae

Here’s a detailed followup for a Buildkite incident last month.

  Buildkite

Does “Incident Commander” make sense, or would a better term be “Response Conductor”?

  Matt Davis

Can emoji during incident response improve shared understanding?

  Will Gallego — Jeli

This is cool: the Compressed Log Processor can search compressed logs without uncompressing them.

  Jack (Yu) Luo and Devesh Agrawal

If you enjoy performance engineering and peeling back abstraction layers to ask underlying subsystems to explain themselves, this article’s for you

  Matt Smiley — GitLab

Balancing holiday cheer and on-call rotations for one is tricky, but take it from me — two pagers under one roof is madness!

  Paige Cruz — Chronosphere

SRE Weekly Issue #349

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This is a new spin on dependency-based attacks that has interesting implications for reliability.

  Christoph Treude — The Conversation

In a counterpoint to the articles I linked to last week, this engineer expects that the Twitter infrastructure will keep trucking on for awhile due to automation and redundancy.

  Rory Bathgate — ITPro

When folks use blaming language, bring up counterfactuals, or exhibit other suboptimal behaviors in a retrospective, what’s a good way to respond, and what doesn’t work as well?

  Fred Hebert

Cloudflare uses a novel approach to make the most out of a limited number of IPv4 addresses for outgoing traffic: “soft-unicast”.

  Marek Majkowski — Cloudflare

After attending my first retrospective at Honeycomb, I wrote this article about how they establish expectations and shared context at the start of the meeting.

  Lex Neva — Honeycomb

Pilot incapacitation, an argument, and a broader rift between cohorts of pilots were just some of the many contributing factors in this air accident. In response to this 1972 accident, the UK mandated cockpit voice recorders on all commercial flights.

  Admiral Cloudberg

SRE Weekly Issue #348

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Here’s a good intro to creating SLOs including a section on best practices.

  Cortex

When they started to get complaints from customers, they knew it was time to get serious about measuring and monitoring their reliability.

  arun — Reputation

As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.

What follows is a thread with tens of realistic failure scenarios, many of which apply not just to Twitter.

  @MosquitoCapital on Twitter

A few amusing anecdotes reveal deeper lessons in SRE.

  David Cassel — The New Stack

A resilient system like Twitter isn’t likely to go down instantly just because of a few changes. It’s much more likely to slowly degrade, per this article.

  Christopher Carbone — Daily Mail

It’s really interesting to see where this write-up differs from a video summary of the same accident by Mentour Pilot. Given the differences, I wonder if there are even more details that both left out?

  Admiral Cloudberg

This is a really great description of common ground breakdown, referencing Woods and Klein.

  Dan Slimmon

SRE Weekly Issue #347

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Check it out, a conference from the Learning From Incidents people!

Echoing Bainbridge’s Ironies of Automation, this article discusses the potential dangers of over-automation, using an air accident as a case study. I hadn’t been aware of the term “Children of the Magenta” before.

   Katie Mingle — 99% Invisible

There’s more to it than just hacking together some slack workflows.

   Ryan McDonald — FireHydrant

Honeycomb doesn’t do its SLOs “by the book”.

The way Honeycomb defines SLOs is radically different from what I expected. Instead of the definitions I wrote about at the beginning of this post, I saw:

  Reid Savage — Honeycomb
  Full disclosure: Honeycomb is my employer.

An anonymous Twitter engineer talks about what’s going on over there and how they think it might play out.

  Chris Stokel-Walker — MIT Technology Review

They rolled out automated rollbacks across a complex infrastructure, and in this article, they share the lessons they learned in the process.

  Will Sewell and Joseph Pallamidessi — Monzo

Okay. Here’s the Important Thing:

As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.

  Dan Slimmon

It was not clear to the pilots that the fuel estimation system was not designed to be used in the way they were using it.

  Admiral Cloudberg

As is usually the case with air accidents, the crash of Air Florida flight 90 did not have a single cause. In fact, the accident was the result of the confluence of two proximate factors, each of which was itself the culmination of a long chain of errors.

  Admiral Cloudberg

A production of Tinker Tinker Tinker, LLC Frontier Theme