General

SRE Weekly Issue #345

SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out.

This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at sreweekly.com).

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Don’t beat yourself up! This is like another form of blamelessness.

  Robert Ross — FireHydrant + The New Stack

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

  Ash Patel — SREPath

This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.

  Shaaron A Alvares — InfoQ

Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.

  Jason Kalich — Meta

Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.

  Chris Baraniuk — BBC

Cloudflare shares details about their 87-minute partial outage this past Tuesday.

  John Graham-Cumming — Cloudflare

In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.

  Vivek Aggarwal — Razorpay

The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!

  Marc Brooker

SRE Weekly Issue #344

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In this story of SLOs gone bad, error budgets and code freezes provided a perverse incentive that caused a great deal of harm.

  dobbse.net

This article seeks to apply SRE principles to security in the form of a Threat Budget.

  Jason Bloomberg — Intellyx

After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents.

  Mike Lacsamana — FireHydrant

The Analysis section has a lot of important lessons. What really stands out in this incident review is the fact that Honeycomb plainly lays out the fact that they don’t yet know what went wrong, and why not.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

several, small staging clusters—each fit for their purpose—offers a more maintainable, cheaper alternative.

  Tyler Cipriana

I’m really enjoying the Admiral Cloudberg series of aircraft accident investigation reports. How did I not know about these before??

A lot has improved in aviation safety since this crash in 1967, but there’s still a lot we can learn in SRE even now. For example: the operator’s view into the system should make the result of their inputs clear.

  Admiral Cloudberg

An unannounced (maybe inadvertent?) breaking change in an Azure API caused an outage. Here’s the story of the investigation.

  Nikko Campbell — Metrist

Another Admiral Cloudberg air accident investigation, this time showing how easily critical details can slip through the cracks.

  Admiral Cloudberg

SRE Weekly Issue #343

Bit of a short one this week as I recover from my third bout of COVID. Fortunately, this is another relatively mild one (thank you, vaccine!). Good luck everyone, and get your boosters.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article explores the advantages of powering SLOs with observability data.

  Pierre Tessier — Honeycomb
  Full disclosure: Honeycomb is my employer.

As the James Webb Space Telescope moves into normal operations, there are more great SRE lessons to be learned.

  Jennifer Riggins — The New Stack

During 5 years of experience as an SRE, the author of this article gathered a set of best practice patterns for software development and operation, which they share with us.

  brandon willett

How Airbnb built a persistent, high availability and low latency key-value storage engine for accessing derived data from offline and streaming events.

  Chandramouli Rangarajan, Shouyan Guo, Yuxi Jin — Airbnb

By owning and reporting MTTR, teams have no choice but to be accountable for the reliability of the code they write. This dramatically changes the culture of engineering.

  Sidu Ponnappa — Last9

I learned about plan continuation bias while reading this air accident report, and I’m certain I’ve experienced this during incidents I’ve been involved in.

  Admiral Cloudberg

SRE Weekly Issue #342

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

As a television broadcaster, how do I ensure that my channels are playing out the right thing for my viewers?

This is SRE applied to tv broadcasting: they replaced human monitoring of screens with an automated system.

  Jeremy Blythe — evertz.io
  Full disclosure: Honeycomb, my employer, is mentioned.

An interview with an engineer about on-call practices, training folks for on-call, and chaos engineering.

  Elena Boroda — Fiberplane

SRE: totally defined. Time for a reorg, and with a catchy tune!

  Forrest Brazeal

Great advice for incident response, backed up by real-world anecdotes.

  Audrey Simonne — DZone

There’s a lot to learn from in this air accident. A chilling example: several quirks of the plane’s automation combined to effectively tell the pilot to continue pushing the plane to stall.

  Admiral Cloudberg

When sharding a database, if transactions can span shards, then it can be very difficult to reason about the system’s maximum throughput.

For example, splitting a single-node database in half could lead to worse performance than the original system.

  Marc Brooker

Through Ubuntu’s unattended-upgrades system, a systemd update was installed that broke systemd-resolved, which in turn broke GitHub Codespaces. The systemd bug report they link to is also well worth a read.

  Jakub Oleksy — GitHub

Why not?

we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

  Lorin Hochstein

SRE Weekly Issue #341

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

My coworkers referred to a system “going metastable”, and when I asked what that was, they pointed me to this awesome paper.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is `removed.

  Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu

Honeycomb posted this incident report involving a service hitting the open file descriptors limit.

  Honeycomb
  Full disclosure: Honeycomb is my employer.

Lots of interesting answers to this one, especially when someone uttered the phrase:

engineers should not be on call

  u/infomaniac89 and others — reddit

A misbehaving internal Google service overloaded Cloud Filestore, exceeding its global request limit and effectively DoSing customers.

  Google

An in-depth look at how Adobe improved its on-call experience. They used a deliberate plan to change their team’s on-call habits for the better.

  Bianca Costache — Adobe

This one contains an interesting observation: they found that outages caused by a cloud providers take longer to solve.

  Jeff Martens — Metrist

Even if you don’t agree with all of their reasons, it’s definitely worth thinking about.

  Danny Martinez — incident.io

This one covers common reliability risks in APIs and techniques for mitigating them.

  Utsav Shah

The evolution beyond separate Dev and Ops teams continues. This article traces the path through DevOps and into platform-focused teams.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

A production of Tinker Tinker Tinker, LLC Frontier Theme