SRE Weekly Issue #344

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


In this story of SLOs gone bad, error budgets and code freezes provided a perverse incentive that caused a great deal of harm.

This article seeks to apply SRE principles to security in the form of a Threat Budget.

  Jason Bloomberg — Intellyx

After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents.

  Mike Lacsamana — FireHydrant

The Analysis section has a lot of important lessons. What really stands out in this incident review is the fact that Honeycomb plainly lays out the fact that they don’t yet know what went wrong, and why not.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

several, small staging clusters—each fit for their purpose—offers a more maintainable, cheaper alternative.

  Tyler Cipriana

I’m really enjoying the Admiral Cloudberg series of aircraft accident investigation reports. How did I not know about these before??

A lot has improved in aviation safety since this crash in 1967, but there’s still a lot we can learn in SRE even now. For example: the operator’s view into the system should make the result of their inputs clear.

  Admiral Cloudberg

An unannounced (maybe inadvertent?) breaking change in an Azure API caused an outage. Here’s the story of the investigation.

  Nikko Campbell — Metrist

Another Admiral Cloudberg air accident investigation, this time showing how easily critical details can slip through the cracks.

  Admiral Cloudberg

Updated: October 23, 2022 — 8:08 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme