SRE Weekly Issue #346

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


The theme of this article is, somebody knows. So often this is the case with lurking infrastructure issues, and it only becomes clear that somebody knew about the underlying risk once things blow up (or never). How can we find out these things that someone already knows, soon enough to act?

  Elizabeth Ayer

In this air crash investigation report, somebody knew: the maintenance supervisor had written multiple memos about a risky maintenance practice to no avail, and the practice directly contributed to the crash.

  Admiral Cloudberg

And in this one, somebody knew too: a trained pilot in a nearby village called air traffic control to warn them that a plane looked likely to crash into a mountain and needed to pull up — shortly before it hit the mountain.

  Admiral Cloudberg

A lolsob-worthy comment on laying off SREs. And here‘s a totally on-point reply with the somebody knew moment.

Partly, it’s about accepting that this is hard work. The other part is choosing where your energy input can yield the most learning.

Full disclosure: Fred is my teammate at work.

  Fred Hebert

Check it out, the folks started a podcast about incidents!

Here’s Google’s report for a BigQuery outage that occurred on October 13.


At last9, we auto-delete slack messages after 2 days on all personal Direct Messages. These retention policies force teams to improve documentation, kill tribal knowledge and drive accountability for mistakes, errors.

  Nishant Modak — Last9

There are some interesting tidbits in the pile of incidents in this report.

  Jakub Oleksy — GitHub

Updated: November 6, 2022 — 8:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme