SRE Weekly Issue #314

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

The first episode of this new podcast answers the question in three ways: what Google says SRE is, what the podcast host thinks it is, and how people seem to be practicing SRE.

  Stephen Townsend — Slight Reliability

This aircraft accident report puts heavy emphasis on the deeper contributing factors rather than a seemingly obvious single root cause.

  Mentour Pilot

Google posted an incident report for the March 8 incident involving Traffic Director.

  Google

This one includes some neat graphs made by showing load and theoretical success rates for various strategies such as no retries, N retries, token buckets, and circuit breakers.

  Marc Brooker

What if your alerting system goes down? These folks set up a dead-switch to handle that situation.

  Miedwar Meshbesher — Nanit

Strategies for creating concise, efficient communication between teams during incidents and operational suprises

[…] communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.

  Steve Stevens — Transposit

I really love these articles about hardware errors. They’re more common than we tend to realize.

  Harish Dattatraya Dixit — Facebook

Outages

Updated: March 20, 2022 — 9:18 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme