SRE Weekly Issue #314

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

The first episode of this new podcast answers the question in three ways: what Google says SRE is, what the podcast host thinks it is, and how people seem to be practicing SRE.

  Stephen Townsend β€” Slight Reliability

This aircraft accident report puts heavy emphasis on the deeper contributing factors rather than a seemingly obvious single root cause.

  Mentour Pilot

Google posted an incident report for the March 8 incident involving Traffic Director.

  Google

This one includes some neat graphs made by showing load and theoretical success rates for various strategies such as no retries, N retries, token buckets, and circuit breakers.

  Marc Brooker

What if your alerting system goes down? These folks set up a dead-switch to handle that situation.

  Miedwar Meshbesher β€” Nanit

Strategies for creating concise, efficient communication between teams during incidents and operational suprises

[…] communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.

  Steve Stevens β€” Transposit

I really love these articles about hardware errors. They’re more common than we tend to realize.

  Harish Dattatraya Dixit β€” Facebook

Outages

Updated: March 20, 2022 — 9:18 pm
SRE WEEKLY © 2015 Frontier Theme