SRE Weekly Issue #332

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Their notification service had complex load characteristics that made scaling up a tricky proposition.

  Anand Prakash — Razorpay

Coalescing alerts and adding dependencies in AlertManager were the key to reducing this team’s excessive pager load.

  steveazz — GitLab

Lorin Hochstein has started a series of blog posts on what we can learn about incident response from the Uvalde school shooting tragedy in the US. This article looks at how an organization’s perspective can affect their retrospective incident analysis.

  Lorin Hochstein

My claim here is that we should assume the officer is telling the truth and was acting reasonably if we want to understand how these types of failure modes can happen.

Every retrospective ever:

We must assume that a person can act reasonably and still come to the wrong conclusion in order to make progress.

  Lorin Hochstein

How do you synchronize state between multiple browsers and a backend, and ensure that everyone’s state will eventually converge? These folks explain how they did it, and a bug they found through testing.

  Jakub Mikians — Airspace Intelligence

MTTR is a mean, so it doesn’t tell you anything about the number of incidents, among other potential pitfalls.

  Dan Slimmon

Last week, I included a GCP outage in europe-west2. This week, Google posted this report about what went wrong, and it’s got layers.

Bonus: another GCP outage report

  Google

Meta wants to do away with leap seconds, because they make it especially difficult to create reliable systems.

  Oleg Obleukhov and Ahmad Byagowi — Meta

If you’re anywhere near incident analysis in your organization, you need to read this list.

  Milly Leadley — incident.io

Outages

Updated: July 31, 2022 — 10:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme