SRE Weekly Issue #299

Articles

More More More! Why the Most Resilient Companies Want More Incidents

Lacking enough incidents to learn from, NASA “borrowed” incidents from outside of their organization and wrote case studies of their own!

John Egan — InfoQ

How to Work Asynchronously as a Remote-First SRE

In this interview, they hit hard on the importance of setting and adhering to clear work hours when working remotely as an SRE.

Ben Linders (interviewing James McNeil) — InfoQ

How much did that outage cost?

Here’s a clever way to put a price on how much an outage cost the company.

Lorin Hochstein

SRE: The feedback loop of error budgets

This article introduces error budgets through an analogy to feedback loops in electrical engineering.

Sjuul Janssen — Cloud Legends

A Primer on Saturation SLO: What Is It and Do You Need to Consider It?

[…] saturation SLOs have always been a point of discussion in the SRE community. Today, we attempt to clarify that.

Last9

Using ChatOps to help Actions on-call engineers

Here’s how the GitHub Actions engineering team uses ChatOps. I love the examples!

Yaswanth Anantharaju — GitHub

GitHub Availability Report: November 2021

This contains some pretty interesting details on their major outage last month.

GitHub

Shardz

In the last few weeks, I’ve been working on an extendible general purpose shard coordinator, Shardz. In this article, I will explain the main concepts and the future work.

Lots of deep technical detail here.

Jaana Dogan

blog dds: 2021-11-27 — Rather than alchemy, methodical troubleshooting

They constructed a set of git commits, one for each environment variable, then used git bisect to figure out which variable was causing the failure. Neat trick!

Diomidis Spinellis

Outages

Facebook

SRE Weekly Issue #299

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues