SRE Weekly Issue #299

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/?utm_source=sreweekly

Articles

Lacking enough incidents to learn from, NASA “borrowed” incidents from outside of their organization and wrote case studies of their own!

  John Egan — InfoQ

In this interview, they hit hard on the importance of setting and adhering to clear work hours when working remotely as an SRE.

  Ben Linders (interviewing James McNeil) — InfoQ

Here’s a clever way to put a price on how much an outage cost the company.

  Lorin Hochstein

This article introduces error budgets through an analogy to feedback loops in electrical engineering.

  Sjuul Janssen — Cloud Legends

[…] saturation SLOs have always been a point of discussion in the SRE community. Today, we attempt to clarify that.

  Last9

Here’s how the GitHub Actions engineering team uses ChatOps. I love the examples!

  Yaswanth Anantharaju — GitHub

This contains some pretty interesting details on their major outage last month.

  GitHub

In the last few weeks, I’ve been working on an extendible general purpose shard coordinator, Shardz. In this article, I will explain the main concepts and the future work.

Lots of deep technical detail here.

  Jaana Dogan

They constructed a set of git commits, one for each environment variable, then used git bisect to figure out which variable was causing the failure. Neat trick!

  Diomidis Spinellis

Outages

Updated: December 5, 2021 — 9:33 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme