Lacking enough incidents to learn from, NASA “borrowed” incidents from outside of their organization and wrote case studies of their own!
John Egan — InfoQ
In this interview, they hit hard on the importance of setting and adhering to clear work hours when working remotely as an SRE.
Ben Linders (interviewing James McNeil) — InfoQ
Here’s a clever way to put a price on how much an outage cost the company.
This article introduces error budgets through an analogy to feedback loops in electrical engineering.
Sjuul Janssen — Cloud Legends
[…] saturation SLOs have always been a point of discussion in the SRE community. Today, we attempt to clarify that.
Here’s how the GitHub Actions engineering team uses ChatOps. I love the examples!
Yaswanth Anantharaju — GitHub
This contains some pretty interesting details on their major outage last month.
In the last few weeks, I’ve been working on an extendible general purpose shard coordinator, Shardz. In this article, I will explain the main concepts and the future work.
Lots of deep technical detail here.
They constructed a set of git commits, one for each environment variable, then used
git bisect to figure out which variable was causing the failure. Neat trick!