View on sreweekly.com
I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.
Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.
Lisa Karlin Curtis — incident.io
Full disclosure: Fastly, my employer, is mentioned.
An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.
The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.
u/dicksoutfoeharambe (and others) — reddit
From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.
You can determine whether backoff will actually help your system, and this article does a great job of telling you how.
I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!
This is a really great exlpanation of observability from an angle I haven’t seen before.
a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.
View on sreweekly.com
Thanks for all the well-wishes as I took a sick day last week. I’m feeling much better!
Is your status page status.yourcompany.com? If so, read this article, then get yourself a new domain.
Eduardo Messuti — Statuspal
The author used my favorite technique for getting up to speed on a company: analyzing a recent incident.
Vanessa Huerta Granda — Jeli
There are a number of lessons I learned guiding weeks-long backcountry leadership courses for teens that I carried with me into my roles in incident management. In this blog post, I’ll share three that stand out.
Ryan McDonald — FireHydrant
I really like these articles about interpreting SRE in a way that makes sense for your organization. SRE is still constantly evolving.
Steve Smith — Equal Experts
The author led an incident just 3 months into their tenure. Here’s what they learned.
Milly Leadley — incident.io
while SRE and DevOps type job explainers have been written ad nauseam, I found there’s relatively little online about Observability Teams and roles. I figured I’d share a bit about my experience on an O11y Team.
I found the contrast between this one and the previous article interesting. The previous one includes a quote of Brendan Gregg:
Let me try some observability first. (Means: Let me look at the system without changing it.)
Jessica Kerr — Honeycomb
In June, we experienced four incidents resulting in significant impact to multiple GitHub.com services. This report also sheds light into an incident that impacted several GitHub.com services in May.
Using the Webb telescope as an example, this article describes the progression of a system toward production operation using a metaphor of 3 days.
Robert Barron — IBM