View on sreweekly.com
Last week, I came across Lorin Hochstein and started to read through his blog. Lorin has a lot of awesome stuff to say, as you can see in this issue. Thanks, Lorin!
“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”
￼Kristy Kiernan — Forbes
Use the right tool for the job, not the coolest one.
In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.
Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.
To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.
Some great observations and questions related to the Cloudflare outage in July.
Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?
Silvia Botros — Learning From Incidents
View on sreweekly.com
It’s been four years since I started SRE Weekly. I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.
A huge thank you to everyone who writes amazing SRE content every week. Without you folks, SRE Weekly would be nothing. Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!
Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.
In order to understand how things went wrong, we need to first understand how they went right
I love the move toward using the term “operational surprise” rather than “incident”.
Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.
Dwayne A. Day — The Space Review
Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.
Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.
Nishant Singh — LinkedIn
Our field is learning a ton, and it can be tempting to short-circuit that learning. It takes time to really grok and integrate what we’re learning.
Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.
I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google
This chronicle of learning about observability makes for an excellent reading list to those just diving in.