SRE Weekly Issue #198

 

Last week, I came across Lorin Hochstein and started to read through his blog.  Lorin has a lot of awesome stuff to say, as you can see in this issue.  Thanks, Lorin!

A message from our sponsor, VictorOps:

[You’re Invited] Learn how to modernize your approach to incident management and slash MTTA/MTTR in the latest webinar from VictorOps + Splunk, Thursday, December 19th:

https://go.victorops.com/sreweekly-modern-incident-management-webinar

Articles

“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”

Kristy Kiernan — Forbes

Use the right tool for the job, not the coolest one.

Mattias Geniar

In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.

Lorin Hochstein

Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.

Lorin Hochstein

To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.

Lorin Hochstein

Some great observations and questions related to the Cloudflare outage in July.

Lorin Hochstein

Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?

Silvia Botros — Learning From Incidents

Outages

Updated: December 15, 2019 — 9:39 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme