General

SRE Weekly Issue #200

A message from our sponsor, VictorOps:

Learn how to modernize your approach to incident management and slash MTTA/MTTR in the latest webinar from VictorOps + Splunk:

https://go.victorops.com/sreweekly-modernized-incident-management

Articles

The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.

Lorin Hochstein

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Lorin Hochstein

Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.

Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)

In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.

Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)

Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.

Will Sargent

Outages

SRE Weekly Issue #199

A message from our sponsor, VictorOps:

Ever find yourself asking, “How do I write Ansible playbooks for new Terraform servers?” Well, this new walkthrough from Splunk + VictorOps has your answer.

https://go.victorops.com/sreweekly-ansible-playbooks-with-terraform

Articles

Domino model, Swiss Cheese model, stand aside. The Gamma Knife model is a nifty analogy for contributing factors.

Lorin Hochstein

Lots of great tips here for how to make things easier on yourself when you’re paged. Pave the way for your 3 am self to get things fixed and get back to sleep as soon as possible.

Katie McLaughlin (Sysadvent day 21)

Ooh, a new SRE podcast! PagerDuty started things up with 4 episodes right out of the gate.

Introducing “Page It To The Limit,” a new podcast by the Community team here at PagerDuty that discusses what it means to operate software in production.

Wow, I love the idea of this shadowing program. The author discusses incidents they saw and 5 things they learned while shadowing.

Tristan Read — GitLab

Outages

SRE Weekly Issue #198

 

Last week, I came across Lorin Hochstein and started to read through his blog.  Lorin has a lot of awesome stuff to say, as you can see in this issue.  Thanks, Lorin!

A message from our sponsor, VictorOps:

[You’re Invited] Learn how to modernize your approach to incident management and slash MTTA/MTTR in the latest webinar from VictorOps + Splunk, Thursday, December 19th:

https://go.victorops.com/sreweekly-modern-incident-management-webinar

Articles

“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”

Kristy Kiernan — Forbes

Use the right tool for the job, not the coolest one.

Mattias Geniar

In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.

Lorin Hochstein

Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.

Lorin Hochstein

To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.

Lorin Hochstein

Some great observations and questions related to the Cloudflare outage in July.

Lorin Hochstein

Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?

Silvia Botros — Learning From Incidents

Outages

SRE Weekly Issue #197

It’s been four years since I started SRE Weekly.  I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.

A huge thank you to everyone who writes amazing SRE content every week.  Without you folks, SRE Weekly would be nothing.  Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!

A message from our sponsor, VictorOps:

From everyone at VictorOps, we wanted to wish you a happy holiday season and give thanks for this SRE community. So, we put together this fun post to highlight the highs and lows of being on-call during the holidays.

https://go.victorops.com/sreweekly-on-call-holidays

Articles

Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.

Nora Jones

In order to understand how things went wrong, we need to first understand how they went right

I love the move toward using the term “operational surprise” rather than “incident”.

Lorin Hochstein

Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.

Dwayne A. Day — The Space Review

Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.

Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.

Nishant Singh — LinkedIn

Our field is learning a ton, and it can be tempting to short-circuit that learning.  It takes time to really grok and integrate what we’re learning.

Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.

Will Gallego

I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google

This chronicle of learning about observability makes for an excellent reading list to those just diving in.

Mads Hartmann

Outages

SRE Weekly Issue #196

A message from our sponsor, VictorOps:

From everyone at VictorOps, we wanted to wish you a happy holiday season and give thanks for this SRE community. So, we put together this fun post to highlight the highs and lows of being on-call during the holidays.

https://go.victorops.com/sreweekly-on-call-holidays

Articles

My favorite:

Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.

John Agger — Fastly

Full disclosure: Fastly is my employer.

Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.

 Phil Porada — Let’s Encrypt

In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.

I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.

John Allspaw — Adaptive Capacity Labs

This is a striking analogue for an infrastructure with many unactionable alerts.

The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.

Melissa Bailey — The Washington Post

A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.

Dan McKinley (@mcfunley)

If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.

Ivan Pepelnjak

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme