SRE Weekly Issue #179

A message from our sponsor, VictorOps:

A good SRE manager can make or break your site reliability engineering team. Learn all about the duties of an SRE manager and the best practices for building a highly-effective SRE program:

http://try.victorops.com/sreweekly/duties-of-effective-sre-managers

Articles

This is an engrossing write-up of the Chernobyl incident from the perspective of complex systems and failure analysis.

Barry O’Reilly

Slack’s Disasterpiece Theater isn’t quite chaos engineering, but it’s arguably better in some ways. They carefully craft scenarios to test their system’s resiliency, verifying (or disproving!) their hypothesis that a given disruption will be handled by the system without an incident. They share three riveting stories of lessons learned from past exercises.

The process each Disasterpiece Theater exercise follows is designed to maximize learning while minimizing risk of a production incident.

Richard Crowley — Slack

The above is the title of this YouTube playlist curated by John Allspaw.

My favorite sentence:

If you think an incident is “too common” to get its own postmortem that’s a good indicator that there’s a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it.

Fran Garcia — HostedGraphite

In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.

I sure do love a good debugging story.

Eve Harris — Ably

When an incident occurs, your company is faced with a choice: do you seek to learn as much as possible about how it happened, or do you seek to find out who messed up?

Phillip Dowland — Safety Differently

Outages

Updated: August 4, 2019 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme