SRE Weekly Issue #179

Articles

This is an engrossing write-up of the Chernobyl incident from the perspective of complex systems and failure analysis.

Barry O’Reilly

Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack’s Disasterpiece Theater isn’t quite chaos engineering, but it’s arguably better in some ways. They carefully craft scenarios to test their system’s resiliency, verifying (or disproving!) their hypothesis that a given disruption will be handled by the system without an incident. They share three riveting stories of lessons learned from past exercises.

The process each Disasterpiece Theater exercise follows is designed to maximize learning while minimizing risk of a production incident.

Richard Crowley — Slack

Resilience Engineering, Cognitive Systems Engineering, and Human Factors Concepts in Software Contexts

The above is the title of this YouTube playlist curated by John Allspaw.

“It’s dead, Jim”: How we write an incident postmortem

My favorite sentence:

If you think an incident is “too common” to get its own postmortem that’s a good indicator that there’s a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it.

Fran Garcia — HostedGraphite

Building a real-time anomaly detection system for time series at Pinterest

In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.

Ably Debugging Tales Part 1 — An Elixir Erlang Mystery

I sure do love a good debugging story.

Eve Harris — Ably

Incident investigation: Learning vs Blaming

When an incident occurs, your company is faced with a choice: do you seek to learn as much as possible about how it happened, or do you seek to find out who messed up?

Phillip Dowland — Safety Differently

Outages

Stack Exchange

SRE Weekly Issue #179

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues