SRE Weekly Issue #180

A message from our sponsor, VictorOps:

Endorsing a culture of blameless transparency around post-incident reviews can lead to continuous improvement and more resilient services. Check out an interesting technique that SRE teams are using to improve post-incident analysis and learn more from failure:


This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!


The myths:

  1. Add Redundancy
  2. Simplify
  3. Avoid Risk
  4. Enforce Procedures
  5. Defend against Prior Root Causes
  6. Document Best Practices and Runbooks
  7. Remove the People Who Cause Accidents

If that doesn’t make you want to read this, I don’t know what will.

Casey Rosenthal — Verica

The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.

Liz Fong-Jones — Honeycomb

My favorite idea in this article is that the absence of “errors” is not the same thing as safety.

Thai Woods (summary)

Sidney Dekker (original paper)

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Tim Little — Kudos


Updated: August 12, 2019 — 8:47 am
SRE WEEKLY © 2015 Frontier Theme