SRE Weekly Issue #505

2013–09–17 Outage Postmortem

An incident write-up from the archives, and it’s a juicy one. An update to their code caused a crash only after some time had passed, so their automated testing didn’t catch it before they deployed it worldwide.

Xandr

Quick takes on the Triple Zero Outage at Optus – the Schott Review

This article covers an independent review of the Optus outage.

I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.

Lorin Hochstein

How Workers powers our internal maintenance scheduling pipeline

Cloudflare needed a tool to look for overlapping impact across their many maintenance events in order to avoid unintentionally impairing redundancy.

Kevin Deems and Michael Hoffmann — Cloudflare

Expiry times are dangerous, on “The dangers of SSL certificates”

Another great piece on expiration dates. I especially like the discussion of abrupt cliffs as a design choice.

Chris Siebenmann — University of Toronto

SRE Is Anti-Transactional: An API for interfacing with automaters

It’s not always easy to see how to automate a given bit of toil, especially when cross-team interactions are involved.

Thomas A. Limoncelli and Christian Pearce — ACM Queue

Resilience vs. Fault tolerance

How do resilience and fault tolerance relate? Are they synonyms, do they overlap, or does one contain the other?

Uwe Friedrichsen

Datadog, Thank You for Blocking Us: Why Vendor Lock-In No Longer Matters

After unexpectedly losing their observability vendor, these folks were able to migrate to a new solution within a couple days.

Karan Abrol, Yating Zhou, Pratyush Verma, Aditya Bhandari, and Sameer Agarwal — Deductive.ai

You Can’t Debug a System by Blaming a Person

A great dive into what blameless incident analysis really means.

Blameless also doesn’t mean you stop talking about what people did.

Busra Koken

SRE Weekly Issue #505

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Hopp:

Subscribe

RSS

Mastodon

Search Issues