SRE Weekly Issue #334

I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.

Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.

  Lisa Karlin Curtis β€”
Full disclosure: Fastly, my employer, is mentioned.

An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.

  Lorin Hochstein

The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.

  u/dicksoutfoeharambe (and others) β€” reddit

From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.

  Jon Stevens-Hall

You can determine whether backoff will actually help your system, and this article does a great job of telling you how.

  Marc Brooker

I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!

  Dan Slimmon

This is a really great exlpanation of observability from an angle I haven’t seen before.

a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.

  Dan Slimmon


  • Twitter
  • Google Search
    • Did you catch the Google search outage? I’ve never seen one like it β€” that’s how rare they are. Google shared a tidbit of information about what went wrong β€” and it wasn’t the datacenter explosion folks speculated about.

  • Peloton
Updated: August 14, 2022 — 9:04 pm
