A quick note on last week’s issue: Google posted an updated version of their Google Chat incident summary with the “confidential” language removed. They also updated the content at the original link.
T-Mobile, one of the main mobile phone carriers in the US, had a major outage earlier this year. This report is essentially a retrospective performed by the US FCC (Federal Communications Commission). The report details the satisfyingly complex interplay of contributing factors in the incident.
US Federal Communications Commission
How can you be sure your failover plan will actually work? Hint: it’s almost certainly not going to work properly the first time you try it.
In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.
Emily Arnott — Blameless
Netflix has some interesting ideas around sampling, performance, and storage for their tracing system.
Maulik Pandey — Netflix
Oh, I do0 love reading stories of systems failing in interesting ways. This first installment contains five of the 10.
Yoz Grahame — LaunchDarkly
Black Friday is coming. Here are some ideas on how to deal with the rush — and how to analyze how you dealt with it when it’s over.
Nelly Wilson — Google
Two of my favorite authors/speakers have conspired to create a book on one of my favorite topics. Take my money! Oh wait, they’re giving it away, too?!
Nora Jones and Casey Rosenthal