SRE Weekly Issue #241

A message from our sponsor, StackHawk:

Want a quick glimpse of how StackHawk works? Check out this 11 minute demo from SnykCon last week and learn about modern application security testing for DevOps teams.
http://sthwk.com/snykcon-demo

Articles

A quick note on last week’s issue: Google posted an updated version of their Google Chat incident summary with the “confidential” language removed. They also updated the content at the original link.

T-Mobile, one of the main mobile phone carriers in the US, had a major outage earlier this year. This report is essentially a retrospective performed by the US FCC (Federal Communications Commission). The report details the satisfyingly complex interplay of contributing factors in the incident.

US Federal Communications Commission

How can you be sure your failover plan will actually work? Hint: it’s almost certainly not going to work properly the first time you try it.

Adrian Cockcroft

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Emily Arnott — Blameless

Netflix has some interesting ideas around sampling, performance, and storage for their tracing system.

Maulik Pandey — Netflix

Oh, I do0 love reading stories of systems failing in interesting ways. This first installment contains five of the 10.

Yoz Grahame — LaunchDarkly

Black Friday is coming. Here are some ideas on how to deal with the rush — and how to analyze how you dealt with it when it’s over.

Nelly Wilson — Google

Two of my favorite authors/speakers have conspired to create a book on one of my favorite topics. Take my money! Oh wait, they’re giving it away, too?!

Nora Jones and Casey Rosenthal

Outages

Updated: October 25, 2020 — 8:33 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme