SRE Weekly Issue #148

A message from our sponsor, VictorOps:

In case you don’t know the actual numbers, the costs of downtime can be pretty extreme. The benefits of SRE not only extend to system reliability and deployment speed, but it also creates a lot of business value:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

Updated: November 18, 2018 — 8:27 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme