Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!
Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.
Diwaker Gupta — Dropbox
There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.
Adrian Colyer — The Morning Paper (review and summary)
Zhou et al. (original paper)
A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.
The section titled “A surprising discovery” is really thought-provoking:
t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.
Igor Wiedler — Travis CI
An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.
I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.
- Google Cloud Platform (and possibly CloudFlare)
- The big outage this week occurred when an ISP in Africa accidentally advertised one of Google’s IP blocks over BGP, effectively black-holing traffic originally destined for GCP. This article suggests that CloudFlare might also have been affected, and it includes a statement from the offending ISP’s CEO.
- Microsoft Outlook
- Basecamp 3
- Second Life
- Heroku followup report: Incident #1655 (October 30)