SRE Weekly Issue #148

View on sreweekly.com

Articles

Open-Sourcing Our Incident Response Training

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Validating performance and reliability of the new Dropbox search engine

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

Overload control for scaling WeChat microservices\

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

The Time Our Provider Screwed Us

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

How Honeycomb Has Changed the Way Travis CI Operates Their Business

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

Analyzing the GitHub outage

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

Some notes on running new software in production

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

Google Cloud Platform (and possibly CloudFlare)
- The big outage this week occurred when an ISP in Africa accidentally advertised one of Google’s IP blocks over BGP, effectively black-holing traffic originally destined for GCP. This article suggests that CloudFlare might also have been affected, and it includes a statement from the offending ISP’s CEO.
Microsoft Outlook
Instagram
Basecamp 3
Second Life
Heroku followup report: Incident #1655 (October 30)
Facebook

SRE Weekly Issue #148

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues