Articles
This and other enlightened reflections on incident reviews can be found in this article:
Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management is occurring.
Richard Cook — Adaptive Capacity Labs
In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.
Jeff Jo — Stripe
“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.
Disco Posse
I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.
Miguel Carranza — RevenueCat
This is either really clever or just unsporting.
Tonya Garcia — MarketWatch
This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.
Gustavo Franco and Matt Brown — Google
If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.
Marloes Nitert — Safety Differently
This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?
Julia Evans
Outages
- How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
- The big outage this week happened when a small ISP accidentally told the Internet that it was the best place to send all their packets. Tom Strickx — Cloudflare
- Statuspage.io
- Slack
- Hulu
- Hulu suffered an outage during their live stream of an important US political debate.