I’m taking a week off to catch up on a few things. Keep an eye on your inbox for next week’s issue!
General
SRE Weekly Issue #177
Articles
The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.
John Allspaw — Adaptive Capacity Labs
This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).
David Mytton
It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.
Zack Whittaker — TechCrunch
Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.
Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty
There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!
John Graham-Cumming — Cloudflare
Outages
- Statuspage.io
- NS1
- PagerDuty
- Nordstrom
- Nordstrom’s site went down at the start of a major sale.
- Heroku
- Honeycomb
- Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
- Australian Tax Office
- Stripe
-
[…] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.
Click through for Stripe’s full analysis.
-
- Discord
SRE Weekly Issue #176
Articles
[…] spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.
Find out why current distributed tracing tools fall short and the author’s vision of the future of distributed tracing.
Cindy Sridharan
If I wanted to introduce the concept of blameless culture to execs, this article would be a great starting point.
Rui Su — Blameless
When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences.
John Allspaw — Adaptive Capacity Labs
When you meant to type /127 but entered /12 instead
Oops?
The early failure injection testing mechanisms from Chaos Monkey and friends were like acts of random vandalism. Monocle is more of an intelligent probing, seeking out any weakness a service may have.
There’s a great example of Monocle discovering a mismatched timeout between client and server and targeting it for a test.
Adrian Colyer (summary)
Basiri et al., ICSE 2019 (original paper)
Take the axiom of “don’t hardcode values” to an extreme, and you end up right back where you started.
Mike Hadlow
Outages
- Cloudflare
- Cloudflare suffered a massive outage, returning 502 responses for over 80% of traffic for over 20 minutes. Linked above is their analysis. A tweet thread involving their CEO is also illuminating.
- Google Maps
- iCloud
- Tweetdeck
- Azure
- Azure suffered an outage in San Jose, CA, USA on July 2.
SRE Weekly Issue #175
Articles
This and other enlightened reflections on incident reviews can be found in this article:
Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management is occurring.
Richard Cook — Adaptive Capacity Labs
In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.
Jeff Jo — Stripe
“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.
Disco Posse
I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.
Miguel Carranza — RevenueCat
This is either really clever or just unsporting.
Tonya Garcia — MarketWatch
This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.
Gustavo Franco and Matt Brown — Google
If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.
Marloes Nitert — Safety Differently
This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?
Julia Evans
Outages
- How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
- The big outage this week happened when a small ISP accidentally told the Internet that it was the best place to send all their packets. Tom Strickx — Cloudflare
- Statuspage.io
- Slack
- Hulu
- Hulu suffered an outage during their live stream of an important US political debate.
SRE Weekly Issue #174
A special treat awaits you in the Outages section this week: five awesome incident followups!
Articles
This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.
Adrian Colyer (summary)
Liu et al., HotOS’19 (original paper)
PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.
George Miranda — PagerDuty
Outages
- Google Calendar
- Netflix
- Hulu
- Joyent May 27 2014 outage followup
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
- Salesforce May 17 outage followup
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an
UPDATE
without itsWHERE
clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an
- Second Life mid-May outage followup
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.
April Linden — Linden Lab
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
- Monzo May 30 outage followup
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
Nicholas Robinson-Wall — Monzo
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
- Google Cloud Platform June 2 outage followup
- Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.