General

SRE Weekly Issue #177

lex

July 14, 2019

Articles

The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.

John Allspaw — Adaptive Capacity Labs

What are the common causes of cloud outages?

This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).

David Mytton

It was a really bad month for the internet

It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.

Zack Whittaker — TechCrunch

MTTR is dead, long live CIRT

Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.

Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty

Details of the Cloudflare outage on July 2, 2019

There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!

John Graham-Cumming — Cloudflare

Outages

Statuspage.io
NS1
PagerDuty
Nordstrom
- Nordstrom’s site went down at the start of a major sale.
Twitter
Heroku
Honeycomb
- Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
LinkedIn
Australian Tax Office
Reddit
Stripe
- […] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.
  
  Click through for Stripe’s full analysis.
Discord

SRE Weekly Issue #176

lex

July 7, 2019

General

Comments

View on sreweekly.com

Articles

Distributed Tracing — we’ve been doing it wrong

[…] spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.

Find out why current distributed tracing tools fall short and the author’s vision of the future of distributed tracing.

Cindy Sridharan

Why Every Company Can Benefit from a Blameless Culture

If I wanted to introduce the concept of blameless culture to execs, this article would be a great starting point.

Rui Su — Blameless

The Multiple Audiences and Purposes of Post-Incident Reviews

When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences.

John Allspaw — Adaptive Capacity Labs

This major internet routing blunder took A WEEK to fix. Why so long? It was IPv6 – and no one really noticed

When you meant to type /127 but entered /12 instead

Oops?

Automating chaos experiments in production

The early failure injection testing mechanisms from Chaos Monkey and friends were like acts of random vandalism. Monocle is more of an intelligent probing, seeking out any weakness a service may have.

There’s a great example of Monocle discovering a mismatched timeout between client and server and targeting it for a test.

Adrian Colyer (summary)

Basiri et al., ICSE 2019 (original paper)

The Configuration Complexity Clock

Take the axiom of “don’t hardcode values” to an extreme, and you end up right back where you started.

Mike Hadlow

Outages

Cloudflare
- Cloudflare suffered a massive outage, returning 502 responses for over 80% of traffic for over 20 minutes. Linked above is their analysis. A tweet thread involving their CEO is also illuminating.
Instagram
Twitter
Google Maps
iCloud
Tweetdeck
Azure
- Azure suffered an outage in San Jose, CA, USA on July 2.

SRE Weekly Issue #175

lex

June 30, 2019

General

Comments

View on sreweekly.com

Articles

Some Observations On the Messy Realities of Incident Reviews

This and other enlightened reflections on incident reviews can be found in this article:

Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management is occurring.

Richard Cook — Adaptive Capacity Labs

The secret life of DNS packets: investigating complex networks

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

Jeff Jo — Stripe

Multi-Cloud: You keep using that word…

“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.

Disco Posse

How we migrated our database to Amazon Aurora with zero downtime

I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.

Miguel Carranza — RevenueCat

Ebay to hold ‘Crash Sale’ on July 15 in case Amazon’s site goes down

This is either really clever or just unsporting.

Tonya Garcia — MarketWatch

How SRE teams are organized, and how to get started | Google Cloud Blog

This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.

Gustavo Franco and Matt Brown — Google

When does a reduction in injury numbers become statistically significant?

If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.

Marloes Nitert — Safety Differently

What does debugging a program look like?

This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?

Julia Evans

Outages

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
- The big outage this week happened when a small ISP accidentally told the Internet that it was the best place to send all their packets. Tom Strickx — Cloudflare
Statuspage.io
Slack
Hulu
- Hulu suffered an outage during their live stream of an important US political debate.

SRE Weekly Issue #174

lex

June 23, 2019

General

Comments

View on sreweekly.com

A special treat awaits you in the Outages section this week: five awesome incident followups!

Articles

What bugs cause cloud production incidents?

This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.

Adrian Colyer (summary)

Liu et al., HotOS’19 (original paper)

Listen to a Recorded Incident Response Call

PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.

George Miranda — PagerDuty

Outages

Google Calendar
Netflix
Hulu
Joyent May 27 2014 outage followup
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
  
  The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.
Salesforce May 17 outage followup
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an UPDATE without its WHERE clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
Second Life mid-May outage followup
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
  
  Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.
  
  April Linden — Linden Lab
Monzo May 30 outage followup
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
  Nicholas Robinson-Wall — Monzo
Google Cloud Platform June 2 outage followup
- Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.

SRE Weekly Issue #173

lex

June 16, 2019

General

Comments

View on sreweekly.com

I’m back! Thank you all so much for the outpouring of support while SRE Weekly was on hiatus. My recovery is going nicely and I’m starting to catch up on my long backlog of articles to review. I’m going to skip trying to list all the outages that occurred since the last issue and instead just focus on a couple of interesting follow-up posts.

Articles

The Negotiability of “Severity” Levels

So many awesome concepts packed into this article. Here are just a couple:

Seen in this light, “severity” could be seen as a currency that product owners and/or hiring managers could use to ‘pay’ for attention.

This yields the logic that if a customer was affected, learning about the incident is worth the effort, and if no customers experienced negative consequences for the incident, then there must not be much to learn from it.

John Allspaw — Adaptive Capacity Labs

How Webhook.site handles 100 mbit/s traffic on a single VPS

This shares more in common with the server behind sreweekly.com than I perhaps ought to admit to:

Additionally, lots can be done for scalability regarding infrastructure: I’ve kept everything on a single, smaller server basically as a matter of stubbornness and wanting to see how far I can push a single VPS.

Simon Fredsted

PostgreSQL: pg_upgrade can result in early wraparound on databases with high
transaction load

A Reddit engineer explains a hidden gotcha of pg_upgrade that caused an outage I reported here previously.

Jason Harvey — Reddit

Pilots at MIA’s Biggest Cargo Airline Warned Execs a Crash Was Coming. Then a Plane Went Down.

This has “normalization of deviance” all over it.

Taylor Dolven — The Miami Herald

Boeing Built Deadly Assumptions Into 737 Max, Blind to a Late Design Change

The deep details around MCAS are starting to come out. This article tells a tale that is all too familiar to me about organizational pressures and compartmentalization.

Jack Nicas, David Gelles and James Glanz — New York Times

Outages

Google
- Click through for Google’s blog post about the outage that impacted Google Cloud Platform, YouTube, Gmail, Google Drive.A configuration change intended for a small number of servers was incorrectly applied more broadly, causing reduced network capacity. The similarity to the second Heroku outage below is striking.
Heroku Incident #1776 Follow-up
- An expired SSL certificate caused control plane impact and some impact to running applications.
Heroku Incident #1789 Follow-up
- A configuration change intended for a testing environment was mistakenly applied to production, resulting in 100% of requests in the EU failing.

SRE Weekly Issue #177

Articles

Outages

SRE Weekly Issue #176

Articles

Outages

SRE Weekly Issue #175

Articles

Outages

SRE Weekly Issue #174

Articles

Outages

SRE Weekly Issue #173

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues