SRE Weekly Issue #177

Articles

The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.

John Allspaw — Adaptive Capacity Labs

What are the common causes of cloud outages?

This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).

David Mytton

It was a really bad month for the internet

It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.

Zack Whittaker — TechCrunch

MTTR is dead, long live CIRT

Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.

Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty

Details of the Cloudflare outage on July 2, 2019

There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!

John Graham-Cumming — Cloudflare

Outages

Statuspage.io
NS1
PagerDuty
Nordstrom
- Nordstrom’s site went down at the start of a major sale.
Twitter
Heroku
Honeycomb
- Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
LinkedIn
Australian Tax Office
Reddit
Stripe
- […] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.
  
  Click through for Stripe’s full analysis.
Discord

SRE Weekly Issue #177

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues