SRE Weekly Issue #177

A message from our sponsor, VictorOps:

[Free Webinar] VictorOps partnered with Catchpoint to put death to downtime with actionable monitoring and incident response practices. See how SRE teams are being more proactive toward service reliability:

http://try.victorops.com/sreweekly/death-to-downtime

Articles

The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.

John Allspaw — Adaptive Capacity Labs

This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).

David Mytton

It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.

Zack Whittaker — TechCrunch

Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.

Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty

There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!

John Graham-Cumming — Cloudflare

Outages

  • Statuspage.io
  • NS1
  • PagerDuty
  • Nordstrom
    • Nordstrom’s site went down at the start of a major sale.
  • Twitter
  • Heroku
  • Honeycomb
    • Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
  • LinkedIn
  • Australian Tax Office
  • Reddit
  • Stripe
    • […] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.

      Click through for Stripe’s full analysis.

  • Discord
Updated: July 14, 2019 — 10:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme