SRE Weekly Issue #196

Articles

My favorite:

Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.

John Agger — Fastly

Full disclosure: Fastly is my employer.

Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.

Phil Porada — Let’s Encrypt

Impaired Air Traffic Controller

In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.

The Requirements For Aftermath Projects

I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.

John Allspaw — Adaptive Capacity Labs

Hospital alarms prove a noisy misery for patients: ‘I feel like I’m in jail.’

This is a striking analogue for an infrastructure with many unactionable alerts.

The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.

Melissa Bailey — The Washington Post

Twitter: Dan McKinley on the history of Etsy

A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.

Dan McKinley (@mcfunley)

Disaster Recovery Test Faking: Another Use Case for Stretched VLANs

If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.

Ivan Pepelnjak

Outages

GitHub
BNZ (bank)
Bank of Ireland
Rakuten
IndiGo (airline)
Tinder
Amino App
Twitter
Costco
Nordstrom Rack
Facebook and Instagram
- This one happened on the US’s Thanskgiving Day.
Telsa App
ABC News website
- An outage resulted in articles from 2011 being served to visitors.
Heroku
SquareSpace
NatWest Bank
- Thanks to Dr. Richard Cook for this one.

SRE Weekly Issue #196

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues