SRE Weekly Issue #196

A message from our sponsor, VictorOps:

From everyone at VictorOps, we wanted to wish you a happy holiday season and give thanks for this SRE community. So, we put together this fun post to highlight the highs and lows of being on-call during the holidays.


My favorite:

Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.

John Agger — Fastly

Full disclosure: Fastly is my employer.

Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.

 Phil Porada — Let’s Encrypt

In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.

I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.

John Allspaw — Adaptive Capacity Labs

This is a striking analogue for an infrastructure with many unactionable alerts.

The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.

Melissa Bailey — The Washington Post

A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.

Dan McKinley (@mcfunley)

If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.

Ivan Pepelnjak


Updated: December 1, 2019 — 8:53 pm
SRE WEEKLY © 2015 Frontier Theme