SRE Weekly Issue #83


The definitive guide for DevOps Post-Incident Reviews (AKA – Postmortems). Learn why traditional methods don’t work – and why fast incident response isn’t enough. Download your free copy of the 90+ page eBook from O’Reilly Media and VictorOps.


Decision fatigue is the diminishment of certain mental faculties after making many decisions. It can cause incidents, and just as importantly, it can make incident response more difficult. After reading this article, I’m wondering if I should be asking incident responders to stop and drink a glass of orange juice before making a tough call during an incident.

Here’s an interesting debugging session that plumbs some of the more obscure depths of TCP.

What does DR look like if your system is serverless? How do you manage performance if you don’t control the thing that loads (and hopefully pre-caches) your code?

The new book on incident response from the folks at Blackrock3 has arrived! They draw on their years of fire incident response experience to teach us how to resolve outages. I had the privilege of attending one of Blackrock3’s 2-day training sessions last week and I highly recommend it.

I like the idea of focusing on reducing customer pain points, even if they’re not directly due to bugs. After all, reliability is all about the customer experience.

Netflix’s ChAP tests a target microservice by creating experimental and control clusters and routing a small portion of traffic to them.

Microservice-based architecture is great, right? The problem is that the fan-out of backend requests can create an amplification vector for a DDoS attack. A small, carefully-constructed API call from an attacker can result in a massive number of requests to services in the backend, taking them down.

The latest from Mathias Lafeldt is this article about post-hoc learning. He draws on Zwiebeck and Cook, reminding us that both success and failure are normal circumstances in complex systems.

It’s important to understand that every outcome, successful or not, is the result of a gamble.

Remember Awesome SRE? The same author, Pavlos Ratis, has pulled together a ton of links on Chaos Engineering.  Thanks, Pavlos!

He’s also compiled this set of postmortem templates, drawn from various sources.  He’s unstoppable!

What a great idea, and I wish I’d known about it earlier! Pingdom uses their aggregate monitoring data to create a live map of the internet. Might be useful for those big events like the Dyn DDoS or the S3 outage.

Last week, I reported on the disaster that was Niantic’s Pokémon Go live event. Verizon wants to assure us that it wasn’t a capacity issue on their part.


  • EC2 (us-east-1)
    • Between 6:47 AM and 7:10 AM PDT we experienced increased launch failures for EC2 Instances, degraded EBS volume performance and connectivity issues for some instances in a single Availability Zone in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

      This one seems to have affected several companies including Heroku and Rollbar.

  • Marketo
    • Marketo failed to renew their domain name registration, reportedly due to a failure in their automated tooling.
  • Instagram
  • Report on July 7, 2017 incident | Gandi News
    • Here’s one I missed from earlier this month.

      In all, 751 domains were affected by this incident, which involved an unauthorized modification of the name servers [NS] assigned to the affected domains that then forwarded traffic to a malicious site exploiting security flaws in several browsers.

      Thanks to an anonymous reader for this one.

  • Threat Stack Status – Config Audit Database Maintenance
    • Another one I missed. This one appears to be a maintenance that went wrong.Thanks to an anonymous reader for this one.
Updated: July 30, 2017 — 11:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme