SRE Weekly Issue #126


Alert fatigue will kill team morale. Take a look at some great ways to avoid alert fatigue and why it’s important for employee health and incident resolution speed:


Our friends in the GrabFood team now save up to 70% development time on creating a new service. We have also recorded improvements in stability and availability of our services.

Karen Kue and Michael Cartmell — Grab

Some tips on surviving peak traffic as we head into World Cup season. I like the discussion in #10 (load testing): accurately testing your CDN is all but impossible.

Hadar Weiss — Peer 5 (CDN)

This is a video recording of a talk by Charity Majors at Monkigras 2018. She has a lot of awesome stuff to say about making on-call enjoyable and owning your code, including this gem:

Babies, by the way, are engineered by evolution to be too cute for you to want to kill them. Your code is not.

Charity Majors — Honeycomb

A power disruption occurred at our service provider resulting in a number of instances going offline. Heroku databases running on these instances were impacted.

Presumably this was the us-east-1 power issue I reported on in Issue 124.

The first article in this new series is about the evolution of the Network Engineer into a Network Reliability Engineer. It’s part of the broader breakdown of silos with the goal of understanding holistic reliabilty.

Michael Kehoe

I hadn’t realized that GDPR has provisions related to site/service reliability.

Theresa Abbamondi — Netscout

To shamelessly steal a line from this recorded talk, it’s very rarely the right thing for your observability system’s scale to match that of the system it’s observing. To avoid that, you need to throw away some event data rather than storing and indexing everything. How do you do that while still achieving functioning observability?

Ben Hartshorne — Honeycomb

I’m looking forward to seeing where this article series goes. Database changes can be a huge reliability risk, and getting them right is critical.

Bob Walker — Octopus Deploy


  • Azure south-central US region
    • A load spike in a backend storage system caused impact across a range of Azure services, according to the RCA linked above.

      Actually, I’ve linked to their generic “status history” page, since that seems to be as specific as I can get. Readers from Microsoft, perhaps you could ask the folks that run the Azure status page to create dedicated permalinks for each incident, or at least for each RCA? Even an anchor link in the status history page would be super-awesome!

  • New Relic infrastructure alerting
  • Travis CI
  • WhatsApp
  • American Airlines
  • Instagram
  • Google Compute Engine
    • While instances were stopped (shut down), newly-launched instances were allowed to take their IPs. The stopped instances then failed on startup due to the IP conflicts. The situation lasted for around 20 hours.
  • Optus Sport
    • World Cup fans had issues watching through Optus. World Cup streaming traffic is massive this time around.
  • Apple Maps
  • Netflix
  • .my TLD
Updated: June 17, 2018 — 4:13 pm
SRE WEEKLY © 2015 Frontier Theme