SRE Weekly Issue #127

It’s a jam-packed issue this week!  After a few light issues, suddenly everyone decided to publish awesome SRE-related content all at once.  Nice work, folks!


Creating on-call schedules for your SRE team(s) can be challenging. We’ve put together a short list of tips, tricks, and tools you can use to better organize your on-call rotations and help your SRE efforts:


Visa wrote a letter to the Chair of the Treasury Committee of the UK House of Commons, explaining their outage from a few weeks ago and answering the questions they posed. The good bits are in the first few pages, and the question answers mostly reiterate them. The last question about steps to prevent recurrence has some additional detail.

[…] a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.


This is really nifty!

The website has two sections: Country Statistics and Traffic Shifts.

Such an awesome idea:

@eanakashima: Alerting on spikes in status page views: so wrong, or so right?

Emily Nakashima

How (and why) should an SRE team communicate with Dev and the rest of the organization? I especially enjoy the section on how communicating outwardly helps SRE.


o11ycon has posted a Call for Failures:

Send us a slide or two, including a graph or other visual artifact of observability that represents the worst day of your (professional) life. Or a graph that drives home some important, deeply unexpected, or just plain interesting point about your systems.


There’s a great description of their current setup, but what really makes this article awesome is the explanation of what was wrong with their old system and why they replaced it.

Shlomi Noach — GitHub

Hilights of this article:

  • description of the pros and cons of two techniques for automating database migrations
  • a surprising number of instances of the word “tentacle”

Hen Peretz — BlazeMeter

Rather than firing the driver that caused a rear-end collision, this company looked deeper and found an underlying flaw in their procedures.

The organization had unknowingly created a system that was risk-promoting, rather than risk-averse.

Larry Boxman and Paul LeSage — Journal of Emergency Medical Services


  • NPM (nodeJS package manager)
    • This status posting is minimal, but there’s a deeper story at play here. There’s this article:

      Twitter bought an anti-harassment startup and immediately shut it down

      And this tweet by Laurie Voss (npmjs COO):

      @seldo: A vendor notified us of their acquisition at 6am this morning and shut down their APIs 30 minutes later, creating a production outage for npm (package publishes and user registrations). The sheer unprofessionalism of this is blowing my mind.


  • Datadog
    • These delays may result in “no data” alert conditions for Metric Monitors, to avoid spurious alerts we’ve temporarily disabled these alert types.

    • In the midst of suffering a major outage to their DIRECTV NOW OTT service, AT&T announced the official launch of AT&T WatchTV […]

  • Algeria
    • Algeria switched off its internet on Wednesday in an attempt to prevent cheating on exams.

      Algeria’s blackout can be seen in Oracle’s Internet Intelligence project, which maps web access globally.

      Rory Smith — CNN

  • Atlassian Statuspage (
    • We have identified the issue as errant traffic from a single customer and have taken action to mitigate the issue, which appears to only affect status pages. The Management Portal is working as normal.

  • New Relic
  • GCP Networking in us-east1
  • Azure North Europe region
    • An environment control system failure caused a huge rise in humidity, taking down some equipment. Huge shout-out to the Microsoft employee who reached out to me to let me know that they saw my call for help last week and forwarded it on to the folks responsible for the status page!

SRE Weekly Issue #126


Alert fatigue will kill team morale. Take a look at some great ways to avoid alert fatigue and why it’s important for employee health and incident resolution speed:


Our friends in the GrabFood team now save up to 70% development time on creating a new service. We have also recorded improvements in stability and availability of our services.

Karen Kue and Michael Cartmell — Grab

Some tips on surviving peak traffic as we head into World Cup season. I like the discussion in #10 (load testing): accurately testing your CDN is all but impossible.

Hadar Weiss — Peer 5 (CDN)

This is a video recording of a talk by Charity Majors at Monkigras 2018. She has a lot of awesome stuff to say about making on-call enjoyable and owning your code, including this gem:

Babies, by the way, are engineered by evolution to be too cute for you to want to kill them. Your code is not.

Charity Majors — Honeycomb

A power disruption occurred at our service provider resulting in a number of instances going offline. Heroku databases running on these instances were impacted.

Presumably this was the us-east-1 power issue I reported on in Issue 124.

The first article in this new series is about the evolution of the Network Engineer into a Network Reliability Engineer. It’s part of the broader breakdown of silos with the goal of understanding holistic reliabilty.

Michael Kehoe

I hadn’t realized that GDPR has provisions related to site/service reliability.

Theresa Abbamondi — Netscout

To shamelessly steal a line from this recorded talk, it’s very rarely the right thing for your observability system’s scale to match that of the system it’s observing. To avoid that, you need to throw away some event data rather than storing and indexing everything. How do you do that while still achieving functioning observability?

Ben Hartshorne — Honeycomb

I’m looking forward to seeing where this article series goes. Database changes can be a huge reliability risk, and getting them right is critical.

Bob Walker — Octopus Deploy


  • Azure south-central US region
    • A load spike in a backend storage system caused impact across a range of Azure services, according to the RCA linked above.

      Actually, I’ve linked to their generic “status history” page, since that seems to be as specific as I can get. Readers from Microsoft, perhaps you could ask the folks that run the Azure status page to create dedicated permalinks for each incident, or at least for each RCA? Even an anchor link in the status history page would be super-awesome!

  • New Relic infrastructure alerting
  • Travis CI
  • WhatsApp
  • American Airlines
  • Instagram
  • Google Compute Engine
    • While instances were stopped (shut down), newly-launched instances were allowed to take their IPs. The stopped instances then failed on startup due to the IP conflicts. The situation lasted for around 20 hours.
  • Optus Sport
    • World Cup fans had issues watching through Optus. World Cup streaming traffic is massive this time around.
  • Apple Maps
  • Netflix
  • .my TLD

SRE Weekly Issue #125


Now is the time to start investing in DevOps. We sat down with Forrester’s Chris Condo to get an industry expert’s opinions on this exact topic:


Go’s HTTP client defaults to no timeout. Making HTTP requests with no timeout is rarely a good idea and has been at the heart of many incidents I’ve been involved in.

Nathan Smith

A few times now, I’ve made offhand comments about how Spanner promises a lot and I’d like to know what the catches are. Here they are! In all fairness, they’re pretty reasonable constraints to work with.

Niel Markwick and Robert Saxby — Google

I’d refer to this as more of a retrospective template, but in any case, it’s pretty nifty!

Michael Kehoe

This is a news report rather than a technical deep-dive. It’s got some pretty interesting (and amusing) stories from various MMOs.

Alex Wiltshire — PC Gamer

Here’s how Netflix does observability.

Kevin Lew and Sangeeta Narayanan — Netflix

Looks like I’ve missed a few incident followup posts from Heroku in the past couple months:

#1548: Increased errors in starting dynos
#1535: Post-incident Dyno Restarts
#1459: Scheduled API Maintenance on Monday March 26 at 23:00 UTC (4:00 PM PT)’
#1413: Dyno Availability
#1414: Heroku Connect Sync Delays
#1395: Heroku Connect Availability
#1393: Heroku Connect unavailable
#1379: Dyno boot issues


SRE Weekly Issue #124

Today’s my birthday!  Bit of a short issue this week as a result, but lots of interesting outages.


Support your DevOps and SRE efforts by implementing on-call tools that make people happy. With the right on-call tools, you can continuously deliver while maintaining system resiliency. Read more to learn about identifying good on-call tools:


These terms are not interchangeable. Learn about the ins and outs of fault tolerance to highlight the differences between the two concepts.

Fernando Doglio

What caught my eye in this article: AIIMS, the Australasian InterService Incident Management System. It’s the equivalent of the Incident Management System (IMS) in the US.

Ian Jones


SRE Weekly Issue #123

I hope you all had a happy GDPR day!  SRE Weekly’s privacy policy has not changed.  Folks that subscribed by email would have seen a message that I only share your email address with MailChimp, and that’s the way it will stay.

You can unsubscribe at any time by following the link at the bottom of the email, but if you have any trouble at all with unsubscribing, please don’t hesitate to email me and I’ll take care of it for you.


Maintaining reliability through cloud migration can be difficult. Learn how implementing an incident management solution can make migration faster, reduce costs, and make SRE-life easier:


The system is highly configurable, allowing fine-grained A/B testing of failures at all levels of the microservice call tree.

Ephemeral port exhaustion can really ruin your day. Read this to learn how to deal with it, how to detect it before you have problems, and why you might run out of ports sooner than you expect.

Will Sewell — Pusher

This incident report from 2013 is a great read. It’s really two inches in one, including an analysis of why a remediation task from the first wasn’t completed in time to prevent the second.

David Poblador i Garcia — Spotify

There are a few nice tidbits in this interview, including this one:

[…] the health of the system no longer matters.  We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience […]

Daniel Bryant – InfoQ

This article has introduction to implementing canary deployment and also includes a discussion of the potential downsides.

Erik [surname not given] —

Lots of great detail in this announcement, including an analysis of how (and why) they designed their load balancer to function entirely in userspace without a kernel bypass mechanism.

Nikita Shirokov and Ranjeeth Dasineni — Facebook

Metrics are great, right? Except sometimes they’re not, when the metric collection itself adds enough load to impair the system.

Jonathan Brown — Wallaroo


  • Google BigQuery
    • Click through for the full incident report.

      Configuration changes being rolled out on the evening of the incident were not applied in the intended order.

  • GCP Networking in us-east4
    • Here’s some detail on the BGP issue that took down us-east4 last week.
  • Google StackDriver
    • It’s a hat trick of three GCP incident followup reports. Happy reading!
  • Slack
  • Bank of New Zealand
  • Twitter
  • National Australia Bank
    • This outage is particularly notable because the bank has stated their intention to compensate customers for their losses, such as estimated lost revenues from inability to complete sales transactions.
SRE WEEKLY © 2015 Frontier Theme