SRE Weekly Issue #127

It’s a jam-packed issue this week!  After a few light issues, suddenly everyone decided to publish awesome SRE-related content all at once.  Nice work, folks!


Creating on-call schedules for your SRE team(s) can be challenging. We’ve put together a short list of tips, tricks, and tools you can use to better organize your on-call rotations and help your SRE efforts:


Visa wrote a letter to the Chair of the Treasury Committee of the UK House of Commons, explaining their outage from a few weeks ago and answering the questions they posed. The good bits are in the first few pages, and the question answers mostly reiterate them. The last question about steps to prevent recurrence has some additional detail.

[…] a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.


This is really nifty!

The website has two sections: Country Statistics and Traffic Shifts.

Such an awesome idea:

@eanakashima: Alerting on spikes in status page views: so wrong, or so right?

Emily Nakashima

How (and why) should an SRE team communicate with Dev and the rest of the organization? I especially enjoy the section on how communicating outwardly helps SRE.


o11ycon has posted a Call for Failures:

Send us a slide or two, including a graph or other visual artifact of observability that represents the worst day of your (professional) life. Or a graph that drives home some important, deeply unexpected, or just plain interesting point about your systems.


There’s a great description of their current setup, but what really makes this article awesome is the explanation of what was wrong with their old system and why they replaced it.

Shlomi Noach — GitHub

Hilights of this article:

  • description of the pros and cons of two techniques for automating database migrations
  • a surprising number of instances of the word “tentacle”

Hen Peretz — BlazeMeter

Rather than firing the driver that caused a rear-end collision, this company looked deeper and found an underlying flaw in their procedures.

The organization had unknowingly created a system that was risk-promoting, rather than risk-averse.

Larry Boxman and Paul LeSage — Journal of Emergency Medical Services


  • NPM (nodeJS package manager)
    • This status posting is minimal, but there’s a deeper story at play here. There’s this article:

      Twitter bought an anti-harassment startup and immediately shut it down

      And this tweet by Laurie Voss (npmjs COO):

      @seldo: A vendor notified us of their acquisition at 6am this morning and shut down their APIs 30 minutes later, creating a production outage for npm (package publishes and user registrations). The sheer unprofessionalism of this is blowing my mind.


  • Datadog
    • These delays may result in “no data” alert conditions for Metric Monitors, to avoid spurious alerts we’ve temporarily disabled these alert types.

    • In the midst of suffering a major outage to their DIRECTV NOW OTT service, AT&T announced the official launch of AT&T WatchTV […]

  • Algeria
    • Algeria switched off its internet on Wednesday in an attempt to prevent cheating on exams.

      Algeria’s blackout can be seen in Oracle’s Internet Intelligence project, which maps web access globally.

      Rory Smith — CNN

  • Atlassian Statuspage (
    • We have identified the issue as errant traffic from a single customer and have taken action to mitigate the issue, which appears to only affect status pages. The Management Portal is working as normal.

  • New Relic
  • GCP Networking in us-east1
  • Azure North Europe region
    • An environment control system failure caused a huge rise in humidity, taking down some equipment. Huge shout-out to the Microsoft employee who reached out to me to let me know that they saw my call for help last week and forwarded it on to the folks responsible for the status page!
Updated: June 24, 2018 — 10:09 pm
SRE WEEKLY © 2015 Frontier Theme