SRE Weekly Issue #127

View on sreweekly.com

It’s a jam-packed issue this week! After a few light issues, suddenly everyone decided to publish awesome SRE-related content all at once. Nice work, folks!

Articles

Letter from Visa regarding service disruption, 15 June 2018

Visa wrote a letter to the Chair of the Treasury Committee of the UK House of Commons, explaining their outage from a few weeks ago and answering the questions they posed. The good bits are in the first few pages, and the question answers mostly reiterate them. The last question about steps to prevent recurrence has some additional detail.

[…] a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.

Visa

Introducing the Internet Intelligence Map

This is really nifty!

The website has two sections: Country Statistics and Traffic Shifts.

Emily Nakashima on Twitter

Such an awesome idea:

@eanakashima: Alerting on spikes in status page views: so wrong, or so right?

Emily Nakashima

Communicating with twits: How to minimize friction between Dev and SRE

How (and why) should an SRE team communicate with Dev and the rest of the organization? I especially enjoy the section on how communicating outwardly helps SRE.

HostedGraphite

Observability Show & Tell: show us your failure!

o11ycon has posted a Call for Failures:

Send us a slide or two, including a graph or other visual artifact of observability that represents the worst day of your (professional) life. Or a graph that drives home some important, deeply unexpected, or just plain interesting point about your systems.

o11ycon

MySQL High Availability at GitHub

There’s a great description of their current setup, but what really makes this article awesome is the explanation of what was wrong with their old system and why they replaced it.

Shlomi Noach — GitHub

Automated Database Deployments Iteration Zero

Hilights of this article:

description of the pros and cons of two techniques for automating database migrations
a surprising number of instances of the word “tentacle”

Hen Peretz — BlazeMeter

Just Culture & High Reliability: Steps to a More Reliable Organization

Rather than firing the driver that caused a rear-end collision, this company looked deeper and found an underlying flaw in their procedures.

The organization had unknowingly created a system that was risk-promoting, rather than risk-averse.

Larry Boxman and Paul LeSage — Journal of Emergency Medical Services

Outages

NPM (nodeJS package manager)
- This status posting is minimal, but there’s a deeper story at play here. There’s this article:
  Twitter bought an anti-harassment startup and immediately shut it down
  
  And this tweet by Laurie Voss (npmjs COO):
  
  @seldo: A vendor notified us of their acquisition at 6am this morning and shut down their APIs 30 minutes later, creating a production outage for npm (package publishes and user registrations). The sheer unprofessionalism of this is blowing my mind.
  
  Ouch.
Datadog
- These delays may result in “no data” alert conditions for Metric Monitors, to avoid spurious alerts we’ve temporarily disabled these alert types.
DIRECTV NOW
- In the midst of suffering a major outage to their DIRECTV NOW OTT service, AT&T announced the official launch of AT&T WatchTV […]
Algeria
- Algeria switched off its internet on Wednesday in an attempt to prevent cheating on exams.
  
  Algeria’s blackout can be seen in Oracle’s Internet Intelligence project, which maps web access globally.
  
  Rory Smith — CNN
Atlassian Statuspage (statuspage.io)
- We have identified the issue as errant traffic from a single customer and have taken action to mitigate the issue, which appears to only affect status pages. The Management Portal is working as normal.
New Relic
GCP Networking in us-east1
Azure North Europe region
- An environment control system failure caused a huge rise in humidity, taking down some equipment. Huge shout-out to the Microsoft employee who reached out to me to let me know that they saw my call for help last week and forwarded it on to the folks responsible for the status page!

SRE Weekly Issue #127

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues