SRE Weekly Issue #77

View on sreweekly.com

I really love that some of you are taking vacations. Preventing burnout is really critical for improving reliability. That said, if you’d please exempt my address from your vacation auto-responder, that’d be super-cool ;)

Articles

Systemic brittleness, reactions to failure, and Conroy’s Law

Last week, I linked to a reddit story of an engineer that was unfairly fired for a mistake on their first day. Dr. Richard Cook picked this up and wrote up a great analysis of the underlying organizational issues.

Thanks to John Allspaw for this one.

Australian Tax Office’s post-incident report on the SAN outages

This was released the week before last, but it took me awhile to digest it. The ATO did a very thorough post-analysis on their two outages and released this polished report. I like that they took full responsibility for the outage even though it was an issue with a fully-managed vendor SAN offering, and they clearly sought to learn as much as possible.

Applications of (pin)trace data

Pinterest tech lead Suman Karumuri explains how they use distributed tracing and the benefits it’s brought them.

With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems.

An Imaginary Apology Letter From Your Airline CEO

Frustrated by British Airways’s Willie Walsh’s public statement regarding their major outage, TripWire founder Gene Kim took it upon himself to write an open letter of apology as if he were an airline CEO. It’s pretty great.

NGINX Plus High Availability on AWS

This article explores several options for HA with Nginx: put an ELB in front of it, Route 53 with health checks, or an elastic IP switched either by keepalived or a Lambda function.

On-Calliday: A Guide to Unsucking Your On-Call Experience

I’ve been following GitLab’s blog since their engineer accidentally deleted their database earlier this year, and I’m glad I did. This article touches on all sorts of topics near to my heart: preventing burnout, examining incident response metrics, enforcing vacations, incident command, and having developers go on-call for what they wrote.

The hidden cost of “Dark DR:” The economic argument for active/active operations

The costs associated with running a full-capacity redundant system in a secondary site can be numerous and subtle. Those costs can be especially hard to swallow when expected returns on infrastructure investments prove elusive.

A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data Science

Netflix explains in depth the careful scientific experiments they perform in production in order to improve the QoE (quality of experience).

Outages

Google Cloud Services
- 62-minute multiple-zone total internet outage in asia-northeast1. Postmortem linked, including a description of several contributing factors.
  
  We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve.
Coinbase
YouTube

SRE Weekly Issue #77

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues