General

SRE Weekly Issue #206

lex

February 9, 2020

General

Comments

View on sreweekly.com

Articles

Awakening the Sleeping Mind

All the plans in the world can’t prepare us for every incident, and yet we can excel during incidents anyway. How?

Will Gallego

Pilot Declares Emergency Because Of Extreme Hypoxia

These pilots’ minds were almost literally sleeping. The air traffic controller gave them a command they could execute in their sleep: Descend and Maintain.

In Flight Emergency – Pregnant Pilot Brake Failure!

Along the same lines an incident caused this pilot to nearly forget how to fly, and yet she safely landed the plane with some reassurance by the ATC.

Ten challenges for making automation a ‘team player’ in joint human-agent activity

Today’s choice looks at what it takes for machines to participate productively in collaborations with humans.

Adrian Colyer — The Morning Paper (summary)

Klein et al. — IEEE Computer Nov/Dec 2004 (original paper)

Rehabilitating “you can’t manage what you can’t measure”

A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.

Lorin Hochstein

Heroku Incident 1951 followup

This followup covers incidents on January 22 and 24.

Stop Counting Production Incidents

Tracking the number of incidents is almost never going to be useful, and is likely to be detrimental.

Rick Branson

Low hanging BCP and DR scenarios – UnixDaemon: In search of (a) life

Some good scenarios to think about, including an idea for chaos engineering with humans.

Dean Wilson

Outages

Chef Supermarket
Twitter
Search in Windows 10
- Included searching of local content.
Buffer
EMIS
- EMIS is an IT system used by health care services across the UK.
Vimeo
Ticketmaster
NFL Game Pass
Microsoft Teams

SRE Weekly Issue #205

lex

February 2, 2020

General

Comments

View on sreweekly.com

Articles

The Myth of the Blameless Retrospective

This article hints at the fact that blame and sanction (punishment) are two different things.

Bonus content: Dr. Richard Cook on blameless vs sanctionless retrospectives

Bob Reselman

(A few) Ops Lessons We All Learn The Hard Way

here we have a few lessons in operations that we all (eventually) (have to) learn; often the hard way.

Jan Schaumann

What are Service Level Objectives (SLOs)? Lessons Learned

I especially like the emphasis on reducing pager fatigue through thoughtfully selected SLOs.

Emily Arnott — Blameless

Resilience Roundup – Four concepts for resilience and the implications for the future of resilience engineering

The four concepts, drawn from a paper by Dr. David Woods, are:

Rebound
Robustness
Graceful extensibility
Sustained adaptability

Thai Wood — Resilience Roundup

How an Alleged “Space Strike” Beautifully Demonstrates Work-As-Imagined Versus Work-As-Done

Understanding the difference between work-as-imagined and work-as-done is critical to the reliability of a complex system.

Jaime Woo and Emil Stolarsky — The Morning Mind-Meld

Tracking toil with SRE principles

There’s a useful survey in here if you’re trying to measure or track toil in your organization.

Eric Harvieux — Google

Site Wide Memory Leak: An On-Call Story

A nice little debugging story hinging on a bug in an upstream library.

Sanket Patel

Outages

Pinterest
Microsoft Office 365 Sharepoint Online
TD Bank
Google Drive, Docs, Sheets, and Slides
Facebook and Instagram
Gandi
- They posted a quite candid analysis, concluding that they’re not sure what went wrong.

SRE Weekly Issue #204

lex

January 27, 2020

General

Comments

View on sreweekly.com

Articles

The Resilience of Bone and Resilience Engineering

In this talk, Dr. Richard Cook presents bone as the archetype for resilient systems, and shows us what we can learn about resilience engineering from medicine.

Richard Cook, MD — Adaptive Capacity Labs

Developing in Production

Some interesting ideas on testing in production, involving developer instances that live right inside production and take a portion of production traffic.

Will Sargent

An old lesson about a fish

Keep in mind, though, that you aren’t really studying an incident at all: you’re studying your system through the lens of an incident.

Lorin Hochstein

@shelbyspees on fixing the code to match an alert (Twitter)

This thread has an interesting analogy between alerts and code comments.

Shelby Spees

STELLA: report from the SNAFU-catchers workshop on coping with complexity — The Morning Paper

I’m really loving this thing where Adrian Colyer is going through classic works on The Morning Paper. Here’s his take on the STELLA Report.

Adrian Colyer — The Morning Paper (summary)

Woods et al. (original report)

Outages

Fastly
- Full disclosure: Fastly is my employer.
Google Drive
f.root-servers.net
- There was an issue with one of the root DNS servers. More detail later in the thread.
LastPass
Apple App Store
Honeywell Home
Microsoft Teams
Reddit
- At least, I think they had an outage. Their status site was down when I tried to verify this one.
Tumblr
Indeed
[Official update] Twitter down : App not working & broken for many users
Udemy
WhatsApp
Facebook
Hosted Graphite
AWS ap-southeast-2

SRE Weekly Issue #203

lex

January 19, 2020

General

Comments

View on sreweekly.com

Articles

5 Example Postmortems & Best Practices you can Start Using Today

Spot-on advice for writing incident followups, citing examples of real write-ups that exhibit the techniques they recommend.

Hannah Culver — Blameless

On-Call Nightmares With Jay Gordon

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you – you’re always on-call”

Jay Gordon — Page It to the Limit

Serving 100µs reads with 100% availability

This is a companion to last week’s article, Sharing SQLite databases across containers is surprisingly brilliant. This one explains the broader ctlstore system.

Rick Branson and Collin Van Dyck — Segment

Chaos Mesh – Your Chaos Engineering Solution for System Resiliency on Kubernetes

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Chengwen Yin — PingCAP

Getting into people’s heads: how and why to fake it

Fake it ’til you make it clear what motivated the decisions of incident responders.

Lorin Hochstein

Deemed SLIs to put SRE into practice

When running a platform, pay attention to the experience of specific customers, says Google. That may mean inferring their metrics from your own if they haven’t shared their SLIs with you.

Adrian Hilton — Google

Journey into Observability: Telemetry

This article takes a stand against the “Three Pillars of Observability”.

[…] focus on what kinds of questions you’re trying to answer and let that guide your choice of telemetry.

Mads Hartmann

Logging: Rules of thumb

My favorite recommendation is to make log messages “two-way greppable” — findable in logs and easy to tell exactly which part of the code it comes from.

Vladimir Garvardt — HelloFresh

Outages

Dyn Managed DNS
G Suite admin console
WhatsApp Gets its First Ever Outage in 2020, Only Text Service Working
South Africa and other African countries
- An important undersea cable was severed.
US Driver’s License system
- A downstream dependency of many US states’ motor vehicle departments had an outage.
UK National Lottery
Spotify
HootSuite
Instagram
Reddit
LinkedIn

SRE Weekly Issue #202

lex

January 12, 2020

General

Comments

View on sreweekly.com

Articles

Conveying confusion without confusing the reader

When writing about an incident, it’s important to skillfully show the reader how the participants’ understanding of the situation evolved.

Lorin Hochstein

The Morning Paper: Ironies of automation

This is a summary of Bainbridge’s seminal paper, and I really love where Adrian Colyer goes with it.

One example I found myself thinking about while reading through the paper does have a human precedence though: self-driving cars.

Adrian Colyer — The Morning Paper (summary)

Bainbridge — Automatica (original paper)

Sharing SQLite databases across containers is surprisingly brilliant

I have to admit, it is brilliant. Why add the risk (and latency) of a centralized configuration repository service when a local DB on each host will do?

Rick Branson — Segment

Managing Failure Modes in Microservice Architectures

This one covers a lot. My favorite parts:

Permissive failure — if Netflix’s subscriber information service is down they just show videos for free, favoring reliability over correctness.
Human attention span — if it takes 10 minutes to see if your changes broke production, you’re likely to wander off and work on something else.

Adrian Cockcroft

Understanding Observability

The author guides you through the moment they began to truly understand what observability is all about. Worth reading even if you’re already quite familiar with the concept.

Sanjeev Sharma

Intelligent DNS based load balancing at Dropbox

This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network.

How We Prevented App Performance Degradation From Sudden Ride Demand Spikes

Grab uses bulkheading to prevent localized demand spikes from affecting the service for customers elsewhere. The notable part is that they shed load they can’t satisfy anyway, due to a limited supply of available vehicles.

Corey Scott — Grab

Outages

Dyn
- Dyn had a delay in DNS resolution in London.
Google Cloud Platform (update on December 18 outage)
- On Wednesday, 18 December, 2019, a part of Google’s production network experienced a temporary reduction in capacity, due to multiple fiber cuts in optical links interconnecting Sofia, Bulgaria with other points-of-presence.
Travelex
Twitter
Airbnb
Thingiverse
Southwest Airlines website
Yahoo Mail
Disney+
QuickBooks
Trello
Reddit

SRE Weekly Issue #206

Articles

Outages

SRE Weekly Issue #205

Articles

Outages

SRE Weekly Issue #204

Articles

Outages

SRE Weekly Issue #203

Articles

Outages

SRE Weekly Issue #202

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues