General

SRE Weekly Issue #202

lex

January 12, 2020

Articles

Conveying confusion without confusing the reader

When writing about an incident, it’s important to skillfully show the reader how the participants’ understanding of the situation evolved.

Lorin Hochstein

The Morning Paper: Ironies of automation

This is a summary of Bainbridge’s seminal paper, and I really love where Adrian Colyer goes with it.

One example I found myself thinking about while reading through the paper does have a human precedence though: self-driving cars.

Adrian Colyer — The Morning Paper (summary)

Bainbridge — Automatica (original paper)

Sharing SQLite databases across containers is surprisingly brilliant

I have to admit, it is brilliant. Why add the risk (and latency) of a centralized configuration repository service when a local DB on each host will do?

Rick Branson — Segment

Managing Failure Modes in Microservice Architectures

This one covers a lot. My favorite parts:

Permissive failure — if Netflix’s subscriber information service is down they just show videos for free, favoring reliability over correctness.
Human attention span — if it takes 10 minutes to see if your changes broke production, you’re likely to wander off and work on something else.

Adrian Cockcroft

Understanding Observability

The author guides you through the moment they began to truly understand what observability is all about. Worth reading even if you’re already quite familiar with the concept.

Sanjeev Sharma

Intelligent DNS based load balancing at Dropbox

This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network.

How We Prevented App Performance Degradation From Sudden Ride Demand Spikes

Grab uses bulkheading to prevent localized demand spikes from affecting the service for customers elsewhere. The notable part is that they shed load they can’t satisfy anyway, due to a limited supply of available vehicles.

Corey Scott — Grab

Outages

Dyn
- Dyn had a delay in DNS resolution in London.
Google Cloud Platform (update on December 18 outage)
- On Wednesday, 18 December, 2019, a part of Google’s production network experienced a temporary reduction in capacity, due to multiple fiber cuts in optical links interconnecting Sofia, Bulgaria with other points-of-presence.
Travelex
Twitter
Airbnb
Thingiverse
Southwest Airlines website
Yahoo Mail
Disney+
QuickBooks
Trello
Reddit

SRE Weekly Issue #201

lex

January 5, 2020

General

Comments

View on sreweekly.com

Articles

Studying an Incident

Looking from multiple perspectives is incredibly important to effectively learn from an incident. Equally true for asking what went right.

Subbu Allamaraju

The Risks of Autonomy: Doyles Catch

Failure to anticipate and design for
the new challenges that are certain to arise following periods of technology change leads
to automation surprises when advocates are surprised by negative unintended consequences that offset apparent benefits

Thanks to Greg Burek for this one.

David Woods — Ohio State University

Stdarg And The Case Of The Forgotten Registers

Start the year off with this refreshingly deep dive into how variable-argument functions in C work.

Jan Schaumann

Files are fraught with peril

Think you know how to write files safely, say with fsync() or something? Think again.

In conclusion, computers don’t work

Dan Luu

Outages

SRE Weekly Issue #200

lex

December 29, 2019

General

Comments

View on sreweekly.com

Articles

Why incidents can’t be monocausal

The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.

Lorin Hochstein

A conjecture on why reliable systems fail

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or

Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Lorin Hochstein

When mental models go wrong. Co-occurrences in dynamic, critical systems

Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.

Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)

Reducing alert fatigue with GoAlert, Target’s on-call scheduling and notification platform

In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.

Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)

Targeted Diagnostic Logging in Production

Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.

Will Sargent

Outages

SRE Weekly Issue #199

lex

December 22, 2019

General

Comments

View on sreweekly.com

Articles

The Gamma Knife model of incidents

Domino model, Swiss Cheese model, stand aside. The Gamma Knife model is a nifty analogy for contributing factors.

Lorin Hochstein

Day 21 – Being kind to 3am you

Lots of great tips here for how to make things easier on yourself when you’re paged. Pave the way for your 3 am self to get things fixed and get back to sleep as soon as possible.

Katie McLaughlin (Sysadvent day 21)

Page It To The Limit Podcast

Ooh, a new SRE podcast! PagerDuty started things up with 4 episodes right out of the gate.

Introducing “Page It To The Limit,” a new podcast by the Community team here at PagerDuty that discusses what it means to operate software in production.

My week shadowing a GitLab Site Reliability Engineer

Wow, I love the idea of this shadowing program. The author discusses incidents they saw and 5 things they learned while shadowing.

Tristan Read — GitLab

Outages

ACH
- ACH is the Automated Clearinghouse, the system that underpins money transfer in the US.
Twitte
New Zealand stock exchange
UPS
Hulu
Whirlpool Recall Website
- Whirlpool issued a recall of 500,000 washing machines, and their recall information website went down.
Hosted Graphite

SRE Weekly Issue #198

lex

December 15, 2019

General

Comments

View on sreweekly.com

Last week, I came across Lorin Hochstein and started to read through his blog. Lorin has a lot of awesome stuff to say, as you can see in this issue. Thanks, Lorin!

Articles

In Focusing On What Pilots Do Wrong, We May Be Missing Valuable Lessons From What They Quietly Do Right

“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”

Kristy Kiernan — Forbes

SysAdvent Day 10 – It’s OK if you’re not running Kubernetes

Use the right tool for the job, not the coolest one.

Mattias Geniar

Climbing the mountain

In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.

Lorin Hochstein

In service of the narrative

Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.

Lorin Hochstein

The problem with counterfactuals

To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.

Lorin Hochstein

Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

Some great observations and questions related to the Cloudflare outage in July.

Lorin Hochstein

Teaching “the smell”

Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?

Silvia Botros — Learning From Incidents

Outages

Google Drive
Slack System Status
- Some of our customers are seeing an ‘Error: 9DCE38C9695E’ message after attempting to send a message in Slack.
  
  The plot thickens.
  
  Oh, and Slack had another incident too.
Hosted Chef
Azure CDN and Azure Kubernetes Service
GitHub
TikTok
Facebook
Spotify
WhatsApp
Discord
Fastly
- Full disclosure: Fastly is my employer.

SRE Weekly Issue #202

Articles

Outages

SRE Weekly Issue #201

Articles

Outages

SRE Weekly Issue #200

Articles

Outages

SRE Weekly Issue #199

Articles

Outages

SRE Weekly Issue #198

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues