General

SRE Weekly Issue #226

lex

July 5, 2020

Articles

A Doctor Confronts Medical Errors — And Flaws In The System That Create Mistakes

This is an article version of an interview with Dr. Danielle Ofri, author of a new book When We Do Harm, on NPR’s Fresh Air. I especially loved the part about near misses.

Bridget Bentz, Molly Seavy-Nesper, Deborah Franklin, Sam Briger, and Thea Chaloner — NPR

Heroku incident 2081 follow-up

Maintenance of the logging system had unintended downstream effects including log loss and failure of the system that manages dynos.

Heroku incident 2045 follow-up

In this incident, a TLS certificate was deployed without its intermediate, resulting in failures for some clients.

Software engineering responses to COVID-19 — My take on REA’s Webinar

I wrote this after attending the Resilience Engienering Association’s webinar with panelists Dr. Richard Cook, John Allspaw, and Nora Jones, moderated by Laura Maguire. Once the recording is posted, I highly recommend watching!

Lex Neva

How SLIs Help You Understand Users’ Needs

As SREs, we need to be laser focused on the user’s experience. Our SLIs should reflect that.

Emily Arnott — Blameless

Twitter’s Reliability Journey

This two-part series is an in-depth look at how Twitter adopted SRE, before SRE was even a thing.

Blameless

Outages

Elevated 500 errors on status pages and management portal
Gmail
Tinder
Australian Taxation Office
Discord
- This status page post is interesting and worth a read.
GitHub
Twitch

SRE Weekly Issue #225

lex

June 28, 2020

General

Comments

View on sreweekly.com

Articles

Catchpoint’s SRE Report 2020 – The Highlights

This suggests an upcoming shift in our field:

50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.

Kameerath Kareem — Catchpoint

BONUS CONTENT: An outside take on the survey results is here (Mike Vizard — DevOps.com).

Even Experts Need Experts

No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?

Will Gallego

Heroku Incident Folow-up: Incident #2038

A new feature was accidentally rolled out to too wide an audience, causing log message loss.

Heroku

The impact of slow NFS on data systems

[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Kalyanasundaram Somasundaram — LinkedIn

SRE error budgets and maintenance windows

Should you count scheduled maintenance against your error budget? It depends.

Jesus Climent — Google

Cassandra counter columns: nice in theory, hazardous in practice

An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:

In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.

Paddy Byers — Ably

How to Be a Financially Conscious Site Reliability Engineer

This article explains why we should have cost data at our fingertips as we design cloud-based systems.

[…] a well-architected system is often a cost-efficient system.

CloudZero

A Shared Pilot-Autopilot Control Architecture for Resilient Flight

This is a new concept to me, and I really like it:

Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.

Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Outages

PagerDuty
GitHub
- An update to our nameservers has been rolled back. We are monitoring recovery.
IBM Cloud
- I saw several mentions of this outage in the media but IBM’s status page doesn’t seem to list it.
Fastly
Reddit
- and this one

SRE Weekly Issue #224

lex

June 21, 2020

General

Comments

View on sreweekly.com

Happy Juneteenth (a couple days late)! Let’s all work to strengthen the SRE profession by working to improve inclusion and diversity.

Articles

How diversity, inclusion, and belonging looks in the tech industry

Diversity and inclusion make our companies stronger and more effective. This article has lots of links with evidence of why diversity matters and how to get your company on the road to improvement.

Sara Kassabian — GitLab

Is Your Team Culture Ready to Accelerate Innovation and Build Resiliency with Chaos Engineering?

Starting on the road to chaos engineering is about more than just figuring out what experiments to run. Spreading knowledge and gaining buy-in before you start is critical.

Deven Samant — Business 2 Community

What happens when you update your DNS?

DNS propagation and inconsistent resolver behavior has bitten me so many times in my career.

Julia Evans

Post-Incident Reviews With Jaime Woo & Emil Stolarsky

I don’t often have enough time to listen to podcasts, but when it’s these two, I had to. Jaime and Emil talk about post-incident reviews, geeking out about incidents, and their philosophy on publishing a zine.

Scott McAllister — Page It To the Limit Podcast (PagerDuty)

Heroku Incident #2042 Follow-up

As so often happens, their attempts to fix a problem caused other problems. Has that happened to you? I’d love to read your story about it!

Being Kind

This article opens with a great story about how to help someone feel better when they are a contributing factor in an outage.

Tanya Reilly

Outages

SRE Weekly Issue #223

lex

June 14, 2020

General

Comments

View on sreweekly.com

Articles

Prevent application and network instability by serving stale content

I’ve used this technique in the past with a single-page app and a highly-cacheable API, to ensure stability even when the backend goes down.

Patrick Hamann

Full disclosure: Fastly is my employer.

The Impending Doom of Expiring Root CAs and Legacy Clients

Here’s a deep dive into how your CA’s certificate can affect your application’s reliability — at least in the eyes of your customers.

Scott Helme

[Coinbase] Incident Post Mortem: June 1, 2020

Here’s Coinbase’s followup from their outage last week.

Michael de Hoog — Coinbase

Who’s afraid of serializability?

Kyle Kingsbury recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Lorin Hochstein

Achieving FMEA goals faster with Chaos Engineering

Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service.

If you’ve been tasked with applying FMEA in your SRE work, this article will get you started.

Matthew Helmke

Outages

IBM Softlayer
- This is the big one this week, with downstream effects on lots of sites and services hosted in Softlayer.
  There’s a bit of detail from IBM that seems to indicate that a BGP error by a third party flooded IBM with misrouted traffic.
Reddit
Squarespace
Snapchat
Microsoft Teams

SRE Weekly Issue #222

lex

June 7, 2020

General

Comments

View on sreweekly.com

Articles

Meaningful availability: How many nines do you actually need?

This article in a nutshell:

Nines don’t matter if users aren’t happy (h/t Charity Majors)
Chaos engineering

Kolton Andrus — Gremlin

Byzantine and non-Byzantine distributed systems

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

Using SRE to meet reliability challenges

In our experience, the three big sources of production stress are:

Toil

Bad monitoring

Immature incident handling procedures

Cheryl Kang — Google

Faulty Equipment, Lapsed Training, Repeated Warnings: How a Preventable Disaster Killed Six Marines

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

SRE, CSE, and the safety boundary

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

the boundary to economic failure
the boundary of unacceptable work load
the boundary of functionally acceptable performance

Lorin Hochstein

The Tail at Scale Revisited

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Oncall and COVID-19 Survey Results

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

The mystery of the expiring Sectigo web certificate

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security

Outages

PagerDuty
Coinbase
- Coinbase had an outage on June 1. Click for their post-incident analysis.
Robinhood
- Robinhood’s status page doesn’t show history, so I can’t verify this one.
iCloud
Ebay
- Ebay’s status page also doesn’t show history, so I can’t verify this one either.
Lloyds and Halifax (bank)
Adobe Cloud
Squarespace
- Their followup post discusses the large-scale DDoS that contributed to the outage.
HostedGraphite
Telegram

SRE Weekly Issue #226

Articles

Outages

SRE Weekly Issue #225

Articles

Outages

SRE Weekly Issue #224

Articles

Outages

SRE Weekly Issue #223

Articles

Outages

SRE Weekly Issue #222

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues