General

SRE Weekly Issue #227

lex

July 12, 2020

Articles

A Terrible, Horrible, No-Good, Very Bad Day at Slack

This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.

Laura Nolan — Slack

All Hands on Deck

This is the companion article that describes Slack’s incident response process, using the same incident as a case study.

Ryan Katkov — Slack

Improving Incident Retrospectives at Indeed

The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.

Alex Elman

Google Cloud Networking Incident #20005 Follow-Up

The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.

Towards More Effective Incident Postmortems

This is a good primer on the ins and outs of running a post-incident analysis.

Anusuya Kannabiran — Squadcast

Setting SLOs: observability using custom metrics

This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.

Cindy Quach — Google

Introducing the GitHub Availability Report

GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.

Keith Ballinger — GitHub

Blameless’ SRE Journey

This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.

Blameless

Outages

Facebook SDK
- Like in May, a Facebook SDK release caused problems on iOS for Spotify, Pinterest, Tinder.
Uber Eats
Crunchyroll
TikTok
Spotify

SRE Weekly Issue #226

lex

July 5, 2020

General

Comments

View on sreweekly.com

Articles

A Doctor Confronts Medical Errors — And Flaws In The System That Create Mistakes

This is an article version of an interview with Dr. Danielle Ofri, author of a new book When We Do Harm, on NPR’s Fresh Air. I especially loved the part about near misses.

Bridget Bentz, Molly Seavy-Nesper, Deborah Franklin, Sam Briger, and Thea Chaloner — NPR

Heroku incident 2081 follow-up

Maintenance of the logging system had unintended downstream effects including log loss and failure of the system that manages dynos.

Heroku incident 2045 follow-up

In this incident, a TLS certificate was deployed without its intermediate, resulting in failures for some clients.

Software engineering responses to COVID-19 — My take on REA’s Webinar

I wrote this after attending the Resilience Engienering Association’s webinar with panelists Dr. Richard Cook, John Allspaw, and Nora Jones, moderated by Laura Maguire. Once the recording is posted, I highly recommend watching!

Lex Neva

How SLIs Help You Understand Users’ Needs

As SREs, we need to be laser focused on the user’s experience. Our SLIs should reflect that.

Emily Arnott — Blameless

Twitter’s Reliability Journey

This two-part series is an in-depth look at how Twitter adopted SRE, before SRE was even a thing.

Blameless

Outages

Elevated 500 errors on status pages and management portal
Gmail
Tinder
Australian Taxation Office
Discord
- This status page post is interesting and worth a read.
GitHub
Twitch

SRE Weekly Issue #225

lex

June 28, 2020

General

Comments

View on sreweekly.com

Articles

Catchpoint’s SRE Report 2020 – The Highlights

This suggests an upcoming shift in our field:

50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.

Kameerath Kareem — Catchpoint

BONUS CONTENT: An outside take on the survey results is here (Mike Vizard — DevOps.com).

Even Experts Need Experts

No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?

Will Gallego

Heroku Incident Folow-up: Incident #2038

A new feature was accidentally rolled out to too wide an audience, causing log message loss.

Heroku

The impact of slow NFS on data systems

[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Kalyanasundaram Somasundaram — LinkedIn

SRE error budgets and maintenance windows

Should you count scheduled maintenance against your error budget? It depends.

Jesus Climent — Google

Cassandra counter columns: nice in theory, hazardous in practice

An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:

In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.

Paddy Byers — Ably

How to Be a Financially Conscious Site Reliability Engineer

This article explains why we should have cost data at our fingertips as we design cloud-based systems.

[…] a well-architected system is often a cost-efficient system.

CloudZero

A Shared Pilot-Autopilot Control Architecture for Resilient Flight

This is a new concept to me, and I really like it:

Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.

Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Outages

PagerDuty
GitHub
- An update to our nameservers has been rolled back. We are monitoring recovery.
IBM Cloud
- I saw several mentions of this outage in the media but IBM’s status page doesn’t seem to list it.
Fastly
Reddit
- and this one

SRE Weekly Issue #224

lex

June 21, 2020

General

Comments

View on sreweekly.com

Happy Juneteenth (a couple days late)! Let’s all work to strengthen the SRE profession by working to improve inclusion and diversity.

Articles

How diversity, inclusion, and belonging looks in the tech industry

Diversity and inclusion make our companies stronger and more effective. This article has lots of links with evidence of why diversity matters and how to get your company on the road to improvement.

Sara Kassabian — GitLab

Is Your Team Culture Ready to Accelerate Innovation and Build Resiliency with Chaos Engineering?

Starting on the road to chaos engineering is about more than just figuring out what experiments to run. Spreading knowledge and gaining buy-in before you start is critical.

Deven Samant — Business 2 Community

What happens when you update your DNS?

DNS propagation and inconsistent resolver behavior has bitten me so many times in my career.

Julia Evans

Post-Incident Reviews With Jaime Woo & Emil Stolarsky

I don’t often have enough time to listen to podcasts, but when it’s these two, I had to. Jaime and Emil talk about post-incident reviews, geeking out about incidents, and their philosophy on publishing a zine.

Scott McAllister — Page It To the Limit Podcast (PagerDuty)

Heroku Incident #2042 Follow-up

As so often happens, their attempts to fix a problem caused other problems. Has that happened to you? I’d love to read your story about it!

Being Kind

This article opens with a great story about how to help someone feel better when they are a contributing factor in an outage.

Tanya Reilly

Outages

SRE Weekly Issue #223

lex

June 14, 2020

General

Comments

View on sreweekly.com

Articles

Prevent application and network instability by serving stale content

I’ve used this technique in the past with a single-page app and a highly-cacheable API, to ensure stability even when the backend goes down.

Patrick Hamann

Full disclosure: Fastly is my employer.

The Impending Doom of Expiring Root CAs and Legacy Clients

Here’s a deep dive into how your CA’s certificate can affect your application’s reliability — at least in the eyes of your customers.

Scott Helme

[Coinbase] Incident Post Mortem: June 1, 2020

Here’s Coinbase’s followup from their outage last week.

Michael de Hoog — Coinbase

Who’s afraid of serializability?

Kyle Kingsbury recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Lorin Hochstein

Achieving FMEA goals faster with Chaos Engineering

Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service.

If you’ve been tasked with applying FMEA in your SRE work, this article will get you started.

Matthew Helmke

Outages

IBM Softlayer
- This is the big one this week, with downstream effects on lots of sites and services hosted in Softlayer.
  There’s a bit of detail from IBM that seems to indicate that a BGP error by a third party flooded IBM with misrouted traffic.
Reddit
Squarespace
Snapchat
Microsoft Teams

SRE Weekly Issue #227

Articles

Outages

SRE Weekly Issue #226

Articles

Outages

SRE Weekly Issue #225

Articles

Outages

SRE Weekly Issue #224

Articles

Outages

SRE Weekly Issue #223

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues