SRE WEEKLY – Page 35 – scalability, availability, incident response, automation

SRE Weekly Issue #344

lex

October 23, 2022

Articles

In this story of SLOs gone bad, error budgets and code freezes provided a perverse incentive that caused a great deal of harm.

dobbse.net

Taking a Page From Site Reliability Engineering

This article seeks to apply SRE principles to security in the form of a Threat Budget.

Jason Bloomberg — Intellyx

How an incident management tool helps you conquer response challenges

After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents.

Mike Lacsamana — FireHydrant

Incident Review: Shepherd Cache Delays

The Analysis section has a lot of important lessons. What really stands out in this incident review is the fact that Honeycomb plainly lays out the fact that they don’t yet know what went wrong, and why not.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

Staging is a trap

several, small staging clusters—each fit for their purpose—offers a more maintainable, cheaper alternative.

Tyler Cipriana

The Case of the Missing Fuel: The story of the Stockport air disaster

I’m really enjoying the Admiral Cloudberg series of aircraft accident investigation reports. How did I not know about these before??

A lot has improved in aviation safety since this crash in 1967, but there’s still a lot we can learn in SRE even now. For example: the operator’s view into the system should make the result of their inputs clear.

Admiral Cloudberg

How We Found Azure’s Unannounced Breaking Change

An unannounced (maybe inadvertent?) breaking change in an Azure API caused an outage. Here’s the story of the investigation.

Nikko Campbell — Metrist

Value for Money: The crash of ValuJet flight 592

Another Admiral Cloudberg air accident investigation, this time showing how easily critical details can slip through the cracks.

Admiral Cloudberg

SRE Weekly Issue #343

lex

October 16, 2022

General

Comments

View on sreweekly.com

Bit of a short one this week as I recover from my third bout of COVID. Fortunately, this is another relatively mild one (thank you, vaccine!). Good luck everyone, and get your boosters.

Articles

Authors’ Cut—Actionable SLOs Based on What Matters Most

This article explores the advantages of powering SLOs with observability data.

Pierre Tessier — Honeycomb
Full disclosure: Honeycomb is my employer.

#JWST: Day 2 Operations of the Most Expensive SRE Project

As the James Webb Space Telescope moves into normal operations, there are more great SRE lessons to be learned.

Jennifer Riggins — The New Stack

How to Build Software like an SRE

During 5 years of experience as an SRE, the author of this article gathered a set of best practice patterns for software development and operation, which they share with us.

brandon willett

Mussel — Airbnb’s Key-Value Store for Derived Data

How Airbnb built a persistent, high availability and low latency key-value storage engine for accessing derived data from offline and streaming events.

Chandramouli Rangarajan, Shouyan Guo, Yuxi Jin — Airbnb

Why MTTR should be a ‘business’ metric

By owning and reporting MTTR, teams have no choice but to be accountable for the reliability of the code they write. This dramatically changes the culture of engineering.

Sidu Ponnappa — Last9

Alaskan Double-Cross: The crash of PenAir flight 3296

I learned about plan continuation bias while reading this air accident report, and I’m certain I’ve experienced this during incidents I’ve been involved in.

Admiral Cloudberg

SRE Weekly Issue #342

lex

October 9, 2022

General

Comments

View on sreweekly.com

Articles

Video Observability

As a television broadcaster, how do I ensure that my channels are playing out the right thing for my viewers?

This is SRE applied to tv broadcasting: they replaced human monitoring of screens with an automated system.

Jeremy Blythe — evertz.io
Full disclosure: Honeycomb, my employer, is mentioned.

On-call with Jérôme Petazzoni

An interview with an engineer about on-call practices, training folks for on-call, and chaos engineering.

Elena Boroda — Fiberplane

The Re-Org Rag (I’m My Own VP)

SRE: totally defined. Time for a reorg, and with a catchy tune!

Forrest Brazeal

Keep Calm and Respond: A Beginner’s Heuristic to Incident Response

Great advice for incident response, backed up by real-world anecdotes.

Audrey Simonne — DZone

The Long Way Down: The crash of Air France flight 447

There’s a lot to learn from in this air accident. A chilling example: several quirks of the plane’s automation combined to effectively tell the pilot to continue pushing the plane to stall.

Admiral Cloudberg

Atomic Commitment: The Unscalability Protocol – Marc’s Blog

When sharding a database, if transactions can span shards, then it can be very difficult to reason about the system’s maximum throughput.

For example, splitting a single-node database in half could lead to worse performance than the original system.

Marc Brooker

GitHub Availability Report: September 2022

Through Ubuntu’s unattended-upgrades system, a systemd update was installed that broke systemd-resolved, which in turn broke GitHub Codespaces. The systemd bug report they link to is also well worth a read.

Jakub Oleksy — GitHub

There is no “Three Mile Island” event coming for software

Why not?

we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

Lorin Hochstein

SRE Weekly Issue #341

lex

October 2, 2022

General

Comments

View on sreweekly.com

Articles

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

My coworkers referred to a system “going metastable”, and when I asked what that was, they pointed me to this awesome paper.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is `removed.

Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu

Honeycomb incident report: Querying Errors

Honeycomb posted this incident report involving a service hitting the open file descriptors limit.

Honeycomb
Full disclosure: Honeycomb is my employer.

[reddit r/sre] What does your oncall rotas look like?

Lots of interesting answers to this one, especially when someone uttered the phrase:

engineers should not be on call

u/infomaniac89 and others — reddit

Incident Report: Google Cloud Filestore Outage 2022-09-13

A misbehaving internal Google service overloaded Cloud Filestore, exceeding its global request limit and effectively DoSing customers.

Google

Creating a Thriving On-Call Engineering Workflow by Embracing Healthy Team Habits

An in-depth look at how Adobe improved its on-call experience. They used a deliberate plan to change their team’s on-call habits for the better.

Bianca Costache — Adobe

Here’s How Chicago Trading Company’s Luke Rotta Engineers Resilient Systems

This one contains an interesting observation: they found that outages caused by a cloud providers take longer to solve.

Jeff Martens — Metrist

Why you should ditch your overly detailed incident response plan | incident.io

Even if you don’t agree with all of their reasons, it’s definitely worth thinking about.

Danny Martinez — incident.io

Thoughts on API Reliability

This one covers common reliability risks in APIs and techniques for mitigating them.

Utsav Shah

The Future of Ops Is Platform Engineering

The evolution beyond separate Dev and Ops teams continues. This article traces the path through DevOps and into platform-focused teams.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #340

lex

September 25, 2022

General

Comments

View on sreweekly.com

Articles

SREcon Americas 2020: Exposing the Human Factor

This one’s from a couple years ago and covers 3 main themes the author saw at SRECon Americas 2020. Fascinating topics include providing context for newbies, learning from incidents, and rethinking the incident command system.

Taylor Barnett — Transposit

Honeycomb preliminary incident report: Ingestion delays

On September 8, Honeycomb had a major outage in data ingestion, and they’ve posted this preliminary report, “pending an in-depth incident review in the upcoming weeks”.

BONUS CONTENT: Another outage report from a different outage the next day.

Honeycomb
Full disclosure: Honeycomb is my employer.

/r/sre Thread: A “real” day in the life of an SRE

This is neat! Someone posted a day in their life as an actual SRE, and a bunch of commenters followed suit.

Various commenters — Reddit

What’s Difficult About Problem Detection? Three Key Takeaways

Some big names in SRE got together to talk about how to know when your system is broken. Listen to the recording or read this excellent summary that goes in depth on grey failures and more.

Emily Arnott — Blameless

Scaling Robinhood Crypto Systems

To better scale our systems, our infrastructure and product teams got together and decided to make these optimizations: reduce database loads, conduct load tests and size the demand and prioritize critical flows.

…and sharding.

Robinhood

How an incident transformed Razorpay — Building our Command Center

A major incident went poorly, and that catalyzed investment in developing a new incident response system. They worked to transition from swarming to Incident Command.

Vikrant Saini — Razorpay

Consider these 9 microservices best practices to help you ditch your monolith — Cortex

I love this part:

[…] if you have to deploy your microservices in a certain order, they’re not really microservices.

Cortex

Heroku Incident 2451 Follow-up

This one had an interesting interplay of contributing factors.

Heroku

SRE Weekly Issue #344

Articles

SRE Weekly Issue #343

Articles

SRE Weekly Issue #342

Articles

SRE Weekly Issue #341

Articles

SRE Weekly Issue #340

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues