General

SRE Weekly Issue #406

lex

January 7, 2024

This article describes how to clearly show your value delivered to a tech company as someone who focuses on non-functional requirements such as operability, performance, or reliability.

Amin Astaneh — Certo Modo

The courage to imagine other failures

Doggedly preventing a recurrence of an incident may not be the best way to protect our systems — and may in fact make things worse.

Lorin Hochstein

SLO Compliance Period

Should your SLO cover a rolling 30 days? 7 days? A calendar month?

Alex Ewerlöf

How Meta built the infrastructure for Threads

Threads was built in five months and had over 100 million users in its first week.

Laine Campbell and Chunqiang (CQ) Tang — Meta

Setting the foundations for on-call that’s fair, balanced, and human-focused

This article is full of advice on setting up an on-call process that’s livable and less likely to burn folks out.

incident.io

Lesser of Two Evils: The crash of Ameristar Charters flight 9363

A pilot violated a major aviation principle, and it was the right move. It’s very interesting to me that pilots are trained on the principle but not on the exceptions, with the expectation that they will react well in exceptional circumstances.

Admiral Cloudberg

How to Do UUID as Primary Keys the Right Way

Integer IDs or UUIDs as your DB primary key? I can’t count the number of incidents I’ve been involved in where integer primary keys played a part.

Bertrand Florat

SRE Weekly Issue #405

lex

December 31, 2023

General

Comments

View on sreweekly.com

Lagom SLO

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

Alex Ewerlöf

Our Journey Migrating to AWS IMDSv2

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

Archie Gunasekara — Slack

Why People Should Be at the Heart of Operational Resilience

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

Richard Gall — The New Stack

Any change can break us, but we can’t treat every change the same

If any seemingly innocuous change can break our systems, what should we do?

Lorin Hochstein

Human Performance in the Spotlight: ‘Human Error’ and ‘Honest Mistakes’

What exactly is “human error”?

Steven Shorrock — Humanistic Systems

Zero downtime Postgres upgrades

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

Brent Anderson — Knock

RHIP, doctors, and pagers

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

Fred Hebert

Beyond Murphy’s Law

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

Bertrand Florat

SRE Weekly Issue #404

lex

December 24, 2023

General

Comments

View on sreweekly.com

Rule of 10x per 9

For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.

Alex Ewerlöf

Patching around a C++ crash with a little bit of Lua

In this incident story, the feature flags were served by the main application server. When a new feature caused the server to crash, there was no way to flip the flag back off to stop the crashes.

rachelbythebay

Set Taxonomies to Neutral

The author of a classification system for human error reflects 20 years later on the harm that such systems can cause by using deficit-based language.

Dr. Steven Shorrock

Post Mortem on VOID Report: Cloudflare Control Plane and Analytics Outage

Here’s Fred Hebert’s analysis of Cloudflare’s write-up of their incident on November 2.

I’m hoping they’re going to do a more in-depth review.

Fred Hebert — VOID

Integrating manual with automatic instrumentation

In this post, we introduce a hybrid approach that seamlessly combines the precision of manual instrumentation with the comfort, efficiency, and performance of automatic instrumentation.

Ron Federman — Odigos

The Swedbank Outage shows that Change Controls don’t work

Change is not the problem. It’s unaddressed risk

Bruce Johnston — High Scalability

Production Postmortem: The Spawn of Denial of Service

A shell script with a loop running a DB client can fill up your ephemeral ports in a hurry.

Oren Eini — RavenDB

Writing Code is the Same Thing as Writing Prose

When you get right down to it, it’s all human communication, even assembly code. It’s human factors all the way down.

Michael Hart

SRE Weekly Issue #403

lex

December 17, 2023

General

Comments

View on sreweekly.com

Service Level Indicators

A great overview of SLIs, covering event-based vs time-based SLIs, commonly used SLIs, and examples of things that don’t make good SLIs.

Alex Ewerlöf

Your incident declaration form is (probably) too long: The power of concise reporting

When it’s time to declare an incident, I want to spend ten seconds or less getting things kicked off.

Matilda Hultgren — incident.io

On Error Budgets

This short article covers three important aspects of error budgets:

Understanding Your Error Budget

Make Informed Decisions

Proactively communicate

Code Reliant

4 SRE Golden Signals (What they are and why they matter)

SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

Blameless Full disclosure: Honeycomb, my employer, is mentioned.

A deep dive into CPU requests and limits in Kubernetes

I hadn’t really appreciated some of the subtler details of CPU requests in k8s until I read this.

Ara Pulido — Datadog

Maybe Getting Rid of Your QA Team was Bad, Actually.

Reading this, I can see hints of the contributing factors in many incidents I’ve been involved in.

To these folks, it feels like giving a damn is a huge career liability in your organization. Because it is.

David Caudill

Upgrading GitHub.com to MySQL 8.0

They went to impressive lengths to make the upgrade process reversible.

Amusingly, this post was directly relevant to me 30 minutes ago when I discovered mojibake all over sreweekly.com due to upgrading MySQL from 5.7 to 8.0+ last week. Oops.

Jiaqi Liu, Daniel Rogart, and Xin Wu — GitHub

“Why Aren’t They Reporting Incidents?” Influences on Reporting Behaviour

In order to learn from incidents, we need to know that they happened. That means someone needs to report them, but a lot can get in the way of reporting incidents.

Dr. Steven Shorrock — Humanistic Systems

SRE Weekly Issue #402

lex

December 10, 2023

General

Comments

View on sreweekly.com

Introducing Service Level Calculator

Wow, this interactive tool for choosing SLOs is fun to play with! Dragging the sliders really gives you a feel for the math involved, and then you get a formula that you can actually use.

Alex Ewerlöf

Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

A riveting story of a service that was the victim of its own success, a potential solution, and then further challenges to overcome.

Tanat Lokejaroenlarb — Adevinta

Paper: You Want My Password or a Dead Patient?

Here’s a classic example of “work as imagined” vs “work as done”, as health care workers struggle against difficult security constraints while trying to care for patients.

Fred Hebert — summary
Ross Koppel, Sean Smith, Jim Blythe, and Vijay Kothari — original paper

The Guide to SRE Principles

This article covers a lot of ground, touching on a lot of components of a successful SRE program, and even includes a code example for SLO calculation.

Vishal Padghan — Squadcast

Christmas Come Early: An AWS EBS Performance Regression Update

More on the weird EBS performance regression I linked to last week. Still no full explanation of what changed, but at least they have a solution (gp3 volumes).

Dustin Brown — dolthub

How We’re Making Roblox’s Infrastructure More Efficient and Resilient

After a massive 73-hour outage, Roblox set out to redesign their infrastructure to make that kind of incident much less likely. They’ve charted a path through several intermediate architectures, with the ultimate goal of active-active datacenters.

Daniel Sturman, Max Ross, and Michael Wolf — Roblox

“Human error” means they don’t understand how the system worked

Now here’s one that really makes me think. I can’t really summarize it in a sentence, so just go read it.

Lorin Hochstein

SRE Weekly Issue #406

SRE Weekly Issue #405

SRE Weekly Issue #404

SRE Weekly Issue #403

SRE Weekly Issue #402

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues