General

SRE Weekly Issue #402

lex

December 10, 2023

Wow, this interactive tool for choosing SLOs is fun to play with! Dragging the sliders really gives you a feel for the math involved, and then you get a formula that you can actually use.

Alex Ewerlöf

Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

A riveting story of a service that was the victim of its own success, a potential solution, and then further challenges to overcome.

Tanat Lokejaroenlarb — Adevinta

Paper: You Want My Password or a Dead Patient?

Here’s a classic example of “work as imagined” vs “work as done”, as health care workers struggle against difficult security constraints while trying to care for patients.

Fred Hebert — summary
Ross Koppel, Sean Smith, Jim Blythe, and Vijay Kothari — original paper

The Guide to SRE Principles

This article covers a lot of ground, touching on a lot of components of a successful SRE program, and even includes a code example for SLO calculation.

Vishal Padghan — Squadcast

Christmas Come Early: An AWS EBS Performance Regression Update

More on the weird EBS performance regression I linked to last week. Still no full explanation of what changed, but at least they have a solution (gp3 volumes).

Dustin Brown — dolthub

How We’re Making Roblox’s Infrastructure More Efficient and Resilient

After a massive 73-hour outage, Roblox set out to redesign their infrastructure to make that kind of incident much less likely. They’ve charted a path through several intermediate architectures, with the ultimate goal of active-active datacenters.

Daniel Sturman, Max Ross, and Michael Wolf — Roblox

“Human error” means they don’t understand how the system worked

Now here’s one that really makes me think. I can’t really summarize it in a sentence, so just go read it.

Lorin Hochstein

SRE Weekly Issue #401

lex

December 3, 2023

General

Comments

View on sreweekly.com

A Few Words About Blameless Culture

Maybe you’re thinking of skipping over “yet another article about blamelessness”? Don’t. This one has some great examples and stories and is well worth a read.

Michael Hart

5 SRE Confessions

I’m definitely guilty of a couple of these.

Code Reliant

Introducing The Debrief: A new podcast series from incident.io

New podcast relevant to our interests!

In this series, you’ll hear insightful conversations with engineers, product managers, co-founders and more, all about the debatable topic of incident management.

Luis Gonzalez — incident.io

A Spooky Performance Regression in AWS EBS Volumes

A puzzling performance regression in EBS volumes, seemingly reproducible across instances. Anyone else seeing anything like this?

Dustin Brown — dolthub

Scaling SRE Teams

This article presents a framework for scaling SRE teams by defining SRE processes, automating, and iterating.

Stelios Manioudakis — DZone

Alerts Should Work for You, Not the Other Way Around

Some tips on what makes a good alert and how to design your alerts to be actually useful, rather than just noise.

Leon Adato — Kentik

Multi-tiered SLOs

Why would you want multiple different targets for the same SLO? Read this one to find out.

Alex Ewerlöf

You don’t need CRDTs for collaborative experiences

Conflict-free Replicated Data Types are powerful, but they have downsides explained in this article, so it’d be great if we could avoid them when possible.

Zak Knill

SRE Weekly Issue #400

lex

November 26, 2023

General

Comments

View on sreweekly.com

A guide to Managing the First Fallacy of Distributed Computing

The network is not reliable. What are the implications and what can we do about it?

Anadi Misra

Incident severity levels for online platforms

Beyond a run-of-the-mill severity levels article, this one goes into a couple of common pitfalls.

Jonathan Word

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

Some good tips in here, esp. the one about brevity.

Ashley Sawatsky — Rootly

Lessons learned from two decades of Site Reliability Engineering

Subtitle:

Or, Eleven things we have learned as Site Reliability Engineers at Google

Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey — Google

Don’t name your EKS Managed NodeGroups (unless you want to trigger an incident)

Good lessons to learn here that apply more broadly than just EKS.

Christian Alexánder Polanco Valdez — Adevinta

Three reasons a liberal arts degree helped me succeed in tech

This article is about project management, but a lot of the skills discussed apply to aspects of SRE at Staff+ levels.

Sannie Lee — Thoughtworks (via martinfowler.com)

How Does Generative AI Work with Devops and Incident Response?

Now this is more like it: there’s a healthy does of skepticism woven through this article, including things genAI probably won’t be good for, and potential pitfalls.

Jesse Robbins — Heavybit

From Oops to Ops: SLOs Get Budget Rate Alerts

There are two different ways of alerting on SLOs, for two very different audiences, as explained in this article. Ostensibly this is a product feature announcement, but you don’t need to be using the product to get a lot out of this.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #399

lex

November 19, 2023

General

Comments

View on sreweekly.com

Paper: How in the World Did We Ever Get into That Mode?

This research paper summary goes into Mode Error and the dangers of adding more features to a system in the form of modes, especially if the system can change modes on its own.

Fred Hebert (summary)
Dr. Nadine B. Sarter (original paper)

Post Mortem on Cloudflare Control Plane and Analytics Outage

Cloudflare suffered a power outage in one of the datacenters housing their control and data planes. The outage itself is intriguing, and in its aftermath, Cloudflare learned that their system wasn’t as HA as they thought.

Lots of great lessons here, and if you want more, they posted another incident writeup recently.

Matthew Prince — Cloudflare

Architecture Patterns : Command Query Responsibility Segregation (CQRS)

Separating write from read workloads can increase complexity but also open the door to greater scalability, as this article explains.

Pier-Jean Malandrino

Load Shedding for High Traffic Systems

Covers four strategies for load shedding, with code examples:

Random Shedding
Priority-Based Shedding
Resource-Based Shedding
Node Isolation

Code Reliant

Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP

Lots of juicy details about the three outages, including a link to AWS’s write-up of their Lambda outage in June.

Gergely Orosz

Architecture Patterns : The Circuit-Breaker

The diagrams in this article are especially useful for understanding how the circuit-breaker pattern works.

Pier-Jean Malandrino

How to be on-call

This one’s about how on-call can go bad, and how to structure your team’s on-call so to be livable and sustainable.

Michael Hart

Working Effectively With Executives During an Incident

Execs cast a big shadow in an incident, so it’s important to have a plan for how to communicate with them, as this article explains.

Ashley Sawatsky — Rootly

SRE Weekly Issue #398

lex

November 12, 2023

General

Comments

View on sreweekly.com

Precise Communication Saves Lives

A cardiac surgeon draws lessons from the Tenerife commercial airline disaster and applies them to communication in the operating room.

Dr. Rob Poston

Why create a post mortem document?

Creating an incident write-up is an expensive investment. This article will tell you why it’s worthwhile.

Emily Ruppe — Jeli

Optimism vs Pessimism in Distributed Systems

The optimism and pessimism in this article are about the likelihood of contention and conflicts between actors in a distributed system, and it’s a fascinating way of looking at things.

Marc Brooker

A guide to running Incident Command

Here is a guide for how to be an effective Incident Commander and get things fixed as quickly as possible as part of an efficient Incident Management process.

Jonathan Word

Paper: Four Concepts for Resilience Engineering

The four concepts are Rebound, Robustness, Graceful Extensibility, and Sustained Adaptability, and this research paper summary explains each concept.

Fred Hebert (summary)
Dr. David Woods (original paper)

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Apache Beam played a pivotal role in revolutionizing and scaling LinkedIn’s data infrastructure. Beam’s powerful streaming capabilities enable real-time processing for critical business use cases, at a scale of over 4 trillion events daily through more than 3,000 pipelines.

Bingfeng Xia and Xinyu Liu — LinkedIn

Automating dead code cleanup

Meta’s SCARF tool automatically scans for unused (dead) code and creates pull requests for their removal, on a daily basis.

Will Shackleton, Andy Pincombe, and Katriel Cohn-Gordon — Meta

Kubernetes And Kernel Panics

Netflix built a system that detects kernel panics in k8s nodes and annotates the resulting orphaned pods so that it’s clear what happened to them.

Kyle Anderson — Netflix

Webinar: Resilience Engineering in 2024: Challenges, Trends & Priorities

This upcoming webinar will cover a range of topics around resilience engineering and incident response, with two big names we’ve seen in many past issues: Chris Evans (incident.io) and Courtney Nash (Verica).

SRE Weekly Issue #402

SRE Weekly Issue #401

SRE Weekly Issue #400

SRE Weekly Issue #399

SRE Weekly Issue #398

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues