SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #420

lex

April 15, 2024

The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it.

EHG_Kain — Last Epoch

Autonomous hardware diagnostics and recovery at scale

Cloudflare’s Phoenix system can find and recover failed servers, reducing toil.

Jet Mariscal, Aakash Shah, and Yilin Xiong — Cloudflare

SLA vs SLO vs SLI: What’s the Difference?

More than just another glossary of SL*s, this one also has examples and best practices.

Sara Miteva — Checkly

What’s the biggest unsolved problem within Site Reliability Engineering?

Spurred from a question in the SRECon attendee survey, this one really gets you thinking: how does the current “generation” of SREs differ from those that came before?

Paige — PagerDuty

Finding the common ground with executives in incidents

This one’s about finding out what execs need in incidents and figuring out how to get everone’s needs met.

Chris Evans — incident.io

Minimizing on-call burnout through alerts observability

This post explains how Cloudflare gathers information about their alerts and improves them to benefit reliability and on-call health.

Monika Singh — Cloudflare

Composite SLO

This one contains formulas for calculating compound SLOs when downstream dependencies are parallel or serial.

Alex Ewerlöf

SRE Weekly Issue #419

lex

April 7, 2024

General

Comments

View on sreweekly.com

How Figma’s Databases Team Lived to Tell the Scale

Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.

Retrofitting sharding is a huge undertaking.

Sammy Steele — Figma

Moving fast breaks things: the importance of a staging environment

Ride along as this company evolves from constantly shipping directly to production to a robust staging and internal canary deployment system.

Greg Foster — Graphite

Post mortem: we took 124 seconds from you, here’s 378 back

A lighthearted but still detail-filled take on a post-incident analysis for a short production outage.

Greg Foster — Graphite

Building Application Reliability on Top of Infrastructure Unreliability

This one has an interesting discussion of the nature of reliability and the impact of multiple services on overall reliability, including possible mathematical models to use.

Fitz — Temporal

#30 Clearing Delusions in Observability (with David Caudill)

This episode of the SREPath Podcast covers a variety of themes around observability and SLOs. There’s a great text-based summary if that’s your preference.

Ash Patel — SREPath

Linux Crisis Tools

This piece argues that you should install system debugging tools in on your production systems now, because it’s going to be really hard to do it live when you need them.

Brendan Gregg

How much are their 9’s worth?

Following on from a previous article about the squiggliness of availability numbers, this article evaluates SLAs from 4 major companies to try to divine what they actually mean.

Ross Brodbeck

SLO formulas implementation in PromQL step by step

I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.

Michał Kaźmierczak

SRE Weekly Issue #418

lex

March 31, 2024

General

Comments

View on sreweekly.com

Redefining Observability

The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.

Hazel Weakly

A Commentary on Defining Observability

Fred Hebert wrote this response/follow-on to Hazel’s article:

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

Fred Hebert

Service Level Agreement

What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.

Alex Ewerlöf

Assumptions About Debriefs That Belie Legal Risk

Fascinating legal questions come to the surface when lawyers consider the possibility for legal risk exposure from a surgical incident debriefing meeting.

Dr. Rob Poston

How to deal with alert fatigue head-on

if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we’ll dive into the tactics teams can implement to address alert fatigue and its underlying causes.

incident.io

Different Ways to Aggregate Nines

How do you create an SLO that references multiple SLIs together, such as slow requests and errors?

Ross Brodbeck

SREcon24 Americas Recap

More than just a list of talks, this piece pulls out major themes from SRECon24.

Will Gallego

How much are your 9’s worth?

Making your 9’s look great by cheating.

Of course, you don’t actually want to do that, but learning how can show us that availability numbers are nuanced.

Ross Brodbeck

SRE Weekly Issue #417

lex

March 24, 2024

General

Comments

View on sreweekly.com

Harnessing chaos in Cloudflare offices

Remember that cool lava lamp random number generator that Cloudflare uses? Now they have a couple of other sources of entropy, and they’re teaming up with other companies.

Cefan Daniel Rubin, Luke Valenta, and Thibault Meunier — Cloudflare

Live Streaming the Super Bowl: The Art of Scaling Across Multiple Cloud Regions

To support 123 million simultaneous streams (!), Paramount+ migrated to a multi-region architecture with a distributed, multi-write database.

Denis Magda — Yugabyte

DORA vs. DORA!

DevOps Research and Assessment or the Digital Operational Resilience Act, which is which? Turns out they both matter to SREs.

Lee Fredricks — PagerDuty

The 2038 Problem

2038 isn’t so far off now. Do you have a plan for 64-bit timestamps?

Code Reliant

Onboarding roulette: deleting our employee accounts daily

To ensure they would dogfood the new account process regularly, these folks delete a random employee’s account in their product every day.

Greg Foster — Graphite

Properly Running Kubernetes Jobs with Sidecars in 2024 (K8s 1.28+)

Hey, check it out, sidecars are going to be fully supported in upcoming versions of Kubernetes!

Steven Aldinger — TeamSnap

Inside the gamedays: how we tested Signals for reliability

As part of releasing a new product, FireHydrant ran simulations to determine the right SLO — and uncover some room for optimization.

Danielle Leong — FireHydrant

This article is published by my sponsor, FireHydrant, but their sponsorship did not influence its inclusion in this issue.

Distributed Tracing Guide: Quick Overview

If you’re new to distributed tracing, this is a great overview. The part about automated instrumentation for span tracing is especially useful.

Chris Battarbee — Metoro

SRE Weekly Issue #416

lex

March 17, 2024

General

Comments

View on sreweekly.com

4 Instructive Postmortems on Data Downtime and Loss

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summaries and takeaways.

The Hacker News

Resilience and Incident Management with Vanessa Huerta Granda

My favorite part of this interview is when Vanessa describes parenting twin babies as constant incident response.

Shane Hastie — InfoQ

Beyond the beep and saving sleep: optimizing the On-Call experience

Here follow some lessons I’ve learned from the trenches in small start-ups and larger engineering teams, to improve your on-call shift experience and remediation time for production issues and make sure you’re spending on-call efforts on what has the most impact.

Alex Wauters

The case for Fault Injection testing in Production

Doing your chaos experiments in a non-production environment can feel safer, but what are you giving up?

Sam Rossoff — Gremlin

In Defense of Shell Scripts

Sometimes, shell is just the right tool for the job.

Amin Astaneh — Certo Modo

Tank Explosions at Midland Resource Recovery

Catherine from Mastodon summarized this incident report beautifully:

this is one of the most violently unhinged CSB reports i’ve ever read […]

while investigating an explosion at a facility, CSB staff tried to prevent another explosion of the same kind in the same facility, and being unable to convince the workers to not cause it, ended up hiding behind a shipping container

U.S. Chemical Safety and Hazard Investigation Board

Broken windows: why the ‘Single Pane of Glass’ is impossible

This one’s about why people tend to want a “SPoG” and what we should want instead. Bonus points for the Star Trek reference.

Nočnica Mellifera — Checkly

How we built our infrastructure fail-over checklist

Right in the middle of migrating from one datacenter to an HA pair of new datacenters, one of the new ones failed. They had to quickly do a partial rollback of the migration to ride out the outage.

Gauthier François — Doctolib

Announcing bpftop: Streamlining eBPF performance optimization

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF programs.

Jose Fernandez — Netflix

SRE Weekly Issue #420

SRE Weekly Issue #419

SRE Weekly Issue #418

SRE Weekly Issue #417

SRE Weekly Issue #416

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues