SRE WEEKLY – Page 15 – scalability, availability, incident response, automation

SRE Weekly Issue #417

lex

March 24, 2024

Remember that cool lava lamp random number generator that Cloudflare uses? Now they have a couple of other sources of entropy, and they’re teaming up with other companies.

Cefan Daniel Rubin, Luke Valenta, and Thibault Meunier — Cloudflare

Live Streaming the Super Bowl: The Art of Scaling Across Multiple Cloud Regions

To support 123 million simultaneous streams (!), Paramount+ migrated to a multi-region architecture with a distributed, multi-write database.

Denis Magda — Yugabyte

DORA vs. DORA!

DevOps Research and Assessment or the Digital Operational Resilience Act, which is which? Turns out they both matter to SREs.

Lee Fredricks — PagerDuty

The 2038 Problem

2038 isn’t so far off now. Do you have a plan for 64-bit timestamps?

Code Reliant

Onboarding roulette: deleting our employee accounts daily

To ensure they would dogfood the new account process regularly, these folks delete a random employee’s account in their product every day.

Greg Foster — Graphite

Properly Running Kubernetes Jobs with Sidecars in 2024 (K8s 1.28+)

Hey, check it out, sidecars are going to be fully supported in upcoming versions of Kubernetes!

Steven Aldinger — TeamSnap

Inside the gamedays: how we tested Signals for reliability

As part of releasing a new product, FireHydrant ran simulations to determine the right SLO — and uncover some room for optimization.

Danielle Leong — FireHydrant

This article is published by my sponsor, FireHydrant, but their sponsorship did not influence its inclusion in this issue.

Distributed Tracing Guide: Quick Overview

If you’re new to distributed tracing, this is a great overview. The part about automated instrumentation for span tracing is especially useful.

Chris Battarbee — Metoro

SRE Weekly Issue #416

lex

March 17, 2024

General

Comments

View on sreweekly.com

4 Instructive Postmortems on Data Downtime and Loss

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summaries and takeaways.

The Hacker News

Resilience and Incident Management with Vanessa Huerta Granda

My favorite part of this interview is when Vanessa describes parenting twin babies as constant incident response.

Shane Hastie — InfoQ

Beyond the beep and saving sleep: optimizing the On-Call experience

Here follow some lessons I’ve learned from the trenches in small start-ups and larger engineering teams, to improve your on-call shift experience and remediation time for production issues and make sure you’re spending on-call efforts on what has the most impact.

Alex Wauters

The case for Fault Injection testing in Production

Doing your chaos experiments in a non-production environment can feel safer, but what are you giving up?

Sam Rossoff — Gremlin

In Defense of Shell Scripts

Sometimes, shell is just the right tool for the job.

Amin Astaneh — Certo Modo

Tank Explosions at Midland Resource Recovery

Catherine from Mastodon summarized this incident report beautifully:

this is one of the most violently unhinged CSB reports i’ve ever read […]

while investigating an explosion at a facility, CSB staff tried to prevent another explosion of the same kind in the same facility, and being unable to convince the workers to not cause it, ended up hiding behind a shipping container

U.S. Chemical Safety and Hazard Investigation Board

Broken windows: why the ‘Single Pane of Glass’ is impossible

This one’s about why people tend to want a “SPoG” and what we should want instead. Bonus points for the Star Trek reference.

Nočnica Mellifera — Checkly

How we built our infrastructure fail-over checklist

Right in the middle of migrating from one datacenter to an HA pair of new datacenters, one of the new ones failed. They had to quickly do a partial rollback of the migration to ride out the outage.

Gauthier François — Doctolib

Announcing bpftop: Streamlining eBPF performance optimization

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF programs.

Jose Fernandez — Netflix

SRE Weekly Issue #415

lex

March 10, 2024

General

Comments

View on sreweekly.com

The Wrong Way to Use DORA Metrics

[…] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for example, to hire and fire team leads.

Nočnica Mellifera — The New Stack

Investigating and Optimizing Over-Querying

A primer on the problems with N+1 database queries and how this pattern can sneak into your code whether you realize it or not.

neda — ReadySet

Choosing Good SLIs

A great explainer on choosing the right SLIs, starting with the Golden Signals and branching out.

Tyler Treat

You should never be responsible for what you don’t control

My favorite part about this is the “latency budget” question — which team’s code gets to spend how much time doing its part to serve a request?

Alex Ewerlöf

An unexpected crash due to unrelated software changes

Changes in two programs outside the container made Ceph suddenly grind to a halt, as detailed in this troubleshooting story.

Vladimir Guryanov — Palark

How to set a good only one threshold for an alert?

The word “one” is the key here, as the author argues for getting rid of “warning” alerts entirely in favor of using only “critical”.

Gauthier François

Creating An Oncall Handoff Bot

They wrote a Slack bot to summarize open PagerDuty incidents every day.

Matt Weingarten

Negotiating Priorities Around Incident Investigations

The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Fred Hebert — Honeycomb

Full disclosure: Honeycomb is my employer.

How we avoided alarm fatigue syndrome by managing/reducing the alerting noise.

In order to reduce the noise, first they had to define noisy alerts and the KPIs they were looking to improve.

Gauthier François — Doctolib

SRE Weekly Issue #414

lex

March 3, 2024

General

Comments

View on sreweekly.com

2024 VOID Report

This year’s VOID Report is out, and it’s well worth a read. The subtitle is “Exploring the Unintended Consequences of Automation in Software” which is a really good way to get me to read something!

Courtney Nash — The VOID

How DoorDash Ensures Velocity and Reliability through Policy Automation

A terraform change deleted a critical resource, and reviewers missed it because the plan was so big. Now they use Atlantis and Open Policy Agent to avoid accidental deletions of critical resources.

Lin Du — InfoQ

What if everybody did everything right?

When analyzing an incident, what can we learn when we assume that everyone did everything as well as possible?

Lorin Hochstein

Google Cloud Incident Report: europe-west8-b partial outage

onsite technicians performing this planned network maintenance inadvertently unplugged several fibers that were adjacent to those in the work order, but still in use for production traffic

Google

What Does 99.999% Uptime Really Mean?

There’s a huge difference between four and five nines. There’s an especially interesting quote in this article that Google doesn’t think five nines is attainable in a commercial service.

Diana Bocco — UptimeRobot

Life as a Site Reliability Engineer at IBM

Here’s an interview with three SREs about what it’s like to be an SRE at IBM.

IBM

The Cost Crisis in Observability Tooling

I’ve been hearing about Observability 2.0 but didn’t know what it was all about. This article explains what it is and how it can help with cost.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

Positive Affirmations for Site Reliability Engineers

A cute little video pep talk for SREs. The site is actually real, too!

Krazam

Happy Leap Day!

Like a mini Y2K, leap day came around again and left some technical glitches in its wake, as chronicled in this article.

Gergely Orosz — The Pragmatic Engineer

SRE Weekly Issue #413

lex

February 25, 2024

General

Comments

View on sreweekly.com

Sorry about the automation fail and resend! That definitely wasn’t issue #1.

The Domain of Failure

This article discusses building failure management directly into our systems, using Erlang as a case study.

Jamie Allen

Cinnamon: Using Century Old Tech to Build a Mean Load Shedder

Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.

Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber

Conditional Love for AWS Metadata Enumeration

These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!

Daniel Grzelak — Plerion

Justifying Resilience Work

How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.

Will Gallego

SREday – London, UK, Sep 19-20, 2024

Check it out, it’s an entire SRE conference I was totally unaware of!

SREday

SLA vs. SLO vs. SLI: What’s the Difference?

It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.

Laura Clayton — UptimeRobot

What were your worst on-call experiences?

A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?

u/New_Detective_1363 and others — reddit

End of support for repl.co & recent issues explained

What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.

Tomas Koprusak — UptimeRobot

The Role of CAP Theorem in Modern Day Distributed Systems

The CAP theorem is useful as a theory, but what does it actually mean in practice?

neda — ReadySet

SRE Weekly Issue #417

SRE Weekly Issue #416

SRE Weekly Issue #415

SRE Weekly Issue #414

SRE Weekly Issue #413

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues