SRE WEEKLY – Page 3 – scalability, availability, incident response, automation

SRE Weekly Issue #414

lex

March 3, 2024

2024 VOID Report

This year’s VOID Report is out, and it’s well worth a read. The subtitle is “Exploring the Unintended Consequences of Automation in Software” which is a really good way to get me to read something!

Courtney Nash — The VOID

How DoorDash Ensures Velocity and Reliability through Policy Automation

A terraform change deleted a critical resource, and reviewers missed it because the plan was so big. Now they use Atlantis and Open Policy Agent to avoid accidental deletions of critical resources.

Lin Du — InfoQ

What if everybody did everything right?

When analyzing an incident, what can we learn when we assume that everyone did everything as well as possible?

Lorin Hochstein

Google Cloud Incident Report: europe-west8-b partial outage

onsite technicians performing this planned network maintenance inadvertently unplugged several fibers that were adjacent to those in the work order, but still in use for production traffic

Google

What Does 99.999% Uptime Really Mean?

There’s a huge difference between four and five nines. There’s an especially interesting quote in this article that Google doesn’t think five nines is attainable in a commercial service.

Diana Bocco — UptimeRobot

Life as a Site Reliability Engineer at IBM

Here’s an interview with three SREs about what it’s like to be an SRE at IBM.

IBM

The Cost Crisis in Observability Tooling

I’ve been hearing about Observability 2.0 but didn’t know what it was all about. This article explains what it is and how it can help with cost.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

Positive Affirmations for Site Reliability Engineers

A cute little video pep talk for SREs. The site is actually real, too!

Krazam

Happy Leap Day!

Like a mini Y2K, leap day came around again and left some technical glitches in its wake, as chronicled in this article.

Gergely Orosz — The Pragmatic Engineer

SRE Weekly Issue #413

lex

February 25, 2024

General

Comments

View on sreweekly.com

Sorry about the automation fail and resend! That definitely wasn’t issue #1.

The Domain of Failure

This article discusses building failure management directly into our systems, using Erlang as a case study.

Jamie Allen

Cinnamon: Using Century Old Tech to Build a Mean Load Shedder

Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.

Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber

Conditional Love for AWS Metadata Enumeration

These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!

Daniel Grzelak — Plerion

Justifying Resilience Work

How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.

Will Gallego

SREday – London, UK, Sep 19-20, 2024

Check it out, it’s an entire SRE conference I was totally unaware of!

SREday

SLA vs. SLO vs. SLI: What’s the Difference?

It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.

Laura Clayton — UptimeRobot

What were your worst on-call experiences?

A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?

u/New_Detective_1363 and others — reddit

End of support for repl.co & recent issues explained

What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.

Tomas Koprusak — UptimeRobot

The Role of CAP Theorem in Modern Day Distributed Systems

The CAP theorem is useful as a theory, but what does it actually mean in practice?

neda — ReadySet

SRE Weekly Issue #412

lex

February 18, 2024

General

Comments

View on sreweekly.com

The Single Pain of Glass

Can a single dashboard to cover your entire system really exist?

Jamie Allen

The importance of SEV-1 call leaders

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

Jonathan Word

Getting Buy-in from Management on Reliability Investments

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

Emily Arnott — Blameless

SRE Archetypes

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

Alex Ewerlöf

connect() – why are you so slow?

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

Frederick Lawler — Cloudflare

Simple Precision Time Protocol at Meta

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

Oleg Obleukhov and Ahmad Byagowi — Meta

A Distributed Systems Reading List

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

Fred Hebert

Streamlining and Implementing Incident Management at Dyninno

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

Vladimirs Romanovskis

SRE Weekly Issue #411

lex

February 11, 2024

General

Comments

View on sreweekly.com

Shared On-Call Is Where the SRE Magic Happens

Software engineers and SREs should share a single on-call rotation as part of a single team, as this is where empathy for each other is built.

Jamie Allen

Pinterest Goes HTTP/3: Boosts Performance

I was pretty fuzzy on what HTTP/3 was all about, but this article set me straight.

Roopa Kushtagi

Architecture Style: Modulith (vs. Microservices)

An overview of the modulith pattern including reasons to choose modulith over microservices.

Pier-Jean Malandrino

“Why Are We Having More Incidents?” Causal Loops in Reactions to Unwanted Events

This article explores feedback loops formed out of various ways of responding to incidents that in turn increase the likelihood of more incidents. It took me a couple tries to get into this one, but it was well worth my effort.

Steven Shorrock

A practical approach to on-call compensation

Here, we’re going to outline some practical things you should consider when visiting on-call compensation and the incentives you create around it. We’ll also share how we approach this conversation here at incident.io.

incident.io

GitHub – mxssl/sre-interview-prep-guide: Site Reliability Engineer Interview Preparation Guide

This link-aggregation repo isn’t just about interviewing for SRE roles. It also links to resources on a ton of topics relevant to those starting out in SRE.

@mxssl on GitHub

Paper: How audits fail according to incident investigations

Cool trick: this paper uses counterfactual “should have” statements for good as a way of surfacing what incident investigators wish auditing was looking for. Click through for Fred Hebert’s synopsis of the paper.

Fred Hebert (summary) Ben Hutchinson, Sidney Dekker, and Andrew Rae (original authors) — Process Safety Progress

Dyninno’s Incident Management: an Introduction

This article (part one in a series) follows the author’s journey to learn and improve incident management at their company.

Vladimirs Romanovskis — Dyninno

SRE Weekly Issue #410

lex

February 4, 2024

General

Comments

View on sreweekly.com

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

Hochuen Wong and Levon Stepanian — DoorDash

APAC Retrospective: Learnings from a Year of Tech Outages – Dismantling Knowledge Silos

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

David Ridge — PagerDuty

How the data center site selection process works at Dropbox

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

Edward del Rio — Dropbox

Untangle Blockers that impede Site Reliability Engineering (SRE) adoption.

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

Shyam Venkat

Beyond Debugging: Harnessing Preattentive Processes in Incident Response

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

Dennis Henry

SRE and Product

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

Jamie Allen

Panic on the Schoolyard: The Merion midair collision (death of Senator John Heinz)

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

Kyra Dempsey (Admiral Cloudberg)

SRE Weekly Issue #414

SRE Weekly Issue #413

SRE Weekly Issue #412

SRE Weekly Issue #411

SRE Weekly Issue #410

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues