General

I’ll be at KubeCon North America

lex

November 6, 2023

Hi folks, sorry for invading your inbox / RSS feed an extra time this week! I forgot to mention with yesterday’s issue that I’ll be at KubeCon this week. Hit me up for some SRE Weekly swag (patches, decals, and stickers).

SRE Weekly Issue #397

lex

November 5, 2023

General

Comments

View on sreweekly.com

Modern rollback strategies

The length and complexity of this article hints at the theme that runs throughout: there’s no easy, universal, perfect rollback strategy. Instead, they present a couple of rollback strategies you can choose from and implement.

Bob Walker — Octopus Deploy

Make Your Jobs More Robust With Automatic Safety Switches

This article delves into enhancing error management in batch processing programs through the strategic implementation of automatic safety switches and their critical role in safeguarding data integrity during technical errors.

Bertrand Florat — DZone

How DoorDash Standardized and Improved Microservices Caching

Part of their observability strategy, which they call “shadowing”, is especially nifty.

Lev Neiman and Jason Fan — DoorDash

GitHub Availability Report: September 2023

It’s interesting that the DB failed in a way that GitHub’s Orchestrator deployment was unable to detect.

Jakub Oleksy — GitHub

Beyond Staff Engineer

What exactly is a Senior Staff Engineer? While this article is not specifically about Senior Staff SREs, it’s directly applicable, especially as I’ve seen more Staff+ SRE job postings in the past couple years.

Alex Ewerlöf

A guide to post-mortem meetings and how we run them at incident.io

“Blameless” doesn’t mean no names allowed!

Remember—if discussing the actions of a specific person is being done for the sake of better learning; don’t shy away from it.

incident.io

SRE Interview Prep Plan (Week 2)

This series is shaping up to be a great study guide for new SREs.

Each day of this week brings you one step closer to not only acing your SRE interviews but also becoming the SRE who can leverage code & infrastructure to perfect systems reliability.

Code Reliant

Automating product deprecation

A fascinating and scary concept: a tool for automatically identifying and performing all the changes involved in deprecating an entire product.

Will Shackleton, Andy Pincombe, and Katriel Cohn-Gordon — Meta

SRE Weekly Issue #396

lex

October 29, 2023

General

Comments

View on sreweekly.com

Translating Failures into Service-Level Objectives

Using 3 high-profile incidents from the past year, this article explores how to define SLOs that might catch similar problems, with a special focus on keeping the SLI close to the user experience.

Adriana Villela and Ana Margarita Medina — The New Stack

The costs of microservices

Microservices can have some great benefits, but if you want to build with them, you’re going to have to solve a whole pile of new problems.

Roberto Vitillo

How distributed systems fail

To protect your application against failures, you first need to know what can go wrong. […] the most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load.

Roberto Vitillo

Sofia’s Observability Odyssey: The Do’s and Don’ts for Effective Observability

I love how this article keeps things interesting by starting with a fictional (but realistic) story about the dangers of over-alerting before continuing on to give direct advice.

Adso

Retries, Backoff and Jitter

I especially enjoy the section on the potential pitfalls and challenges with retries and how you can avoid them.

CodeReliant

As an SRE, how often are you directly involved with application code / logic?

This reddit thread is a goldmine, including this gem:

I actively avoid getting involved with software subject matter expertise, because it robs the engineering team of self-reliance, which is itself a reliability issue.

u/bv8z and others — reddit

crates.io Postmortem: Broken Crate Downloads

There’s a pretty cool “Five Whys”-style analysis that goes past “dev pushed unreviewed code with incomplete tests to production” and to the sociotechnical challenges underlying that.

Tobias Bieniek — crates.io

SRE Weekly Issue #395

lex

October 22, 2023

General

Comments

View on sreweekly.com

What every developer should know about database consistency

This article gives an overview of database consistency models and introduces the PACELC Theorem.

Roberto Vitillo

What is a Memory leak? Causes | Detection | Tools | Golang

A primer on memory and resource leaks, including some lesser-known causes.

Code Reliant

Rescue Struggling Pods from Scratch

How can you troubleshoot a broken pod when it’s built FROM scratch and you can’t even run a shell in it?

Mike Terhar
Full disclosure: Honeycomb is my employer.

Five mindset shifts for effective reliability programs

This article explains why reliability isn’t just a one-off project that you can bolt on and move on.

Gavin Cahill — Gremlin

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash wanted consistent observability across their infrastructure that didn’t depend on instrumenting each application. To solve this, they developed BPFAgent, and this article explains how.

Patrick Rogers — DoorDash

What is Mean Time to Innocence?

Mean time to innocence is the average elapsed time between when a system problem is detected and any given team’s ability to say the team or part of its system is not the root cause of the problem.

This article, of course, is about not having a culture like that.

John Burke — TechTarget

Details of the Pulumi Outage on October 6, 2023

It was the DB — more specifically, it was a DB migration with unintended locking.

Casey Huang — Pulumi

Google Cloud Networking Incident Report (2023-10-05)

The incident stemmed from a control plane change that worked in some regions but caused OOMs in others.

Google

SRE Weekly Issue #394

lex

October 15, 2023

General

Comments

View on sreweekly.com

A warm welcome to my new sponsor, FireHydrant!

Creating Checklists for High Stakes Changes

This article gives an example checklist for a database version upgrade in RDS and explains why checklists cam be so useful for changes like this.

Nick Janetakis

The balancing act of reliability and availability

The distinction in this article is between responding at all and responding correctly. Different techniques solve for availability vs reliability.

incident.io

What every developer should know about TCP

Latency and throughput are inextricably linked in TCP, and this article explains why with a primer on congestion windows and handshakes.

Roberto Vitillo

Why you should measure tail latencies

Tail latency has a huge impact on throughput and on the overall user experience. Measuring average latency just won’t cut it.

Roberto Vitillo

A Brief, Incomplete and Mostly Wrong Devops Glossary

Is it really wrong though? Is it?

Adam Gordon Bell — Earthly

Part One: Exploring Aviation’s Human Factors ‘Dirty Dozen’

I’ve shared the FAA’s infographic of the Dirty Dozen here previously, but here’s a more in-depth look at the first six items.

Dr. Omar Memon — Simple Flying

More than five whys and “layer eight” problems

It’s often necessary to go through far more than five whys to understand what’s really going on in a sociotechnical system.

rachelbythebay

SRE Story with Michael Hausenblas

I found the bit about the AWS Incident/Communication Manager on-call role pretty interesting.

Prathamesh Sonpatki — SRE Stories

I’ll be at KubeCon North America

SRE Weekly Issue #397

SRE Weekly Issue #396

SRE Weekly Issue #395

SRE Weekly Issue #394

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues