General

SRE Weekly Issue #374

lex

May 28, 2023

General

Comments

View on sreweekly.com

Articles

More Memory, More Problems

A fascinating Postgresql debugging story that hinges on code comments, of all things.

Christopher White — Prefect

Redpanda’s official Jepsen report: What we fixed, and what we shouldn’t

If you’re a distributed systems nerd, this one’s a real treat. It’s a detailed breakdown of the results of a Jepsen test.

Denis Rystsov — RedPAnda

Unbounded memory usage by TCP for receive buffers, and how we fixed it

An investigation into a kernel bug that caused excessive TCP memory usage in certain situations.

Mike Freemon — Cloudflare

Scaling Site Reliability Engineering (SRE) Teams the Right Way

Let’s unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you’re done.

Biju Chacko — Squadcast

Running Post-Mortems

Here’s another guide on running incident retrospectives and building a repeatable retrospective process.

Amin Astaneh — Certo Modo

New playground: memory spy

Here’s a fun little tool that lets you inspect how data in a C program is represented in memory.

Julia Evans

How we learned to improve Kubernetes CronJobs at Scale

This two-part series explores some shortcomings in Kubernetes’s CronJob system and the ways that Lyft fixed and worked around them.

Kevin Yang — Lyft

Kubernetes CronJob Failed For 24 Days: a Retrospective

And here’s a case where someone ran into the Kubernetes CronJob bug described in the previous article.

Vallery Lancey

SRE Weekly Issue #373

lex

May 21, 2023

General

Comments

View on sreweekly.com

Articles

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

Alexis Lê-Quôc — Datadog

Addressing GitHub’s recent availability issues

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

Mike Hanley — GitHub

StanzaSystems/awesome-load-management

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

Laura Nolan and Niall Murphy — Stanza Systems

SRE Story with Matthew Iselin

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

Prathamesh Sonpatki — SRE Stories

Debugging a FUSE deadlock in the Linux kernel

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

Tycho Andersen — Netflix

Why `fsync()`: Losing Unsynced Data

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

Denis Rystsov and Alexander Gallego — Redpanda

Emotional Intelligence

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

Amin Astaneh — Certo Modo

Fleet Management at Spotify (Part 3): Fleet-wide Refactoring

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

Matt Brown — Spotify

Teach me how to Howie!

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

Jeli

SRE Weekly Issue #372

lex

May 14, 2023

General

Comments

View on sreweekly.com

Articles

Read Every Single Error

At Pulumi we read every single error message that our API produces. This is the primary mechanism that led to a 17x YoY reduction in our error rate

Evan Boyle — Pulumi

Uptime Guarantees — A Pragmatic Perspective

Rather than striving for a million nines, we should choose the right reliability target based on an evaluation of the effect of downtime on the business.

Itzy Sabo — HEY

Reckoning with the Harm We Do: In Search of Restorative Just Culture in Software and Web Operations

This is a presentation of a study of harm and trauma resulting from incident response work. I especially like the part about blamelessness in theory versus practice.

Jessica DeVita — InfoQ

Learning from incidents is not the goal

Perhaps a sensationalist title, but there’s a really good point here: learning from incidents is only practical if it actually improves the business.

Chris Evans — incident.io

Real-Time Presence Platform System Design

A highly-detailed proposal for a system to track which users are online at a huge scale.

Nk — System Design

Upscaling LinkedIn’s Profile Datastore While Reducing Costs

However, for any cache to be used for the purpose of upscaling, it must operate completely independent from the source of truth (SOT) and must not be allowed to fall back to the SOT on failures.

Estella Pham and Guanlin Lu – LinkedIn

The Madness in our Methods: The crash of Germanwings flight 9525 and our broken aeromedical system

If you design your system to make lying the only viable option, then people will lie. To me, this article is all about understanding that our systems involve real, squishy humans, an designing appropriately.

Admiral Cloudberg

SRE Weekly Issue #371

lex

May 7, 2023

General

Comments

View on sreweekly.com

Articles

Is there such a thing as a system that’s too reliable?

NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.

Robert Barron — IBM

Observability maven ‘cranky’ about AIOps embraces GPT

I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.

With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.

Beth Pariseau — TechTarget
Full disclosure: Honeycomb, my employer, is mentioned.

Assembly time is where you have the most control of an incident

The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.

Robert Ross — FireHydrant

How to improve incident triaging for better organization-wide incident response

The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.

Luis Gonzalez — incident.io

GitHub Availability Report: April 2023

GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.

Jakub Oleksy — GitHub

Migrating Critical Traffic At Scale with No Downtime — Part 1

Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%

The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?

Marcin Kolny — Amazon Prime Video

SRE Weekly Issue #370

lex

May 1, 2023

General

Comments

View on sreweekly.com

Articles

Improving Incident Recovery By using SLI Pyramid

[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.

What functionality is critical? Should we sacrifice feature A to save feature B? It’s important to plan ahead.

Boris Cherkasky

Slack Said It Had 100% Uptime. Did It Really?

It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?

Ellen Steinke — Metrist

The importance of right-sizing your retro

Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.

Jouhné Scott — FireHydrant

Site Reliability Engineering 101

Another good one to have in your back pocket for those “What would you say… you do here?” moments.

Ash Patel — SREPath

The True Cost of Building Your Own IMS

Build versus buy for incident management systems: what is the true cost of rolling your own?

Biju Chacko and Nir Sharma — Squadcast

Deploy AWS Resources Seamlessly With ChatGPT

A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.

Banjo Obayomi — DZone

Improved Alerting with Atlas Streaming Eval

They solved a cardinality explosion by switching from query-based alerting to stream data processing.

Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix

SRE Weekly Issue #374

Articles

SRE Weekly Issue #373

Articles

SRE Weekly Issue #372

Articles

SRE Weekly Issue #371

Articles

SRE Weekly Issue #370

Articles

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues