SRE Weekly Issue #375

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

An in-depth analysis of the crash of a recent lunar lander. It’s really interesting that a feature designed specifically to improve robustness to failures instead made the system less reliable in unforeseen circumstances.

  Robert Barron — IBM

With each external cloud service you deploy, you introduce the amount of unreliability that product has into your own product’s reliability (even if it’s incredibly small).

   Jeff Martens — The New Stack

Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.

  Amin Astaneh — Certo Modo

While it can seem pretty insignificant, properly distinguishing between an incident and a bug is worthwhile. Why? Because it will ultimately help dictate your response to it.

  Luis Gonzalez — incident.io

This is impressive: an engineer built an entire model of a ride-share system, complete with simulated riders and drivers, metrics, containerization, the works, all to gain a better understanding of how these kinds of systems work.

  Gergely Orosz — Pragmatic Engineer

This article answers the most important questions:
* How is using service levels any different than “regular” alarms?
* What’s in it for the company and the teams?
* Why bother? Don’t we already have enough work to do?

  Alex Ewerlöf

Here at eBay, we’ve crafted a brand new approach to automate platform evolution for all applications — one that provides a repeatable and reusable infrastructure to streamline evolution.

  Paul Zhang and Tao Jin

Interesting idea: feeding trace data into an LLM and asking it to build an end-to-end (E2E) test for the entire system. This article is a good description of what they’re doing but I’d be interested to hear more about the results.

  Nir Gazit — Honeycomb
  Full disclosure: Honeycomb is my employer.

What conclusions can we draw from the recent announcement that Amazon Prime Video is moving from serverless to a monolith?

The supposed difference between the two methods is not based on the technology itself, but the context in which you’re working.

  Ian Miell

SRE Weekly Issue #374

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

A fascinating Postgresql debugging story that hinges on code comments, of all things.

  Christopher White — Prefect

If you’re a distributed systems nerd, this one’s a real treat. It’s a detailed breakdown of the results of a Jepsen test.

  Denis Rystsov — RedPAnda

An investigation into a kernel bug that caused excessive TCP memory usage in certain situations.

  Mike Freemon — Cloudflare

Let’s unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you’re done.

  Biju Chacko — Squadcast

Here’s another guide on running incident retrospectives and building a repeatable retrospective process.

  Amin Astaneh — Certo Modo

Here’s a fun little tool that lets you inspect how data in a C program is represented in memory.

  Julia Evans

This two-part series explores some shortcomings in Kubernetes’s CronJob system and the ways that Lyft fixed and worked around them.

  Kevin Yang — Lyft

And here’s a case where someone ran into the Kubernetes CronJob bug described in the previous article.

  Vallery Lancey

SRE Weekly Issue #373

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

  Alexis Lê-Quôc — Datadog

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

  Mike Hanley — GitHub

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

  Laura Nolan and Niall Murphy — Stanza Systems

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

  Prathamesh Sonpatki — SRE Stories

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

  Tycho Andersen — Netflix

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

  Denis Rystsov and Alexander Gallego — Redpanda

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

  Amin Astaneh — Certo Modo

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

  Matt Brown — Spotify

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

  Jeli

SRE Weekly Issue #372

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

At Pulumi we read every single error message that our API produces. This is the primary mechanism that led to a 17x YoY reduction in our error rate

  Evan Boyle — Pulumi

Rather than striving for a million nines, we should choose the right reliability target based on an evaluation of the effect of downtime on the business.

  Itzy Sabo — HEY

This is a presentation of a study of harm and trauma resulting from incident response work. I especially like the part about blamelessness in theory versus practice.

  Jessica DeVita — InfoQ

Perhaps a sensationalist title, but there’s a really good point here: learning from incidents is only practical if it actually improves the business.

  Chris Evans — incident.io

A highly-detailed proposal for a system to track which users are online at a huge scale.

  Nk — System Design

However, for any cache to be used for the purpose of upscaling, it must operate completely independent from the source of truth (SOT) and must not be allowed to fall back to the SOT on failures.

  Estella Pham and Guanlin Lu – LinkedIn

If you design your system to make lying the only viable option, then people will lie. To me, this article is all about understanding that our systems involve real, squishy humans, an designing appropriately.

  Admiral Cloudberg

SRE Weekly Issue #371

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place.  Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.

  Robert Barron — IBM

I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.

With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.

  Beth Pariseau — TechTarget
  Full disclosure: Honeycomb, my employer, is mentioned.

The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.

  Robert Ross — FireHydrant

The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.

  Luis Gonzalez — incident.io

GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.

  Jakub Oleksy — GitHub

Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.

  Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?

  Marcin Kolny — Amazon Prime Video

A production of Tinker Tinker Tinker, LLC Frontier Theme