SRE Weekly Issue #421

Last week, I mistakenly attributed [an article](https://www.paigerduty.com/sre-biggest-problem/) to PagerDuty. Actually, it was by Paige Cruz, whose clever blog name I didn’t pay anywhere near close enough attention to! Thanks to several readers that nudged me gently about my error.

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.
https://firehydrant.com/blog/ai-for-incident-management-is-here/

If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change.

Easily the majority of incidents I’ve been in.

  Lorin Hochstein

This article is about trying to solve for this problem:

a potentially significant number of customers or queries can be affected by an outage and this won’t trigger an SLO violation.

  Niall Murphy

A surgeon struggles with the difficulties in building a culture of retrospectives and introspection in their surgical team, by running a fascinating retro on himself in this blog post.

  Robert Poston, MD

An argument for buying yourself time to slow down and make decisions carefully, as a way of ultimately speeding up incident resolution.

  Shayon Mukherjee

Disasters threatening a business’ ability to operate core functions don’t occur that often (phew!), but we do want to ensure we are prepared to keep our business running if they do. To practice disaster response skills, we run business continuity drills, and you can too with our 10-step plan!

  Janna Brummel — WeTransfer

How people think about reliability varies between companies. Which of the four different perspectives laid out int his article does your company fit into, if any?

  Ross Brodbeck

Honeycomb posted this followup on their April 9 outage, explaining what went wrong and how they’re responding.

  Honeycomb

  Full disclosure: Honeycomb is my employer.

The author of this article posed a question on r/sre:

What matters most for your success as an SRE?

They share a summary of the answers they got, with their commentary.

  Nočnica Mellifera — Checkly

SRE Weekly Issue #420

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it.

  EHG_Kain — Last Epoch

Cloudflare’s Phoenix system can find and recover failed servers, reducing toil.

  Jet Mariscal, Aakash Shah, and Yilin Xiong — Cloudflare

More than just another glossary of SL*s, this one also has examples and best practices.

  Sara Miteva — Checkly

Spurred from a question in the SRECon attendee survey, this one really gets you thinking: how does the current “generation” of SREs differ from those that came before?

  Paige — PagerDuty

This one’s about finding out what execs need in incidents and figuring out how to get everone’s needs met.

  Chris Evans — incident.io

This post explains how Cloudflare gathers information about their alerts and improves them to benefit reliability and on-call health.

  Monika Singh — Cloudflare

This one contains formulas for calculating compound SLOs when downstream dependencies are parallel or serial.

  Alex Ewerlöf

SRE Weekly Issue #419

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.

Retrofitting sharding is a huge undertaking.

  Sammy Steele — Figma

Ride along as this company evolves from constantly shipping directly to production to a robust staging and internal canary deployment system.

  Greg Foster — Graphite

A lighthearted but still detail-filled take on a post-incident analysis for a short production outage.

  Greg Foster — Graphite

This one has an interesting discussion of the nature of reliability and the impact of multiple services on overall reliability, including possible mathematical models to use.

  Fitz — Temporal

This episode of the SREPath Podcast covers a variety of themes around observability and SLOs. There’s a great text-based summary if that’s your preference.

  Ash Patel — SREPath

This piece argues that you should install system debugging tools in on your production systems now, because it’s going to be really hard to do it live when you need them.

  Brendan Gregg

Following on from a previous article about the squiggliness of availability numbers, this article evaluates SLAs from 4 major companies to try to divine what they actually mean.

  Ross Brodbeck

I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.

  Michał Kaźmierczak

SRE Weekly Issue #418

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.
https://firehydrant.com/blog/ai-for-incident-management-is-here/

The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.

  Hazel Weakly

Fred Hebert wrote this response/follow-on to Hazel’s article:

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

  Fred Hebert

What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.

  Alex Ewerlöf

Fascinating legal questions come to the surface when lawyers consider the possibility for legal risk exposure from a surgical incident debriefing meeting.

  Dr. Rob Poston

if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we’ll dive into the tactics teams can implement to address alert fatigue and its underlying causes.

  incident.io

How do you create an SLO that references multiple SLIs together, such as slow requests and errors?

  Ross Brodbeck

More than just a list of talks, this piece pulls out major themes from SRECon24.

  Will Gallego

Making your 9’s look great by cheating.

Of course, you don’t actually want to do that, but learning how can show us that availability numbers are nuanced.

  Ross Brodbeck

SRE Weekly Issue #417

A message from our sponsor, FireHydrant:

Join FireHydrant this Thursday for a conversation about on-call burnout and how to prevent it. Get a better understanding of what makes a fatigue-free on-call culture, including real-world examples from your incident management peers. No sales, just shop talk.
https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024

Remember that cool lava lamp random number generator that Cloudflare uses? Now they have a couple of other sources of entropy, and they’re teaming up with other companies.

  Cefan Daniel Rubin, Luke Valenta, and Thibault Meunier — Cloudflare

To support 123 million simultaneous streams (!), Paramount+ migrated to a multi-region architecture with a distributed, multi-write database.

  Denis Magda — Yugabyte

DevOps Research and Assessment or the Digital Operational Resilience Act, which is which? Turns out they both matter to SREs.

  Lee Fredricks — PagerDuty

2038 isn’t so far off now. Do you have a plan for 64-bit timestamps?

  Code Reliant

To ensure they would dogfood the new account process regularly, these folks delete a random employee’s account in their product every day.

  Greg Foster — Graphite

Hey, check it out, sidecars are going to be fully supported in upcoming versions of Kubernetes!

  Steven Aldinger — TeamSnap

As part of releasing a new product, FireHydrant ran simulations to determine the right SLO — and uncover some room for optimization.

  Danielle Leong — FireHydrant

  This article is published by my sponsor, FireHydrant, but their sponsorship did not influence its inclusion in this issue.

If you’re new to distributed tracing, this is a great overview. The part about automated instrumentation for span tracing is especially useful.

  Chris Battarbee Metoro

A production of Tinker Tinker Tinker, LLC Frontier Theme