SRE Weekly Issue #433

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

This article covers five skills:

  1. Ability to Lead
  2. Taking Charge in Critical Situations
  3. Expressing Opinions in a Non-Conflicting Way
  4. Leading Initiatives for Continuous Improvement
  5. Building and Maintaining Relationships

  Prabesh

I was pretty dubious most of the way through this article — until I realized it was a story about why this solution didn’t work for them. Now it’s an interesting read about Python and exercising restraint in complexity.

  Jean-Mark Wright

Meta is training an LLM to suggest commits that may have caused a given incident, and its suggestions are right 42% of the time.

  Diana Hsu, Michael Neu, Mohamed Farrag, and Rahul Kindi — Meta

Percentiles, because when your math(s) teacher told you you’d use math all the time when you grew up, they were right! This article does a great job of explaining percentiles if you’re having trouble wrapping your mind around them.

  Alex Ewerlöf

Netflix designed their load shedding system to efficiently drop the requests that don’t matter as much and prioritize what users really care about.

  Anirudh Mendiratta, Kevin Wang, Joey Lynch, Javier Fernandez-Ivern, and Benjamin Fedorka — Netflix

This article illustrates cascading delays in microservices and describes three techniques for dealing with them: timeouts, retries, and circuit breakers.

  Jean-Mark Wright

Cloudflare’s public DNS resolver had an outage due to a (probably accidental?) BGP hijack. 1.1.1.1 is a common address used internally for testing routing, so it’s easy to understand how an accidental route leak happened.

   Bryton Herdes, Mingwei Zhang, and Tanner Ryan — Cloudflare

Here’s a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL.

  Phil Eaton

Updated: July 14, 2024 — 10:10 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme