SRE Weekly Issue #433

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

This article covers five skills:

  1. Ability to Lead
  2. Taking Charge in Critical Situations
  3. Expressing Opinions in a Non-Conflicting Way
  4. Leading Initiatives for Continuous Improvement
  5. Building and Maintaining Relationships

  Prabesh

I was pretty dubious most of the way through this article — until I realized it was a story about why this solution didn’t work for them. Now it’s an interesting read about Python and exercising restraint in complexity.

  Jean-Mark Wright

Meta is training an LLM to suggest commits that may have caused a given incident, and its suggestions are right 42% of the time.

  Diana Hsu, Michael Neu, Mohamed Farrag, and Rahul Kindi — Meta

Percentiles, because when your math(s) teacher told you you’d use math all the time when you grew up, they were right! This article does a great job of explaining percentiles if you’re having trouble wrapping your mind around them.

  Alex Ewerlöf

Netflix designed their load shedding system to efficiently drop the requests that don’t matter as much and prioritize what users really care about.

  Anirudh Mendiratta, Kevin Wang, Joey Lynch, Javier Fernandez-Ivern, and Benjamin Fedorka — Netflix

This article illustrates cascading delays in microservices and describes three techniques for dealing with them: timeouts, retries, and circuit breakers.

  Jean-Mark Wright

Cloudflare’s public DNS resolver had an outage due to a (probably accidental?) BGP hijack. 1.1.1.1 is a common address used internally for testing routing, so it’s easy to understand how an accidental route leak happened.

   Bryton Herdes, Mingwei Zhang, and Tanner Ryan — Cloudflare

Here’s a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL.

  Phil Eaton

SRE Weekly Issue #432

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

In this debugging story, an engineer wielded SystemTap to figure out why a Kafka broker was doing a ridiculous amount of reads.

  Terra Field — Honeycomb

  Full disclosure: Honeycomb is my employer.

A concise breakdown of the math involved in getting that extra nine of reliability.

It all boils down to creating the SLOs and requirements to keep your users happy, but nothing more. Unnecessary reliability is a high cost.

  Thomas Stringer

If you’re looking to advance in SRE, this article has some examples of the skills and experience you should aim for.

  Prabesh

Will Gallego shows us a way of thinking that helps turn “should haves” into deeper understanding of our sociotechnical systems.

  Will Gallego

Some words of wisdom I came across this week around startups choosing not to work on scalability too early.

   Vassil Popovski

Some commenters in this reddit thread are saying it’s easier to be called an SRE, but what does it mean? Some say SRE has gotten easier, and some say it’s gotten harder. What do you think?

  u/sreiously and others — reddit

The full report isn’t available yet (and may not ever be?) but this executive summary has a lot of juicy bits about the major 2022 Rogers internet and emergency service outage in Canada.

  Xona Partners, Inc.

The Rogers report executive summary includes some blamey and blame-adjacent language, and this analysis does a good job of calling it out and suggesting ways to recast it.

  Lorin Hochstein

The Rogers outage report executive summary indicates that truly out-of-band network management access may have made recovery easier. What exactly is involved in setting that up?

  Chris Siebenmann

SRE Weekly Issue #431

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

This is a really thorny one. As individual subprocesses started infinitely looping, their system shifted load to other datacenters, masking the problem. A coinciding failure in the load shifting system made things even more interesting.

  Lloyd Wallis, Julien Desgats, and Manish Arora — Cloudflare

A great discussion of where dashboards fall short and what we should look for instead.

  Adam Kinniburgh — SquaredUp

Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.

  Will Haltom — GitHub

Timing things to happen at specific intervals is yet another way that we collectively find out that dealing with time is a hard problem.

This article illustrates the subtle but important pitfalls in trying to create a system that does something on a strict interval.

  rachelbythebay

This article reads more like a case study. The author gave a prompt to three different LLMs and actually tested the Terraform config it produced.

  Mike Vanbuskirk — Terrateam

When your pub/sub system can have a million subscribers, even something mundane as notifying about subscriber counts requires careful thought.

  Ashmeet Singh — Pusher

To me, this concept comes up over and over in SRE, and it’s a core part of SLOs.

  Juraj Masar — BetterStack

In this blog post, we’ll dive deep into the technical aspects of feature flags and feature management, exploring how they can be leveraged by SREs to enable progressive delivery, improve system resilience, and optimize the user experience.

  Hope Lynch — CloudBees

This week’s Mentour Pilot video covers an accident that involved an inaccurate flight simulator. I wasn’t familiar with the term “negative training” before, but now I’m going to be keeping an eye out for it in the systems I manage!

  Mentour Pilot

SRE Weekly Issue #430

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

Lots of great tips in the comments if you’re looking to tune your resume.

  u/goodolbluey and others — reddit

What can SREs do to increase their available focus time?

   Krishna Vinnakota — DZone

One set of DNS root nameservers (c.root-servers.net) recently fell behind by a couple of days on updates for the root zone. We kind of just expect the root servers to work, you know?

  Dan Goodin — Ars Technica

Stripe talks about the design of their DocDB system built on MongoDB that achieves 5 nines of reliability.

  Jimmy Morzaria and Suraj Narkhede — Stripe

A Severity Zero (worst-case) incident is an entirely different thing from your average incident. This article talks about what makes it different and gives tips for handling one.

  Chris Evans — incident.io

With SLA credits kicking in for some services after just seconds of downtime, Amazon relies on multiple layers of automation.

  Nicholas Yan — Graphite

Here’s a great summary of a podcast episode about Google’s incident response practices.

Google’s latest Search Off The Record podcast discussed examples of disruptive incidents that can affect crawling and indexing and discuss the criteria for deciding whether or not to disclose the details of what happened.

  Roger Montti — Search Engine Journal

Here are some essential practices and traits that can make you an exemplary SRE.

Includes 19 tips with short explanations.

  Prabesh

How do layoffs impact resiliency and adaptive capacity? Are the folks making those decisions cognizant of the potential impact on reliability?

  Will Gallego

SRE Weekly Issue #429

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

Time to get down into the bits and bytes of how Honeycomb queries work with this look into a recent optimization in their data storage layer.

  Hazel Edmands — Honeycomb

  Full disclosure: Honeycomb is my employer.

Here’s how HelloFresh integrated SLOs into their internal platform’s new progressive rollout capability.

  Victor Hugo Brito Fernandes — HelloFresh

I like to consider running an incident review to be its own action item. Other follow-ups emerging from it are a plus, but the point is to learn from incidents, and the review gives room for that to happen.

  Fred Hebert
Note: Fred is my coworker and I’m mentioned in this article

This article covers a wealth of topics around creating an on-call system.

Learn how to navigate vacations, parenthood and personal preferences to improve your reliability practice.

  Rootly

There has been major flooding in Brazil recently, and this article looks at it with an SRE lens. Note, the main article is in Portuguese with an English translation lower down the page.

  Dario Bestetti

This article shows you how to use Infrastructure as Code to implement AWS’s Well-Architected Framework, with Terraform examples.

  Lokesh Aggarwal

The challenges of Auto Scaling, from cold start impact, tech debt, and cost realities. Prioritising scaling as code and shared responsibility for optimal performance in cloud efficiency.

  Karl Stoney

For each post-incident action that you are proposing, we would appreciate it if you would fill out the following template.

Looking at the author, you know this one’s not going to just be what it says on the tin. It’s a thought-provoking exploration of the meaning and purpose of post-incident action items.

  Lorin Hochstein

A production of Tinker Tinker Tinker, LLC Frontier Theme