SRE Weekly Issue #429

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

Time to get down into the bits and bytes of how Honeycomb queries work with this look into a recent optimization in their data storage layer.

  Hazel Edmands — Honeycomb

  Full disclosure: Honeycomb is my employer.

Here’s how HelloFresh integrated SLOs into their internal platform’s new progressive rollout capability.

  Victor Hugo Brito Fernandes — HelloFresh

I like to consider running an incident review to be its own action item. Other follow-ups emerging from it are a plus, but the point is to learn from incidents, and the review gives room for that to happen.

  Fred Hebert
Note: Fred is my coworker and I’m mentioned in this article

This article covers a wealth of topics around creating an on-call system.

Learn how to navigate vacations, parenthood and personal preferences to improve your reliability practice.

  Rootly

There has been major flooding in Brazil recently, and this article looks at it with an SRE lens. Note, the main article is in Portuguese with an English translation lower down the page.

  Dario Bestetti

This article shows you how to use Infrastructure as Code to implement AWS’s Well-Architected Framework, with Terraform examples.

  Lokesh Aggarwal

The challenges of Auto Scaling, from cold start impact, tech debt, and cost realities. Prioritising scaling as code and shared responsibility for optimal performance in cloud efficiency.

  Karl Stoney

For each post-incident action that you are proposing, we would appreciate it if you would fill out the following template.

Looking at the author, you know this one’s not going to just be what it says on the tin. It’s a thought-provoking exploration of the meaning and purpose of post-incident action items.

  Lorin Hochstein

SRE Weekly Issue #428

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

This article presents in incident theme that I’ve lived through many times but never had such a pithy name for.

  Geoff Townsend — Blameless

There are risks and downsides inherent in a distributed system, so it’s worth thinking about whether you really need one.

  Pipitz — Adevinta

And here’s a counterpoint to the previous article: deciding whether you need a distributed system isn’t just about scale.

  Marc Brooker

The effectiveness of memes in availability campaigns.

This short post is a pile of memes, and the video one is top notch.

  Ross Brodbeck

Paraphrasing part of this article: either you didn’t understand your system fully when you wrote the alert, or there really are sporadic failures.

  Chris Siebenmann

If you’ve ever created an action item from an incident along the lines of “don’t take unnecessary risks in the future”, you need to read this one.

The rest of you need to read it too.

  Lorin Hochstein

A how-to for building anomaly detection alerting in Prometheus with specific config examples.

  Karl Stoney

A panicked engineer asks reddit’s r/sre about an incident they caused: how could they have done better? Will they be fired? The comments are spot on, and this conversation is fresh enough that you could jump in too if you’re interested.

  u/console_fulcrum and others — reddit

Last Monday, Honeycomb had an outaged related to a schema migration involving MySQL’s ENUM data type, and they posted this incident report.

Bonus content: I wasn’t aware of ENUMs at all, so I had to brush up with this article: 8 Reasons Why MySQL’s ENUM Data Type Is Evil.

  Honeycomb

  Full disclosure: Honeycomb is my employer.

An experienced SRE discusses the skills and experiences you might be quizzed about in an interview for an SRE role.

  Krishna Vinnakota — DZone

SRE Weekly Issue #427

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

Written by a GitHub employee, this article seeks to answer the titular question, with discussions of noise reduction concerns and incidents that affect only a subset of customers.

  Ross Brodbeck

Wow, this incident is a really great example of the idea that there is no one single root cause.

  Google

Understand the safeguard configuration of the ArgoCD’s ApplicationSet through the experience of our SRE who learned from an incident

  Tanat Lokejaroenlarb — Adevinta

Sometimes it’s better to do something in multiple passes, even if it’s less efficient. This applies to individual programs and major deployments alike.

  Thomas A. Limoncelli — ACM Queue

Another thought-provoking take on the argument that there is no one root cause.

  Lorin Hochstein

I referenced this at work the other day, but the interesting bit is that the pod-eviction-timeout option has been removed in Kubernetes 1.27 and I’ve had difficulty finding out what it was replaced by.

  Bhargav Bhikkaji

How to use llama-2 7b to generate summaries of your incidents, using Cloudflare workers and Workers AI.

It’s a complete how-to using an open source LLM.

  Karl Stoney

Here’s a great incident writeup from last December that I came across this week.

By the way, if you see or write an incident followup post, I’d be grateful if you sent a link my way!

  Turso

SRE Weekly Issue #426

Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.

  Ross Brodbeck

This article outlines five facets of microservice architectures that can have implications for reliability.

  Andre Newman — Gremlin

If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.

  Kyra Dempsey (Admiral Cloudberg)

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.

  Nick Travaglini — Honeycomb

  Full disclosure: Honeycomb is my employer.

Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.

  Google

This incident is intriguing because exponential backoff made the problem harder to detect.

  Heroku

A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.

Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.

  Alex Ewerlöf

if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.

  Lorin Hochstein

SRE Weekly Issue #425

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

Great practical advice for how to present reliability problems (and your proposed solutions) to e-staff.

  Ross Brodbeck

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk.

  Lorin Hochstein

The three kinds of roles covered in this article relate to Standards, Operations, and Leadership.

  Gavin Cahill — Gremlin

Nagle’s algorithm considered harmful? It’s important to be aware of it because it can trip you up.

  Marc Brooker

In issue #423, I linked to a story about Amazon charging for unauthenticated and failed requests to S3 buckets. Thankfully, they’re no longer charging for that.

  Amazon

A little low on details, but interesting nonetheless: Google Cloud did something weird and accidentally deleted a customer’s account out from under them.

  UniSuper

What is a “service” in the context of service levels (SLI/SLO)?

  Alex Ewerlöf

My favorite part of this one is the description of techniques for improving psychological safety at your company.

  Incident.io

A production of Tinker Tinker Tinker, LLC Frontier Theme