General

SRE Weekly Issue #441

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.

  Eddie Bracho — Mixpanel

Amazon posted this thorough summary of a multi-service outage at the end of July. The impact stems from a complex distributed system failure in Kinesis.

  Amazon

This team shows what they did to ferret out and eliminate occurrences of N+1 DB queries triggered by a single request in their Django app.

  Gonzalo Lopez — Mixpanel

The folks at incident.io share about how they baked observability into the infrastructure for their new on-call tool.

Note for folks using screen readers: there’s a picture without alt-text that contains the following important text:

  1. Overview dashboard
  2. System dashboard
  3. Logs
  4. Tracing

It’s right after this sentence:

Those pieces fit together something like this:

  Martha Lambert — incident.io

An overview of DST, which was a new concept for me. It’s about running simulations to try to find faults in a distributed system.

  Phil Eaton

If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑

I like the way this one draws from the author’s experience, plus the emphasis on feedback loops.

  Amin Astaneh

Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.

   Rajesh Pandey

Keepalive pings are critical in any system that uses TCP, since connections can hang at any point. I’ve been meaning to write this one for years!

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #440

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.

  Rory Malcolm — incident.io

As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.

  Ian Hoffman and Mike Demmer — Slack

A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.

  Heroku

Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.

  John Allspaw — InfoQ

Metrics are great for many other things, but they can’t compete with traces for investigating problems.

  Jean-Mark Wright

Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.

  Denis Isaev — Yandex

Hot take? Sure, but they back it up with a well-reasoned argument.

  Ethan McCue

A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.

  Richard Artoul — WarpStream

SRE Weekly Issue #439

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

Read on to learn why client-side network monitoring is so important and what you are missing if your only visibility into network performance is from a backend perspective.

  Fredric Newberg — The New Stack

An engineer with no Kubernetes experience migrates an app to Kubernetes — with a bit of help from StackOverflow and Copilot, of course.

  Jacob Brandt — Klaviyo

As data teams become increasingly critical, problems in their systems become incidents. Here’s an overview of how one data team has designed their incident response process.

  Navo Das — incident.io

Certificate pinning can be a useful practice, but it’s also fraught with pitfalls and outage risks, especially with the modern tendency toward shorter certificates and multiple intermediates. What can we do instead?

  Dina Kozlov — Cloudflare

A super-thorough overview of SLAs with a helpful section on how to chose the level for an SLA.

  Diana Bocco — UptimeRobot

This debugging story focuses on a Linux TCP option I wasn’t familiar with: tcp_slow_start_after_idle.

  Amnon Cohen — Ably

This is the story of a company that got an unexpectedly huge rush of interest in their platform—and traffic too. They made a number of changes to quickly scale to meet the demand.

  Jekaterina Petrova — Dyninno

This Honeycomb incident followup seems to be related to their post that I shared last week.

  Honeycomb

  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #438

Are there any blind or low-vision readers out there that would be willing to answer a few questions? I’m looking to learn more about your experience of reading a newsletter like this and the articles I link to. If you’re interested, please drop me an email at lex at sreweekly dot com. Thanks!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This article shows how to use timed_rotating and multirotate_set to regularly rotate credentials using Terraform.

  Andy Leap — Mixpanel

After an incident involving a database schema change, this engineer created a linting system for schema changes to catch painful ones that would cause a full table rewrite.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

Finding Heroku and alternative services lacking for various reasons, these folks built their own Heroku-like platform on top of Kubernetes and migrated their service to it.

  Matheus Lichtnow — WorkOS

It’s anything but simple to handle IPv4 and IPv6 in your service. This article covers the nitty-gritty details including dual-stack resolvers and Happy Eyeballs.

  Viacheslav Biriukov

What’s great about an incident? It helps uncover latent flaws in your system, as happened to these folks during a Redis upgrade.

  Shayon Mukherjee

Tips on how to handle vendor incidents, from runbooks to incident management and post-incident review.

  Mandi Walls — PagerDuty

Cool trick:

[…] when an operational surprise happens, someone will remember “Oh yeah, I remember reading about something like this when incident XYZ happened”, and then they can go look up the incident writeup to incident XYZ and see the details that they need to help them respond.

  Lorin Hochstein

While the CAP theorem may be technically correct, the actual limitations it imposes on real-world systems have nuance.

The reality is that CAP is nearly irrelevant for almost all engineers building cloud-style distributed systems, and applications on the cloud.

  Marc Brooker

SRE Weekly Issue #437

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeksmy RSS reader is packed with awesomeness!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

  CrowdStrike

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

  Gergely Orosz

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

  Chris Siebenmann

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

  rachelbythebay

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

  Lorin Hochstein

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

  Lorin Hochstein

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

  Brendan Gregg

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

  Bob Walker — Octopus Deploy

A production of Tinker Tinker Tinker, LLC Frontier Theme