General

SRE Weekly Issue #440

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.

  Rory Malcolm — incident.io

As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.

  Ian Hoffman and Mike Demmer — Slack

A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.

  Heroku

Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.

  John Allspaw — InfoQ

Metrics are great for many other things, but they can’t compete with traces for investigating problems.

  Jean-Mark Wright

Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.

  Denis Isaev — Yandex

Hot take? Sure, but they back it up with a well-reasoned argument.

  Ethan McCue

A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.

  Richard Artoul — WarpStream

SRE Weekly Issue #439

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

Read on to learn why client-side network monitoring is so important and what you are missing if your only visibility into network performance is from a backend perspective.

  Fredric Newberg — The New Stack

An engineer with no Kubernetes experience migrates an app to Kubernetes — with a bit of help from StackOverflow and Copilot, of course.

  Jacob Brandt — Klaviyo

As data teams become increasingly critical, problems in their systems become incidents. Here’s an overview of how one data team has designed their incident response process.

  Navo Das — incident.io

Certificate pinning can be a useful practice, but it’s also fraught with pitfalls and outage risks, especially with the modern tendency toward shorter certificates and multiple intermediates. What can we do instead?

  Dina Kozlov — Cloudflare

A super-thorough overview of SLAs with a helpful section on how to chose the level for an SLA.

  Diana Bocco — UptimeRobot

This debugging story focuses on a Linux TCP option I wasn’t familiar with: tcp_slow_start_after_idle.

  Amnon Cohen — Ably

This is the story of a company that got an unexpectedly huge rush of interest in their platform—and traffic too. They made a number of changes to quickly scale to meet the demand.

  Jekaterina Petrova — Dyninno

This Honeycomb incident followup seems to be related to their post that I shared last week.

  Honeycomb

  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #438

Are there any blind or low-vision readers out there that would be willing to answer a few questions? I’m looking to learn more about your experience of reading a newsletter like this and the articles I link to. If you’re interested, please drop me an email at lex at sreweekly dot com. Thanks!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This article shows how to use timed_rotating and multirotate_set to regularly rotate credentials using Terraform.

  Andy Leap — Mixpanel

After an incident involving a database schema change, this engineer created a linting system for schema changes to catch painful ones that would cause a full table rewrite.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

Finding Heroku and alternative services lacking for various reasons, these folks built their own Heroku-like platform on top of Kubernetes and migrated their service to it.

  Matheus Lichtnow — WorkOS

It’s anything but simple to handle IPv4 and IPv6 in your service. This article covers the nitty-gritty details including dual-stack resolvers and Happy Eyeballs.

  Viacheslav Biriukov

What’s great about an incident? It helps uncover latent flaws in your system, as happened to these folks during a Redis upgrade.

  Shayon Mukherjee

Tips on how to handle vendor incidents, from runbooks to incident management and post-incident review.

  Mandi Walls — PagerDuty

Cool trick:

[…] when an operational surprise happens, someone will remember “Oh yeah, I remember reading about something like this when incident XYZ happened”, and then they can go look up the incident writeup to incident XYZ and see the details that they need to help them respond.

  Lorin Hochstein

While the CAP theorem may be technically correct, the actual limitations it imposes on real-world systems have nuance.

The reality is that CAP is nearly irrelevant for almost all engineers building cloud-style distributed systems, and applications on the cloud.

  Marc Brooker

SRE Weekly Issue #437

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeksmy RSS reader is packed with awesomeness!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

  CrowdStrike

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

  Gergely Orosz

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

  Chris Siebenmann

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

  rachelbythebay

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

  Lorin Hochstein

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

  Lorin Hochstein

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

  Brendan Gregg

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

  Bob Walker — Octopus Deploy

SRE Weekly Issue #436

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

As we can see from the above, any reliability problem like this invalid memory access issue can lead to widespread availability issues when not combined with safe deployment practices.

This analysis from Microsoft starts off by examining crash dumps from the incident that were voluntarily submitted by Windows users. Then they explain why security vendors like CrowdStrike might choose to operate in kernel mode, the inherent risks, and alternative options they could use instead.

  Microsoft

This is CrowdStrike’s initial technical analysis posted shortly after the incident, which I shared here previously.  I’m linking to it again to highlight an apparent contradiction with the analysis from Microsoft as to whether the CrowdStrike component involved was a kernel driver:

Although Channel Files end with the SYS extension, they are not kernel drivers.

I’m guessing the technical resolution to this apparent contradiction is that the channel files are merely data files and not kernel drivers, whereas the thing that processes the channel files is in fact a kernel driver. To me this seems like a needless clarification that was highly likely to mislead readers into thinking that kernel drivers were not at play, which is exactly how I interpreted it at the time.

  CrowdStrike

Here’s a summary and opinion piece on Microsoft’s analysis article, including more on the trade-off of vendors running code in kernel mode.

  Thom Holwerda — OSNews

The challenge is, how do you formulate the right free-text representation of your system to get a useful answer out of an LLM?

  Amir Krayden — DevOps.com

Will artfully uses a refrigeration-based metaphor to discuss creating a blameless culture. Trust me, it works.

  Will Gallego

These folks wanted to allow log lines greater than 128 bytes in their observability product, but their data store made that tricky. They used bloom filters and other techniques to achieve acceptable performance.

  Nathan Ostgard and Javier Schoijet — Embrace

It turns out sending texts and making phone calls automatically is really hard, and many assumptions you might make turn out to be wrong.

  Leo Sjöberg — incident.io

Wow, I had no idea Systemd could limit a program’s ability to access certain IPs. This one’s worth a read to save you from hair-pulling if you ever run into this.

  rachelbythebay

A production of Tinker Tinker Tinker, LLC Frontier Theme