SRE Weekly Issue #436

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

As we can see from the above, any reliability problem like this invalid memory access issue can lead to widespread availability issues when not combined with safe deployment practices.

This analysis from Microsoft starts off by examining crash dumps from the incident that were voluntarily submitted by Windows users. Then they explain why security vendors like CrowdStrike might choose to operate in kernel mode, the inherent risks, and alternative options they could use instead.

  Microsoft

This is CrowdStrike’s initial technical analysis posted shortly after the incident, which I shared here previously.  I’m linking to it again to highlight an apparent contradiction with the analysis from Microsoft as to whether the CrowdStrike component involved was a kernel driver:

Although Channel Files end with the SYS extension, they are not kernel drivers.

I’m guessing the technical resolution to this apparent contradiction is that the channel files are merely data files and not kernel drivers, whereas the thing that processes the channel files is in fact a kernel driver. To me this seems like a needless clarification that was highly likely to mislead readers into thinking that kernel drivers were not at play, which is exactly how I interpreted it at the time.

  CrowdStrike

Here’s a summary and opinion piece on Microsoft’s analysis article, including more on the trade-off of vendors running code in kernel mode.

  Thom Holwerda — OSNews

The challenge is, how do you formulate the right free-text representation of your system to get a useful answer out of an LLM?

  Amir Krayden — DevOps.com

Will artfully uses a refrigeration-based metaphor to discuss creating a blameless culture. Trust me, it works.

  Will Gallego

These folks wanted to allow log lines greater than 128 bytes in their observability product, but their data store made that tricky. They used bloom filters and other techniques to achieve acceptable performance.

  Nathan Ostgard and Javier Schoijet — Embrace

It turns out sending texts and making phone calls automatically is really hard, and many assumptions you might make turn out to be wrong.

  Leo Sjöberg — incident.io

Wow, I had no idea Systemd could limit a program’s ability to access certain IPs. This one’s worth a read to save you from hair-pulling if you ever run into this.

  rachelbythebay

Updated: August 4, 2024 — 10:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme