SRE Weekly Issue #404

A message from our sponsor, FireHydrant:

Looking to cozy up with a good read this week? Check out “Your guide to better status pages.” It’s a mini masterclass on how to better communicate on your status pages. https://firehydrant.com/blog/your-guide-to-better-incident-status-pages/

For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.

  Alex Ewerlöf

In this incident story, the feature flags were served by the main application server. When a new feature caused the server to crash, there was no way to flip the flag back off to stop the crashes.

  rachelbythebay

The author of a classification system for human error reflects 20 years later on the harm that such systems can cause by using deficit-based language.

  Dr. Steven Shorrock

Here’s Fred Hebert’s analysis of Cloudflare’s write-up of their incident on November 2.

I’m hoping they’re going to do a more in-depth review.

  Fred Hebert — VOID

In this post, we introduce a hybrid approach that seamlessly combines the precision of manual instrumentation with the comfort, efficiency, and performance of automatic instrumentation.

  Ron Federman — Odigos

Change is not the problem. It’s unaddressed risk

  Bruce Johnston — High Scalability

A shell script with a loop running a DB client can fill up your ephemeral ports in a hurry.

  Oren Eini — RavenDB

When you get right down to it, it’s all human communication, even assembly code. It’s human factors all the way down.

  Michael Hart

Updated: December 24, 2023 — 10:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme