Articles
With 100 workstreams and over 500 engineers engaged, this was the biggest incident response I’ve read about in years.
We had to force ourselves to identify the facts on the ground instead of “what ought to be,” and overrule our instincts to look for data in the places we normally looked (since our own monitoring was impacted).
Laura de Vesine — Datadog
When you unify these three “pillars” into one cohesive approach, a new ability to understand the full state of your system in several new ways also emerges.
Danyel Fisher — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.
This report details the 10-hour incident response following the accidental deletion of live databases (rather than their snapshots, as intended).
Eric Mattingly — Azure
Neat trick: write your alerts in English and get GPT to convert them to real alert configurations.
Shahar and Tal — Keep (via HackerNews)
If your DNS resolver is responsible for handling queries for both internal and external domains, what happens when external DNS requests fail? Can internal ones still proceed?
Chris Siebenmann
This article explains potential pitfalls and downsides to observability tools and the ways vendors might try to get you to use them, along with tips for how to avoid the traps.
David Caudill
Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.
Loron Hochstein — Surfing Complexity