SRE Weekly Issue #376

Articles

2023 03 08 Incident: A Deep Dive into Our Incident Response

With 100 workstreams and over 500 engineers engaged, this was the biggest incident response I’ve read about in years.

We had to force ourselves to identify the facts on the ground instead of “what ought to be,” and overrule our instincts to look for data in the places we normally looked (since our own monitoring was impacted).

Laura de Vesine — Datadog

How the ‘3 Pillars of Observability’ Miss the Big Picture

When you unify these three “pillars” into one cohesive approach, a new ability to understand the full state of your system in several new ways also emerges.

Danyel Fisher — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.

Azure DevOps Outage in South Brazil

This report details the 10-hour incident response following the accidental deletion of live databases (rather than their snapshots, as intended).

Eric Mattingly — Azure

Show HN: Keep – Create production alerts from plain English

Neat trick: write your alerts in English and get GPT to convert them to real alert configurations.

Shahar and Tal — Keep (via HackerNews)

A potential issue with outstanding query limits in your DNS resolver

If your DNS resolver is responsible for handling queries for both internal and external domains, what happens when external DNS requests fail? Can internal ones still proceed?

Chris Siebenmann

Delusion Soup: How Observability Got Here, and What We Can Do About It

This article explains potential pitfalls and downsides to observability tools and the ways vendors might try to get you to use them, along with tips for how to avoid the traps.

David Caudill

Treating uncertainty as a first-class concern

Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

Loron Hochstein — Surfing Complexity

SRE Weekly Issue #376

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues