SRE Weekly Issue #360

Articles

Overworked and Underpaid: The crash of TransAsia Airways flight 222

Another case of “pilot error” vs “systemic problems”. It’s interesting to me how the organizational pressures the pilots were facing mirror many stories I’ve seen in tech firms, especially startups.

Admiral Cloudberg

Incident travel time

This article recommends improving MTTA (mean time to assemble) by modeling our dispatch systems on the emergency services for a large city.

Robert Ross

Our 2023 Site Reliability Engineering Wish List

Lots of great stuff to aspire to, with a big emphasis on observability.

Adriana Villela and Ana Margarita Medina — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.

Move past incident response to reliability

I really love the concept of “incident legalism” introduced in this article. I’ve definitely been there.

The high cost of low ambiguity

Anyone who has coordinated over Slack during the incident has felt the pain of the ambiguity of Slack messages.

But communicating with specificity has a cost.

Lorin Hochstein

Spotify Engineering Incident Report: Spotify Outage on January 14, 2023 Infrastructure

I remember this one! I was trying to listen to music at the time. Turns out it was DNS (and a git repo).

Erik Lindblad — Spotify

Good category, bad category (or: tag, don’t bucket)

If you’re gonna group your incidents, use tags, not exclusive groups.

Lorin Hochstein

SRE Weekly Issue #360

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues