SRE Weekly Issue #292

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.io/?utm_source=sreweekly

Articles

The lessons:

  1. Acknowledge human error as a given and aim to compensate for it
  2. Conduct blameless post-mortems
  3. Avoid the “deadly embrace”
  4. Favor decentralized IT architectures

There have been quite a few of these “lessons learned” articles that I’ve passed over, but I feel like this one is worth reading.

Anurag Gupta — Shoreline.io

Niall Murphy

Could us-east-1 go away? What might you do about it? Let’s catastrophize!

I love catastrophizing!

Tim Bray

When evaluating options, this article focuses on reliability, both of the service itself and the options it provides for building reliable services on it.

Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

This one answers the questions: what are failure domains, and how can we structure them to improve reliability?

brandon willett

It’s a great list of questions, and it covers a lot of ground. SREs wear many hats.

Opsera

I’ve always been curious about how Prometheus and similar time-series DBs compress metric data. Now I know!

Alex Vondrak — Honeycomb

This one has some unconfirmed (but totally plausible!) deeper details about what might have gone wrong in the Facebook outage, sourced from rumors.

rachelbythebay

There’s a really intriguing discussion in here about why organizations might justify a choice of profit at the expense of safety, and how the deck is stacked.

Rob Poston

Outages

Updated: October 17, 2021 — 8:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme