SRE Weekly Issue #305

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):
https://rootly.com/demo/?utm_source=sreweekly

Articles

[…] when Kubernetes is involved, the number of alert sources can skyrocket quickly. This article will reflect on some common causes of alert fatigue and share tips to help reduce it.

  Nate Matherson — DZone

Meta has a special system to warn servers about power outages, giving them 45 seconds of battery power to finish things up and get ready to shut down.

  Raghunathan Modoor Jagannathan, Sulav Malla, and Parimala Kondety — Meta

This is an approachable explanation of the Paxos algorithm with examples, diagrams, and code.

  Martin Fowler

But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams?

  Emily Arnott — Blameless

“The Practice of Practice” is a concept from improvisational music. This article artfully applies the idea to the practice of incident response.

  Matt Davis — Blameless

I haven’t heard of this technique being used before, assigning alerts to on-call folks in round-robin order as they come in. I wonder if there’s a reason for that…

  Hannah Culver — PagerDuty

Raise your hand if you’ve been bitten by DNS before.

  Julia Evans

Outages

Updated: January 16, 2022 — 8:36 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme