SRE Weekly Issue #305

Articles

Avoiding Alert Fatigue: 8 Tips for Every K8s Team

[…] when Kubernetes is involved, the number of alert sources can skyrocket quickly. This article will reflect on some common causes of alert fatigue and share tips to help reduce it.

Nate Matherson — DZone

Power Loss Siren: Making Meta resilient to power loss events

Meta has a special system to warn servers about power outages, giving them 45 seconds of battery power to finish things up and get ready to shut down.

Raghunathan Modoor Jagannathan, Sulav Malla, and Parimala Kondety — Meta

Paxos

This is an approachable explanation of the Paxos algorithm with examples, diagrams, and code.

Martin Fowler

The Universal Language: Reliability for Non-Engineering Teams

But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams?

Emily Arnott — Blameless

SRE and the Practice of Practice

“The Practice of Practice” is a concept from improvisational music. This article artfully applies the idea to the practice of incident response.

Matt Davis — Blameless

Equitably distribute on-call responsibility and streamline incident response with Round Robin Scheduling

I haven’t heard of this technique being used before, assigning alerts to on-call folks in round-robin order as they come in. I wonder if there’s a reason for that…

Hannah Culver — PagerDuty

Some ways DNS can break

Raise your hand if you’ve been bitten by DNS before.

Julia Evans

SRE Weekly Issue #305

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues