SRE Weekly Issue #430

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

Lots of great tips in the comments if you’re looking to tune your resume.

  u/goodolbluey and others — reddit

What can SREs do to increase their available focus time?

   Krishna Vinnakota — DZone

One set of DNS root nameservers ( recently fell behind by a couple of days on updates for the root zone. We kind of just expect the root servers to work, you know?

  Dan Goodin — Ars Technica

Stripe talks about the design of their DocDB system built on MongoDB that achieves 5 nines of reliability.

  Jimmy Morzaria and Suraj Narkhede — Stripe

A Severity Zero (worst-case) incident is an entirely different thing from your average incident. This article talks about what makes it different and gives tips for handling one.

  Chris Evans —

With SLA credits kicking in for some services after just seconds of downtime, Amazon relies on multiple layers of automation.

  Nicholas Yan — Graphite

Here’s a great summary of a podcast episode about Google’s incident response practices.

Google’s latest Search Off The Record podcast discussed examples of disruptive incidents that can affect crawling and indexing and discuss the criteria for deciding whether or not to disclose the details of what happened.

  Roger Montti — Search Engine Journal

Here are some essential practices and traits that can make you an exemplary SRE.

Includes 19 tips with short explanations.


How do layoffs impact resiliency and adaptive capacity? Are the folks making those decisions cognizant of the potential impact on reliability?

  Will Gallego

Updated: June 23, 2024 — 9:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme