SRE Weekly Issue #384

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:


They tested this new git merge strategy by using Scientist, a framework that runs both the old and new implementation and compares the results.

  Jesse Toth — GitHub

DNS is simple (kinda) but it can be really difficult to fully wrap your head around it. This article explains why, and in the process gives a blueprint for designing more understandable tools in general.

  Julia Evans

Fallback is different from Failover for a number of reasons. This article describes how they differ, how fallback works, and why you might choose it over failover.

  Alex Ewerlöf

Repository Purpose: Provide teams and individuals an idea on what to take into consideration and what to aspire for in the SRE field and work

Note: these checklists are opinionated.

  Arie Bregman

A thought-provoking article on trying to change people’s behavior in incidents through incentives (positive or negative) without also changing the context in which they act.

  Fred Hebert — Learning From Incidents

Cloudflare shares what they learned as they transitioned their KV service to a new architecture which resulted in multiple unexpected problems.

  Matt Silverlock, Charles Burnett, Rob Sutter, and Kris Evans — Cloudflare

In this article, learn about two interesting strategies for getting an organization to prioritize technical debt work: using a more specific name for the work, and referencing the work’s impact on an SLO — and the impact of not doing the work.

  Emily Nakashima — Honeycomb
  Full disclosure: Honeycomb is my employer.

Updated: August 6, 2023 — 9:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme