SRE Weekly Issue #402

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience the difference: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally.
https://firehydrant.com/blog/signals-beta-live/

Wow, this interactive tool for choosing SLOs is fun to play with! Dragging the sliders really gives you a feel for the math involved, and then you get a formula that you can actually use.

  Alex Ewerlöf

A riveting story of a service that was the victim of its own success, a potential solution, and then further challenges to overcome.

  Tanat Lokejaroenlarb — Adevinta

Here’s a classic example of “work as imagined” vs “work as done”, as health care workers struggle against difficult security constraints while trying to care for patients.

  Fred Hebert — summary
  Ross Koppel, Sean Smith, Jim Blythe, and Vijay Kothari — original paper

This article covers a lot of ground, touching on a lot of components of a successful SRE program, and even includes a code example for SLO calculation.

  Vishal Padghan — Squadcast

More on the weird EBS performance regression I linked to last week. Still no full explanation of what changed, but at least they have a solution (gp3 volumes).

  Dustin Brown — dolthub

After a massive 73-hour outage, Roblox set out to redesign their infrastructure to make that kind of incident much less likely. They’ve charted a path through several intermediate architectures, with the ultimate goal of active-active datacenters.

  Daniel Sturman, Max Ross, and Michael Wolf — Roblox

Now here’s one that really makes me think. I can’t really summarize it in a sentence, so just go read it.

  Lorin Hochstein

Updated: December 10, 2023 — 9:27 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme