SRE Weekly Issue #342

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


As a television broadcaster, how do I ensure that my channels are playing out the right thing for my viewers?

This is SRE applied to tv broadcasting: they replaced human monitoring of screens with an automated system.

  Jeremy Blythe —
  Full disclosure: Honeycomb, my employer, is mentioned.

An interview with an engineer about on-call practices, training folks for on-call, and chaos engineering.

  Elena Boroda — Fiberplane

SRE: totally defined. Time for a reorg, and with a catchy tune!

  Forrest Brazeal

Great advice for incident response, backed up by real-world anecdotes.

  Audrey Simonne — DZone

There’s a lot to learn from in this air accident. A chilling example: several quirks of the plane’s automation combined to effectively tell the pilot to continue pushing the plane to stall.

  Admiral Cloudberg

When sharding a database, if transactions can span shards, then it can be very difficult to reason about the system’s maximum throughput.

For example, splitting a single-node database in half could lead to worse performance than the original system.

  Marc Brooker

Through Ubuntu’s unattended-upgrades system, a systemd update was installed that broke systemd-resolved, which in turn broke GitHub Codespaces. The systemd bug report they link to is also well worth a read.

  Jakub Oleksy — GitHub

Why not?

we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

  Lorin Hochstein

Updated: October 9, 2022 — 9:06 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme