SRE Weekly Issue #342

As a television broadcaster, how do I ensure that my channels are playing out the right thing for my viewers?

This is SRE applied to tv broadcasting: they replaced human monitoring of screens with an automated system.

  Jeremy Blythe —
  Full disclosure: Honeycomb, my employer, is mentioned.

An interview with an engineer about on-call practices, training folks for on-call, and chaos engineering.

  Elena Boroda — Fiberplane

SRE: totally defined. Time for a reorg, and with a catchy tune!

  Forrest Brazeal

Great advice for incident response, backed up by real-world anecdotes.

  Audrey Simonne — DZone

There’s a lot to learn from in this air accident. A chilling example: several quirks of the plane’s automation combined to effectively tell the pilot to continue pushing the plane to stall.

  Admiral Cloudberg

When sharding a database, if transactions can span shards, then it can be very difficult to reason about the system’s maximum throughput.

For example, splitting a single-node database in half could lead to worse performance than the original system.

  Marc Brooker

Through Ubuntu’s unattended-upgrades system, a systemd update was installed that broke systemd-resolved, which in turn broke GitHub Codespaces. The systemd bug report they link to is also well worth a read.

  Jakub Oleksy — GitHub

Why not?

we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

  Lorin Hochstein

