SRE Weekly Issue #342

View on sreweekly.com

Articles

Video Observability

As a television broadcaster, how do I ensure that my channels are playing out the right thing for my viewers?

This is SRE applied to tv broadcasting: they replaced human monitoring of screens with an automated system.

Jeremy Blythe — evertz.io
Full disclosure: Honeycomb, my employer, is mentioned.

On-call with Jérôme Petazzoni

An interview with an engineer about on-call practices, training folks for on-call, and chaos engineering.

Elena Boroda — Fiberplane

The Re-Org Rag (I’m My Own VP)

SRE: totally defined. Time for a reorg, and with a catchy tune!

Forrest Brazeal

Keep Calm and Respond: A Beginner’s Heuristic to Incident Response

Great advice for incident response, backed up by real-world anecdotes.

Audrey Simonne — DZone

The Long Way Down: The crash of Air France flight 447

There’s a lot to learn from in this air accident. A chilling example: several quirks of the plane’s automation combined to effectively tell the pilot to continue pushing the plane to stall.

Admiral Cloudberg

Atomic Commitment: The Unscalability Protocol – Marc’s Blog

When sharding a database, if transactions can span shards, then it can be very difficult to reason about the system’s maximum throughput.

For example, splitting a single-node database in half could lead to worse performance than the original system.

Marc Brooker

GitHub Availability Report: September 2022

Through Ubuntu’s unattended-upgrades system, a systemd update was installed that broke systemd-resolved, which in turn broke GitHub Codespaces. The systemd bug report they link to is also well worth a read.

Jakub Oleksy — GitHub

There is no “Three Mile Island” event coming for software

Why not?

we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

Lorin Hochstein

SRE Weekly Issue #342

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues