SRE Weekly Issue #371

Articles

Is there such a thing as a system that’s too reliable?

NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.

Robert Barron — IBM

Observability maven ‘cranky’ about AIOps embraces GPT

I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.

With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.

Beth Pariseau — TechTarget
Full disclosure: Honeycomb, my employer, is mentioned.

Assembly time is where you have the most control of an incident

The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.

Robert Ross — FireHydrant

How to improve incident triaging for better organization-wide incident response

The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.

Luis Gonzalez — incident.io

GitHub Availability Report: April 2023

GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.

Jakub Oleksy — GitHub

Migrating Critical Traffic At Scale with No Downtime — Part 1

Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.

Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%

The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?

Marcin Kolny — Amazon Prime Video

SRE Weekly Issue #371

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues