SRE Weekly Issue #371

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place.  Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.

  Robert Barron — IBM

I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.

With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.

  Beth Pariseau — TechTarget
  Full disclosure: Honeycomb, my employer, is mentioned.

The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.

  Robert Ross — FireHydrant

The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.

  Luis Gonzalez — incident.io

GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.

  Jakub Oleksy — GitHub

Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.

  Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.

I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?

  Marcin Kolny — Amazon Prime Video

Updated: May 7, 2023 — 10:40 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme