Articles
NASA chose to squeeze just a bit more science out of the Voyager spacecrafts’ aging power supplies by sacrificing a layer of redundancy. I love this so much, because it sounds just like the kinds of decisions we make during incidents.
Robert Barron — IBM
I really debated about including this one, because I don’t often include articles about new products, and Ii think especially critically when the the company in question is my employer.
With all that in mind, I’m including this one anyway because Charity Majors really put a fine point on exactly why I, too, am cranky about AIOps.
Beth Pariseau — TechTarget
Full disclosure: Honeycomb, my employer, is mentioned.
The main reason that MTTR is a flawed metric is that the nature of each incident varies so wildly. Time to assemble, though, is much closer to being under our control.
Robert Ross — FireHydrant
The folks at incident.io recommend being expansive in what is considered an incident and then using a defined process to find the real incidents, determine impact and priority, and assign to the right team for resolution.
Luis Gonzalez — incident.io
GitHub had some interesting incidents this time around, in several cases stemming from changes made with the intention of improving reliability.
Jakub Oleksy — GitHub
Netflix records and replays live traffic in a testbed environment in order to validate a migration plan before they ever impact real customers.
Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix
The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs.
I’ve seen this sentiment more frequently recently. Are we at the cusp of a general shift away from microservices?
Marcin Kolny — Amazon Prime Video