SRE Weekly Issue #398

A message from our sponsor, FireHydrant:

“Change is the essential process of all existence.” – Spock
It’s time for alerting to evolve. Get a first look at how incident management platform FireHydrant is architecting Signals, its native alerting tool, for resilience in the Signals Captain’s Log.
https://firehydrant.com/blog/captains-log-a-first-look-at-our-architecture-for-signals/

A cardiac surgeon draws lessons from the Tenerife commercial airline disaster and applies them to communication in the operating room.

  Dr. Rob Poston

Creating an incident write-up is an expensive investment. This article will tell you why it’s worthwhile.

  Emily Ruppe — Jeli

The optimism and pessimism in this article are about the likelihood of contention and conflicts between actors in a distributed system, and it’s a fascinating way of looking at things.

  Marc Brooker

Here is a guide for how to be an effective Incident Commander and get things fixed as quickly as possible as part of an efficient Incident Management process.

  Jonathan Word

The four concepts are Rebound, Robustness, Graceful Extensibility, and Sustained Adaptability, and this research paper summary explains each concept.

  Fred Hebert (summary)
  Dr. David Woods (original paper)

Apache Beam played a pivotal role in revolutionizing and scaling LinkedIn’s data infrastructure. Beam’s powerful streaming capabilities enable real-time processing for critical business use cases, at a scale of over 4 trillion events daily through more than 3,000 pipelines.

  Bingfeng Xia and Xinyu Liu — LinkedIn

Meta’s SCARF tool automatically scans for unused (dead) code and creates pull requests for their removal, on a daily basis.

  Will Shackleton, Andy Pincombe, and Katriel Cohn-Gordon — Meta

Netflix built a system that detects kernel panics in k8s nodes and annotates the resulting orphaned pods so that it’s clear what happened to them.

  Kyle Anderson — Netflix

This upcoming webinar will cover a range of topics around resilience engineering and incident response, with two big names we’ve seen in many past issues: Chris Evans (incident.io) and Courtney Nash (Verica).

Updated: November 12, 2023 — 9:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme