General

SRE Weekly Issue #399

A message from our sponsor, FireHydrant:

Severity levels help responders and stakeholders understand the incident impact and set expectations for the level of response. This can mean jumping into action faster. But first, you have to ensure severity is actually being set. Here’s one way.
https://firehydrant.com/blog/incident-severity-why-you-need-it-and-how-to-ensure-its-set/

This research paper summary goes into Mode Error and the dangers of adding more features to a system in the form of modes, especially if the system can change modes on its own.

  Fred Hebert (summary)
  Dr. Nadine B. Sarter (original paper)

Cloudflare suffered a power outage in one of the datacenters housing their control and data planes. The outage itself is intriguing, and in its aftermath, Cloudflare learned that their system wasn’t as HA as they thought.

Lots of great lessons here, and if you want more, they posted another incident writeup recently.

   Matthew Prince — Cloudflare

Separating write from read workloads can increase complexity but also open the door to greater scalability, as this article explains.

  Pier-Jean Malandrino

Covers four strategies for load shedding, with code examples:

  • Random Shedding
  • Priority-Based Shedding
  • Resource-Based Shedding
  • Node Isolation

  Code Reliant

Lots of juicy details about the three outages, including a link to AWS’s write-up of their Lambda outage in June.

  Gergely Orosz

The diagrams in this article are especially useful for understanding how the circuit-breaker pattern works.

  Pier-Jean Malandrino

This one’s about how on-call can go bad, and how to structure your team’s on-call so to be livable and sustainable.

  Michael Hart

Execs cast a big shadow in an incident, so it’s important to have a plan for how to communicate with them, as this article explains.

  Ashley Sawatsky — Rootly

SRE Weekly Issue #398

A message from our sponsor, FireHydrant:

“Change is the essential process of all existence.” – Spock
It’s time for alerting to evolve. Get a first look at how incident management platform FireHydrant is architecting Signals, its native alerting tool, for resilience in the Signals Captain’s Log.
https://firehydrant.com/blog/captains-log-a-first-look-at-our-architecture-for-signals/

A cardiac surgeon draws lessons from the Tenerife commercial airline disaster and applies them to communication in the operating room.

  Dr. Rob Poston

Creating an incident write-up is an expensive investment. This article will tell you why it’s worthwhile.

  Emily Ruppe — Jeli

The optimism and pessimism in this article are about the likelihood of contention and conflicts between actors in a distributed system, and it’s a fascinating way of looking at things.

  Marc Brooker

Here is a guide for how to be an effective Incident Commander and get things fixed as quickly as possible as part of an efficient Incident Management process.

  Jonathan Word

The four concepts are Rebound, Robustness, Graceful Extensibility, and Sustained Adaptability, and this research paper summary explains each concept.

  Fred Hebert (summary)
  Dr. David Woods (original paper)

Apache Beam played a pivotal role in revolutionizing and scaling LinkedIn’s data infrastructure. Beam’s powerful streaming capabilities enable real-time processing for critical business use cases, at a scale of over 4 trillion events daily through more than 3,000 pipelines.

  Bingfeng Xia and Xinyu Liu — LinkedIn

Meta’s SCARF tool automatically scans for unused (dead) code and creates pull requests for their removal, on a daily basis.

  Will Shackleton, Andy Pincombe, and Katriel Cohn-Gordon — Meta

Netflix built a system that detects kernel panics in k8s nodes and annotates the resulting orphaned pods so that it’s clear what happened to them.

  Kyle Anderson — Netflix

This upcoming webinar will cover a range of topics around resilience engineering and incident response, with two big names we’ve seen in many past issues: Chris Evans (incident.io) and Courtney Nash (Verica).

SRE Weekly Issue #397

A message from our sponsor, FireHydrant:

Incident management platform FireHydrant is combining alerting and incident response in one ring-to-retro tool. Sign up for the early access waitlist and be the first to experience the power of alerting + incident response in one platform at last.
https://firehydrant.com/signals/

The length and complexity of this article hints at the theme that runs throughout: there’s no easy, universal, perfect rollback strategy. Instead, they present a couple of rollback strategies you can choose from and implement.

  Bob Walker — Octopus Deploy

This article delves into enhancing error management in batch processing programs through the strategic implementation of automatic safety switches and their critical role in safeguarding data integrity during technical errors.

  Bertrand Florat — DZone

Part of their observability strategy, which they call “shadowing”, is especially nifty.

  Lev Neiman and Jason Fan — DoorDash

It’s interesting that the DB failed in a way that GitHub’s Orchestrator deployment was unable to detect.

  Jakub Oleksy — GitHub

What exactly is a Senior Staff Engineer? While this article is not specifically about Senior Staff SREs, it’s directly applicable, especially as I’ve seen more Staff+ SRE job postings in the past couple years.

  Alex Ewerlöf

“Blameless” doesn’t mean no names allowed!

Remember—if discussing the actions of a specific person is being done for the sake of better learning; don’t shy away from it.

  incident.io

This series is shaping up to be a great study guide for new SREs.

Each day of this week brings you one step closer to not only acing your SRE interviews but also becoming the SRE who can leverage code & infrastructure to perfect systems reliability.

  Code Reliant

A fascinating and scary concept: a tool for automatically identifying and performing all the changes involved in deprecating an entire product.

  Will Shackleton, Andy Pincombe, and Katriel Cohn-Gordon — Meta

SRE Weekly Issue #396

A message from our sponsor, FireHydrant:

DevOps keeps evolving but alerting tools are stuck in the past. Any modern alerting tool should be built on these four principles: cost-efficiency, service catalog empowerment, easier scheduling and substitutions, and clear distinctions between incidents and alerts.
https://firehydrant.com/blog/the-new-principles-of-incident-alerting-its-time-to-evolve/

Using 3 high-profile incidents from the past year, this article explores how to define SLOs that might catch similar problems, with a special focus on keeping the SLI close to the user experience.

   Adriana Villela and Ana Margarita Medina — The New Stack

Microservices can have some great benefits, but if you want to build with them, you’re going to have to solve a whole pile of new problems.

  Roberto Vitillo

To protect your application against failures, you first need to know what can go wrong. […] the most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load.

  Roberto Vitillo

I love how this article keeps things interesting by starting with a fictional (but realistic) story about the dangers of over-alerting before continuing on to give direct advice.

  Adso

I especially enjoy the section on the potential pitfalls and challenges with retries and how you can avoid them.

  CodeReliant

This reddit thread is a goldmine, including this gem:

I actively avoid getting involved with software subject matter expertise, because it robs the engineering team of self-reliance, which is itself a reliability issue.

  u/bv8z and others — reddit

There’s a pretty cool “Five Whys”-style analysis that goes past “dev pushed unreviewed code with incomplete tests to production” and to the sociotechnical challenges underlying that.

  Tobias Bieniek — crates.io

A production of Tinker Tinker Tinker, LLC Frontier Theme