General

SRE Weekly Issue #416

A message from our sponsor, FireHydrant:

We need tools that help us show our value, enhance understanding of our systems, and free time for us to expand our skills. In this article, FireHydrant lays out three questions to ask vendors as you evaluate DevOps tools. https://firehydrant.com/blog/3-questions-to-ask-of-any-devops-tool-in-2024/

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summaries and takeaways.

  The Hacker News

My favorite part of this interview is when Vanessa describes parenting twin babies as constant incident response.

  Shane Hastie — InfoQ

Here follow some lessons I’ve learned from the trenches in small start-ups and larger engineering teams, to improve your on-call shift experience and remediation time for production issues and make sure you’re spending on-call efforts on what has the most impact.

  Alex Wauters

Doing your chaos experiments in a non-production environment can feel safer, but what are you giving up?

  Sam Rossoff — Gremlin

Sometimes, shell is just the right tool for the job.

  Amin Astaneh — Certo Modo

Catherine from Mastodon summarized this incident report beautifully:

this is one of the most violently unhinged CSB reports i’ve ever read […]

while investigating an explosion at a facility, CSB staff tried to prevent another explosion of the same kind in the same facility, and being unable to convince the workers to not cause it, ended up hiding behind a shipping container

  U.S. Chemical Safety and Hazard Investigation Board

This one’s about why people tend to want a “SPoG” and what we should want instead. Bonus points for the Star Trek reference.

  Nočnica Mellifera — Checkly

Right in the middle of migrating from one datacenter to an HA pair of new datacenters, one of the new ones failed. They had to quickly do a partial rollback of the migration to ride out the outage.

  Gauthier François — Doctolib

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF programs.

  Jose Fernandez — Netflix

SRE Weekly Issue #415

A message from our sponsor, FireHydrant:

Join FireHydrant and talk shop with your DevOps peers on March 28! You’ll gain a better understanding of what makes a fatigue-free on-call culture and how to implement practices to improve yours at this free, virtual roundtable.
https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024

[…] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for example, to hire and fire team leads.

  Nočnica Mellifera — The New Stack

A primer on the problems with N+1 database queries and how this pattern can sneak into your code whether you realize it or not.

  neda — ReadySet

A great explainer on choosing the right SLIs, starting with the Golden Signals and branching out.

  Tyler Treat

My favorite part about this is the “latency budget” question — which team’s code gets to spend how much time doing its part to serve a request?

  Alex Ewerlöf

Changes in two programs outside the container made Ceph suddenly grind to a halt, as detailed in this troubleshooting story.

  Vladimir Guryanov — Palark

The word “one” is the key here, as the author argues for getting rid of “warning” alerts entirely in favor of using only “critical”.

  Gauthier François

They wrote a Slack bot to summarize open PagerDuty incidents every day.

  Matt Weingarten

The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

In order to reduce the noise, first they had to define noisy alerts and the KPIs they were looking to improve.

  Gauthier François — Doctolib

SRE Weekly Issue #414

A message from our sponsor, FireHydrant:

91% of engineering leaders say they want a better alerting tool. The other 9% couldn’t take the survey on their Blackberry. Meet Signals: a new standard in alerting and on call, now available.
https://firehydrant.com/blog/alerting-and-on-call-scheduling-for-how-you-actually-work/

This year’s VOID Report is out, and it’s well worth a read. The subtitle is “Exploring the Unintended Consequences of Automation in Software” which is a really good way to get me to read something!

  Courtney Nash — The VOID

A terraform change deleted a critical resource, and reviewers missed it because the plan was so big. Now they use Atlantis and Open Policy Agent to avoid accidental deletions of critical resources.

  Lin Du — InfoQ

When analyzing an incident, what can we learn when we assume that everyone did everything as well as possible?

  Lorin Hochstein

onsite technicians performing this planned network maintenance inadvertently unplugged several fibers that were adjacent to those in the work order, but still in use for production traffic

  Google

There’s a huge difference between four and five nines. There’s an especially interesting quote in this article that Google doesn’t think five nines is attainable in a commercial service.

  Diana Bocco — UptimeRobot

Here’s an interview with three SREs about what it’s like to be an SRE at IBM.

  IBM

I’ve been hearing about Observability 2.0 but didn’t know what it was all about. This article explains what it is and how it can help with cost.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

A cute little video pep talk for SREs. The site is actually real, too!

  Krazam

Like a mini Y2K, leap day came around again and left some technical glitches in its wake, as chronicled in this article.

  Gergely Orosz — The Pragmatic Engineer

SRE Weekly Issue #413

Sorry about the automation fail and resend! That definitely wasn’t issue #1.

A message from our sponsor, FireHydrant:

Check out how global payments company Dock uses FireHydrant to streamline and consolidate their incident management stack and reduce what they call “mean time to combat.”
https://firehydrant.com/blog/the-revolution-in-critical-incident-response-at-dock-with-firehydrant/

This article discusses building failure management directly into our systems, using Erlang as a case study.

  Jamie Allen

Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.

  Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber

These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!

  Daniel Grzelak — Plerion

How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.

  Will Gallego

Check it out, it’s an entire SRE conference I was totally unaware of!

  SREday

It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.

  Laura Clayton — UptimeRobot

A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?

  u/New_Detective_1363 and others — reddit

What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.

  Tomas Koprusak — UptimeRobot

The CAP theorem is useful as a theory, but what does it actually mean in practice?

  neda — ReadySet

SRE Weekly Issue #412

A message from our sponsor, FireHydrant:

FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time.
https://firehydrant.com/blog/mttx-incident-analytics-to-drive-your-reliability-roadmap/

Can a single dashboard to cover your entire system really exist?

  Jamie Allen

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

  Jonathan Word

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

  Emily Arnott — Blameless

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

  Alex Ewerlöf

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

  Frederick Lawler — Cloudflare

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

  Oleg Obleukhov and Ahmad Byagowi — Meta

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

  Fred Hebert

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

  Vladimirs Romanovskis

A production of Tinker Tinker Tinker, LLC Frontier Theme