General

SRE Weekly Issue #405

A message from our sponsor, FireHydrant:

In this episode of FireHydrant’s Gimme 5 video series, Asaf Gaon, Director of Technical Support for automated grocery fulfillment solution Takeoff Technologies, talks about how to handle third-party downtime in a collaborative – and automated – way. https://firehydrant.com/blog/gimme-5-with-takeoff-technologies-asaf-gaon/

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

  Alex Ewerlöf

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

  Archie Gunasekara — Slack

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

  Richard Gall — The New Stack

If any seemingly innocuous change can break our systems, what should we do?

  Lorin Hochstein

What exactly is “human error”?

  Steven Shorrock — Humanistic Systems

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

  Brent Anderson — Knock

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

  Fred Hebert

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

   Bertrand Florat

SRE Weekly Issue #404

A message from our sponsor, FireHydrant:

Looking to cozy up with a good read this week? Check out “Your guide to better status pages.” It’s a mini masterclass on how to better communicate on your status pages. https://firehydrant.com/blog/your-guide-to-better-incident-status-pages/

For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.

  Alex Ewerlöf

In this incident story, the feature flags were served by the main application server. When a new feature caused the server to crash, there was no way to flip the flag back off to stop the crashes.

  rachelbythebay

The author of a classification system for human error reflects 20 years later on the harm that such systems can cause by using deficit-based language.

  Dr. Steven Shorrock

Here’s Fred Hebert’s analysis of Cloudflare’s write-up of their incident on November 2.

I’m hoping they’re going to do a more in-depth review.

  Fred Hebert — VOID

In this post, we introduce a hybrid approach that seamlessly combines the precision of manual instrumentation with the comfort, efficiency, and performance of automatic instrumentation.

  Ron Federman — Odigos

Change is not the problem. It’s unaddressed risk

  Bruce Johnston — High Scalability

A shell script with a loop running a DB client can fill up your ephemeral ports in a hurry.

  Oren Eini — RavenDB

When you get right down to it, it’s all human communication, even assembly code. It’s human factors all the way down.

  Michael Hart

SRE Weekly Issue #403

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/

A great overview of SLIs, covering event-based vs time-based SLIs, commonly used SLIs, and examples of things that don’t make good SLIs.

  Alex Ewerlöf

When it’s time to declare an incident, I want to spend ten seconds or less getting things kicked off.

  Matilda Hultgren — incident.io

This short article covers three important aspects of error budgets:

  1. Understanding Your Error Budget
  2. Make Informed Decisions
  3. Proactively communicate

  Code Reliant

SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

  Blameless   Full disclosure: Honeycomb, my employer, is mentioned.

I hadn’t really appreciated some of the subtler details of CPU requests in k8s until I read this.

  Ara Pulido — Datadog

Reading this, I can see hints of the contributing factors in many incidents I’ve been involved in.

To these folks, it feels like giving a damn is a huge career liability in your organization. Because it is.

  David Caudill

They went to impressive lengths to make the upgrade process reversible.

Amusingly, this post was directly relevant to me 30 minutes ago when I discovered mojibake all over sreweekly.com due to upgrading MySQL from 5.7 to 8.0+ last week. Oops.

  Jiaqi Liu, Daniel Rogart, and Xin Wu — GitHub

In order to learn from incidents, we need to know that they happened. That means someone needs to report them, but a lot can get in the way of reporting incidents.

  Dr. Steven Shorrock — Humanistic Systems

SRE Weekly Issue #402

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience the difference: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally.
https://firehydrant.com/blog/signals-beta-live/

Wow, this interactive tool for choosing SLOs is fun to play with! Dragging the sliders really gives you a feel for the math involved, and then you get a formula that you can actually use.

  Alex Ewerlöf

A riveting story of a service that was the victim of its own success, a potential solution, and then further challenges to overcome.

  Tanat Lokejaroenlarb — Adevinta

Here’s a classic example of “work as imagined” vs “work as done”, as health care workers struggle against difficult security constraints while trying to care for patients.

  Fred Hebert — summary
  Ross Koppel, Sean Smith, Jim Blythe, and Vijay Kothari — original paper

This article covers a lot of ground, touching on a lot of components of a successful SRE program, and even includes a code example for SLO calculation.

  Vishal Padghan — Squadcast

More on the weird EBS performance regression I linked to last week. Still no full explanation of what changed, but at least they have a solution (gp3 volumes).

  Dustin Brown — dolthub

After a massive 73-hour outage, Roblox set out to redesign their infrastructure to make that kind of incident much less likely. They’ve charted a path through several intermediate architectures, with the ultimate goal of active-active datacenters.

  Daniel Sturman, Max Ross, and Michael Wolf — Roblox

Now here’s one that really makes me think. I can’t really summarize it in a sentence, so just go read it.

  Lorin Hochstein

SRE Weekly Issue #401

A message from our sponsor, FireHydrant:

Join FireHydrant Dec.14 for a conversation about on-call culture and its effect on engineering organizations, featuring special guests from Outreach and Udemy. Gain a better understanding of what makes excellent on-call culture and how to implement practices to improve yours.
https://app.livestorm.co/firehydrant/better-incidents-winter-bonfire-inside-on-call?type=detailed

Maybe you’re thinking of skipping over “yet another article about blamelessness”? Don’t. This one has some great examples and stories and is well worth a read.

  Michael Hart

I’m definitely guilty of a couple of these.

  Code Reliant

New podcast relevant to our interests!

In this series, you’ll hear insightful conversations with engineers, product managers, co-founders and more, all about the debatable topic of incident management.

  Luis Gonzalez — incident.io

A puzzling performance regression in EBS volumes, seemingly reproducible across instances. Anyone else seeing anything like this?

  Dustin Brown — dolthub

This article presents a framework for scaling SRE teams by defining SRE processes, automating, and iterating.

   Stelios Manioudakis — DZone

Some tips on what makes a good alert and how to design your alerts to be actually useful, rather than just noise.

  Leon Adato — Kentik

Why would you want multiple different targets for the same SLO? Read this one to find out.

  Alex Ewerlöf

Conflict-free Replicated Data Types are powerful, but they have downsides explained in this article, so it’d be great if we could avoid them when possible.

  Zak Knill

A production of Tinker Tinker Tinker, LLC Frontier Theme