SRE Weekly Issue #403

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/

A great overview of SLIs, covering event-based vs time-based SLIs, commonly used SLIs, and examples of things that don’t make good SLIs.

  Alex Ewerlöf

When it’s time to declare an incident, I want to spend ten seconds or less getting things kicked off.

  Matilda Hultgren — incident.io

This short article covers three important aspects of error budgets:

  1. Understanding Your Error Budget
  2. Make Informed Decisions
  3. Proactively communicate

  Code Reliant

SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

  Blameless   Full disclosure: Honeycomb, my employer, is mentioned.

I hadn’t really appreciated some of the subtler details of CPU requests in k8s until I read this.

  Ara Pulido — Datadog

Reading this, I can see hints of the contributing factors in many incidents I’ve been involved in.

To these folks, it feels like giving a damn is a huge career liability in your organization. Because it is.

  David Caudill

They went to impressive lengths to make the upgrade process reversible.

Amusingly, this post was directly relevant to me 30 minutes ago when I discovered mojibake all over sreweekly.com due to upgrading MySQL from 5.7 to 8.0+ last week. Oops.

  Jiaqi Liu, Daniel Rogart, and Xin Wu — GitHub

In order to learn from incidents, we need to know that they happened. That means someone needs to report them, but a lot can get in the way of reporting incidents.

  Dr. Steven Shorrock — Humanistic Systems

Updated: December 17, 2023 — 10:39 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme