SRE Weekly Issue #275

A message from our sponsor, StackHawk:

Join ZAP Founder & Project Lead Simon Bennetts on June 30 for a live AMA where he will be answering questions on all things open source and AppSec. Register:
http://sthwk.com/Simon-AMA

Articles

Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).

Nancy Chauhan — Rootly

Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.

Michael Nygard

Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.

incident.io

Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.

Xabier Larrakoetxea

If you’re looking for a way to evaluate your SRE process, this might help.

Alex Bramley — Google

This article tries to put an actual number on the cost of adding more nines of reliability.

Jack Shirazi — Expedia

It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.

Catchpoint

Outages

  • Akamai
    • This outage impacted banks and airlines, among other Akamai customers.
Updated: June 21, 2021 — 9:13 am
A production of Tinker Tinker Tinker, LLC Frontier Theme