SRE Weekly Issue #275

Articles

Practical Guide to SRE: Incident Severity Levels

Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).

Nancy Chauhan — Rootly

Counterfactuals are not Causality

Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.

Michael Nygard

Don’t count your incidents, make your incidents count

Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.

incident.io

SLOs should be easy, say hi to Sloth

Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.

Xabier Larrakoetxea

Evaluating where your team lies on the SRE spectrum

If you’re looking for a way to evaluate your SRE process, this might help.

Alex Bramley — Google

The Cost of 100% Reliability

This article tries to put an actual number on the cost of adding more nines of reliability.

Jack Shirazi — Expedia

2021 SRE Report

It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.

Catchpoint

Outages

Akamai
- This outage impacted banks and airlines, among other Akamai customers.

SRE Weekly Issue #275

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues