SRE Weekly Issue #310

Articles

Incident Analysis 101: Who should investigate?

Here’s the next incredibly useful article in Jeli’s Incident Analysis 101 series. This one covers the skills and traits of a good incident analyst, along with what not to look for.

Laura Maguire — Jeli

A decade of major cache incidents at Twitter

This article has a remarkable level of detail on 13 incidents at Twitter that were related to cache. The authors open with an explanation of why they focused on cache-related incidents.

Dan Luu and Yao Yue

The three pillars of great incident response

[…] the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.

The three pillars are:

Clarity

Transparency

Calm

Lisa Karlin Curtis — incident.io

Designing your incident severity levels

This one recommends doing away with “P0” and “P5” and instead using plain words like “Low” and “High”.

Stephen Whitworth — incident.io

Why and How SREs Can Benefit from Feature Flags

Feature flags can be a useful way to resolve user impact during an incident.

Weihan Li — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Who monitors the monitoring system? — Is my Prometheus alive at all

Implementing a dead-switch for your alerting tool is really important so that you don’t blissfully sleep through an outage.

Chris Loukas — HelloFresh

How We Define SRE Work

As SRE #1, the author of this article got to define the SRE role from the ground up.

Fred Hebert — Honeycomb

Lessons Learned in 10 Years of SRE: Part 1 – Starting SRE

In this article, I will share five lessons I learned about starting SRE teams (or engagements, or organizations).

This article is all about the shape of an SRE team, rather than technical details like SLOs and such.

Andrea Spadaccini — USENIX ;login:

SRE Weekly Issue #310

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues