Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.
Zendesk SRE has a set of 8 reliability principles that guide what they do.
Jason Smale — Zendesk
We’re going to talk about a few necessities that enable exceptional incident management.
- Service ownership
- Incident roles
- The incident declaration process
- Running incident drills
Robert Ross — FireHydrant
I don’t think you’re supposed to use Consul that way…
Read this article to follow along on an interesting design journey.
Thomas Ptacek — Fly.io
One single metric for availability probably can’t tell you the whole story.
Stephen Townshend — Slight Reliability
We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.
Lorin Hochstein — The ReadME Project (GitHub)
We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.
@tjun — Mercari
This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.
Vanessa Huerta Granda — Jeli
There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.
Robert Barron — IBM