Articles
Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.
Marc Brooker
Zendesk SRE has a set of 8 reliability principles that guide what they do.
Jason Smale β Zendesk
Weβre going to talk about a few necessities that enable exceptional incident management.
- Service ownership
- Incident roles
- The incident declaration process
- Running incident drills
Robert Ross β FireHydrant
I don’t think you’re supposed to use Consul that way…
Read this article to follow along on an interesting design journey.
Thomas Ptacek β Fly.io
One single metric for availability probably can’t tell you the whole story.
Β Β Stephen Townshend β Slight Reliability
We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.
Lorin Hochstein β The ReadME Project (GitHub)
Weβre still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.
@tjun β Mercari
This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.
Β Β Vanessa Huerta Granda β Jeli
There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals β and in the way that the JWST doesn’t.
Robert Barron β IBM