Articles
I really need to learn bpftrace
, and this article is a great place to start.
Brendan Gregg
If we expand our definition of “incident” beyond traditional engineering problems, we increase our opportunity for learning.
Stephen Whitworth — incident.io
This is an interview with a director at Catchpoint about their 2021 SRE Report. They discuss two results from the survey: folks report a 15% decrease in toil and slow adoption of AIOps.
Charlene O’Hanlon — devops.com
A recurring theme in this story is that the incident was when folks learned how the push notifications work.
Molly Struve — DEV
In this reddit thread, a company hired some developers as SREs and then found that they didn’t want to do operations work. Folks weigh on why and what to do.
u/red_flock and others — reddit
How exactly do you want to phrase (and measure) an SLO about latency percentiles? Beware the subtle details.
Piyush Verma — last9
I’m definitely going to think on the great incident response and followup wisdom in this interview. My favorite:
If I can change 1% to better that outcome, what is that 1%?
Christina Tan — Blameless
Full disclosure: Fastly, my employer, is mentioned.
Root cause: guessed wrong in the moment
Lorin Hochstein
Here’s a run-down of some IT mishaps from Olympic games past and present.
Quentin Rousseau — Rootly