SRE Weekly Issue #282

A message from our sponsor, StackHawk:

ICYMI ZAP Creator and Project Lead Simon Bennetts recently unveiled ZAP’s new automation framework. Watch the session and see how it works:
https://sthwk.com/Automation-Framework

Articles

I really need to learn bpftrace, and this article is a great place to start.

Brendan Gregg

If we expand our definition of “incident” beyond traditional engineering problems, we increase our opportunity for learning.

Stephen Whitworth — incident.io

This is an interview with a director at Catchpoint about their 2021 SRE Report. They discuss two results from the survey: folks report a 15% decrease in toil and slow adoption of AIOps.

Charlene O’Hanlon — devops.com

A recurring theme in this story is that the incident was when folks learned how the push notifications work.

Molly Struve — DEV

In this reddit thread, a company hired some developers as SREs and then found that they didn’t want to do operations work. Folks weigh on why and what to do.

u/red_flock and others — reddit

How exactly do you want to phrase (and measure) an SLO about latency percentiles? Beware the subtle details.

Piyush Verma — last9

I’m definitely going to think on the great incident response and followup wisdom in this interview. My favorite:

If I can change 1% to better that outcome, what is that 1%?

Christina Tan — Blameless

Full disclosure: Fastly, my employer, is mentioned.

Root cause: guessed wrong in the moment

Lorin Hochstein

Here’s a run-down of some IT mishaps from Olympic games past and present.

Quentin Rousseau — Rootly

Outages

Updated: August 8, 2021 — 9:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme