Panic takes time and energy away from swift incident response, leading to second-guessing, a higher likelihood of mistakes, and analysis paralysis. Here are three tips to minimize it.
Malcolm Preston — incident.io
A great explanation of why we need to wait for more details on the FAA NOTAM outage. My favorite part is the list of clues to whether an incident report might be useful: Time, Artifacts, Jargon, and Narrative.
Thai Wood — Resilience Roundup
Lots of juicy details about a large SRE organization and how they work.
Ash Patel — SREPath
A deploy accidentally wiped authentication tokens for some internal Cloudflare services, causing an outage for those services.
Kenny Johnson and Sam Rhea — Cloudflare
eBay thought about adopting “test in production” and eliminating staging, but they determined that their use case really does require a staging environment. They carefully selected and anonymized real production data to use as test cases in staging.
Senthil Padmanabhan — eBay
This article has a really great section explaining the pitfalls of full system dashboards.
The first one is my favorite:
Economic factors will force companies to look for more efficient ways of managing reliability
I’m not sure if that will happen, but it’s an interesting theory.
This author shares what they learned in adapting to running incidents remotely once the pandemic hit.
Emily Ruppe — Jeli