SRE Weekly Issue #357

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


Panic takes time and energy away from swift incident response, leading to second-guessing, a higher likelihood of mistakes, and analysis paralysis. Here are three tips to minimize it.

  Malcolm Preston —

A great explanation of why we need to wait for more details on the FAA NOTAM outage. My favorite part is the list of clues to whether an incident report might be useful: Time, Artifacts, Jargon, and Narrative.

  Thai Wood — Resilience Roundup

Lots of juicy details about a large SRE organization and how they work.

  Ash Patel — SREPath

A deploy accidentally wiped authentication tokens for some internal Cloudflare services, causing an outage for those services.

   Kenny Johnson and Sam Rhea — Cloudflare

eBay thought about adopting “test in production” and eliminating staging, but they determined that their use case really does require a staging environment. They carefully selected and anonymized real production data to use as test cases in staging.

   Senthil Padmanabhan — eBay

This article has a really great section explaining the pitfalls of full system dashboards.

  Boris Cherkasky

The first one is my favorite:

Economic factors will force companies to look for more efficient ways of managing reliability

I’m not sure if that will happen, but it’s an interesting theory.

  Emily Arnott

This author shares what they learned in adapting to running incidents remotely once the pandemic hit.

  Emily Ruppe — Jeli

Updated: January 29, 2023 — 9:07 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme