SRE Weekly Issue #369

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.

  Lorin Hochstein

Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.

We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.

  Will Gallego

This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.

  Admiral Cloudberg

When your alerts cover systems owned by different teams, who should be on call?

  Nathan Lincoln — Honeycomb
  Full disclosure: Honeycomb is my employer.

Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.

   Quang Luong and Chris Branch

While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.

  Jess Chang — Vanta (for incident.io)

An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.

  Lawrence Jones — incident.io

Here’s a great summary of the key themes from last month’s SRECon Americas.

  Paige Cruz — Chronosphere

Updated: April 23, 2023 — 8:54 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme