SRE Weekly Issue #318

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

  Fred Hebert

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

  Alex Forster — Cloudflare

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

  Gergely Orosz — Pragmatic Engineer

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

  Jo Stichbury — Ably

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

  Laurent Bernaille and David Lentz — DataDog

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

  Zia Mian, M. V. Ramana — Scientific American

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

Updated: April 18, 2022 — 9:03 am
A production of Tinker Tinker Tinker, LLC Frontier Theme