SRE Weekly Issue #362

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

You might wonder why I have given almost zero coverage to “AIOps” here, and why my coverage of “anomaly detection” has included heavy skepticism. The reason: I simply haven’t seen any proof that it works.

The FTC’s recent stance on AI sums up my position nicely. If you want your AIOps product covered here, don’t just tell me it works, prove to me that it works.

  Michael Atleson — Federal Trade Commission

How? With a safe and repeatable procedure for database migrations involving double-writing.

  Lisa Karlin Curtis — incident.io

Push to main on a new microservice repo and it deploys to production, spins up a slack channel for alerts, invites the CODEOWNERS, creates an on-call rotation, and puts them in it. Wow!

  Kiselev Ivan — Better Programming

A routing issue caused widespread packet loss with worldwide impact across many services.

  Google

This month’s report had a couple of fascinating incidents, especially the one about source code archive hashes.

  Jakub Oleksy — GitHub

Folks from the New York Times used chaos engineering to prepare for the surge of traffic during the US’s presidential election. They share 5 guidelines for effective chaos engineering for big data systems.

  Shane Murray — Monte Carlo

Here’s that LFI Conf recap I wanted!

  Vanessa Huerta Granda — Jeli

Former Google folks published this guide to help recently laid-off Google SREs integrate with the way SRE is done in the rest of the tech world. There’s an interesting hint about Google’s on-call compensation that I’m going to have to look into.

  Murali Suriar and Niall Murphy

A normally conscientious airline captain made a decision he normally would not have, likely owing to severe sleep deprivation.

  Admiral Cloudberg

Updated: March 5, 2023 — 8:56 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme