SRE Weekly Issue #297

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.com/?utm_source=sreweekly

Articles

It’s that time of year again, but maybe it’s time to rethink that code freeze.

Robert Ross — FireHydrant

This article really gets to the heart of why I love a good incident. I mean, obviously, I want to minimize, incidents. I swear.

Lisa Karlin Curtis — incident.io

This article draws on incident reports from The VOID to show how root cause analysis can be problematic.

Courtney Nash — Verica

It’s interesting to read this article after reading the previous one. In the “my car won’t start”, I found myself immediately wondering, why was the vehicle not maintained? What factors contributed to that?

Søren Pedersen — Dzone

These are the “phases”, although they stress that aiming for Visionary doesn’t make sense for all organizations.

  • Absent
  • Reactive
  • Proactive
  • Strategic
  • Visionary

Google

Not the field I would have expected to look to for lessons, but it totally works!

Paul Marsicovetere — Formidable

This article introduces a 3-phased approach for safe database schema changes: Expand, Rollout, and Contract.

Alex Yates — Octopus Deploy

Try to run a program, and you get “No such file or directory”, even though the program is right there. How can this happen?

Julia Evans

Outages

  • Google Cloud Load Balancing
    • Google had a major outage that took down many sites and services. Notably, users of these sites were greeted with a Google 404 page with no branding related to the site they were attempting to access.
  • Grab
  • Tesla
    • Tesla owners were locked out of their cars or unable to start them during the outage.
Updated: November 21, 2021 — 8:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme