SRE Weekly Issue #307

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly shirt):
https://rootly.com/demo/?utm_source=sreweekly

Articles

This followup to their initial incident report has a lot to learn from, especially if you run Consul at scale.

  Daniel Sturman and others — Roblox

This week, I came across the Byford Dolphin diving bell incident. This accident seems at face value to be “human error”, but there’s so much to it. Content warning: the accident was quite grisly.

  Wikipedia

Canary testing is more than just deploying your code to a small part of your fleet. You need a plan for how you’re going to spot problems.

  Jyoti Sahoo — OpsMx

My favorite part is how they look for changes in performance, rather than using a static threshold.

  Angus Croll — Netflix

It pays to think ahead about how you’ll answer questions from execs during an incident.

  Chris Fenning — DZone

On January 24, 2022, as a result of an internal Cloudflare product migration, 24 hostnames (including www.cloudflare.com) that were actively proxied through the Cloudflare global network were mistakenly redirected to the wrong origin.

  Jeremy Hartman — Cloudflare

An analysis of SRE job descriptions from 4 companies highlights what businesses actually expect SREs to do.

  JP Cheung — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Members of the search giant’s site reliability group say managers fostered a toxic environment. Google says a ‘safe, inclusive workplace’ is a top priority.

  Nico Grant — Bloomberg

Outages

Updated: January 30, 2022 — 9:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme