Articles
This followup to their initial incident report has a lot to learn from, especially if you run Consul at scale.
  Daniel Sturman and others — Roblox
This week, I came across the Byford Dolphin diving bell incident. This accident seems at face value to be “human error”, but there’s so much to it. Content warning: the accident was quite grisly.
Wikipedia
Canary testing is more than just deploying your code to a small part of your fleet. You need a plan for how you’re going to spot problems.
Jyoti Sahoo — OpsMx
My favorite part is how they look for changes in performance, rather than using a static threshold.
Angus Croll — Netflix
It pays to think ahead about how you’ll answer questions from execs during an incident.
Chris Fenning — DZone
On January 24, 2022, as a result of an internal Cloudflare product migration, 24 hostnames (including www.cloudflare.com) that were actively proxied through the Cloudflare global network were mistakenly redirected to the wrong origin.
Jeremy Hartman — Cloudflare
An analysis of SRE job descriptions from 4 companies highlights what businesses actually expect SREs to do.
JP Cheung — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Members of the search giant’s site reliability group say managers fostered a toxic environment. Google says a ‘safe, inclusive workplace’ is a top priority.
Nico Grant — Bloomberg