Lots of great tips in the comments if you’re looking to tune your resume.
u/goodolbluey and others — reddit
What can SREs do to increase their available focus time?
Krishna Vinnakota — DZone
One set of DNS root nameservers (c.root-servers.net) recently fell behind by a couple of days on updates for the root zone. We kind of just expect the root servers to work, you know?
Dan Goodin — Ars Technica
Stripe talks about the design of their DocDB system built on MongoDB that achieves 5 nines of reliability.
Jimmy Morzaria and Suraj Narkhede — Stripe
A Severity Zero (worst-case) incident is an entirely different thing from your average incident. This article talks about what makes it different and gives tips for handling one.
Chris Evans — incident.io
With SLA credits kicking in for some services after just seconds of downtime, Amazon relies on multiple layers of automation.
Nicholas Yan — Graphite
Here’s a great summary of a podcast episode about Google’s incident response practices.
Google’s latest Search Off The Record podcast discussed examples of disruptive incidents that can affect crawling and indexing and discuss the criteria for deciding whether or not to disclose the details of what happened.
Roger Montti — Search Engine Journal
Here are some essential practices and traits that can make you an exemplary SRE.
Includes 19 tips with short explanations.
Prabesh
How do layoffs impact resiliency and adaptive capacity? Are the folks making those decisions cognizant of the potential impact on reliability?
Will Gallego