SRE Weekly Issue #430

r/sre: Senior SRE looking for a resume review, out of work for 7+ months now and still struggling to get interviews

Lots of great tips in the comments if you’re looking to tune your resume.

u/goodolbluey and others — reddit

Deep Work for Site Reliability Engineers

What can SREs do to increase their available focus time?

Krishna Vinnakota — DZone

A root-server at the Internet’s core lost touch with its peers. We still don’t know why.

One set of DNS root nameservers (c.root-servers.net) recently fell behind by a couple of days on updates for the root zone. We kind of just expect the root servers to work, you know?

Dan Goodin — Ars Technica

How Stripe’s document databases supported 99.999% uptime with zero-downtime data migrations

Stripe talks about the design of their DocDB system built on MongoDB that achieves 5 nines of reliability.

Jimmy Morzaria and Suraj Narkhede — Stripe

Mastering the Sev0

A Severity Zero (worst-case) incident is an entirely different thing from your average incident. This article talks about what makes it different and gives tips for handling one.

Chris Evans — incident.io

Down for less than four minutes a month: how AWS deploys code

With SLA credits kicking in for some services after just seconds of downtime, Amazon relies on multiple layers of automation.

Nicholas Yan — Graphite

Google On How It Manages Disclosure Of Search Incidents

Here’s a great summary of a podcast episode about Google’s incident response practices.

Google’s latest Search Off The Record podcast discussed examples of disruptive incidents that can affect crawling and indexing and discuss the criteria for deciding whether or not to disclose the details of what happened.

Roger Montti — Search Engine Journal

Things That Makes a Good Site Reliability Engineer

Here are some essential practices and traits that can make you an exemplary SRE.

Includes 19 tips with short explanations.

Prabesh

Layoffs Reduce Safety

How do layoffs impact resiliency and adaptive capacity? Are the folks making those decisions cognizant of the potential impact on reliability?

Will Gallego

SRE Weekly Issue #430

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues