SRE Weekly Issue #116


How can breaking something also fix it? Controlled chaos engineering can help your SRE team(s) better understand your systems and ultimately improve site reliability. See how VictorOps is incorporating “Game Days” to bolster their systems and their SRE culture:


The BBC suffered two simultaneous major outages that broke their online streaming product and forced their website into a limited-functioning mode.  This post-incident followup explains what happened and how they dealt with it.

Richard Cooper — BBC

Bursting is a hidden reliability risk that has bitten me hard in the past. Click through for an explanation of the risk and how to mitigate it.

Michael Wittig — Cloudonaut

This post has the most concise definition I’ve seen yet for observability, along with a quiz that will tell you whether you’re Doing It RightTM.

the power to ask new questions of your system, without having to ship new code or gather new data in order to ask those new questions

Charity Majors — Honeycomb

This debugging story is an entertaining read, and it’s also got some useful stuff to watch out for in your systems.

Tick tick tick. Time is hard.

Rachel Kroll

Solid knowledge of how DNS works is critical for SREs. This repo contains an introduction to DNS written to be far more approachable than the (many!) DNS RFCs. It’s a work in progress but there’s a lot of good content already.

Bert Hubert and others

Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.

Akhil Ahuja — LinkedIn

Her thread starts here and continues being awesome:

Real talk, you should never have a paging alert on a system stats metric. Or a single host anything metric. (Or an aggregate host metric, or an aggregate divided by host count, or …)

Charity Majors


Updated: April 1, 2018 — 10:08 pm
SRE WEEKLY © 2015 Frontier Theme