The BBC suffered two simultaneous major outages that broke their online streaming product and forced their website into a limited-functioning mode. This post-incident followup explains what happened and how they dealt with it.
Richard Cooper — BBC
Bursting is a hidden reliability risk that has bitten me hard in the past. Click through for an explanation of the risk and how to mitigate it.
Michael Wittig — Cloudonaut
This post has the most concise definition I’ve seen yet for observability, along with a quiz that will tell you whether you’re Doing It RightTM.
the power to ask new questions of your system, without having to ship new code or gather new data in order to ask those new questions
Charity Majors — Honeycomb
This debugging story is an entertaining read, and it’s also got some useful stuff to watch out for in your systems.
Tick tick tick. Time is hard.
Solid knowledge of how DNS works is critical for SREs. This repo contains an introduction to DNS written to be far more approachable than the (many!) DNS RFCs. It’s a work in progress but there’s a lot of good content already.
Bert Hubert and others
Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.
Akhil Ahuja — LinkedIn
Her thread starts here and continues being awesome:
Real talk, you should never have a paging alert on a system stats metric. Or a single host anything metric. (Or an aggregate host metric, or an aggregate divided by host count, or …)