SRE Weekly Issue #116

Articles

BBC Online Outage on Saturday 19th July 2014

The BBC suffered two simultaneous major outages that broke their online streaming product and forced their website into a limited-functioning mode. This post-incident followup explains what happened and how they dealt with it.

Richard Cooper — BBC

Burst credits of t2 EC2 instances need monitoring

Bursting is a hidden reliability risk that has bitten me hard in the past. Click through for an explanation of the risk and how to mitigate it.

Michael Wittig — Cloudonaut

Observability: A Manifesto

This post has the most concise definition I’ve seen yet for observability, along with a quiz that will tell you whether you’re Doing It Right^TM.

the power to ask new questions of your system, without having to ship new code or gather new data in order to ask those new questions

Charity Majors — Honeycomb

Four interacting decisions break ssh access

This debugging story is an entertaining read, and it’s also got some useful stuff to watch out for in your systems.

Tick tick tick. Time is hard.

Rachel Kroll

GitHub – ahupowerdns/hello-dns: Hello and welcome to DNS!

Solid knowledge of how DNS works is critical for SREs. This repo contains an introduction to DNS written to be far more approachable than the (many!) DNS RFCs. It’s a work in progress but there’s a lot of good content already.

Bert Hubert and others

The Makeup of Successful Geographically-Distributed SRE Teams: Part 2 | LinkedIn Engineering

Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.

Akhil Ahuja — LinkedIn

Twitter: mipsytipsy about alerting on metrics

Her thread starts here and continues being awesome:

Real talk, you should never have a paging alert on a system stats metric. Or a single host anything metric. (Or an aggregate host metric, or an aggregate divided by host count, or …)

Charity Majors

Outages

Telegram (messaging app)
Iomart (datacenter provider)
- Two separate network breaks cut off access to data centres run by cloud firm Iomart, affecting a wide range of customers
iTunes App Store
TD Ameritrade

SRE Weekly Issue #116

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues