SRE Weekly Issue #27

Sorry I’m a tad late this week!

If you only have time to read one long article, I highly recommend this first one.

Articles

This fascinating series delves deeply into the cascade of failures leading up to the nearly fatal overdose of a pediatric patient hospitalized for a routine colonoscopy. It’s a five-article series, and it’s well worth every minute you’ll spend reading it. Human error, interface design, misplaced trust in automation, learning from aviation; it’s all here in abundance and depth.

In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.

Interesting, though I have to say I’m a bit skeptical when I hear someone target six nines. Especially when they say this:

Redefining five nines is redefining them to go up to six nines,” said James Feger, CenturyLink’s vice president of network strategy and development […]

The Pre-Accident Podcast reminds us that incident response is just as important as incident prevention.

As automated remediation increases, the problems that actually hit our pagers become more complex and higher-level. This opinion piece from PagerDuty explores that trend and where it’s leading us.

A high-level overview of the difference between HA and DR and Netflix’s HA testing tool, Chaos Monkey.

Outages

Updated: June 13, 2016 — 1:53 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme