SRE Weekly Issue #27

Sorry I’m a tad late this week!

If you only have time to read one long article, I highly recommend this first one.

Articles

How Technology Led a Hospital To Give a Patient 38 Times His Dosage

This fascinating series delves deeply into the cascade of failures leading up to the nearly fatal overdose of a pediatric patient hospitalized for a routine colonoscopy. It’s a five-article series, and it’s well worth every minute you’ll spend reading it. Human error, interface design, misplaced trust in automation, learning from aviation; it’s all here in abundance and depth.

WTF is operations? #serverless

In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault. You chose them, it’s on you.

CenturyLink Targets ‘Six Nines’ Reliability | Light Reading

Interesting, though I have to say I’m a bit skeptical when I hear someone target six nines. Especially when they say this:

Redefining five nines is redefining them to go up to six nines,” said James Feger, CenturyLink’s vice president of network strategy and development […]

PAPod 73 – An Over-Emphasis on Prevention?

The Pre-Accident Podcast reminds us that incident response is just as important as incident prevention.

The Future of Incident Notification in the Modern Enterprise

As automated remediation increases, the problems that actually hit our pagers become more complex and higher-level. This opinion piece from PagerDuty explores that trend and where it’s leading us.

The major lesson IT can learn from Netflix’s high availability testing methodology

A high-level overview of the difference between HA and DR and Netflix’s HA testing tool, Chaos Monkey.

Outages

Fastly
- I noticed this one when it got in the way of my work.
Slack
The Pirate Bay
Telstra
eBay
SaskTel (Canada telecom)
34SP.com
Lending Club
Acunetix
- An opportunistic blackhat photoshopped a screenshot of the downed site, making it appear that they had breached its security.
EU referendum poll voter registration (UK)
Amazon EC2 and EBS (Sydney, AU)
- A major outage in one availability zone took down many sites and services in Australia. Amazon quickly released this detailed post-analysis. TL;DR: an extended utility voltage sag didn’t cause isolation breakers to trip, and the flywheel UPSes then quickly dumped all of their energy into the power grid.

SRE Weekly Issue #27

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues