SRE Weekly Issue #150

Articles

5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

OVMC, EORH Hope To Have Emergency Rooms Back Online

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

SREcon EMEA 2018 conference notes

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Developers On Call

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

AWS Says It’s Never Seen a Whole Data Center Go Down

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Confusion over medicine names threatens lives

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

SRE Weekly Issue #150

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues