SRE Weekly Issue #150

A message from our sponsor, VictorOps:

The golden signals of SRE are essential metrics to monitor when developing reliable systems. But, the golden signals are just the starting point. See how SRE teams are going past the golden signals to proactively build reliability into their services:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

Updated: December 2, 2018 — 8:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme