SRE Weekly Issue #185

A message from our sponsor, VictorOps:

Machine learning is already being used in many DevOps processes – driving highly efficient workflows across the entire software delivery lifecycle. See how machine learning is currently being used to improve incident management and response in production environments:

http://try.victorops.com/sreweekly/machine-learning-in-incident-management

Articles

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

A different kind of monster.

Will Oremus — Slate

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

Outages

Updated: September 15, 2019 — 9:06 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme