SRE Weekly Issue #185

View on sreweekly.com

Articles

What Really Happened to Malaysia’s Missing Airplane

This is a tough read, but really enlightening.

Thanks to Courtney Eckhardt for this one.

William Langewiesche — The Atlantic

Nines are not enough: meaningful metrics for clouds

Read this to find out why it’s so hard to nail down SLOs for cloud services.

Adrian Colyer — The Morning Paper (summary)

Mogul & Wilkes (original paper)

‘Screaming car wreck’ of internet routing needs a fire brigade

BGP: the horrifying, ugly monster lurking at the base of the Internet.

Stilgherrian — ZDNet

The Global Internet Is Being Attacked by Sharks, Google Confirms

A different kind of monster.

Will Oremus — Slate

Shrinking the impact of production incidents using SRE principles

When you’re thinking about best practices for improving phases of the production incident cycle, there are three SRE principles that particularly matter for this task. Keep these in mind as you think about reliability.

Myk Taylor — Google

Optimizing Business Response When Technical Incidents Happen

It’s important that we remember that there’s more to incident response than the technical aspect.

George Miranda — PagerDuty

What I learnt from failure

Learn from this Second Officer’s account of a maritime near-miss and the five lessons they learned. My favorite:

As professionals, we always have more than one goal.

Nippin Anand — Safety Differently

SRE Weekly Issue #185

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues