SRE Weekly Issue #117

Articles

No, seriously. Root Cause is a Fallacy. –

Brilliant, just brilliant. This isn’t just another “there isn’t just one root cause” article to skip over. The author takes time to explain the concept with cogent examples and useful metaphors. This one really caught my eye:

What’s the root cause of success?
[…] When building a successful project, there’s never just one thing that goes right for it to succeed.

Will Gallego

Incident Management – Food Fight Podcast

This episode of Food Fight is an hour-long interview with guests Rob Schnepp, Ron Vidal, and Chris Hawley, the 3 firefighters behind Blackrock 3 Partners. It’s a great intro to the Incident Management System, and well worth a listen.

Shout-out to Maple Player, an android audio player with a really high-quality tempo increase feature. I was able to listen at 1.5x speed and still understand everything; otherwise, I wouldn’t have had time this week.

Nell Shamrell-Harrington and Nathen Harvey

Billing Incident Post-Mortem

Here’s one from the archives, an incident report from 2013. After a temporary network partition in a redis cluster, the replicas all tried to resynchronize at once, overloading the master. One of the results was that some customers got repeatedly charged for the same thing.

Twilio

It’s about what broke, not who broke it

You have to design a system such that the natural thing to do yields a good result and doesn’t put anyone in harm’s way.

Rachel Kroll

Consistent Hashing: Algorithmic Tradeoffs

I thought consistent hashing was largely solved. I was wrong! There are some good solutions out there, but you have to evaluate their relative trade-offs and pick the right one for your use case.

Damian Gryski

Full disclosure: Damian Gryski is my coworker at Fastly.

Computer science faces an ethics crisis. The Cambridge Analytica scandal proves it.

As you read this article, consider the ethical imperative of system reliability, when system reliability can literally mean life and death in some cases. That’s only going to be more common in the coming years.

Yonatan Zunger

LogicMonitor Uses Terraform, Packer & Consul for Disaster Recovery Environments

Our service needs to be available 24/7, without question. In order to ensure this happens, the LogicMonitor TechOps team uses HashiCorp Packer, Terraform, and Consul to dynamically build infrastructure for disaster recovery (DR) in a reliable and sustainable way.

Randall Thomson — LogicMonitor

The Travis CI Blog: Incident Post-Mortem and Security Advisory: Data Exposure After travis-ci.com Outage

On Tuesday, 13 March 2018 at 12:04 UTC a database query was accidentally run against our production database which truncated all tables.

Oof. Sorry, Travis folks, but a sincere thanks for sharing your experience with us.

Konstantin Haase — Travis CI

Preliminary Analysis of the Site Reliability Engineer Survey

I like these “preliminary results” better than the kinds of aggregate statistics you normally get from a survey report. There are real quotes from free-form survey answers, including a couple of real gems. There’s a link to download the actual survey report if you’re into that, too.

Dawn Parzych — Catchpoint

Outages

Statuspage.io
Mindbody Online (fitness studio booking service vendor)
Sling TV
Tinder
- The outage seemingly stemmed from privacy fixes Facebook put in place, resulting in a broken OAuth flow.
Microsoft Office 365
Twitter
Multiple Indian Government Websites
Grab
YouTube

SRE Weekly Issue #117

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues