SRE Weekly Issue #67

Articles

Risky Business Requires Active Operators

This article is about the risks of automation. While automation can reduce risk by making errors less likely, it also disengages human operators from what’s actually happening, meaning that they’re less likely to catch and correct problems.

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites

The author spent seven months sifting through, categorizing, and documenting over 1700 production incidents. The result was impressive: a massive improvement in the SRE team’s incident response process and documentation. It’s got me wondering if we can do something similar at $JOB.

Thanks to Steven Farlie for this one.

Fault Tolerance in a High Volume, Distributed System

A danger of a microservice architecture is that one failing service can affect those that depend on it, even indirectly. The Netflix API handles over 10000 requests per second, and it was carefully designed to avoid the case where a slow dependency breaks unrelated requests.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99% * 30 = 99.7% uptime = 2+ hours in a month).

Human Factors at The Fringe: Nuclear Family

Nuclear Family is an interactive play in which the audience is presented with critical decisions as the characters move inexorably toward a nuclear plant disaster. The goal is to demonstrate local rationality, the principal that people make the best decision they can with the information they have at hand — even if in retrospect that decision led to an adverse outcome.

Owning Your Code is Better

Last year, PagerDuty moved toward giving developers operational responsibilty for the systems they create. The really cool thing about their transition is that they have hard stats on reduction of incidents, decrease in MTTR, and increase in changes deployed to production.

Making On-Call as Painless as Possible – PagerDuty

This post is primarily a new feature announcement, but the intro section is just awesome. I love the idea of designing a system with empathy for your future self that will be on call for it.

Graceful degradation

A short but enlightening blog post on designing systems to degrade gracefully.

when weird stuff happens, make sure it doesn’t cause harm you didn’t expect or plan for.

Outages

Razer
- Notably, this outage reset the careful customizations that people had made to their peripherals.
  Thanks to Steven Farlie for this one.
Heroku
- Heroku had a 2-day long disruption that spanned 3 status site posts.
  Full disclosure: Heroku is my employer.
DigitalOcean
- DigitalOcean accidentally deleted their primary database, resulting in a ~5-hour outage.
  
  A process performing automated testing was misconfigured using production credentials.

SRE Weekly Issue #67

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues