SRE Weekly Issue #66

Articles

Sockets in a Bind: Troubleshooting Port Exhaustion in Heroku’s Routing Layer

I hope you’ll enjoy reading this debug session as much as I enjoyed co-writing it. My former co-worker and I did some serious digging to get to the bottom of an unexpected EADDRINUSE that caused a production incident.

Full disclosure: Heroku is my employer.

Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions

Distributed filesystems provide high availability by duplicating data. In this research paper, the researchers created errorfs, a FUSE plugin that passes through a backing filesystem but introduces single-bit errors. Result: almost all major distributed filesystems missed the error, resulting in corruption.

How we implement Disaster Recovery and High Availability with Postgres on Citus Cloud

The part I like most about this article is the emphasis on the difference between DR and HA.

Full disclosure: Heroku, my employer, is mentioned.

Breaking Things on Purpose

The S3 outage a month ago is a great reminder that chaos experiments are useful not just for taking down parts of our own infrastructure, but also simulating the failure of external dependencies.

HumanOps: It’s time to make DevOps personal

There are several core HumanOps principles, but the most important one to remember is that human health impacts business health.

It’s about time that we recognised that engineers are humans who get stressed and need downtime and that there are strong business as well as social reasons why these needs should be met.

Now Available – Videos from SREcon17

Impressively quickly, USENIX has posted the videos from SRECon17 Americas! I’ve linked to a post by Woodland Hunter, whose review of SRECon I featured here two weeks ago, with links to the talks he reviewed and more.

April Foolishness

The first article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How Incident Management Boosts Employee Morale

PagerDuty theorizes that if developers don’t trust the incident response process, they’ll fear outages and thus be less productive. Proper incident management eases that fear so that they feel safer deploying code.

Reducing Alert Noise: Going from 1000 Alerts to 10 Alerts Overnight

This article could be titled, “Use these three wacky tricks to reduce your pages by 100x!” In all seriousness, the methods described are aggregation (group related alerts), routing (sort alerts by team), and classification (page-worthy alerts versus warnings).

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Slow down your internet with tc

A nice primer on using tc to induce latency, which is really important when testing the resiliency of systems to network instability. Thanks, Julia!

Risk Tolerance of Services

Here’s the second half of Stephen Thorne’s commentary on “Embracing Risk”, the third chapter in Google’s SRE book.

Scaling Incident Management

As your company grows in infrastructure size, number of employees, load, and other areas, how do you change your incident response to cope?

Outages

Azure status history
- While following up on an outage from a couple of weeks ago, I came upon this archive of Azure incidents, several with detailed postmortems. It’s a goldmine of interesting RCAs, but I wish they’d give each its own page for easy linking.

SRE Weekly Issue #66

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues