SRE Weekly Issue #21

View on sreweekly.com

This week’s themes seem to be human error and network debugging. If you’re like me, you rarely have time to sit down and listen to podcasts, but if you ever get in the mood, this first link is a must-listen. I really can’t do it justice with my summary, but I’m very glad I listened to it, and I think you’ll like it too.

Articles

A Discussion on Human Error | PreAccident Investigation Podcast

We can try to train our workers to avoid error. We can design our systems to make errors less likely. This podcast argues that we go one step further and design our systems to be resilient in the face of inevitable error. Human error is normal and expected. Where are we one error away from a serious adverse event?

Steven Shorrock: "Life After Human Error" – Velocity Europe 2014

In this Velocity keynote, Steven Shorrock discusses human error from his point of view as an ergonomist and psychologist.

Tale of the Missing ACK

My old coworker (and network wizard) at Linden Lab wrote up this fascinating episode of network debugging. Sometimes you have to get really deep into the stack to track down reliability issues.

The Discovery of Apache Zookeeper’s Poison Packet

While we’re on the topic of debugging complicated networking failures, here’s PagerDuty’s analysis of a bug in Zookeeper. It turned out that triggering this bug involved the confluence of 3 other bugs that conspired to deliver a malformed packet to Zookeeper, which causes it to blow up. Yeesh.

Sysdig | How we found a bug in Amazon ELB

If you’re in the mood to read one more really deep and detailed network debugging session, this one’s for you. It goes through the process of gathering enough information to confidently implicate ELB as the source of abrupt connection closures.

The Flaw In All Things – blog dot lusis

John Vincent, featured here last week for his review of the new SRE book, writes this week about the burnout he’s suffering. I think it could best be described as operational risk burnout. I’m not sure what the solution is, but I’m really interested in the problem, and I hope that John considers writing more if he has any useful realizations. Good luck, John.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

An Inside Look at How The Ops Team Collaborates

How do you collaborate remotely during an incident? Some companies use conference bridges, but my former boss (and all-around incredible engineer and manager) Landon McDowell advocates for text-based chat. I started my career as part of the Ops team he describes, so I might be biased, but I totally agree: chat is far superior to phone bridges or VoIP.

Load balancing or balancing on the edge of a cliff?

This article starts out as a basic introduction to load-balancing, but where it goes next is really interesting. The author discusses how load-balancing can go wrong (think cascading failure as each remaining backend receives increasingly more traffic) and how to combat the pitfalls. Finally the author suggests two very intriguing concepts for smart load balancing systems that really got me thinking.

Outages

PagerDuty
- It’s especially interesting when PagerDuty goes down, because it might impact the reliability of many companies.
SendGrid
me&you mobile (South Africa)
Bureau of Water and Light (Lansing, MI, USA)
- Ransomware.
HipChat
- Here’s another speedy and detailed postmortem from Atlassian. Nice work, folks.
Large Hadron Collider
- Root cause: weasel.
Neotel (South Africa ISP)

SRE Weekly Issue #21

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues