This week’s themes seem to be human error and network debugging. If you’re like me, you rarely have time to sit down and listen to podcasts, but if you ever get in the mood, this first link is a must-listen. I really can’t do it justice with my summary, but I’m very glad I listened to it, and I think you’ll like it too.
Articles
I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.
Outages
- PagerDuty
-
It’s especially interesting when PagerDuty goes down, because it might impact the reliability of many companies.
-
- SendGrid
- me&you mobile (South Africa)
- Bureau of Water and Light (Lansing, MI, USA)
-
Ransomware.
-
- HipChat
-
Here’s another speedy and detailed postmortem from Atlassian. Nice work, folks.
-
- Large Hadron Collider
-
Root cause: weasel.
-
- Neotel (South Africa ISP)