Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.
Laura Nolan — Slack
This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.
Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook
Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.
Coinbase posted this detailed analysis of their January 29th incident.
Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.
Tina Huang (CTO, Transposit) — Forbes
We need to push past surface-level mitigation of an incident and really dig in and learn.
Darrell Pappa — Blameless
GitHub’s database failed in a manner that wasn’t detected by their automated failover system.
Keith Ballinger — GitHub
LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.
Akbar KM and Kalyanasundaram Somasundaram — LinkedIn
Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?