SRE Weekly Issue #256

A message from our sponsor, StackHawk:

Register now for the first-ever ZAPCon taking place March 9th. The free event will focus on OWASP ZAP and application security best practices. You wont want to miss it!
http://sthwk.com/zapcon-sre-weekly

Articles

Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

Coinbase posted this detailed analysis of their January 29th incident.

Coinbase

Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, Transposit) — Forbes

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?

rachelbythebay

Outages

Updated: February 7, 2021 — 8:57 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme