SRE Weekly Issue #256

Articles

Slack’s Outage on January 4th 2021

Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

NOTAM for SREs

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

Incident Post Mortem: January 29, 2021 [Coinbase]

Coinbase posted this detailed analysis of their January 29th incident.

Coinbase

Council Post: How Cloud Services Platform Teams Can Drive The Adoption Of Effective SRE Practices

Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, Transposit) — Forbes

“I’m Just Doing my Job,” An SRE Myth

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

GitHub Availability Report: January 2021

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

Open source update: School of SRE

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Push some big numbers through your system and look for bugs

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?

rachelbythebay

SRE Weekly Issue #256

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues