SRE Weekly Issue #361

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

  Srinavas — eightnoteight

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

   Satrajit Basu — DZone

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

  Chris Siebenmann

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

  River Rainne and Amy Ciavolino — Etsy

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

  Nk — High Scalability

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

  Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

  Admiral Cloudberg

Updated: February 26, 2023 — 9:51 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme