This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.
Eugene Retunsky — DZone
A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.
Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.
Leart Gjoni — DoorDash
From the archives, here’s an incident report from a major outage at DoorDash in 2022.
Ryan Sokol — DoorDash
Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.
Lee Atchison — Container Journal
Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.
Kit Merker — DevOps.com
Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.
Corey Quinn — Last Week In AWS
It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?
Lex Neva — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.