SRE Weekly Issue #368

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

   Eugene Retunsky — DZone

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.


Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

  Leart Gjoni — DoorDash

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

  Ryan Sokol — DoorDash

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

  Lee Atchison — Container Journal

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

  Kit Merker —

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

  Corey Quinn — Last Week In AWS

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

  Lex Neva — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

Updated: April 16, 2023 — 8:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme