SRE Weekly Issue #394

A warm welcome to my new sponsor, FireHydrant!

A message from our sponsor, FireHydrant:

The 2023 DORA report has two conclusions with big impacts on incident management: incremental steps matter, and good culture contributes to performance. Dig into both topics and explore ideas for how to start making incremental improvements of your own.

This article gives an example checklist for a database version upgrade in RDS and explains why checklists cam be so useful for changes like this.

  Nick Janetakis

The distinction in this article is between responding at all and responding correctly. Different techniques solve for availability vs reliability.

Latency and throughput are inextricably linked in TCP, and this article explains why with a primer on congestion windows and handshakes.

  Roberto Vitillo

Tail latency has a huge impact on throughput and on the overall user experience. Measuring average latency just won’t cut it.

  Roberto Vitillo

Is it really wrong though? Is it?

  Adam Gordon Bell — Earthly

I’ve shared the FAA’s infographic of the Dirty Dozen here previously, but here’s a more in-depth look at the first six items.

  Dr. Omar Memon — Simple Flying

It’s often necessary to go through far more than five whys to understand what’s really going on in a sociotechnical system.


I found the bit about the AWS Incident/Communication Manager on-call role pretty interesting.

  Prathamesh Sonpatki — SRE Stories

Updated: October 15, 2023 — 10:56 am
A production of Tinker Tinker Tinker, LLC Frontier Theme