A warm welcome to my new sponsor, FireHydrant!
This article gives an example checklist for a database version upgrade in RDS and explains why checklists cam be so useful for changes like this.
Nick Janetakis
The distinction in this article is between responding at all and responding correctly. Different techniques solve for availability vs reliability.
incident.io
Latency and throughput are inextricably linked in TCP, and this article explains why with a primer on congestion windows and handshakes.
Roberto Vitillo
Tail latency has a huge impact on throughput and on the overall user experience. Measuring average latency just won’t cut it.
Roberto Vitillo
Is it really wrong though? Is it?
Adam Gordon Bell — Earthly
I’ve shared the FAA’s infographic of the Dirty Dozen here previously, but here’s a more in-depth look at the first six items.
Dr. Omar Memon — Simple Flying
It’s often necessary to go through far more than five whys to understand what’s really going on in a sociotechnical system.
rachelbythebay
I found the bit about the AWS Incident/Communication Manager on-call role pretty interesting.
Prathamesh Sonpatki — SRE Stories