This new series seems promising! I won’t link to every article in the series here, but if you’re an early SRE, the intro-level articles published so far in this series are definitely worth a read.
Today, I’m thrilled to announce an ambitious project that’s been in the works for some time: “52 Weeks of SRE” – a comprehensive, year-long deep dive into the world of Site Reliability Engineering.
J. Pereira
Adevinta shifted from Kubernetes’s cluster autoscaler to AWS’s Karpenter. The change brought huge advantages that they discuss in detail, along with a few challenges and pitfalls they needed to overcome.
Tanat Lokejaroenlarb — Adevinta
An adventure in adopting an open source firmware for Baseboard Management Controllers, including fixing a few bugs themselves.
Nnamdi Ajah, Ryan Chow, and Giovanni Pereira Zantedeschi — Cloudflare
[…] an overview of methods like TCP FastOpen, TLSv1.3, 0-RTT, and HTTP/3 to reduce handshake delays and improve server response times in secure environments.
Maksim Kupriianov — DZone
This article includes general tips and a specific rubric you can follow to decide when to choose a larger or smaller RDS instance type.
Prabesh
It turns out that a lot of the lessons that Mike Massimino learned as an astronaut apply very well to incident management.
Eric Silberstein — Klaviyo
Solving IP exhaustion in EKS: Avoiding a network outage by implementing custom networking
Fabián Sellés Rosa — Adevinta
By leveraging proportional–integral–derivative (PID) controllers, Robinhood can now more quickly and effectively manage load imbalances.
This was my first introduction to PID controllers. Neat!
Yi-Shu Tai — Dropbox
Through an allegory about an imaginary knob to adjust between risk-avoidance and speed, Lorin Hochstein shows us that these trade-offs are being made, just implicitly.
Lorin Hochstein