SRE Weekly Issue #461

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

Written in 2020 after an AWS outage, this article analyzes dependence on third-party services and the responsibility to understand their reliability.

  Uwe Friedrichsen

When a cache expired, these folks found that their application stampeded the database with expensive queries, so they searched for a solution.

  Punit Sethi

When a high-severity incident happens, its associated risks becomes salient: the incident looms large in our mind, and the fact that it just happened leads us to believe that the risk of a similar incident is very high.

  Lorin Hochstein

These folks landed on a hybrid approach using two vendors, allowing them to avoid sending their entire trace volume to an expensive observability vendor.

  Jakub Sokół — monday

Under heavy load, requests are handled in LIFO order to maximize the chance of successfully completing fresh requests.

LIFO = Last In, First Out

  Teiva Harsanyi

More than just a simple feature comparison, this article also presents two use cases and analyzes which tool is best in each case.

   Josson Paul Kalapparambath — DZone

These folks explain why they use Go for everything: application code, infrastructure as code, tooling, and even as a wrapper around Helm charts for Kubernetes.

  Akhilesh Krishnan — Oodle AI

Updated: January 26, 2025 — 11:15 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme