SRE Weekly Issue #423

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

This one’s full of great advice about making sure alerts are actionable, including alerting on flows that actually matter to customers.

  Nočnica Mellifera — Checkly

Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.

  Ross Brodbeck

It was a classic application of technical debt: they chose to focus on customer-facing features and let k8s updates slide. Here’s how they caught back up safely.

  Jeff Wolski

This article presents an interesting hypothesis, and from it draws some nifty conclusions about reasoning about failure in systems.

we cannot know for sure whether or not software is going to be incident-free. It might well be, but we can’t ever prove it.

  Niall Murphy

For teams to solve incidents quickly and effectively, responders need to be able to trust each other and stakeholders have to trust the responders. This level of trust is hard to cultivate if your organization doesn’t have a significant amount of psychological safety.

  Mandi Walls — PagerDuty

More than just an interview, this article outlines a multi-year transformation from disorganized haphazard incident investigation to a smooth and efficient incident response process.

  Eric Silberstein — Klaviyo

In this article, you will learn how to prevent broken connections when a Pod starts or shuts down. You will also learn how to shut down long-running tasks and connections gracefully.

   Daniele Polencic — Learnk8s

It turns out that an S3 bucket owner pays for failed requests to that bucket, even if they’re unauthenticated, so anyone can run up your AWS bill if they know your bucket’s name. Oops.

Oh, and they can get the bucket name from CT logs (thanks, Corey Quinn!)

  Maciej Pocwierz

Updated: May 5, 2024 — 9:02 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme