SRE Weekly Issue #423

How to Fight Alert Fatigue with Synthetic Monitoring

This one’s full of great advice about making sure alerts are actionable, including alerting on flows that actually matter to customers.

Nočnica Mellifera — Checkly

What playing Magic: the Gathering taught me about incidents.

Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.

Ross Brodbeck

Upgrading Kubernetes: From 1.11 to 1.18 in a month

It was a classic application of technical debt: they chose to focus on customer-facing features and let k8s updates slide. Here’s how they caught back up safely.

Jeff Wolski

Rice’s Theorem and Software Failures

This article presents an interesting hypothesis, and from it draws some nifty conclusions about reasoning about failure in systems.

we cannot know for sure whether or not software is going to be incident-free. It might well be, but we can’t ever prove it.

Niall Murphy

The role of psychological safety in incident response

For teams to solve incidents quickly and effectively, responders need to be able to trust each other and stakeholders have to trust the responders. This level of trust is hard to cultivate if your organization doesn’t have a significant amount of psychological safety.

Mandi Walls — PagerDuty

Klaviyo Incident Management: Interview with Laura Stone

More than just an interview, this article outlines a multi-year transformation from disorganized haphazard incident investigation to a smooth and efficient incident response process.

Eric Silberstein — Klaviyo

Graceful shutdown in Kubernetes

In this article, you will learn how to prevent broken connections when a Pod starts or shuts down. You will also learn how to shut down long-running tasks and connections gracefully.

Daniele Polencic — Learnk8s

How an empty S3 bucket can make your AWS bill explode

It turns out that an S3 bucket owner pays for failed requests to that bucket, even if they’re unauthenticated, so anyone can run up your AWS bill if they know your bucket’s name. Oops.

Oh, and they can get the bucket name from CT logs (thanks, Corey Quinn!)

Maciej Pocwierz

SRE Weekly Issue #423

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues