SRE Weekly Issue #415

[…] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for example, to hire and fire team leads.

Nočnica Mellifera — The New Stack

Investigating and Optimizing Over-Querying

A primer on the problems with N+1 database queries and how this pattern can sneak into your code whether you realize it or not.

neda — ReadySet

Choosing Good SLIs

A great explainer on choosing the right SLIs, starting with the Golden Signals and branching out.

Tyler Treat

You should never be responsible for what you don’t control

My favorite part about this is the “latency budget” question — which team’s code gets to spend how much time doing its part to serve a request?

Alex Ewerlöf

An unexpected crash due to unrelated software changes

Changes in two programs outside the container made Ceph suddenly grind to a halt, as detailed in this troubleshooting story.

Vladimir Guryanov — Palark

How to set a good only one threshold for an alert?

The word “one” is the key here, as the author argues for getting rid of “warning” alerts entirely in favor of using only “critical”.

Gauthier François

Creating An Oncall Handoff Bot

They wrote a Slack bot to summarize open PagerDuty incidents every day.

Matt Weingarten

Negotiating Priorities Around Incident Investigations

The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Fred Hebert — Honeycomb

Full disclosure: Honeycomb is my employer.

How we avoided alarm fatigue syndrome by managing/reducing the alerting noise.

In order to reduce the noise, first they had to define noisy alerts and the KPIs they were looking to improve.

Gauthier François — Doctolib

SRE Weekly Issue #415

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues