SRE Weekly Issue #341

Articles

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

My coworkers referred to a system “going metastable”, and when I asked what that was, they pointed me to this awesome paper.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is `removed.

Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu

Honeycomb incident report: Querying Errors

Honeycomb posted this incident report involving a service hitting the open file descriptors limit.

Honeycomb
Full disclosure: Honeycomb is my employer.

[reddit r/sre] What does your oncall rotas look like?

Lots of interesting answers to this one, especially when someone uttered the phrase:

engineers should not be on call

u/infomaniac89 and others — reddit

Incident Report: Google Cloud Filestore Outage 2022-09-13

A misbehaving internal Google service overloaded Cloud Filestore, exceeding its global request limit and effectively DoSing customers.

Google

Creating a Thriving On-Call Engineering Workflow by Embracing Healthy Team Habits

An in-depth look at how Adobe improved its on-call experience. They used a deliberate plan to change their team’s on-call habits for the better.

Bianca Costache — Adobe

Here’s How Chicago Trading Company’s Luke Rotta Engineers Resilient Systems

This one contains an interesting observation: they found that outages caused by a cloud providers take longer to solve.

Jeff Martens — Metrist

Why you should ditch your overly detailed incident response plan | incident.io

Even if you don’t agree with all of their reasons, it’s definitely worth thinking about.

Danny Martinez — incident.io

Thoughts on API Reliability

This one covers common reliability risks in APIs and techniques for mitigating them.

Utsav Shah

The Future of Ops Is Platform Engineering

The evolution beyond separate Dev and Ops teams continues. This article traces the path through DevOps and into platform-focused teams.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #341

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues