Articles
My coworkers referred to a system “going metastable”, and when I asked what that was, they pointed me to this awesome paper.
Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is `removed.
  Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu
Honeycomb posted this incident report involving a service hitting the open file descriptors limit.
  Honeycomb
  Full disclosure: Honeycomb is my employer.
Lots of interesting answers to this one, especially when someone uttered the phrase:
engineers should not be on call
u/infomaniac89 and others — reddit
A misbehaving internal Google service overloaded Cloud Filestore, exceeding its global request limit and effectively DoSing customers.
An in-depth look at how Adobe improved its on-call experience. They used a deliberate plan to change their team’s on-call habits for the better.
Bianca Costache — Adobe
This one contains an interesting observation: they found that outages caused by a cloud providers take longer to solve.
Jeff Martens — Metrist
Even if you don’t agree with all of their reasons, it’s definitely worth thinking about.
Danny Martinez — incident.io
This one covers common reliability risks in APIs and techniques for mitigating them.
Utsav Shah
The evolution beyond separate Dev and Ops teams continues. This article traces the path through DevOps and into platform-focused teams.
  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.