SRE Weekly Issue #195

View on sreweekly.com

Articles

Observablility: Tabs vs. Spaces for Ops

An entertaining take on defining Observability.

Joshua Biggley

What makes a good runbook?

There are some really great tips in here, wrapped up in a handy mnemonic, the Five As:

actionable
accessible
accurate
authoritative
adaptable

Dan Moore — Transposit

Fastly improves delivery reliability with its fast path failover technology

“The Internet routes around damage”, right? Not always, and if it does, it’s often too slow. Fastly has a pretty interesting solution to that problem.

Lorenzo Saino and Raul Landa — Fastly

Full disclosure: Fastly is my employer.

Debugging network stalls on Kubernetes

The stalls were caused by a gnarly kernel performance issue. They had to use bcc and perf to dig into the kernel in order to figure out what was wrong.

Theo Julienne — GitHub

9 Reliability Talks at AWS re:Invent that Everyone Should Attend

Heading to Las Vegas for re:Invent? Here’s a handy guide of talks you might want to check out.

Rui Su — Blameless

Markers of Progress in Incident Analysis

How can you tell when folks are learning effectively from incident reviews? Hint: not by measuring MTTR and the like.

John Allspaw — Adaptive Capacity Labs

Outages

Honeycomb Incident Report: Running Dry on Memory Without Noticing
- A couple weeks ago, I covered a Honeycomb outage and linked to a tweet thread by one of their employees. Here’s their full analysis of the incident, including a mention of the Twitter thread.
  Liz Fong-Jones — Honeycomb
LetsEncrypt
Microsoft Azure
- Microsft posted this followup analysis of an issue with Azure’s edge network.
Netflix
British Airways
Microsoft 365, OneDrive, and SharePoint
Yahoo Mail
Heroku Incident #1927 Followup
Squarespace
- Also this one and this one.
GitHub
reddit

SRE Weekly Issue #195

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues