Articles
An entertaining take on defining Observability.
Joshua Biggley
There are some really great tips in here, wrapped up in a handy mnemonic, the Five As:
- actionable
- accessible
- accurate
- authoritative
- adaptable
Dan Moore — Transposit
“The Internet routes around damage”, right? Not always, and if it does, it’s often too slow. Fastly has a pretty interesting solution to that problem.
Lorenzo Saino and Raul Landa — Fastly
Full disclosure: Fastly is my employer.
The stalls were caused by a gnarly kernel performance issue. They had to use bcc
and perf
to dig into the kernel in order to figure out what was wrong.
Theo Julienne — GitHub
Heading to Las Vegas for re:Invent? Here’s a handy guide of talks you might want to check out.
Rui Su — Blameless
How can you tell when folks are learning effectively from incident reviews? Hint: not by measuring MTTR and the like.
John Allspaw — Adaptive Capacity Labs
Outages
- Honeycomb Incident Report: Running Dry on Memory Without Noticing
- A couple weeks ago, I covered a Honeycomb outage and linked to a tweet thread by one of their employees. Here’s their full analysis of the incident, including a mention of the Twitter thread.
Liz Fong-Jones — Honeycomb
- A couple weeks ago, I covered a Honeycomb outage and linked to a tweet thread by one of their employees. Here’s their full analysis of the incident, including a mention of the Twitter thread.
- LetsEncrypt
- Microsoft Azure
- Microsft posted this followup analysis of an issue with Azure’s edge network.
- Netflix
- British Airways
- Microsoft 365, OneDrive, and SharePoint
- Yahoo Mail
- Heroku Incident #1927 Followup
- Squarespace
- GitHub