SRE Weekly Issue #195

A message from our sponsor, VictorOps:

Understanding the incident lifecycle can guide DevOps and IT engineers into a future where on-call sucks less. See how you can breakdown the stages of the incident lifecycle and use automation, transparency and collaboration to improve each stage:

https://go.victorops.com/sreweekly-incident-lifecycle-guide

Articles

An entertaining take on defining Observability.

Joshua Biggley

There are some really great tips in here, wrapped up in a handy mnemonic, the Five As:

  • actionable
  • accessible
  • accurate
  • authoritative
  • adaptable

Dan Moore — Transposit

“The Internet routes around damage”, right? Not always, and if it does, it’s often too slow. Fastly has a pretty interesting solution to that problem.

Lorenzo Saino and Raul Landa — Fastly

Full disclosure: Fastly is my employer.

The stalls were caused by a gnarly kernel performance issue. They had to use bcc and perf to dig into the kernel in order to figure out what was wrong.

Theo Julienne — GitHub

Heading to Las Vegas for re:Invent? Here’s a handy guide of talks you might want to check out.

Rui Su — Blameless

How can you tell when folks are learning effectively from incident reviews? Hint: not by measuring MTTR and the like.

John Allspaw — Adaptive Capacity Labs

Outages

Updated: November 24, 2019 — 9:24 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme