SRE Weekly Issue #212

A message from our sponsor, VictorOps:

With a surge of developers and IT practitioners working remotely, there’s also a surge of confusion and operational inefficiency. See how data and automation is improving the way SREs and IT operations engineers build, release and maintain reliable services remotely:

https://go.victorops.com/sreweekly-data-and-automation-for-remote-teams

Articles

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Their top 5 are:

  • Use Meaningful Severity Levels
  • Create Detailed Runbooks
  • Load Balance Through Qualitative Metrics
  • Get Ahead of Incidents
  • Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

Updated: March 22, 2020 — 9:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme