SRE Weekly Issue #212

Articles

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Our Top 5 On-Call Practices – Blameless: Better Reliability Through SRE

Their top 5 are:

Use Meaningful Severity Levels
Create Detailed Runbooks
Load Balance Through Qualitative Metrics
Get Ahead of Incidents
Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

NTP: Building a more accurate time service at Facebook scale

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

The Fallacy of Move Fast and Break Things

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

InSearch: LinkedIn’s new message search platform

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

Destiny 2 Outage and Rollback

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

Involving Engineers in Incident Management: QCon London Q&A

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

Statuspage.io
- The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
- They also had this unrelated outage.
Heroku
- Heroku suffered two short bouts of 85% request failure to applications hosted on their platform.Separately, they recently posted a couple of followup reports for previous incidents:
  - Incident #1961: logging outage
  - Incident #1968: EU application errors
Zoom
MacStadium
Hulu
Bumble
Microsoft Teams and Office 365
Discord
- Discord posted this gem of a followup analysis just a few days after their outage last week.
GoToMeeting
Google Nest
DoorDash

SRE Weekly Issue #212

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues