General

SRE Weekly Issue #216

lex

April 26, 2020

Articles

How to create an incident response playbook

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

Failover Conf Wrapup

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

Failover Conf, a Recap of Gremlin’s Epic Virtual Event

And this one!

Hannah Culver — Blamelss

COVID-19 Oncall Survey

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Incident Analysis: How *Learning* is Different Than *Fixing*

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

How to build robust anomaly detectors with machine learning

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

Advice for On-call Teams During COVID-19

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

Shadowing a Site Reliability Engineer

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

GitHub
DataDog
Poloniex
DigitalOcean
Apple Pay
ShipStation
Sendy
Sharp online store and IoT devices
- Sharp retooled one of its factories to produce masks and started selling them commercially. The increased load caused problems with their online store and existing consumer IoT devices.
Discord
Fastly
- Also a control plane issue earlier the same day.Full disclosure: Fastly is my employer.
reddit

SRE Weekly Issue #215

lex

April 19, 2020

General

Comments

View on sreweekly.com

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

Articles

Embracing the beautiful mess

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

Accident Case Study: Just a Short Flight

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Succeeding With Service Level Objectives

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

Back to Basics: Why Global Infrastructure Matters

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

SLA Uptime calculator

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Hosted Pools Availability Degradation

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Free Google Book: Building Secure and Reliable Systems

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

One Team at Uber is Moving from Microservices to Macroservices

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

Quibi
- Quibi had issues on their launch day.
Deliveroo
Google Cloud Platform IAM
- Click through for their interesting post-incident analysis.
Cloudflare
- Here’s their post-incident analysis that details a remote hands request gone awry.
Chef
Hulu
Lots of Banks in the US
- Banks went down around the time when customers were checking to see if their economic stimulus payments had arrived.
Petnet (smart pet feeder)
Snapchat
Twitter
Fastly
Reddit
DoorDash
StackPath

SRE Weekly Issue #214

lex

April 5, 2020

General

Comments

View on sreweekly.com

Articles

Trying to be too (io)nice created a “killer” directory

A nifty little pitfall in which an ioniced process can block non-ioniced processes.Author: rachelbythebay

Technical Writing

Google published this free set of courses on technical writing. As an SRE, I have the constant need to write effectively to justify and document my designs.

Every engineer is also a writer.

This collection of courses and learning resources aims to improve your technical documentation. Learn how to plan and author technical documents.

Google

Message from ACM Regarding Open Access to ACM Digital Library during Coronavirus

The ACM has made their ACM Digital Library free to the public for the next 3 months. Many of their articles have been featured here previously.

The Post-Incident Review Issue 2: March 2020

Includes a great article by Jamie Woo, entitled Imagining Your Post-Incident Report As A Documentary.

Emil Stolarsky and Jaime Woo — The Post-Incident Review

SRE Thought Leader Panel about Embracing Resilience during Crises

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends.

I especially liked the discussion of pent-up demand that may cause problems when we eventually get to relax social distancing.

Amy Tobey (moderator), Alex Hidalgo, Liz Fong-Jones, Dave Rensin

Incidents: What Is Often Missed & What Can Be Done About That

This is a talk that John Allspaw gave for Spotify.

Learning is not the same as fixing

John Allspaw — Adaptive Capacity Labs

Outages

Google Cloud Platform
- This is an update to the outage included in last week’s issue, giving details on what went wrong. A problem with Cloud IAM affected many other GCP services.
Let’s Encrypt
GitHub
Apple News
Facebook, Instagram, WhatsApp
Twitch
GameStop
Discord
- Includes a short description of what went wrong. Take it easy on yourselves, Discord folks, it happens to all of us. ♥

SRE Weekly Issue #213

lex

March 29, 2020

General

Comments

View on sreweekly.com

Articles

COVID-19: Why We Should All Wear Masks — There Is New Scientific Rationale

This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.

Sui Huang

Keeping the Internet “Always On” — the Pressure of COVID-19 on Incident Response Teams

Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.

Rachel Obstler — PagerDuty

February service disruptions post-incident analysis

Here’s the scoop on all those GitHub incidents in February.

Keith Ballinger — GitHub

Embrace Resilience for Business Continuity in Times of Uncertainty

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Hannah Culver — Blameless

Remote incident management

5 tips for incident management when you’re suddenly remote

I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.

Blake Thorne — Atlassian

Elastic Cloud January 18, 2019 Incident Report

Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.

Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic

Outages

G Suite
Google Cloud Platform
- GCP had a major incident that caused the G Suite outage.GCP also had an (apparently) unrelated outage later in the day.
BitBay (cryptocurrency exchange)
Netflix
Uber
WhatsApp
Fastly
- Also this one.Full disclosure: Fastly is my employer.
Reddit
Discord
Brightcove
Zoom
DoorDash
Nest
Canvas (remote learning tool)

SRE Weekly Issue #212

lex

March 22, 2020

General

Comments

View on sreweekly.com

Articles

Meaningful availability

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Our Top 5 On-Call Practices – Blameless: Better Reliability Through SRE

Their top 5 are:

Use Meaningful Severity Levels
Create Detailed Runbooks
Load Balance Through Qualitative Metrics
Get Ahead of Incidents
Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

NTP: Building a more accurate time service at Facebook scale

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

The Fallacy of Move Fast and Break Things

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

InSearch: LinkedIn’s new message search platform

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

Destiny 2 Outage and Rollback

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

Involving Engineers in Incident Management: QCon London Q&A

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

Statuspage.io
- The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
- They also had this unrelated outage.
Heroku
- Heroku suffered two short bouts of 85% request failure to applications hosted on their platform.Separately, they recently posted a couple of followup reports for previous incidents:
  - Incident #1961: logging outage
  - Incident #1968: EU application errors
Zoom
MacStadium
Hulu
Bumble
Microsoft Teams and Office 365
Discord
- Discord posted this gem of a followup analysis just a few days after their outage last week.
GoToMeeting
Google Nest
DoorDash

SRE Weekly Issue #216

Articles

Outages

SRE Weekly Issue #215

Articles

Outages

SRE Weekly Issue #214

Articles

Outages

SRE Weekly Issue #213

Articles

Outages

SRE Weekly Issue #212

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues