General

SRE Weekly Issue #215

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

SRE Weekly Issue #214

A message from our sponsor, VictorOps:

SRE requires continuous improvement and learning. So, to help out, VictorOps lays out a bunch of educational resources, podcast episodes, videos and more in the new learning library. Check it out:

https://go.victorops.com/sreweekly-learning-library

Articles

A nifty little pitfall in which an ioniced process can block non-ioniced processes.Author: rachelbythebay

Google published this free set of courses on technical writing. As an SRE, I have the constant need to write effectively to justify and document my designs.

Every engineer is also a writer.

This collection of courses and learning resources aims to improve your technical documentation. Learn how to plan and author technical documents.

Google

The ACM has made their ACM Digital Library free to the public for the next 3 months. Many of their articles have been featured here previously.

Includes a great article by Jamie Woo, entitled Imagining Your Post-Incident Report As A Documentary.

Emil Stolarsky and Jaime Woo — The Post-Incident Review

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends.

I especially liked the discussion of pent-up demand that may cause problems when we eventually get to relax social distancing.

Amy Tobey (moderator), Alex Hidalgo, Liz Fong-Jones, Dave Rensin

This is a talk that John Allspaw gave for Spotify.

Learning is not the same as fixing

John Allspaw — Adaptive Capacity Labs

Outages

SRE Weekly Issue #213

A message from our sponsor, VictorOps:

Major incidents lead to more alerts, more downtime and unhappy customers. See how modern DevOps-minded teams are building virtual war rooms to quickly mobilize cross-functional engineering and IT teams around major incidents – improving incident remediation while reducing burnout:

https://go.victorops.com/sreweekly-war-rooms-for-major-incidents

Articles

This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.

Sui Huang

Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.

Rachel Obstler — PagerDuty

Here’s the scoop on all those GitHub incidents in February.

Keith Ballinger — GitHub

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Hannah Culver — Blameless

5 tips for incident management when you’re suddenly remote

I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.

Blake Thorne — Atlassian

Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.

Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic

Outages

SRE Weekly Issue #212

A message from our sponsor, VictorOps:

With a surge of developers and IT practitioners working remotely, there’s also a surge of confusion and operational inefficiency. See how data and automation is improving the way SREs and IT operations engineers build, release and maintain reliable services remotely:

https://go.victorops.com/sreweekly-data-and-automation-for-remote-teams

Articles

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Their top 5 are:

  • Use Meaningful Severity Levels
  • Create Detailed Runbooks
  • Load Balance Through Qualitative Metrics
  • Get Ahead of Incidents
  • Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

SRE Weekly Issue #211

A message from our sponsor, VictorOps:

What’s the most important metric for SRE? Well, Splunk Cloud Platform SRE, Jonathan Schwietert, argues that it’s customer happiness. On Tuesday (03/17), join Splunk + VictorOps for a webinar about using SRE and observability to build customer-first applications and services:

https://go.victorops.com/sreweekly-sre-for-happier-customers

Articles

SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.

Gremlin

There are some good tips in here, especially if you’re new to this.

Mandy Mak

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme