General

SRE Weekly Issue #212

A message from our sponsor, VictorOps:

With a surge of developers and IT practitioners working remotely, there’s also a surge of confusion and operational inefficiency. See how data and automation is improving the way SREs and IT operations engineers build, release and maintain reliable services remotely:

https://go.victorops.com/sreweekly-data-and-automation-for-remote-teams

Articles

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Their top 5 are:

  • Use Meaningful Severity Levels
  • Create Detailed Runbooks
  • Load Balance Through Qualitative Metrics
  • Get Ahead of Incidents
  • Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

SRE Weekly Issue #211

A message from our sponsor, VictorOps:

What’s the most important metric for SRE? Well, Splunk Cloud Platform SRE, Jonathan Schwietert, argues that it’s customer happiness. On Tuesday (03/17), join Splunk + VictorOps for a webinar about using SRE and observability to build customer-first applications and services:

https://go.victorops.com/sreweekly-sre-for-happier-customers

Articles

SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.

Gremlin

There are some good tips in here, especially if you’re new to this.

Mandy Mak

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.

Outages

SRE Weekly Issue #210

A message from our sponsor, VictorOps:

See the tangible benefits of incident management reporting and analytics when it comes to faster incident detection, acknowledgment, response and resolution. Read on to learn about real KPIs and incident metrics that drive reliability:

https://go.victorops.com/sreweekly-incident-management-reporting

Articles

Netflix open sourced their incident management system.

Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix

I wasn’t aware of this little pitfall of memory cgroups.

rachelbythebay

Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.

Glenn Fleishman — Increment

This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.

Daniel Condomitti — FireHydrant

This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.

Thai Wood — Resilience Roundup (summary)

Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)

FYI: SRECon Americas West has been rescheduled to June 2-4.

This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.

Adrian Colyer — The Morning Paper (summary)

Brooker et al. — NSDI’20 (original paper)

In this case, “proof” means “formal proof”.

It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

Lorin Hochstein

Outages

SRE Weekly Issue #209

A message from our sponsor, VictorOps:

Efficient management of SQL schema evolutions allows DevOps professionals to deploy code quickly and reliably with little to no impact. Learn how modern teams are building out zero impact SQL database deployment workflows here:

https://go.victorops.com/sreweekly-zero-impact-sql-database-deployments

Articles

Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.

Adrian Colyer — The Morning Paper (summary)

Li et al. — NSDI’20 (original paper)

This one made me laugh out loud.  Better check those system call return codes, people.

rachelbythebay

This caught my eye:

In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.

Laura M.D. Maguire — ACM Queue Volume 17, Issue 6

A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.

Gremlin

The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.

Joydip Kanjilal

There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.

Kevin C. Tofel — Stacey on IoT

This sounds familiar

Durham Radio News

Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.

Ben New

Outages

SRE Weekly Issue #208

A message from our sponsor, VictorOps:

Learn about some more subtle, unknown use cases for using Splunk + VictorOps to drive a more analytical, proactive approach to incident response:

https://go.victorops.com/sreweekly-splunk-for-analytical-incident-response

Articles

There’s so much in this article:

  • how to recognize when your system may be susceptible to cascading failure
  • how to prevent it
  • how to deal with it when it happens (and how hard that can be)

Laura Nolan — Slack

It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.

This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.

Peter Murray — Catchpoint

Both the summary and the original article are well worth reading. This stood out to me:

As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it

Thai Wood (summary)

Dr. Richard Cook (original article)

The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).

Timothy Prickett Morgan — The Next Platform

Ideal: each team manages their microservice(s) in isolation.

Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.

Ben Sigelman — LightStep

This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.

Eric Harvieux — Google

The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.

rachelbythebay

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme