General

SRE Weekly Issue #250

A message from our sponsor, StackHawk:

Check out this video and side by side blog walkthrough about adding application security testing to your Spinnaker Pipeline.
https://sthwk.com/spinnaker

Articles

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

Jordan Place — Transposit

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

Outages

SRE Weekly Issue #249

I’m having a hard time wrapping my head around the fact that this issue marks 5 years of SRE Weekly.  A massive thank you to everyone who writes the content I feature here every week, and also to all of you that subscribe!

A message from our sponsor, StackHawk:

Did you catch the news? StackHawk now offers a free Developer Plan. Getting up and running with application security testing has never been easier. Give it a try.
https://sthwk.com/freeplan

Articles

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof — Transposit

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.

Kintaba

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

A new feature roll-out resulted in impaired service for some customers.

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

Outages

SRE Weekly Issue #248

A message from our sponsor, StackHawk:

Join StackHawk and Snyk on Wednesday to learn about how to automate application security testing with GitHub Actions. Register for the webinar here –>
https://sthwk.com/stackhawk-snyk

Articles

It’s really easy to get an “uptime” SLO wrong, and a lying SLO can give you a false sense of security.

Piyush Verma — Last9

I love this quote. I feel like this is the “root cause” of every incident:

As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences.

Oliver Leaver-Smith — Sky Betting & Gaming

I really love the concept of requisite complexity. This article has me thinking about a big project I’m working on in a new light.

Fred Hebert

They expected to max out an integer primary key column sometime in 2021. Then the pandemic hit and their timetable suddenly accelerated along with their traffic.

Jeff Pollard — Strava

I shouldn’t enjoy reading these so much… got any of your own to share?

Dean Wilson

The idea of borrowing expertise makes me think of Bainbridge’s Ironies of Automation.

Mandi Walls — PagerDuty

Heroku’s report explains how their service was impacted as a result of the big Amazon Kinesis outage a couple weeks back.

Heroku

This primer focuses on ensuring that your SLOs actually match up with business objectives.

Irving Popovetsky — Honeycomb

Outages

SRE Weekly Issue #247

A message from our sponsor, StackHawk:

The ZAP open source project is the underlying security scanner for StackHawk. Check out this 21 minute introduction to ZAP from project founder and core-contributor Simon Bennetts.
https://sthwk.com/zap-intro-video

Articles

This incident report from a September Datadog outage has an interesting tidbit aboiut scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.

Google

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak Nathani

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav Shah

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

Outages

SRE Weekly Issue #246

A message from our sponsor, StackHawk:

Looking to get started with application security testing in CI/CD? Here is a broad overview of steps you can take.
https://sthwk.com/how-to-app-sec-in-ci

Articles

DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.

Paul Berthaux — Algolia

We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).

Piyush Verma — Last9

Thanks to An anonymous reader for this one.

The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.

Eric Uriostigue — effx

Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.

Adam Fowler

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

Sourabh Suman — Grab

This article covers Netflix’s gnmi-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.

Colin McIntosh and Michael Costello — Netflix

This year, re:Invent is online only, so you still have a chance to attend if you’re interested.

Ana M Medina — Gremlin

Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.

Tom Lianza and Chris Snook — Cloudflare

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme