SRE Weekly Issue #249

I’m having a hard time wrapping my head around the fact that this issue marks 5 years of SRE Weekly.  A massive thank you to everyone who writes the content I feature here every week, and also to all of you that subscribe!

A message from our sponsor, StackHawk:

Did you catch the news? StackHawk now offers a free Developer Plan. Getting up and running with application security testing has never been easier. Give it a try.
https://sthwk.com/freeplan

Articles

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof — Transposit

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.

Kintaba

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

A new feature roll-out resulted in impaired service for some customers.

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

Outages

Updated: December 20, 2020 — 8:29 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme