SRE Weekly Issue #253

A message from our sponsor, StackHawk:

How do you know if your GraphQL API is secure? Watch StackHawk CSO Scott Gerlach walk through how to run application security tests for GraphQL-backed apps.
http://sthwk.com/graphql-webinar

Articles

TLS can be such a headache.

This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it.

Adam Surak — Algolia

A well-researched article on shifting emphasis from incident prevention to learning and resilience.

Incidents cannot be prevented, because incidents are the inevitable result of success.

Alex Elman

This one’s worth reading through twice to let it sink in. It puts me in mind of this article by WIll Gallego, which is another thoughtful critique of error budgets.

Here are the claims I’m going to make:

  1. Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
  2. Error budgets don’t help reduce risk of large incidents.

Lorin Hochstein

This is a review of a few of the chapters of the book of the same title by Emil Stolarsky and Jaime Woo.

Have you read it too? I’d love to read your take on it!

Dean Wilson

This one’s worth reading the next time need to do an incident retrospective. The traps are:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

John Allspaw — Adaptive Capacity Labs

The skill in question is glue work, and I sure appreciate a good gluer when I see one.

Emily Arnott — Blameless

This one starts out by defining SRE, then goes into how to define your team and fill it with people.

Julie Gunderson — PagerDuty

Outages

SRE Weekly Issue #252

A message from our sponsor, StackHawk:

Interested in how you can automate application security testing with GitHub Actions? Check out this on demand webinar from StackHawk and Snyk and see how simple it is to get started.
https://sthwk.com/stackhawk-snyk

Articles

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times

Jamie Dobson — Container Solutions

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica

Kelly Shortridge — Capsul8

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

Outages

SRE Weekly Issue #251

Happy new year!

A message from our sponsor, StackHawk:

Still looking for a good new years resolution? How about adding application security testing to your CI/CD pipeline with StackHawk. Get started with our free account.
https://sthwk.com/freeplan

Articles

Tips and tricks for writing effective runbook documentation when you aren’t a technical writer

I like the discussion of the “Curse of Knowledge” cognitive bias.

Taylor Barnett — Transposit

Here’s one engineer’s SLO journey.

My main focus is on how I educated myself about SLOs and how applied this to my organization.

Ioannis Georgoulas

This blog is a redacted internal memo that aimed to familiarize SLOs with its audience, explain the value of an SLO culture, and describe how we would implement and roll them out.

Thomas Césaré-Herriau — Brex

Why would you do this? It’s all about Conway’s Law.

Ben Nadel

The folks at Adaptive Capacity Labs have seen a few patterns crop up over and over in their post-incident reviews. How many of these have you seen before?

John Allspaw — Adaptive Capacity Labs

Lots of complex contributing factors led to the main character being left behind in the movie Home Alone… so let’s treat it like a production incident!

Fred Hebert

This one includes a complex timeline showing the interplay of two pairs of bugs, where one in each pair masked the other.

Lorin Hochstein

Outages

SRE Weekly Issue #250

A message from our sponsor, StackHawk:

Check out this video and side by side blog walkthrough about adding application security testing to your Spinnaker Pipeline.
https://sthwk.com/spinnaker

Articles

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

Jordan Place — Transposit

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

Outages

SRE Weekly Issue #249

I’m having a hard time wrapping my head around the fact that this issue marks 5 years of SRE Weekly.  A massive thank you to everyone who writes the content I feature here every week, and also to all of you that subscribe!

A message from our sponsor, StackHawk:

Did you catch the news? StackHawk now offers a free Developer Plan. Getting up and running with application security testing has never been easier. Give it a try.
https://sthwk.com/freeplan

Articles

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof — Transposit

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.

Kintaba

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

A new feature roll-out resulted in impaired service for some customers.

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

Outages

SRE WEEKLY © 2015 Frontier Theme