SRE Weekly Issue #250

A message from our sponsor, StackHawk:

Check out this video and side by side blog walkthrough about adding application security testing to your Spinnaker Pipeline.


Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

Jordan Place — Transposit

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.


Updated: December 27, 2020 — 8:08 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme