SRE Weekly Issue #250

A message from our sponsor, StackHawk:

Check out this video and side by side blog walkthrough about adding application security testing to your Spinnaker Pipeline.
https://sthwk.com/spinnaker

Articles

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

Jordan Place — Transposit

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

Outages

Updated: December 27, 2020 — 8:08 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme