SRE WEEKLY – Page 51 – scalability, availability, incident response, automation

SRE Weekly Issue #254

lex

January 24, 2021

Articles

Coinbase Incident Post Mortem: January 6–7, 2021

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

Soar: Simulation for Observability, reliAbility, and secuRity

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

Failing to make progress under excess request load

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

Google Cloud Issue Summary — Google Meet — 2021-01-08

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

The Next Gen Database Servers Powering Let’s Encrypt

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

Incident Management in 2021: from Basics to Best Practices

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

Using GPT-3 for plain language incident root cause from logs

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

Taming Operational Load with VMware CRE

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

Slack RCA for outage on January 4, 2021

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

Slack

Outages

SRE Weekly Issue #253

lex

January 17, 2021

General

Comments

View on sreweekly.com

Articles

May 30 SSL incident

TLS can be such a headache.

This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it.

Adam Surak — Algolia

Shifting Modes: Creating a Program to Support Sustained Resilience

A well-researched article on shifting emphasis from incident prevention to learning and resilience.

Incidents cannot be prevented, because incidents are the inevitable result of success.

Alex Elman

Error budgets and the legacy of Herbert Heinrich

This one’s worth reading through twice to let it sink in. It puts me in mind of this article by WIll Gallego, which is another thoughtful critique of error budgets.

Here are the claims I’m going to make:

Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.

Error budgets don’t help reduce risk of large incidents.

Lorin Hochstein

97 things every SRE should know – Part 01

This is a review of a few of the chapters of the book of the same title by Emil Stolarsky and Jaime Woo.

Have you read it too? I’d love to read your take on it!

Dean Wilson

Understanding Incidents: Three Analytical Traps

This one’s worth reading the next time need to do an incident retrospective. The traps are:

Counterfactual reasoning

Normative language

Mechanistic reasoning

John Allspaw — Adaptive Capacity Labs

This Is the Most Underappreciated Skill for SREs

The skill in question is glue work, and I sure appreciate a good gluer when I see one.

Emily Arnott — Blameless

Building and Scaling Your SRE Team

This one starts out by defining SRE, then goes into how to define your team and fill it with people.

Julie Gunderson — PagerDuty

Outages

Fastly
- Fastly is my employer.
Slack
Tyro Payments
Signal
.ke TLD (Kenya)
Microsoft Teams, Office 365 and OneDrive
Instagram

SRE Weekly Issue #252

lex

January 10, 2021

General

Comments

View on sreweekly.com

Articles

Building On-Call Culture at GitHub

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

Google Cloud Issue Summary — Google Meet — 2020-12-14

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

WTF is Alert Fatigue

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times

Jamie Dobson — Container Solutions

Announcing the Security Chaos Engineering Report

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica

Kelly Shortridge — Capsul8

Little Known Ways to Better Use Your Error Budgets

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

Lessons learned in incident management

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

GitHub Availability Report: December 2020

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

Outages

Slack
- My first couple hours of work this year were oddly quiet…
Heroku
Google Meet
- This is different from the one above.
Fanduel
Twitch
Coinbase
Archive of Our Own

SRE Weekly Issue #251

lex

January 3, 2021

General

Comments

View on sreweekly.com

Happy new year!

Articles

Writing Runbook Documentation When You’re An SRE

Tips and tricks for writing effective runbook documentation when you aren’t a technical writer

I like the discussion of the “Curse of Knowledge” cognitive bias.

Taylor Barnett — Transposit

SLO — From Nothing to… Production

Here’s one engineer’s SLO journey.

My main focus is on how I educated myself about SLOs and how applied this to my organization.

Ioannis Georgoulas

How to sell SLOs to Engineering Directors

This blog is a redacted internal memo that aimed to familiarize SLOs with its audience, explain the value of an SLO culture, and describe how we would implement and roll them out.

Thomas Césaré-Herriau — Brex

Why I’ve Been Merging Microservices Back Into The Monolith At InVision

Why would you do this? It’s all about Conway’s Law.

Ben Nadel

Incident Phenomena: Shorthand Names, à la Danny Ocean

The folks at Adaptive Capacity Labs have seen a few patterns crop up over and over in their post-incident reviews. How many of these have you seen before?

John Allspaw — Adaptive Capacity Labs

Home Alone: a Post-Incident Review

Lots of complex contributing factors led to the main character being left behind in the movie Home Alone… so let’s treat it like a production incident!

Fred Hebert

Making sense of what happened is hard

This one includes a complex timeline showing the interplay of two pairs of bugs, where one in each pair masked the other.

Lorin Hochstein

Outages

Apple iCloud

SRE Weekly Issue #250

lex

December 27, 2020

General

Comments

View on sreweekly.com

Articles

Salt Incident: May 3rd 2020 Retrospective and Update

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

How to Prepare for a Site Reliability Engineer Interview

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

6 Scary Outage Stories from CTOs

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

The Day of the RDS Multi-AZ Failover

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

Much that we’ve gotten wrong about Site Reliability Engineering

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

You Reap What You Code

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

The Mysterious Case of the Bad Gateway (502)

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

Jordan Place — Transposit

Google Cloud Platform

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

SRE Weekly Issue #254

Articles

Outages

SRE Weekly Issue #253

Articles

Outages

SRE Weekly Issue #252

Articles

Outages

SRE Weekly Issue #251

Articles

Outages

SRE Weekly Issue #250

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues