SRE Weekly Issue #256

A message from our sponsor, StackHawk:

Register now for the first-ever ZAPCon taking place March 9th. The free event will focus on OWASP ZAP and application security best practices. You wont want to miss it!
http://sthwk.com/zapcon-sre-weekly

Articles

Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

Coinbase posted this detailed analysis of their January 29th incident.

Coinbase

Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, Transposit) — Forbes

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?

rachelbythebay

Outages

SRE Weekly Issue #255

A message from our sponsor, StackHawk:

With StackHawk’s new GitHub Action, you can integrate AppSec testing directly into your GitHub CI/CD pipeline. See how:
http://sthwk.com/appsec-github-action

Articles

It really should! Even Google is much more accurately described as a “service” than a “site”.

Chris Riley — Splunk

There are migrations, and then there’s the time between migrations.

Will Larson

2020 was the year mainstream folks realized how important reliability is. Will overall reliability improve in 2021?

Robert Ross — FireHydrant

I love this for the click-bait title and the content. An HAProxy feature designed for HA had a surprising an unexpected behavior.

Andre Newman — GitLab

Twilio builds customer trust through a reliability culture, customer empathy, and accountability.

Andre Newman — Gremlin

This WTFinar tackles the beginning of understanding SRE. It focuses on service level indicators (SLIs) and service level objectives (SLOs) – components of error budgets.

Container Solutions

Outages

SRE Weekly Issue #254

A message from our sponsor, StackHawk:

Need to run a standalone Kotlin app as a fat jar in a Gradle project? Check out how we handled that!
http://sthwk.com/kotlin-with-gradle

Articles

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

Slack

Outages

SRE Weekly Issue #253

A message from our sponsor, StackHawk:

How do you know if your GraphQL API is secure? Watch StackHawk CSO Scott Gerlach walk through how to run application security tests for GraphQL-backed apps.
http://sthwk.com/graphql-webinar

Articles

TLS can be such a headache.

This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it.

Adam Surak — Algolia

A well-researched article on shifting emphasis from incident prevention to learning and resilience.

Incidents cannot be prevented, because incidents are the inevitable result of success.

Alex Elman

This one’s worth reading through twice to let it sink in. It puts me in mind of this article by WIll Gallego, which is another thoughtful critique of error budgets.

Here are the claims I’m going to make:

  1. Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
  2. Error budgets don’t help reduce risk of large incidents.

Lorin Hochstein

This is a review of a few of the chapters of the book of the same title by Emil Stolarsky and Jaime Woo.

Have you read it too? I’d love to read your take on it!

Dean Wilson

This one’s worth reading the next time need to do an incident retrospective. The traps are:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

John Allspaw — Adaptive Capacity Labs

The skill in question is glue work, and I sure appreciate a good gluer when I see one.

Emily Arnott — Blameless

This one starts out by defining SRE, then goes into how to define your team and fill it with people.

Julie Gunderson — PagerDuty

Outages

SRE Weekly Issue #252

A message from our sponsor, StackHawk:

Interested in how you can automate application security testing with GitHub Actions? Check out this on demand webinar from StackHawk and Snyk and see how simple it is to get started.
https://sthwk.com/stackhawk-snyk

Articles

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times

Jamie Dobson — Container Solutions

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica

Kelly Shortridge — Capsul8

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme