General

SRE Weekly Issue #228

SRE From Home is back! It’s happening this Thursday, and I’ll be on the Ask an SRE panel answering your questions. And don’t miss the talks by lots of great folks, some of whom have had articles featured here previously!

A message from our sponsor, StackHawk:

StackHawk is built on the open source ZAP application security scanner, the most widely used AppSec tool out there. Now the founder of ZAP has joined our team to bring AppSec to developers. Read all about it.
https://www.stackhawk.com/blog/zap-founder-decides-to-join-stackhawk?utm_source=SREWeekly

Articles

They don’t. They just don’t.

[…] as deployments grow beyond a certain size it’s almost impossible to execute them successfully.

Alex Yates — Octopus Deploy

Whoops, forgot to include this one last week.

On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.

My favorite part is the focus on blame awareness:

But it’s not enough to just be blameless—it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.

Isabella Pontecorvo — PagerDuty

Netflix has a team dedicated to the overall reliability of their service.

Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.

Hank Jacobs– Netflix

Another good reference if you’re looking to bootstrap SRE at your organization.

Rich Burroughs — FireHydrant

Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?

Bill Duncan

Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.

Emily Arnot — Blameless

SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.

Samantha Coffman — HelloFresh

Outages

  • Cloudflare
    • Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
  • Twitter
    • Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
  • GitHub
  • WhatsApp
  • Hulu
  • Snapchat
  • Microsoft Outlook
    • Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.
  • Fastly

SRE Weekly Issue #227

A message from our sponsor, StackHawk:

When a team introduces security bugs, they don’t know because nothing tells them. We test for everything else… why not security bugs?
https://www.stackhawk.com/blog/how-security-based-development-should-work?utm_source=SREWeekly

Articles

This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.

Laura Nolan — Slack

This is the companion article that describes Slack’s incident response process, using the same incident as a case study.

Ryan Katkov — Slack

The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.

Alex Elman

The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.

This is a good primer on the ins and outs of running a post-incident analysis.

Anusuya Kannabiran — Squadcast

This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.

Cindy Quach — Google

GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.

Keith Ballinger — GitHub

This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.

Blameless

Outages

SRE Weekly Issue #226

A message from our sponsor, StackHawk:

When a team introduces security bugs, they don’t know because nothing tells them. We test for everything else… why not security bugs?
https://www.stackhawk.com/blog/how-security-based-development-should-work?utm_source=SREWeekly

Articles

This is an article version of an interview with Dr. Danielle Ofri, author of a new book When We Do Harm, on NPR’s Fresh Air. I especially loved the part about near misses.

Bridget Bentz, Molly Seavy-Nesper, Deborah Franklin, Sam Briger, and Thea Chaloner — NPR

Maintenance of the logging system had unintended downstream effects including log loss and failure of the system that manages dynos.

In this incident, a TLS certificate was deployed without its intermediate, resulting in failures for some clients.

I wrote this after attending the Resilience Engienering Association’s webinar with panelists Dr. Richard Cook, John Allspaw, and Nora Jones, moderated by Laura Maguire. Once the recording is posted, I highly recommend watching!

Lex Neva

As SREs, we need to be laser focused on the user’s experience. Our SLIs should reflect that.

Emily Arnott — Blameless

This two-part series is an in-depth look at how Twitter adopted SRE, before SRE was even a thing.

Blameless

Outages

SRE Weekly Issue #225

A message from our sponsor, StackHawk:

Application security is shifting to a model where the engineers who write the code also take ownership of the security. Read our docs to learn more about how StackHawk makes that happen.
https://docs.stackhawk.com?utm_source=SREWeekly

Articles

This suggests an upcoming shift in our field:

50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.

Kameerath Kareem — Catchpoint

BONUS CONTENT: An outside take on the survey results is here (Mike Vizard — DevOps.com).

No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?

Will Gallego

A new feature was accidentally rolled out to too wide an audience, causing log message loss.

Heroku

[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Kalyanasundaram Somasundaram — LinkedIn

Should you count scheduled maintenance against your error budget? It depends.

Jesus Climent — Google

An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:

In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.

Paddy Byers — Ably

This article explains why we should have cost data at our fingertips as we design cloud-based systems.

[…] a well-architected system is often a cost-efficient system.

CloudZero

This is a new concept to me, and I really like it:

Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.

Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Outages

SRE Weekly Issue #224

 

Happy Juneteenth (a couple days late)!  Let’s all work to strengthen the SRE profession by working to improve inclusion and diversity.

A message from our sponsor, StackHawk:

Do you use GraphQL? Learn how to add security testing to your GraphQL backed applications with this walkthrough.
https://www.stackhawk.com/blog/automated-graphql-security-testing?utm_source=SREWeekly

Articles

Diversity and inclusion make our companies stronger and more effective. This article has lots of links with evidence of why diversity matters and how to get your company on the road to improvement.

Sara Kassabian — GitLab

Starting on the road to chaos engineering is about more than just figuring out what experiments to run. Spreading knowledge and gaining buy-in before you start is critical.

Deven Samant — Business 2 Community

DNS propagation and inconsistent resolver behavior has bitten me so many times in my career.

Julia Evans

I don’t often have enough time to listen to podcasts, but when it’s these two, I had to. Jaime and Emil talk about post-incident reviews, geeking out about incidents, and their philosophy on publishing a zine.

Scott McAllister — Page It To the Limit Podcast (PagerDuty)

As so often happens, their attempts to fix a problem caused other problems. Has that happened to you? I’d love to read your story about it!

This article opens with a great story about how to help someone feel better when they are a contributing factor in an outage.

Tanya Reilly

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme