General

SRE Weekly Issue #225

A message from our sponsor, StackHawk:

Application security is shifting to a model where the engineers who write the code also take ownership of the security. Read our docs to learn more about how StackHawk makes that happen.
https://docs.stackhawk.com?utm_source=SREWeekly

Articles

This suggests an upcoming shift in our field:

50 percent of SREs believe they will be working remotely post COVID-19, as compared to only 20 percent prior to the pandemic.

Kameerath Kareem — Catchpoint

BONUS CONTENT: An outside take on the survey results is here (Mike Vizard — DevOps.com).

No one person can (or should) know everything. How do we allocate expertise and build connections in order to maximize resilience and adaptive capacity?

Will Gallego

A new feature was accidentally rolled out to too wide an audience, causing log message loss.

Heroku

[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Kalyanasundaram Somasundaram — LinkedIn

Should you count scheduled maintenance against your error budget? It depends.

Jesus Climent — Google

An investigation in response to three incidents led to this stark conclusion about Cassandra’s “counter columns” feature:

In fact, they don’t appear to have any properties that make them a useful primitive for building predictable distributed systems.

Paddy Byers — Ably

This article explains why we should have cost data at our fingertips as we design cloud-based systems.

[…] a well-architected system is often a cost-efficient system.

CloudZero

This is a new concept to me, and I really like it:

Capacity for maneuver (CfM) is a measure of how much adaptability or room to respond to a new challenge that a given part of the system has, whether a person or autonomous agent.

Amir B. Farjadian, Benjamin Thomsen, Anuradha M. Annaswamy, and David D. Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Outages

SRE Weekly Issue #224

 

Happy Juneteenth (a couple days late)!  Let’s all work to strengthen the SRE profession by working to improve inclusion and diversity.

A message from our sponsor, StackHawk:

Do you use GraphQL? Learn how to add security testing to your GraphQL backed applications with this walkthrough.
https://www.stackhawk.com/blog/automated-graphql-security-testing?utm_source=SREWeekly

Articles

Diversity and inclusion make our companies stronger and more effective. This article has lots of links with evidence of why diversity matters and how to get your company on the road to improvement.

Sara Kassabian — GitLab

Starting on the road to chaos engineering is about more than just figuring out what experiments to run. Spreading knowledge and gaining buy-in before you start is critical.

Deven Samant — Business 2 Community

DNS propagation and inconsistent resolver behavior has bitten me so many times in my career.

Julia Evans

I don’t often have enough time to listen to podcasts, but when it’s these two, I had to. Jaime and Emil talk about post-incident reviews, geeking out about incidents, and their philosophy on publishing a zine.

Scott McAllister — Page It To the Limit Podcast (PagerDuty)

As so often happens, their attempts to fix a problem caused other problems. Has that happened to you? I’d love to read your story about it!

This article opens with a great story about how to help someone feel better when they are a contributing factor in an outage.

Tanya Reilly

Outages

SRE Weekly Issue #223

A message from our sponsor, StackHawk:

DevSecCon24 starts tonight at 10pm ET and runs for 24 hours. Tune in for great talks on building and deploying secure, resiliant software. Grab free tickets at the link here, and visit StackHawk’s virtual booth to get a T Shirt.
https://www.eventbrite.com/e/devseccon24-virtual-conference-tickets-94550734793?discount=StackHawk20

Articles

I’ve used this technique in the past with a single-page app and a highly-cacheable API, to ensure stability even when the backend goes down.

Patrick Hamann

Full disclosure: Fastly is my employer.

Here’s a deep dive into how your CA’s certificate can affect your application’s reliability — at least in the eyes of your customers.

Scott Helme

Here’s Coinbase’s followup from their outage last week.

Michael de Hoog — Coinbase

Kyle Kingsbury recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Lorin Hochstein

Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service.

If you’ve been tasked with applying FMEA in your SRE work, this article will get you started.

Matthew Helmke

Outages

SRE Weekly Issue #222

A message from our sponsor, StackHawk:

The last thing we need is more noise from more tooling. With the new Findings Management feature, you can add AppSec tests to your CI pipeline without being innundated with alerts.
https://www.stackhawk.com/blog/appsec-findings-management?utm_source=SREWeekly

Articles

This article in a nutshell:

Kolton Andrus — Gremlin

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

In our experience, the three big sources of production stress are:

  • Toil
  • Bad monitoring
  • Immature incident handling procedures

Cheryl Kang — Google

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

  • the boundary to economic failure
  • the boundary of unacceptable work load
  • the boundary of functionally acceptable performance

Lorin Hochstein

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security

Outages

SRE Weekly Issue #221

Don’t forget, Catchpoint’s SRE From Home event is happening this Friday. The speaker list has some names you’ll recognize from articles linked here in previous issues. See you there!

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Casey Rosenthal tips over a herd of sacred cows with this talk that opens with 6 myths about reliable systems.

Casey Rosenthal — Verica

This is written as talking about scale during a job interview, and it’s a pretty good read even if you’re not interviewing right now.

Denise Yu

John Allspaw says we should ask “how”, not “why”. Hollnagel and Woods say that finding out why a joint cognitive system does what it does rather than how. Who’s right?

Lorin Hochstein

Yay, another issue! This one revolves around learning from incidents from organizations in other fields (Bose and NASA).

Jaime Woo and Emil Stolarsky — Incident Labs

This is a followup analysis of a Google Hangouts oiutage from last month.

Google

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme