General

SRE Weekly Issue #267

A message from our sponsor, StackHawk:

Serverless doesn’t mean secure. Use modern security testing tools to assess serverless applications for vulnerabilities during development.
http://sthwk.com/serverless

Articles

Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?

Charles Li — eBay

The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.

Brandur Leach — Stripe

This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.

Google

Really great advice about 3 common pitfalls in implementing SL*s.

Cortex

This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.

Dr. Richard Cook and Jens Rasmussen (Original paper)

Thai Wood — Resilience Roundup (summary)

This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.

Alessandro Improta and Luca Sani — Catchpoint

How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?

Aleksandra Gavrilovska — SoundCloud

Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.

Ryan McIlmoyl — Shopify

Outages

SRE Weekly Issue #266

A message from our sponsor, StackHawk:

Are you a ZAP user looking to automate your security testing? Make sure to tune in to ZAPCon After Hours on Tuesday at 8 am PT to see how you can use Jenkins and Zest scripts to automate ZAP.
http://sthwk.com/zapcon-ah

Articles

This one was brought to my attention by Dr. Richard Cook, who also pointed me to the AAIB incident report.

Dr. Cook went on to share these insights with me, which I’ve copied here with permission:

Note:

  • the subtle interactions allowed the manual correction to be lost during the interval between recognizing the software problem and having the corrected software functionally ‘catch’ the Ms/Miss title mixup;
  • the incident is attributed to “a simple flaw in the programming of the IT system” rather than failure of the workarounds that were put in place after the problem was recognized;
  • the report is careful to demonstrate that the flaws in the system made only a slight difference to the flight parameters;

the report does not describe any IT process changes whatsoever!

The report has the effect of making the incident appear to be an unfortunate series of occurrences rather than being emblematic of the way that these sorts of processes are vulnerable.

Last year’s SRE From Home event was awesome, and this year’s iteration looks to be just as great.

Catchpoint

This is fun! Try your hand at troubleshooting a connection issue in this game-ified role-play scenario.

BONUS CONTENT: Read about the author’s motivations, design decisions, and plans here.

Julia Evans

Do we need to have some kind of Pillars Registry? Note, these are more like pillars of high availability than resilience engineering.

Hector Aguilar — Okta

I love this idea that we’re trying to get deep incident analysis done even though that may not be the actual goal of the organization.

As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Lorin Hochstein

This is well worth a read if only for the on-call scenario at the start. Yup, been there. We miss you, Harry.

Harry Hull — Blameless

What’s the difference? Click through to learn about the distinction they’re drawing.

Amir Kazemi — effx

The New York Times’s Operations Engineering group developed an Operational Maturity Assessment and uses it to have collaborative conversations with teams about their systems.

Authro: The NYT Open Team — New York Times

Outages

SRE Weekly Issue #265

A message from our sponsor, StackHawk:

Join StackHawk and WhiteSource tomorrow morning to learn about automated security testing in the DevOps pipeline. With automated dynamic testing and software composition analysis, you can be sure you’re shipping secure APIs and applications. Grab your spot:
http://sthwk.com/stackhawk-whitesource

Articles

Here’s a great look into how LinkedIn’s embedded SREs work.

[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”

Zaina Afoulki and Lakshmi Namboori — LinkedIn

It’s all just other people’s caches.

Ruurtjan Pul

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

Charles Cary — Shoreline

This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.

Robert Barron

This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.

Ems1

Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.

Robert Mosolgo — GitHub

They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.

Keith Ballinger — GitHub

In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

Nakul Pathak — Scribd

Outages

SRE Weekly Issue #264

A message from our sponsor, StackHawk:

StackHawk and FOSSA are getting together Thursday, April 8, to show you how to automate AppSec testing with GitHub actions. Register to learn how to test your open source and proprietary code for vulns in CI/CD.
https://hubs.ly/H0Ks1dy0

Articles

This well-researched article caught me by surprise. It’s shocking that Ably received advice from AWS to stay under 400,000 simultaneous connections, despite Amazon’s own documentation stating support for “millions of connections per second”.

Paddy Byers — Ably

This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.

There’s a lot of detail about what their SREs do and how they communicate, with 3 projects as case studies.

Sergio Galvan — Algolia

This is an incident followup from an incident at Deno earlier this year. Their CDN saw their heavy use of .ts files (TypeScript, a JavaScript variant) and mistakenly assumed they were MPEG transport segments, a violation of the CDN’s ToS. Oops.

Luca Casonato — Deno

Wait, there are 9 now?

Marc Hornbeek — Container Journal

There’s a nice little discussion of why “human error” is not a good enough answer for why a deviation (from standard operating procedure) happened.

Susan J. Schniepp and Steven J. Lynn — Pharmaceutical Technolog

They deployed an optimization that skipped sending some requests to the backend… and the backend metrics got worse. Why? Hint: aggregate metrics.

Dominik Sandjaja — Trivago

Outages

SRE Weekly Issue #263

A message from our sponsor, StackHawk:

You can utilize Swagger Docs in security testing to drive more thorough and accurate vulnerability scans of your APIs. Learn how:
http://sthwk.com/swagger-api-testing

Articles

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

I like that this focuses on human factors.

Kevin Casey

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme