General

SRE Weekly Issue #265

A message from our sponsor, StackHawk:

Join StackHawk and WhiteSource tomorrow morning to learn about automated security testing in the DevOps pipeline. With automated dynamic testing and software composition analysis, you can be sure you’re shipping secure APIs and applications. Grab your spot:
http://sthwk.com/stackhawk-whitesource

Articles

Here’s a great look into how LinkedIn’s embedded SREs work.

[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”

Zaina Afoulki and Lakshmi Namboori — LinkedIn

It’s all just other people’s caches.

Ruurtjan Pul

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

Charles Cary — Shoreline

This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.

Robert Barron

This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.

Ems1

Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.

Robert Mosolgo — GitHub

They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.

Keith Ballinger — GitHub

In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

Nakul Pathak — Scribd

Outages

SRE Weekly Issue #264

A message from our sponsor, StackHawk:

StackHawk and FOSSA are getting together Thursday, April 8, to show you how to automate AppSec testing with GitHub actions. Register to learn how to test your open source and proprietary code for vulns in CI/CD.
https://hubs.ly/H0Ks1dy0

Articles

This well-researched article caught me by surprise. It’s shocking that Ably received advice from AWS to stay under 400,000 simultaneous connections, despite Amazon’s own documentation stating support for “millions of connections per second”.

Paddy Byers — Ably

This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.

There’s a lot of detail about what their SREs do and how they communicate, with 3 projects as case studies.

Sergio Galvan — Algolia

This is an incident followup from an incident at Deno earlier this year. Their CDN saw their heavy use of .ts files (TypeScript, a JavaScript variant) and mistakenly assumed they were MPEG transport segments, a violation of the CDN’s ToS. Oops.

Luca Casonato — Deno

Wait, there are 9 now?

Marc Hornbeek — Container Journal

There’s a nice little discussion of why “human error” is not a good enough answer for why a deviation (from standard operating procedure) happened.

Susan J. Schniepp and Steven J. Lynn — Pharmaceutical Technolog

They deployed an optimization that skipped sending some requests to the backend… and the backend metrics got worse. Why? Hint: aggregate metrics.

Dominik Sandjaja — Trivago

Outages

SRE Weekly Issue #263

A message from our sponsor, StackHawk:

You can utilize Swagger Docs in security testing to drive more thorough and accurate vulnerability scans of your APIs. Learn how:
http://sthwk.com/swagger-api-testing

Articles

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

I like that this focuses on human factors.

Kevin Casey

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

SRE Weekly Issue #262

A message from our sponsor, StackHawk:

Join the Secure Coding Summit to hear from industry-leading AppSec and DevSecOps practitioners, analysts, and visionaries as they share their best pro tips to level up your code security.
http://sthwk.com/secure-code-summit

Articles

Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.

Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.

Courtney Nash — Verica

This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.

Ashley Roof — Transposit

An amusing parable illustrating why not to try to be too reliable.

Andrew Ford — Indeed

In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.

Eric Johansson — Verdict

Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.

Emily Arnott — Blameless

Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.

Deepanshu Mehndiratta — LinkedIn

Everyone is testing in production, some organizations admit and plan for it.

How to do it right, what can happen if it goes wrong, and how to limit the blast radius.

Heidi Waterhouse — LaunchDarkly

Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.

Dirkjan Bussink — GitHub

Outages

SRE Weekly Issue #261

A message from our sponsor, StackHawk:

Join Snyk and StackHawk on March 18 as they walk through how to use Software Composition Analysis (SCA) and Dynamic Application Security Testing (DAST) in CI/CD to ship more secure applications.
http://sthwk.com/snyk-stackhawk-webinar

Articles

I find it really refreshing that fighter pilots have a retrospective about every single mission, successful or not. There’s always something to learn.

Jessica Abelson — Transposit

Heroku applies the Incident Management System, designating an Incident Commander who keeps the incident on track and oversees communications, both external and internal.

Guillaume Winter — Heroku

This story is becoming common: Khan had a sudden influx of traffic when pandemic lockdowns began. Their strategy involved the use of the cloud and a CDN.

Marta Kosarchyn — Khan Academy

Full disclosure: Fastly, my employer, is mentioned.

Here’s a great summary of how Squarespace does SRE.

Franklin Angulo — Squarespace

Leaders at Deliveroo, DigitalOcean, Fastly, and Headspace share how their organizations think about reliability and resiliency and their advice to engineering orgs embarking on reliability journeys.

The leaders each answer a series of questions about how their organization handles reliability, giving an interesting compare-and-contrast overview.

Increment

Full disclosure: Fastly is my employer.

Using a disaster plan created after a devastating hurricane, Freshworks survived and thrived during the pandemic, delivering a major new product by its pre-pandemic deadline.

Ipsita Agarwal — Increment

This one explains what a canary deployment is, how it can help you, and how canary deployments differ from blue/green deployments.

LaunchDarkly

This article explains the meaning of a growth mindset and shows how it applies to SRE.

Emily Arnott — Blameless

Outages

  • Fastly
    • Full disclosure: Fastly is my employer.
  • OVH Cloud
  • All domains containing “t.co” in Russia
    • It appears that Russia tried to impair access to Twitter’s URL-shortening domain t.co, but their pattern-matching was overzealous and affected any domain that contained “t.co” (think reddit.com, microsoft.com, and many others).
  • Dyn
    • Dyn had a DNS outage. I noted impact to Heroku, but I didn’t see any other related outage postings.
  • Chef
  • GitHub
A production of Tinker Tinker Tinker, LLC Frontier Theme