SRE Weekly Issue #264

A message from our sponsor, StackHawk:

StackHawk and FOSSA are getting together Thursday, April 8, to show you how to automate AppSec testing with GitHub actions. Register to learn how to test your open source and proprietary code for vulns in CI/CD.
https://hubs.ly/H0Ks1dy0

Articles

This well-researched article caught me by surprise. It’s shocking that Ably received advice from AWS to stay under 400,000 simultaneous connections, despite Amazon’s own documentation stating support for “millions of connections per second”.

Paddy Byers — Ably

This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.

There’s a lot of detail about what their SREs do and how they communicate, with 3 projects as case studies.

Sergio Galvan — Algolia

This is an incident followup from an incident at Deno earlier this year. Their CDN saw their heavy use of .ts files (TypeScript, a JavaScript variant) and mistakenly assumed they were MPEG transport segments, a violation of the CDN’s ToS. Oops.

Luca Casonato — Deno

Wait, there are 9 now?

Marc Hornbeek — Container Journal

There’s a nice little discussion of why “human error” is not a good enough answer for why a deviation (from standard operating procedure) happened.

Susan J. Schniepp and Steven J. Lynn — Pharmaceutical Technolog

They deployed an optimization that skipped sending some requests to the backend… and the backend metrics got worse. Why? Hint: aggregate metrics.

Dominik Sandjaja — Trivago

Outages

Updated: April 4, 2021 — 10:35 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme