General

SRE Weekly Issue #455

A message from our sponsor, FireHydrant:

FireHydrant Retrospectives are now more customizable and collaborative than ever with custom templates, AI-generated answers, collaborative editing… all exportable to Google Docs and Confluence. See how our retros can save you 2+ hours on every incident.

https://firehydrant.com/blog/welcome-to-your-new-retrospective-experience-more-customizable-collaborative/

This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.

  Sid

Some thoughts on the “second victim” concept. As a note, I was one of the participants in the discussion on which this article is based.

  Fractal Flame

Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?

  Kode Vicious — ACM Queue

This one used a cool technique I haven’t seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.

   Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab

Here’s a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn’t try to pretend this weird bug could have been foreseen and prevented.

  Chris Evans — incident.io

This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!

  Lawrence Jones

It’s all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!

  Grzegorz Skołyszewski — Prezi

OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.

  OpenAI

And here’s Lorin Hochstein’s analysis of OpenAI’s incident writeup, including a recurring theme:

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.

  Lorin Hochstein

SRE Weekly Issue #454

Nine entire years ago, I threw together a few “issues” with my favorite SRE articles, installed WordPress, and added a subscription form, with no clue what I was doing. It’s only thanks to you folks, the thousands of subscribers and the many authors of great SRE content, that I’ve been able to keep this up for so long. Thank you, you make it fun! And as always, thanks also to my sponsors, former, current, and future, who’ve helped make this whole thing possible.

A message from our sponsor, FireHydrant:

Why migrate from PagerDuty? Empower team-level ownership, reduce costs, decouple alerts from incidents, automate incidents end-to-end…to name a few. Join the growing list of companies that have made the switch. (p.s. our Signals migrator makes it simple)

https://firehydrant.com/migrate/from-pagerduty/

When we try to optimize MTTR as if it’s a meaningful statistic, we run into trouble. This article does a great job of explaining why, drawing from concepts and techniques in manufacturing.

  Lorin Hochstein

This article introduces the concepts of “shared nothing” and “shared storage” in distributed systems and then explains why they chose shared storage for WarpStream.

  Richard Artoul — WarpStream

How much did that incident cost in lost revenue? This article says you should avoid including that number in your incident management process, because it’s a trap.

  Tom Webster — Rootly

Pushing a system to 100% CPU utilization can cause workloads to be slowed down. This article is about experimentally finding the sweet spot between utilizing CPUs as much as possible and avoiding performance issues.

  Andreas Strikos — GitHub

This article has a couple of strategies for handling concurrent updates to the same row in MySQL, with and without locking.

  Sönke Ruempler

They do it with a dead man’s switch, implemented using a backup alert provider.

  Lawrence Jones — incident.io

I came across part 6 first and I need to go back and read the rest, but I just had to share this now, because if the cool concept it contains: that efficiency and resiliency are at odds with each other.

  Uwe Friedrichsen

This is so cool! Their system automatically figures out which API calls are critical to each user journey and keeps the list updated.

  yakenji — Mercari

SRE Weekly Issue #453

A message from our sponsor, FireHydrant:

Why migrate from PagerDuty? Empower team-level ownership, reduce costs, decouple alerts from incidents, automate incidents end-to-end…to name a few. Join the growing list of companies that have made the switch. (p.s. our Signals migrator makes it simple)

https://firehydrant.com/migrate/from-pagerduty/

It’s a case of cascading failure, but with an interesting twist: their system was designed to handle floods but the safety mechanism was left unconfigured.

   Jamie Herre, Tom Walwyn, Christian ndres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe — Cloudflare

Lorin takes apart the Cloudflare write-up with style, including a really insightful section on safety mechanisms in complex systems.

  Lorin Hochstein

Meta wanted to log details about the encrypted communications in their systems to help track key use, outdated algorithms, and the like. It’s a ton of telemetry, so they did smart sampling (which they call aggregation):

During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.

  Hussain Humadi, Sasha Frolov, Rafael Misoczki, Dong Wu — Meta

A primer on using Golang’s profiling tools including CPU profiling, memory profiling, goroutine leak analysis, and execution tracing.

  Gaurav Maheshwari — Oodle

A thought-provoking piece of automation, friction, and adaptive capacity. I especially enjoyed the section on decompensation.

  Fred Hebert

With various tools for different kinds of telemetry, these folks needed to up their game to be able to fully understand what happened in a customer request. They also needed a custom sampling strategy to make sure they didn’t miss anything important.

  Martin Fahy — Klaviyo

we’ll be looking at how Ably’s platform achieves scalability, and how, as a result, there’s no effective ceiling on the scale of applications that can be supported.

  Paddy Byers — Ably

Airbnb built a system for tracking and analyzing user actions to aid in personalization. Their system uses Flink and Kafka to handle over a million events per second.

  Kidai Kwon — Airbnb

SRE Weekly Issue #452

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

The Lunch Exercise was my favorite part of the Blackrock3 training, and now Slack has adapted it for their ongoing training.

How Slack trains engineers in incident response by ordering lunch together.

  Scott Nelson Windels — Slack

Cloudflare runs programs written in their custom language Topaz in the hot path. They use formal verification in production(!) to ensure that the set of Topaz programs make sense.

  ames Larisch, Suleman Ahmad, and Marwan Fayed — Cloudflare

Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.

  Rajiv Shringi, Oleksii Tkachuk and Kartik Sathyanarayanan — Netflix

It’s hard, and this article explains why in excellent detail. It also includes a discussion of options to consider when designing a chat system.

  Ably

In anticipation of https://aws-news.com‘s busiest period of the year, I redesigned the API access patterns to support very effective caching. This resulted in significantly reduced backend load and a much faster frontend.

  Luc van Donkersgoed — AWS News

Recover means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs AND a preventative approach has been implemented.

  Will Searle — Causely

Here’s a list of recommended talks for SREs attending re:Invent, with short descriptions explaining why they’re interesting.

  Jamie Baker

In this post, I’ll share exactly how we link our code to the team that owns it, so errors and alerting are routed to the right place with minimal maintenance burden.

  Martha Lambert — incident.io

SRE Weekly Issue #451

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

Most fascinating air incident report I’ve seen in awhile! The pilots deviated from the non-normal checklist, and it immediately made me think of runbooks. On the one hand, you want the runbook to be simple and easy to handle in an incident. On the other hand, it can be very useful to tell the operator why they should do something.

  Mentour Pilot

With their claimed 14.5% of all websites depending on Cloudflare’s DNS, they had to be super careful with this migration. Lots of good stuff in here including:

  • replacing direct DB access by multiple services with an API
  • keeping the old and new DB in sync
  • ensuring both forward and reverse migration were possible in case of rollback

  Alex Fattouche and Corey Horton Cloudflare

I didn’t get to experience the value of a good tracing tool until recently in my career, and I didn’t understand the hype. If you’re in the same boat, this article may help you understand the value of tracing.

  Sam Starling — incident.io

About a year ago, Honeycomb git rid of incident severity levels in favor of incident types, which are purposefully not sortable. Here’s how their experiment has gone so far.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

Is Service Level Indicator (SLI) the same as Key Performance Indicator (KPI)?

There’s a really cool framing in there: KPIs are moonshots, so we aim high and rarely hit all of them, while with SLOs, we under-promise and over-deliver.

  Alex Ewerlöf

A fun dive into some unix/linux internals with nine different methods to run a program with timeouts and retries. If you have a soft spot in your heart for signals and system calls, this one’s for you.

  Philippe Gaultier

Cosmos DB is Azure’s answer to Amazon’s DynamoDB. This article gives a nice overview and compares it to various other data stores to help you decide whether it’s right for your use case.

  Adam Gordon Bell — Pulumi

An engineer at Mercari shares their plan for migrating to their new payment system in this five-part article series, all of which are published now. They created their design after reading 80(!) similar articles from folks at other companies.

  resotto — Mercari

A production of Tinker Tinker Tinker, LLC Frontier Theme