SRE Weekly Issue #459

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

  Yan Cui

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

  hatappi1225 — Mercari

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

  Murat

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

  Kyle Kingsbury

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

  Davi Ottenheimer

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

  Kyle Boutette and Jacob Curtis — Cloudflare

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

  Marc Brooker

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

  Uwe Friedrichsen

SRE Weekly Issue #458

A message from our sponsor, incident.io:

Ever wonder how Netflix handles incidents at their scale? With incident.io, they’ve built a process that’s smooth, scalable, and keeps everyone on the same page. Tools like Catalog and Workflows make it intuitive for teams to tackle incidents consistently—no matter how big the challenge.

https://incident.io/customers/netflix

We can never see our systems directly, so we rely on “sensors” to understand the state of the system. What if the sensors are broken?

  Lorin Hochstein

Two super insightful observations about the nature of architectural work, well worth revisiting next time you’re making big design decisions.

So, “Two IMO relevant findings regarding architectural work” would probably be a more accurate title. But that would be a lot less catchy title … ;)

  Uwe Friedrichsen

To prevent revalidation stampedes, Cloudflare uses randomness to decide whether to send requests to the origin. Click through to find out how it works.

  Thibault Meunier — Cloudflare

Some problems with autoscaling, along with potential solutions.

   John Akkarakaran Jose — DZone

This article provides a detailed overview of the Incremental Migration with the Dual Write strategy, including the necessary steps, considerations, and best practices.

   Deepti Marrivada, Bal Reddy Cherlapally, and Spurthi Jambula — DZone

trying to build the perfect system that anticipates every future need is often worse than creating a system designed to change quickly.

I’ve experienced this firsthand as well. Even an architecture that was supposed to be static needed to change as requirements evolved.

  Simen A. W. Olsen — Pulumi

Using more reliable clocks with definite precision allows for significant performance improvements in distributed systems, as described in this article.

  Murat

This opinion piece argues that Snapshot Isolation is the “sweet spot” isolation level that is best for most applications.

  Marc Brooker

SRE Weekly Issue #457

A message from our sponsor, FireHydrant:

This New Year, resolve to make incident management smarter, faster, and way less stressful with FireHydrant. Modern on-call, automated incident response, and AI tools that do the heavy lifting.

https://firehydrant.com/

In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.

  Will Searle — Causely

The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.

I especially enjoyed the bit about how trying to add robustness can paradoxically diminish overall reliability, reminiscent of Lorin Hochstein and others.

  Uwe Friedrichsen

What happens when you move your DB and network latency goes from 0.5ms to 10ms? Time to find out by experimenting (carefully).

  Lawrence Jones

I’ve only used Kubernetes under Amazon EKS, which handles running etcd, so this guide helped fill in some gaps in my knowledge. Of course, under EKS, you still need to pay attention to etcd.

  David M. Lentz — Datadog

Google folks share how they’ve applied System-Theoretic Accident Model and Processes (STAMP) to SRE at Google. This really stood out to me:

A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether? 

  Tim Falzone and Ben Treynor Sloss — USENIX ;login:

Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.

I really love learning about ICS from the veterans who use it for actual emergencies!

  Claire Leverne — Rootly

Runbooks are programs for an imperfect execution engine of highly variable quality.

What happens when the runbook meets reality?

  Jos Visser

This is a really great one! Several factors combined to cause the outage, and they’re all laid out in juicy detail.

  Brendan Humphreys — Canva

Here’s Lorin Hochstein’s take on Canva’s outage report.

  Lorin Hochstein

SRE Weekly Issue #456

A message from our sponsor, FireHydrant:

On-call during the holidays? Spend more time taking in some R&R and less getting paged. Let alerts make their rounds fairly with our new Round Robin feature for Escalation Policies.

https://firehydrant.com/blog/introducing-round-robin-for-signals-escalation-policies/

Here’s another way to use math to show that tracking MTTR over time is going to help you draw incorrect conclusions about your incident trends.

  Lorin Hochstein

Why build your own? Dropbox had a heterogeneous fleet with differently-sized backends, and no load-balancer available at the time could handle that.

  Richard Oliver Bray

There’s so much here, I need to read it again a few times — and you should too. Their model has three stages of increasing maturity, allowing you to adopt it at the right pace for your org.

  Stephen Whitworth — incident.io

After accidentally losing all of their Kibana dashboards, the folks at Slack implemented chaos engineering to detect similar problems early.

  Sean Madden — Slack

This article raises concerns about using LLMs in production operations that I haven’t seen expressed quite in this way before.

  Niall Murphy

Five years ago, Mercari adopted a checklist for production readiness, and they’ve seen reliability improve as a result. Now they’re sharing how adoption has gone and the impact it’s had on development teams and what they’re doing about it.

  mshibuya — Mercari

They deleted an internal project that held API keys that were still in use.

  Google

A status page can be about so much more than just informing customers of downtime. It’s a marketing artifact, evidence for SLA breach, a sales pitch, and more.

  Lawrence Jones

SRE Weekly Issue #455

A message from our sponsor, FireHydrant:

FireHydrant Retrospectives are now more customizable and collaborative than ever with custom templates, AI-generated answers, collaborative editing… all exportable to Google Docs and Confluence. See how our retros can save you 2+ hours on every incident.

https://firehydrant.com/blog/welcome-to-your-new-retrospective-experience-more-customizable-collaborative/

This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.

  Sid

Some thoughts on the “second victim” concept. As a note, I was one of the participants in the discussion on which this article is based.

  Fractal Flame

Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?

  Kode Vicious — ACM Queue

This one used a cool technique I haven’t seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.

   Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab

Here’s a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn’t try to pretend this weird bug could have been foreseen and prevented.

  Chris Evans — incident.io

This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!

  Lawrence Jones

It’s all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!

  Grzegorz Skołyszewski — Prezi

OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.

  OpenAI

And here’s Lorin Hochstein’s analysis of OpenAI’s incident writeup, including a recurring theme:

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.

  Lorin Hochstein

A production of Tinker Tinker Tinker, LLC Frontier Theme