General

SRE Weekly Issue #379

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

In case you weren’t familiar with the Saga pattern like I was, it’s basically a pseudo-transaction across multiple microservices. Here’s why it might not be a great idea.

  Sergiy Yevtushenko

During a rolling deploy, for a very brief period of time, different parts of the infrastructure had old or new code running, with unexpected results.

  Andrew Ayer

On its face, we have a simple requirement:

  • Generate sequential numbers
  • Ensure that there can be no gaps
  • Do that in a distributed manner

It’s never simple with distributed systems.

In classic Cloudflare style, here’s an ultra-deep dive into the kernel to find the source of trouble-making packet loss.

  Terin Stock — Cloudflare

Even with a “duplicate” incident, there’s always at least one thing that’s different: the fact that it’s happened before. That changes things. In practice, a lot more will be different too.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

There are definitely pros and cons to being in the most popular (and most oft-maligned) AWS region.

  Jeff Martens — Metrist

Changes are frequent causes of incidents, but what exactly counts as a change? This article delves into that with examples.

  Boris Cherkasky

This crash is a great reminder that we have to look past “human error” to the systems around the humans that set them up for failure (or don’t set them up for success).

  Admiral Cloudberg

SRE Weekly Issue #378

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

This is the story of a fascinating incident in which a commercial airplane’s engine was ripped off during takeoff (also covered on Mentour Pilot). What really struck me is the way a huge team on the ground and in the air assembled around the incident and all played very important roles in getting the plane down safely.

  Mark D. Young — PoliticsWeb

Time for another Catchpoint SRE Survey! They donate $5 to the Red Cross for every completed survey, so let’s all work together and drive a huge donation!

  Catchpoint

The US Federal Trade Commission (FTC) put out a request for information about cloud providers, including reliability among other topics. Here’s Corey Quinn’s answer.

  Corey Quinn — The Duckbill Group

What can you do when running an incident feels like herding cats? This article has some tips.

  Robert Ross — FireHydrant

I have a confession. Despite having been hired multiple times in part due to my experience with monitoring platforms, I have come to hate monitoring.

This jaded tale also contains some good suggestions for dealing with monitoring pitfalls.

  Mathew Duggan

The cardinal rule of engineering:

your solution shouldn’t become your next problem.

  Kumar Amit — Mercari

Here’s the articlization of a talk Fred Hebert gave at QCon New York. The alternate title of the talk is:

This Is All Going To Hell Anyway
All We Can Do Is Influence How Long It’s Gonna Take

I had the pleasure of seeing a draft version of this talk at work, since (full disclosure) Fred is my coworker.

  Fred Hebert

This article makes the case that elastic scaling is both harder to implement and more important for use cases involving streaming updates to users in real-time.

  Mittul Madaan — Ably

An intro to pdsh, my favorite of the tools that run commands on many hosts via SSH.

  Amin Astaneh — Certo Modo

SRE Weekly Issue #377

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

AWS had a major Lambda outage in us-east-1, and it took out many customer systems and quite a few other AWS systems, including their support portal.

  The Stack

This person had a fascinating path to SRE, starting out their career as a generator repair technician and transitioning through devops to SRE.

  Brian Hellinger — Towards AWS

In part 1, they outlined how they replay real traffic to test a new system before deploying it. In this article, they build on that with three additional techniques: sticky canaries, A/B testing, and gradually shifting traffic to the new system in production.

  Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

By comparing status page posting to their independent monitoring of services, Metrist is able to produce statistics about how long companies take to post to their status pages when they have an outage.

  Jeff Martens — Metrist

Improvising during an incident isn’t just a one-off occurrence, and we should plan for it.

  Lorin Hochstein — Surfing Complexity

A foreign key column had a smaller integer data type than the key that it referenced, and it failed when the referenced key went too high.

  Heroku

Here, we’ll look at the key considerations you need to make when it comes to the architecture of your chat app, the structure and components of that architecture, and some of the technology options that can help support you in building a reliable chat experience.

  Ably

A departure from the normal air traffic control procedure allowed the pilots to lose situational awareness. A commonly-held myth about flotation equipment contributed to three deaths in a quite survivable accident.

  Admiral Cloudberg

They kept finding what they thought was the problem, and their fixes helped, but the problem kept coming back.

  Tanat Paul Lokejaroenlarb — Adevinta

SRE Weekly Issue #376

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

With 100 workstreams and over 500 engineers engaged, this was the biggest incident response I’ve read about in years.

We had to force ourselves to identify the facts on the ground instead of “what ought to be,” and overrule our instincts to look for data in the places we normally looked (since our own monitoring was impacted).

  Laura de Vesine — Datadog

When you unify these three “pillars” into one cohesive approach, a new ability to understand the full state of your system in several new ways also emerges.

  Danyel Fisher — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

This report details the 10-hour incident response following the accidental deletion of live databases (rather than their snapshots, as intended).

  Eric Mattingly — Azure

Neat trick: write your alerts in English and get GPT to convert them to real alert configurations.

  Shahar and Tal — Keep (via HackerNews)

If your DNS resolver is responsible for handling queries for both internal and external domains, what happens when external DNS requests fail? Can internal ones still proceed?

  Chris Siebenmann

This article explains potential pitfalls and downsides to observability tools and the ways vendors might try to get you to use them, along with tips for how to avoid the traps.

  David Caudill

Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

  Loron Hochstein — Surfing Complexity

SRE Weekly Issue #375

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

An in-depth analysis of the crash of a recent lunar lander. It’s really interesting that a feature designed specifically to improve robustness to failures instead made the system less reliable in unforeseen circumstances.

  Robert Barron — IBM

With each external cloud service you deploy, you introduce the amount of unreliability that product has into your own product’s reliability (even if it’s incredibly small).

   Jeff Martens — The New Stack

Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.

  Amin Astaneh — Certo Modo

While it can seem pretty insignificant, properly distinguishing between an incident and a bug is worthwhile. Why? Because it will ultimately help dictate your response to it.

  Luis Gonzalez — incident.io

This is impressive: an engineer built an entire model of a ride-share system, complete with simulated riders and drivers, metrics, containerization, the works, all to gain a better understanding of how these kinds of systems work.

  Gergely Orosz — Pragmatic Engineer

This article answers the most important questions:
* How is using service levels any different than “regular” alarms?
* What’s in it for the company and the teams?
* Why bother? Don’t we already have enough work to do?

  Alex Ewerlöf

Here at eBay, we’ve crafted a brand new approach to automate platform evolution for all applications — one that provides a repeatable and reusable infrastructure to streamline evolution.

  Paul Zhang and Tao Jin

Interesting idea: feeding trace data into an LLM and asking it to build an end-to-end (E2E) test for the entire system. This article is a good description of what they’re doing but I’d be interested to hear more about the results.

  Nir Gazit — Honeycomb
  Full disclosure: Honeycomb is my employer.

What conclusions can we draw from the recent announcement that Amazon Prime Video is moving from serverless to a monolith?

The supposed difference between the two methods is not based on the technology itself, but the context in which you’re working.

  Ian Miell

A production of Tinker Tinker Tinker, LLC Frontier Theme