SRE Weekly Issue #449

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

This new series seems promising! I won’t link to every article in the series here, but if you’re an early SRE, the intro-level articles published so far in this series are definitely worth a read.

Today, I’m thrilled to announce an ambitious project that’s been in the works for some time: “52 Weeks of SRE” – a comprehensive, year-long deep dive into the world of Site Reliability Engineering.

  J. Pereira

Adevinta shifted from Kubernetes’s cluster autoscaler to AWS’s Karpenter. The change brought huge advantages that they discuss in detail, along with a few challenges and pitfalls they needed to overcome.

  Tanat Lokejaroenlarb — Adevinta

An adventure in adopting an open source firmware for Baseboard Management Controllers, including fixing a few bugs themselves.

  Nnamdi Ajah, Ryan Chow, and Giovanni Pereira Zantedeschi — Cloudflare

[…] an overview of methods like TCP FastOpen, TLSv1.3, 0-RTT, and HTTP/3 to reduce handshake delays and improve server response times in secure environments.

   Maksim Kupriianov — DZone

This article includes general tips and a specific rubric you can follow to decide when to choose a larger or smaller RDS instance type.

  Prabesh

It turns out that a lot of the lessons that Mike Massimino learned as an astronaut apply very well to incident management.

  Eric Silberstein — Klaviyo

Solving IP exhaustion in EKS: Avoiding a network outage by implementing custom networking

  Fabián Sellés Rosa — Adevinta

By leveraging proportional–integral–derivative (PID) controllers, Robinhood can now more quickly and effectively manage load imbalances.

This was my first introduction to PID controllers. Neat!

  Yi-Shu Tai — Dropbox

Through an allegory about an imaginary knob to adjust between risk-avoidance and speed, Lorin Hochstein shows us that these trade-offs are being made, just implicitly.

  Lorin Hochstein

SRE Weekly Issue #448

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

They traded their monolith for microservices in a quest for scalability, but they got complexity along with it.

   Jennifer Riggins — The New Stack

Here’s a great summary of the difference between mutable and immutable infrastructure, including a detailed analysis of the pros and cons of each.

   Josephine Eskaline Joyce and Umar Ali — DZone

An introduction to incident severity and SEV1 incidents, along with how to respond to them, how to prevent them, and how to learn from them.

  Kate Bernacchi-Sass — incident.io

Long-running spans can be difficult to deal with, but fortunately Hazel Weakly is here with an explanation and some tips.

  Hazel Weakly — The New Stack

Here’s a debugging odyssey for a truly gnarly Jupyter Notebook problem that caused slowness in very specific and (seemingly unrelated) circumstances.

  Hechao Li and Marcelo Mayworm — Netflix

Beyond just “What went well?” in an incident writeup, Lorin urges examining our incidents to see what they can tell us about how work gets done and what adaptations people have made in our systems.

  Lorin Hochstein

A huge primer on wide events in observability: what they are, how to implement them, how to use them, and a ton of examples of the kinds of fields you might want to include in your events.

  Jeremy Morrell

  Full disclosure: Honeycomb, my employer, is mentioned.

The 2024 DORA Report is out, and the folks at Rootly have some thoughts on the interesting bits for SREs including AI, platform engineering, and burnout.

  Jorge Lainfiesta — Rootly

SRE Weekly Issue #447

A message from our sponsor, FireHydrant:

If the entire team is on a Zoom bridge during an incident – how do you know what really happened and when? We added real-time Zoom/Google Meet transcripts to make sure your incident timeline has every detail.

https://firehydrant.com/ai/

There are quite a few pitfalls waiting for you if you try to implement SLOs for your mobile app. This article explains and offers strategies.

   Virna Sekuj — The New Stack

Blamelessness in incident retrospectives can be a difficult concept to truly internalize. This article describes 3 common “failure modes”, that is, ways in which organizations struggle with blamelessness.

  Tom Elliott — The Friday Deploy

Cloudflare spends a lot of time thinking about cooling, and it’s fascinating. I didn’t realize that spinning a fan faster consumed so much more energy!

  Leslye Paniagua — Cloudflare

Explore the pitfalls associated with the excessive creation of microservices, insights on their causes, implications, and potential strategies for mitigation.

   Sumit Kumar — DZone

Netflix stores a truly obscene number of events, each of which has a timestamp and a set of key-value pairs. This article goes into a ton of detail on how they built their system.

  Rajiv Shringi, Vinay Chella, Kaidan Fullerton, Oleksii Tkachuk, and Joey Lynch — Netflix

A fun debugging story for a confusing crash bug, in which they found 6 other related bugs along the way.

  Brett Wines — Slack

My favorite one is about the principle “You Ain’t Gonna Need It”:

The flip side of YAGNI, however, is that at some point you might actually need it.

  Luc van Donkersgoed

When you create an index on multiple columns in Postgres, you’ll need to be sure that the order of the fields in the index allows it to be applied to your queries, as these folks learned.

  Jean-Mark Wright

SRE Weekly Issue #446

A message from our sponsor, FireHydrant:

If the entire team is on a Zoom bridge during an incident – how do you know what really happened and when? We added real-time Zoom/Google Meet transcripts to make sure your incident timeline has every detail.

https://firehydrant.com/ai/

This one is a direct response to an article by Lorin Hochstein from a couple weeks back. There’s a lot here to think about, and it’s really great to see the back-and-forth discussion.

  Chris Evans — incident.io

A tour through the design of S3 by its VP. I found the discussion of managing “heat” (I/O load) especially interesting.

  Andy Warfield — Amazon

This one introduced me to a new concept: vertical vs horizontal sharding. Vertical sharding by whole tables, and horizontal is sharding by related rows across tables, as with users or groups of users.

   Suleiman Dibirov

Thanks to its simplicity, in this post we’ll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies.

PutIfAbsent maps nicely to API features available in S3, Azure, and Google Cloud Storage, among others.

  Phil Eaton

If your API has been quietly delivering five nines, and you add an SLO with a target of three nines, you’re gonna have issues.

  Niall Murphy

Those .io domains seemed super cool, but maybe not so much now. If your company depends on one, especially for a public API endpoint, it’s probably about time to get a fallback domain lined up.

  Vivek Naskar

Cloudflare built an automated workflow processor on Temporal to handle routine failures, reducing toil.

  Opeyemi Onikute — Cloudflare

It’s hard enough handling certificate expiry yearly, but this article introduced me to the fact that browser root programs are pushing for standardization on 3-month certificates.

  Krupa Patil — Security Boulevard

SRE Weekly Issue #445

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Providing incident resolution times to customers is an unneeded stress for responders with very little gain.

  Robert Ross — FireHydrant

I can’t tell you how many times I’ve found myself lost in thought, wondering how something like EBS works. While this isn’t an architecture overview, it does contain a bunch of juicy tidbits. I especially like the bit about the value of a “full stack engineer”.

  Marc Olson — All Things Distributed

This article explains how to use eBPF to gather observability data, including an example eBPF program and instructions on how to run it.

   Kranthi Kiran Erusu — DZone

Netflix uses multiple kinds of data stores. It was difficult for developers to manage the differences between data stores, so they wrote an abstraction layer.

Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.

  Vidhya Arvind, Rajasekhar Ummadisetty, Joey Lynch, and Vinay Chella — Netflix

This post looks at the challenges of predicting capacity in a global CDN, including dealing with uncertainties in customer growth, traffic routing, hardware failure, and more.

  Curt Robords — Cloudflare

GitHub tells us about the tools they use to improve reliability and performance, including Scientist and Flipper.

  Nick Hengeveld — GitHub

If you’re heavily action-item-oriented like I used to be, this is a great read to get you thinking down a different path.

My coworker wrote this awesome script to update our various @team-oncall aliases in Slack automatically, following our PagerDuty on-call schedule. This one thing has already saved us so much in the way of toil, frustration, and missed notifications!

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

A production of Tinker Tinker Tinker, LLC Frontier Theme