Search Results for – "outages"

SRE Weekly Issue #484

This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.

  Dmitry Gruzd — GitLab

This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.

  Lorin Hochstein

This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.

  Teiva Harsanyi — The Coder Cafe

In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.

  Fabiana Scala and Tali Gutman — Datadog

At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.

  Will Gallego

This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.

  Hans Nesbitt and Lwanga Phillip — AWS

Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.

  Kaavya Antony — Klaviyo

Final stretch! We’ve handled people and processes, now let’s crack the code side and stitch everything together into a four-stage framework you can reuse.

In case you missed them:

  Konstantin Rohleder — HelloFresh

SRE Weekly Issue #469

A message from our sponsor, incident.io:

Speed isn’t everything. We studied 100K+ incidents to find out what actually makes for good incident management—from detection to follow-up. You can now view the recording of our latest live event to get even more info on the benchmarks, insights, and real-life examples from the report.

https://go.incident.io/events/going-beyond-mttx

I’ve shared this article before, but it’s so critical that it’s time to give it another read. MTTR is a statistically useless metric, and by using it, we will draw faulty conclusions and potentially take harmful actions. Courtney Nash does a really great job of laying out the science in an easy-to-understand way.

  Courtney Nash — Resilience in Software Foundation / The VOID

I like the analogy here: when we say people are components in or sociotechnical systems, system diagrams are like a form of cache.

  Clint Byrum

From Werner Vogels’s intro to this article:

Andy takes us through S3’s evolution from simple object store to sophisticated data platform, illustrating how customer feedback has shaped every aspect of the service. It’s a fascinating look at how we maintain simplicity even as systems scale to handle hundreds of trillions of objects.

  Andy Warfield — Amazon

Instead of a traditional Cost/Performance/Reliability trade-off, this article argues that serverless presents a tradeoff of Cost, Performance, and Complexity.

  Luc van Donkersgoed

Google uses System Theoretic Process Analysis to identify problems in their systems. They found that the most effective way to spread adoption of STPA was to build their own training program.

  Garrett Holthaus — Google

So far, I’m liking this new post series from Nextdoor about their efforts to scale their datastore. Here’s the first installment, about the things they’ve tried up to now.

I’ll share the rest of the series as I work my way through them.

  Slava Markeyev — Nextdoor

Wow, I had no idea EBS volumes failed this often!

  Nick Van Wiggeren — PlanetScale

SRE Weekly Issue #459

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

  Yan Cui

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

  hatappi1225 — Mercari

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

  Murat

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

  Kyle Kingsbury

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

  Davi Ottenheimer

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

  Kyle Boutette and Jacob Curtis — Cloudflare

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

  Marc Brooker

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

  Uwe Friedrichsen

SRE Weekly Issue #446

A message from our sponsor, FireHydrant:

If the entire team is on a Zoom bridge during an incident – how do you know what really happened and when? We added real-time Zoom/Google Meet transcripts to make sure your incident timeline has every detail.

https://firehydrant.com/ai/

This one is a direct response to an article by Lorin Hochstein from a couple weeks back. There’s a lot here to think about, and it’s really great to see the back-and-forth discussion.

  Chris Evans — incident.io

A tour through the design of S3 by its VP. I found the discussion of managing “heat” (I/O load) especially interesting.

  Andy Warfield — Amazon

This one introduced me to a new concept: vertical vs horizontal sharding. Vertical sharding by whole tables, and horizontal is sharding by related rows across tables, as with users or groups of users.

   Suleiman Dibirov

Thanks to its simplicity, in this post we’ll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies.

PutIfAbsent maps nicely to API features available in S3, Azure, and Google Cloud Storage, among others.

  Phil Eaton

If your API has been quietly delivering five nines, and you add an SLO with a target of three nines, you’re gonna have issues.

  Niall Murphy

Those .io domains seemed super cool, but maybe not so much now. If your company depends on one, especially for a public API endpoint, it’s probably about time to get a fallback domain lined up.

  Vivek Naskar

Cloudflare built an automated workflow processor on Temporal to handle routine failures, reducing toil.

  Opeyemi Onikute — Cloudflare

It’s hard enough handling certificate expiry yearly, but this article introduced me to the fact that browser root programs are pushing for standardization on 3-month certificates.

  Krupa Patil — Security Boulevard

SRE Weekly Issue #442

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

  Alex Ewerlöf

🔥🔥🔥  Can calling yourself an SRE be a liability?

  rachelbythebay

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

  Ali Sattari

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

  Lorin Hochstein

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

  Andre Newman — Gremlin

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

   Claudio Guidi and Giovanni Cuccu — DZone

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

  Yorick Peterse

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

  Google

A production of Tinker Tinker Tinker, LLC Frontier Theme