Search Results for “outages”

SRE Weekly Issue #446

lex

October 13, 2024

Why I like discussing actions items in incident reviews

This one is a direct response to an article by Lorin Hochstein from a couple weeks back. There’s a lot here to think about, and it’s really great to see the back-and-forth discussion.

Chris Evans — incident.io

Building and operating a pretty big storage system called S3

A tour through the design of S3 by its VP. I found the discussion of managing “heat” (I/O load) especially interesting.

Andy Warfield — Amazon

A Comprehensive Guide to Database Sharding

This one introduced me to a new concept: vertical vs horizontal sharding. Vertical sharding by whole tables, and horizontal is sharding by related rows across tables, as with users or groups of users.

Suleiman Dibirov

Build a serverless ACID database with this one neat trick (atomic PutIfAbsent)

Thanks to its simplicity, in this post we’ll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies.

PutIfAbsent maps nicely to API features available in S3, Azure, and Google Cloud Storage, among others.

Phil Eaton

Implicit SLOs and their dangers

If your API has been quietly delivering five nines, and you add an SLO with a target of three nines, you’re gonna have issues.

Niall Murphy

Why the Future of the .io Domain Extension is Uncertain

Those .io domains seemed super cool, but maybe not so much now. If your company depends on one, especially for a public API endpoint, it’s probably about time to get a fallback domain lined up.

Vivek Naskar

Improving platform resilience at Cloudflare through automation

Cloudflare built an automated workflow processor on Temporal to handle routine failures, reducing toil.

Opeyemi Onikute — Cloudflare

Don’t Let an Expired Certificate Cause Critical Downtime. Prevent Outages with a Smart CLM

It’s hard enough handling certificate expiry yearly, but this article introduced me to the fact that browser root programs are pushing for standardization on 3-month certificates.

Krupa Patil — Security Boulevard

SRE Weekly Issue #442

lex

September 15, 2024

General

Comments

View on sreweekly.com

SLO: Elastic vs Datadog vs Grafana

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

Alex Ewerlöf

“SRE” doesn’t seem to mean anything useful any more

🔥🔥🔥 Can calling yourself an SRE be a liability?

rachelbythebay

Aggregating SLIs

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

Ali Sattari

Safety first!

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

Lorin Hochstein

Reliability recommendations when adopting Kubernetes

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

Andre Newman — Gremlin

Two Multi-Master DBs Aligned With a Vector Clock

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

Claudio Guidi and Giovanni Cuccu — DZone

Asynchronous IO: the next billion-dollar mistake?

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

Yorick Peterse

Google Cloud Incident Report: September 7 incident in asia-northeast1

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

Google

SRE Weekly Issue #440

lex

September 1, 2024

General

Comments

View on sreweekly.com

Continually testing our product with smoke tests

As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.

Rory Malcolm — incident.io

Unified Grid: How We Re-Architected Slack for Our Largest Customers

As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.

Ian Hoffman and Mike Demmer — Slack

Heroku incident 2678 Followup: Issues with Essential Tier Databases in EU region

A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.

Heroku

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.

John Allspaw — InfoQ

3 reasons traces are better than metrics for debugging

Metrics are great for many other things, but they can’t compete with traces for investigating problems.

Jean-Mark Wright

Good Retry, Bad Retry: An Incident Story

Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.

Denis Isaev — Yandex

Just use Postgres

Hot take? Sure, but they back it up with a well-reasoned argument.

Ethan McCue

Dealing with rejection (in distributed systems)

A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.

Richard Artoul — WarpStream

SRE Weekly Issue #437

lex

August 12, 2024

General

Comments

View on sreweekly.com

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeks—my RSS reader is packed with awesomeness!

CrowdStrike External Technical Root Cause Analysis — Channel File 291

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

CrowdStrike

The biggest-ever global outage: lessons for software engineers

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

Gergely Orosz

Staged rollouts of things still have limitations

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

Chris Siebenmann

Feedback on feed stuff and those pesky blue screens

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

rachelbythebay

Expect it most when you expect it least

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

Lorin Hochstein

CrowdStrike: how did we get here?

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

Lorin Hochstein

No More Blue Fridays

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

Brendan Gregg

Lessons from Crowdstrike’s outage

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

Bob Walker — Octopus Deploy

SRE Weekly Issue #434

lex

July 21, 2024

General

Comments

View on sreweekly.com

Technical Details: Falcon Update for Windows Hosts

The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.

Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.

CrowdStrike

Take the Annual SRE Survey

It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.

Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.

Catchpoint

You don’t always need a queue

Queues have a cost, as this article explains.

Jean-Mark Wright

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment – Part 1

I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!

Lex Neva — Honeycomb

Full disclosure: Honeycomb is my employer.

How to prevent accidental load balancer deletions

Deletion protection: it can really save you!

Andre Newman — Gremlin

A Look Into Netflix System Architecture

A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.

Rahul Shivalkar — ClickIT

Degradation vs disruption

This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.

Alex Ewerlöf

Enhancing cloud storage efficiency with s3-batch-object-store

Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.

Pablo Matias Gomez — Embrace