Search Results for “outages” – Page 2

SRE Weekly Issue #437

lex

August 12, 2024

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeks—my RSS reader is packed with awesomeness!

CrowdStrike External Technical Root Cause Analysis — Channel File 291

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

CrowdStrike

The biggest-ever global outage: lessons for software engineers

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

Gergely Orosz

Staged rollouts of things still have limitations

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

Chris Siebenmann

Feedback on feed stuff and those pesky blue screens

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

rachelbythebay

Expect it most when you expect it least

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

Lorin Hochstein

CrowdStrike: how did we get here?

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

Lorin Hochstein

No More Blue Fridays

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

Brendan Gregg

Lessons from Crowdstrike’s outage

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

Bob Walker — Octopus Deploy

SRE Weekly Issue #434

lex

July 21, 2024

General

Comments

View on sreweekly.com

Technical Details: Falcon Update for Windows Hosts

The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.

Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.

CrowdStrike

Take the Annual SRE Survey

It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.

Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.

Catchpoint

You don’t always need a queue

Queues have a cost, as this article explains.

Jean-Mark Wright

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment – Part 1

I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!

Lex Neva — Honeycomb

Full disclosure: Honeycomb is my employer.

How to prevent accidental load balancer deletions

Deletion protection: it can really save you!

Andre Newman — Gremlin

A Look Into Netflix System Architecture

A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.

Rahul Shivalkar — ClickIT

Degradation vs disruption

This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.

Alex Ewerlöf

Enhancing cloud storage efficiency with s3-batch-object-store

Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.

Pablo Matias Gomez — Embrace

SRE Weekly Issue #422

lex

April 28, 2024

General

Comments

View on sreweekly.com

PIOSEE Decision Model and preparations for critical situations

The PIOSEE model is taught to pilots as a rubric for coming to a decision in a difficult aviation situation. As this article explains, we can also use it during IT incidents.

Francisco Melo Jr.

Solving Observability’s Cardinality Conundrum

What is high cardinality in monitoring systems? Here’s an excellent explanation that includes tips on how to manage cardinality.

Ash P — SREPath

Building a customer-focused Observability Maturity Model

As Xero transitioned to a standard of “you build it you run it”, suddenly more engineering teams were responsible for knowing about and implementing observability. They designed this maturity model to help teams understand what they were aiming for and how to get there.

Andrew Macdonald — Xero

The invisible seafaring industry that keeps the internet afloat

With around 200 undersea fiber cuts worldwide per year, a fleet of ships is at the ready to pull up the cables and repair them.

Josh Dzieza — The Verge

Major data center power failure (again): Cloudflare Code Orange tested

Last year, Cloudflare suffered a control plane outage when one of their datacenters lost power. They since did significant work to improve their resilience to power outages, and it was put to the test when the same datacenter lost power again.

Matthew Prince, John Graham-Cumming, and Jeremy Hartman — Cloudflare

How the Platform team became effective in working remotely

Going from non-remote to remote was challenging but here’s how our team changed as we began working from home

Stefan Mikolajczyk — WeTransfer

The Platform Empathy Gap

Platform teams have a hugely important role to fill in the engineering organization. They are often the teams that enable other teams to move with more speed and safety. They can also quickly become disconnected from their customers.

Ross Brodbeck

Graceful Degradation and SLOs

When your system successfully serves a degraded response to the customer, how should you count that toward your SLO? Is it success? Failure? Something in between?

Niall Murphy

SRE Weekly Issue #412

lex

February 18, 2024

General

Comments

View on sreweekly.com

The Single Pain of Glass

Can a single dashboard to cover your entire system really exist?

Jamie Allen

The importance of SEV-1 call leaders

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

Jonathan Word

Getting Buy-in from Management on Reliability Investments

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

Emily Arnott — Blameless

SRE Archetypes

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

Alex Ewerlöf

connect() – why are you so slow?

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

Frederick Lawler — Cloudflare

Simple Precision Time Protocol at Meta

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

Oleg Obleukhov and Ahmad Byagowi — Meta

A Distributed Systems Reading List

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

Fred Hebert

Streamlining and Implementing Incident Management at Dyninno

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

Vladimirs Romanovskis

SRE Weekly Issue #410

lex

February 4, 2024

General

Comments

View on sreweekly.com

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

Hochuen Wong and Levon Stepanian — DoorDash

APAC Retrospective: Learnings from a Year of Tech Outages – Dismantling Knowledge Silos

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

David Ridge — PagerDuty

How the data center site selection process works at Dropbox

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

Edward del Rio — Dropbox

Untangle Blockers that impede Site Reliability Engineering (SRE) adoption.

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

Shyam Venkat

Beyond Debugging: Harnessing Preattentive Processes in Incident Response

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

Dennis Henry

SRE and Product

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

Jamie Allen

Panic on the Schoolyard: The Merion midair collision (death of Senator John Heinz)

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

Kyra Dempsey (Admiral Cloudberg)