Search Results for – "outages"

SRE Weekly Issue #437

This week’s issue is entirely focused on the CrowdStrike incident: more details on what happened, analysis, and learnings. I’ll be back next week with a selection of all of the great stuff you folks have been writing while I’ve been off on vacation for the past two weeksmy RSS reader is packed with awesomeness!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This week, CrowdStrike posted quite a bit more detail about what happened on July 19. The short of it seems to be an argument count mismatch, but as with any incident of this sort, there are multiple contributing factors.

The report also continues the conversation about the use of kernel mode in a product such as this, amounting to a public conversation with Microsoft that is intriguing to watch from the outside.

  CrowdStrike

This article has some interesting details about antitrust regulations(!) related to security vendors running code in kernel mode. There’s also an intriguing story of a very similar crash on Linux endpoints running CrowdStrike’s Falcon.

Note: this one is from a couple of weeks ago and some of its conjectures don’t quite line up with details that have been released in the interim.

  Gergely Orosz

While it mentions the CrowdStrike incident only in vague terms, this article discusses why slowly rolling out updates isn’t a universal solution and can bring its own problems.

  Chris Siebenmann

Some thoughts on staged rollouts and the CrowdStrike outage:

The notion we tried to get known far and wide was “nothing goes everywhere at once”.

Note that this post was published before CrowdStrike’s RCA which subsequently confirmed that their channel file updates were not deployed through staged rollouts.

  rachelbythebay

[…] there may be risks in your system that haven’t manifested as minor outages.

Jumping off from the CrowdStrike incident, this one asks us to look for reliability problems in parts of our infrastructure that we’ve grown to trust.

  Lorin Hochstein

While CrowdStrike’s RCA has quite a bit of technical detail, this post reminds us that we need a lot more context to really understand how an incident came to be.

  Lorin Hochstein

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

I didn’t realize that Microsoft is working on eBPF for Windows.

  Brendan Gregg

This post isn’t about what Crowdstrike should have done. Instead, I use the resources to provide context and takeaways we can apply to our teams and organizations.

  Bob Walker — Octopus Deploy

SRE Weekly Issue #434

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

https://firehydrant.com/blog/introducing-a-brand-new-microsoft-teams-integration/

The big news this week, of course, is the CrowdStrike-related series of outages in airports, banks, and many other businesses. Here’s their statement on the situation.

Rumor has it that Southwest Airlines survived because they run Windows 3.1. Well, that’s one way to do it.

  CrowdStrike

It’s time for Catchpoint’s annual SRE survey again! We get a lot of interesting information about SRE trends from this, so it’d be great if you could take a moment to fill it out.

Note, usually I try to avoid giving you “utm” stuff in links, but this link is specifically set up to track whether folks come from SRE Weekly, so I left it in this time.

  Catchpoint

Queues have a cost, as this article explains.

  Jean-Mark Wright

I wrote this article about an exciting project I led recently: taking down an entire availability zone in production to test reliability. Part 2 is due out next week!

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

Deletion protection: it can really save you!

  Andre Newman — Gremlin

A thorough overview of Netflix’s architecture, with focus on data stores, content processing, billing, and the CDN, among other topics.

   Rahul Shivalkar — ClickIT

This article compares the terms “degradation”, “disruption”, and “service outage” through the lens of service levels.

  Alex Ewerlöf

Their workload involved writing many small objects but reading very few. By batching many writes into a single object in S3, they saved a ton of money, and now they’re open sourcing their solution.

  Pablo Matias Gomez — Embrace

SRE Weekly Issue #422

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

The PIOSEE model is taught to pilots as a rubric for coming to a decision in a difficult aviation situation. As this article explains, we can also use it during IT incidents.

  Francisco Melo Jr.

What is high cardinality in monitoring systems? Here’s an excellent explanation that includes tips on how to manage cardinality.

  Ash P — SREPath

As Xero transitioned to a standard of “you build it you run it”, suddenly more engineering teams were responsible for knowing about and implementing observability. They designed this maturity model to help teams understand what they were aiming for and how to get there.

  Andrew Macdonald — Xero

With around 200 undersea fiber cuts worldwide per year, a fleet of ships is at the ready to pull up the cables and repair them.

  Josh Dzieza — The Verge

Last year, Cloudflare suffered a control plane outage when one of their datacenters lost power. They since did significant work to improve their resilience to power outages, and it was put to the test when the same datacenter lost power again.

   Matthew Prince, John Graham-Cumming, and Jeremy Hartman — Cloudflare

Going from non-remote to remote was challenging but here’s how our team changed as we began working from home

  Stefan Mikolajczyk — WeTransfer

Platform teams have a hugely important role to fill in the engineering organization. They are often the teams that enable other teams to move with more speed and safety. They can also quickly become disconnected from their customers.

  Ross Brodbeck

When your system successfully serves a degraded response to the customer, how should you count that toward your SLO? Is it success? Failure? Something in between?

  Niall Murphy

SRE Weekly Issue #412

A message from our sponsor, FireHydrant:

FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time.
https://firehydrant.com/blog/mttx-incident-analytics-to-drive-your-reliability-roadmap/

Can a single dashboard to cover your entire system really exist?

  Jamie Allen

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

  Jonathan Word

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

  Emily Arnott — Blameless

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

  Alex Ewerlöf

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

  Frederick Lawler — Cloudflare

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

  Oleg Obleukhov and Ahmad Byagowi — Meta

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

  Fred Hebert

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

  Vladimirs Romanovskis

SRE Weekly Issue #410

A message from our sponsor, FireHydrant:

How many seats are you paying for in your legacy alerting tool that rarely get paged? With Signals’ bucket pricing, you only pay for what you use. Join the beta for a better tool at a better price.
https://firehydrant.com/blog/signals-beta-live/

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

  Hochuen Wong and Levon Stepanian — DoorDash

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

  David Ridge — PagerDuty

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

  Edward del Rio — Dropbox

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

  Shyam Venkat

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

  Dennis Henry

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

  Jamie Allen

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

  Kyra Dempsey (Admiral Cloudberg)

A production of Tinker Tinker Tinker, LLC Frontier Theme