Search Results for – "outages"

SRE Weekly Issue #412

A message from our sponsor, FireHydrant:

FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time.
https://firehydrant.com/blog/mttx-incident-analytics-to-drive-your-reliability-roadmap/

Can a single dashboard to cover your entire system really exist?

  Jamie Allen

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

  Jonathan Word

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

  Emily Arnott — Blameless

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

  Alex Ewerlöf

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

  Frederick Lawler — Cloudflare

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

  Oleg Obleukhov and Ahmad Byagowi — Meta

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

  Fred Hebert

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

  Vladimirs Romanovskis

SRE Weekly Issue #410

A message from our sponsor, FireHydrant:

How many seats are you paying for in your legacy alerting tool that rarely get paged? With Signals’ bucket pricing, you only pay for what you use. Join the beta for a better tool at a better price.
https://firehydrant.com/blog/signals-beta-live/

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

  Hochuen Wong and Levon Stepanian — DoorDash

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

  David Ridge — PagerDuty

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

  Edward del Rio — Dropbox

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

  Shyam Venkat

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

  Dennis Henry

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

  Jamie Allen

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

  Kyra Dempsey (Admiral Cloudberg)

SRE Weekly Issue #399

A message from our sponsor, FireHydrant:

Severity levels help responders and stakeholders understand the incident impact and set expectations for the level of response. This can mean jumping into action faster. But first, you have to ensure severity is actually being set. Here’s one way.
https://firehydrant.com/blog/incident-severity-why-you-need-it-and-how-to-ensure-its-set/

This research paper summary goes into Mode Error and the dangers of adding more features to a system in the form of modes, especially if the system can change modes on its own.

  Fred Hebert (summary)
  Dr. Nadine B. Sarter (original paper)

Cloudflare suffered a power outage in one of the datacenters housing their control and data planes. The outage itself is intriguing, and in its aftermath, Cloudflare learned that their system wasn’t as HA as they thought.

Lots of great lessons here, and if you want more, they posted another incident writeup recently.

   Matthew Prince — Cloudflare

Separating write from read workloads can increase complexity but also open the door to greater scalability, as this article explains.

  Pier-Jean Malandrino

Covers four strategies for load shedding, with code examples:

  • Random Shedding
  • Priority-Based Shedding
  • Resource-Based Shedding
  • Node Isolation

  Code Reliant

Lots of juicy details about the three outages, including a link to AWS’s write-up of their Lambda outage in June.

  Gergely Orosz

The diagrams in this article are especially useful for understanding how the circuit-breaker pattern works.

  Pier-Jean Malandrino

This one’s about how on-call can go bad, and how to structure your team’s on-call so to be livable and sustainable.

  Michael Hart

Execs cast a big shadow in an incident, so it’s important to have a plan for how to communicate with them, as this article explains.

  Ashley Sawatsky — Rootly

SRE Weekly Issue #373

A message from our sponsor, Rootly:

Rootly is hiring for a Sr. Developer Relations Advocate to continue helping more world-class companies like Figma, NVIDIA, Squarespace, accelerate their incident management journey. Looking for previous on-call engineers with a passion for making the world a more reliable place. Learn more:

https://rootly.com/careers?gh_jid=4015888007

Articles

Datadog posted a report on their major outage in March, and it’s a doozy. An unattended updates system that they didn’t even want, need, or know about triggered across all hosts in multiple clouds nearly simultaneously, causing a regression.

  Alexis Lê-Quôc — Datadog

GitHub has had a string of apparently unrelated outages recently, and they’ve posted this description.

  Mike Hanley — GitHub

Oh look, another awesome-* repo relevant to our interests!

A repo of links to articles, papers, conference talks, and tooling related to load management in software services: loadshedding, circuitbreaking, quota management and throttling. PRs welcome.

  Laura Nolan and Niall Murphy — Stanza Systems

This interview covers a lot of ground including looking beyond just “up or down” when considering reliability.

  Prathamesh Sonpatki — SRE Stories

If you’re in the mood for a deep systems debugging story, you’re in for a treat. The author takes you along for the ride with a wealth of detailed code snippets.

  Tycho Andersen — Netflix

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

  Denis Rystsov and Alexander Gallego — Redpanda

Emotional intelligence is a critical skill for SREs, especially when we interact with other teams in fraught situations.

  Amin Astaneh — Certo Modo

Wow! Spotify created a set of tools to perform automated refactoring of thousands of repositories at once. This includes the ability to run tests, automatically merge pull requests without human review, and roll refactorings out gradually.

  Matt Brown — Spotify

Jeli has published a one-page cheat-sheet for their highly-detailed Howie guide for running incident retrospectives.

  Jeli

SRE Weekly Issue #345

SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out.

This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at sreweekly.com).

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Don’t beat yourself up! This is like another form of blamelessness.

  Robert Ross — FireHydrant + The New Stack

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

  Ash Patel — SREPath

This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.

  Shaaron A Alvares — InfoQ

Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.

  Jason Kalich — Meta

Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.

  Chris Baraniuk — BBC

Cloudflare shares details about their 87-minute partial outage this past Tuesday.

  John Graham-Cumming — Cloudflare

In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.

  Vivek Aggarwal — Razorpay

The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!

  Marc Brooker

A production of Tinker Tinker Tinker, LLC Frontier Theme