Search Results for – "outages"

SRE Weekly Issue #345

SRE Weekly is now on Mastodon at @SREWeekly@social.linux.pizza! Follow to get notified of each new issue as it comes out.

This replaces the Twitter account @SREWeekly, which I am now retiring in favor of Mastodon. For those of you following @SREWeekly on Twitter, you’ll need to choose a different way to get notified of new issues. If Mastodon isn’t your jam, try RSS or a straight email subscription (by filling out the form at sreweekly.com).

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Don’t beat yourself up! This is like another form of blamelessness.

  Robert Ross — FireHydrant + The New Stack

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

  Ash Patel — SREPath

This conference talk summary outlines the three main lessons Jason Cox learned as director of SRE at Disney.

  Shaaron A Alvares — InfoQ

Here’s a look at how Meta has structured its Production Engineer role, their name for SREs.

  Jason Kalich — Meta

Bit-flips caused by cosmic rays seem incredibly rare, but they become more likely as we make circuits smaller and our infrastructures larger.

  Chris Baraniuk — BBC

Cloudflare shares details about their 87-minute partial outage this past Tuesday.

  John Graham-Cumming — Cloudflare

In reaction to a major outage, these folks revamped their alerting and incident response systems. Here’s what they changed.

  Vivek Aggarwal — Razorpay

The author of this post sought to test a simple algorithm from a research paper that purported to reduce tail latency. Yay for independent verfication!

  Marc Brooker

SRE Weekly Issue #341

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

My coworkers referred to a system “going metastable”, and when I asked what that was, they pointed me to this awesome paper.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is `removed.

  Nathan Bronson, Aleksey Charapko, Abutalib Aghayev, and Timothy Zhu

Honeycomb posted this incident report involving a service hitting the open file descriptors limit.

  Honeycomb
  Full disclosure: Honeycomb is my employer.

Lots of interesting answers to this one, especially when someone uttered the phrase:

engineers should not be on call

  u/infomaniac89 and others — reddit

A misbehaving internal Google service overloaded Cloud Filestore, exceeding its global request limit and effectively DoSing customers.

  Google

An in-depth look at how Adobe improved its on-call experience. They used a deliberate plan to change their team’s on-call habits for the better.

  Bianca Costache — Adobe

This one contains an interesting observation: they found that outages caused by a cloud providers take longer to solve.

  Jeff Martens — Metrist

Even if you don’t agree with all of their reasons, it’s definitely worth thinking about.

  Danny Martinez — incident.io

This one covers common reliability risks in APIs and techniques for mitigating them.

  Utsav Shah

The evolution beyond separate Dev and Ops teams continues. This article traces the path through DevOps and into platform-focused teams.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #337

Thanks for all the vacation well-wishes! It was really great and relaxing. Take vacations, it’s important for reliability!

While I was out, I shipped the past two issues with content prepared in advance, and without the Outages section. This gave me a chance to really think hard about the value of the Outages section versus the time and effort I put into it.

I’ve decided to put the Outages section on hiatus for the time being. For notable outages, I’ll include them in the main section, on a case-by-case basis. Read on if you’re interested in what went into this decision.

The Outages section has always been of lower quality than the rest of the newsletter. I have no scientific process for choosing which Outages make the cut — mostly it’s just whatever shows up in my Google search alerts and seems “important”, minus a few arbitrary categories that don’t seem particularly interesting like telecoms and games. I do only a cursory review of the outage-related news articles I link to, and often they’re on poor-quality sites with a ton of intrusive ads. Gathering the list of Outages has begun taking more and more of my time, and I’d much rather spend that effort on curating quality content, so that’s what I’m going to do going forward.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Every one of these 10 items is enough reason to read this article! This makes me want to go investigate some incidents right now.

  Fischer Jemison — Jeli

Slack shares with us in great detail why they use circuit breakers and how they rolled them out.

  Frank Chen — Slack

My favorite part of this one is the section on expectations. We need to socialize this to help reduce the pressure on folks going on call for the first time.

  Prakya Vasudevan — Squadcast

Status pages are marketing material. Prove me wrong.

  Ellen Steinke — Metrist

incidents have unusually high information density compared with day-to-day work, and they enable you to piggy-back on the experience of others

  Lisa Karlin Curtis — incident.io

These folks realized that they had two different use cases for the same data, real-time transactions and batch processing. Rather than try to find one DB that could support both, they fork two copies of the data.

  Xi Chen and Siliang Cao — Grab

It’s all about gathering enough information that you can ask new questions when something goes wrong, rather than being stuck with only answers to the questions you thought to ask in advance.

  Charity Majors

They needed the speed of local ephemeral SSDs but the reliability of network-based persistent disks. The solution: a linux MD option to mirror but prefer to read from the local disks. Neat!

  Glen Oakley — Discord

OS upgrades can be risky. LinkedIn developed a system to unify OS upgrade procedures and make them much less risky.

  Hengyang Hu, Dinesh Dhakal, and Kalyanasundaram Somasundaram — LinkedIn

SRE Weekly Issue #334

I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.

  Lisa Karlin Curtis — incident.io
Full disclosure: Fastly, my employer, is mentioned.

An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.

  Lorin Hochstein

The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.

  u/dicksoutfoeharambe (and others) — reddit

From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.

  Jon Stevens-Hall

You can determine whether backoff will actually help your system, and this article does a great job of telling you how.

  Marc Brooker

I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!

  Dan Slimmon

This is a really great exlpanation of observability from an angle I haven’t seen before.

a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.

  Dan Slimmon

Outages

  • Twitter
  • Google Search
    • Did you catch the Google search outage? I’ve never seen one like it — that’s how rare they are. Google shared a tidbit of information about what went wrong — and it wasn’t the datacenter explosion folks speculated about.

  • Peloton

SRE Weekly Issue #333

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

They asked four people and got four answers that run the gamut.

  Jeff Martens — Metrist

How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.

Includes an overview of their ChatOps system that would make for a great blueprint to build your own.

  Vlad Vassiliouk — Airbnb

Rigidly categorizing incidents can cause problems, according to this article.

From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

  Jon Stevens-Hall

Lots of great advice in this one.

  • If no human needs to be involved, it’s pure automation.
  • If it doesn’t need a response right now, it’s a report.
  • If the thing you’re observing isn’t a problem, it’s a dashboard.
  • If nothing actually needs to be done, you should delete it.

   Leon Adato — New Relic

Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.

  Martha Lambert — incident.io

My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.

  Emily Arnott — Blameless

There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.

  Jakub Oleksy — GitHub

A great read both for interviewers and interviewees.

  Myra Nizami — Blameless

Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.

  Tomas Fernandez and Dan Ackerson — semaphore

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme