Search Results for “outages” – Page 3

SRE Weekly Issue #332

lex

July 31, 2022

General

Comments

View on sreweekly.com

Articles

How Razorpay’s Notification Service Handles Increasing Load

Their notification service had complex load characteristics that made scaling up a tricky proposition.

Anand Prakash — Razorpay

How we improved on-call life by reducing pager noise

Coalescing alerts and adding dependencies in AlertManager were the key to reducing this team’s excessive pager load.

steveazz — GitLab

What’s allowed to count as a cause: ALERRT edition

Lorin Hochstein has started a series of blog posts on what we can learn about incident response from the Uvalde school shooting tragedy in the US. This article looks at how an organization’s perspective can affect their retrospective incident analysis.

Lorin Hochstein

The fog of war in Uvalde

My claim here is that we should assume the officer is telling the truth and was acting reasonably if we want to understand how these types of failure modes can happen.

Every retrospective ever:

We must assume that a person can act reasonably and still come to the wrong conclusion in order to make progress.

Lorin Hochstein

User settings, Lamport clocks and lightweight formal methods

How do you synchronize state between multiple browsers and a backend, and ensure that everyone’s state will eventually converge? These folks explain how they did it, and a bug they found through testing.

Jakub Mikians — Airspace Intelligence

MTTR: lower isn’t always better

MTTR is a mean, so it doesn’t tell you anything about the number of incidents, among other potential pitfalls.

Dan Slimmon

Google Cloud Platform outage report: europe-west2 cooling failure

Last week, I included a GCP outage in europe-west2. This week, Google posted this report about what went wrong, and it’s got layers.

Bonus: another GCP outage report

Google

It’s time to leave the leap second in the past

Meta wants to do away with leap seconds, because they make it especially difficult to create reliable systems.

Oleg Obleukhov and Ahmad Byagowi — Meta

3 common pitfalls of post-mortems

If you’re anywhere near incident analysis in your organization, you need to read this list.

Milly Leadley — incident.io

Outages

SRE Weekly Issue #331

lex

July 24, 2022

General

Comments

View on sreweekly.com

Articles

DisasterCast – A podcast about scary things and how to stop them happening

I’ve been listening to this podcast this week and I love it! Each episode covers a disaster, safety theory, and other topics — with no ads. Their site is down right now, but the podcast is available on the usual platforms.

Drew Rae — DisasterCast

An 8 Step Guide to Go From a Clueless to a Production-aware Software Engineer

If we want to get folks to own their code in production, we need to teach them how to think like an SRE.

Boris Cherkasky

3 mistakes I’ve made at the beginning of an incident (and how not to make them)

Let’s look at three mistakes I’ve made during those stressful moments during the beginning of an incident — and discuss how you can avoid making them.

The mistakes are:

Mistake 1: We didn’t have a plan.
Mistake 2: We weren’t production ready.
Mistake 3: We fell down a cognitive tunnel.

Robert Ross — FireHydrant

When to kill the canary

At what point does your canary test indicate failure? Should the criteria be the same as your normal production alerting?

Øystein Blixhavn

On Counting Alerts

This is a followup to a previous article about on-call health. In this one, the author shares metrics about the number of alerts and discusses whether this number is useful.

Fred Hebert — Honeycomb

High Availability on Razorpay Payments Dashboard

Their dashboard crashed for 50% of user sessions, so they had a lot of work ahead of them. Find out how they got crash-free sessions to 99.9% and improved their time to respond to incidents.

Sandesh Damkondwar — Razorpay

@atoonk on Twitter summarizing the Rogers Communications outage

Rogers Communications, a major telecom in Canada, had a country-wide outage earlier this month. I don’t normally include telecom outages in the Outages section because they rarely share information that we can learn from.

This time, Rogers released a (redacted) report on their outage, and this Twitter thread summarizes the key points.

@atoonk on Twitter

Outages

Microsoft Teams and Office 365
Microsoft blames storage error for Teams outage
Google Cloud Storage
Google Cloud europe-west2 region
- Preliminary root cause has been identified as multiple concurrent failures to our redundant cooling systems within one of the buildings that hosts the europe-west2-a zone for the europe-west2 region.

SRE Weekly Issue #330

lex

July 17, 2022

General

Comments

View on sreweekly.com

Thanks for all the well-wishes as I took a sick day last week. I’m feeling much better!

Articles

DNS Incidents Like Cloudflare’s Could Turn your Status Page Useless; Here is How to Prevent It

Is your status page status.yourcompany.com? If so, read this article, then get yourself a new domain.

Eduardo Messuti — Statuspal

Coming back from maternity leave and learning from incidents.

The author used my favorite technique for getting up to speed on a company: analyzing a recent incident.

Vanessa Huerta Granda — Jeli

What leading expeditions taught me about incident management

There are a number of lessons I learned guiding weeks-long backcountry leadership courses for teens that I carried with me into my roles in incident management. In this blog post, I’ll share three that stand out.

Ryan McDonald — FireHydrant

What you should and (probably) shouldn’t try from SRE

I really like these articles about interpreting SRE in a way that makes sense for your organization. SRE is still constantly evolving.

Steve Smith — Equal Experts

What I learned from leading my first incident

The author led an incident just 3 months into their tenure. Here’s what they learned.

Milly Leadley — incident.io

Notes on an Observability Team

while SRE and DevOps type job explainers have been written ad nauseam, I found there’s relatively little online about Observability Teams and roles. I figured I’d share a bit about my experience on an O11y Team.

Eric Mustin

We Learn Systems by Changing Them

I found the contrast between this one and the previous article interesting. The previous one includes a quote of Brendan Gregg:

Let me try some observability first. (Means: Let me look at the system without changing it.)

Jessica Kerr — Honeycomb

GitHub Availability Report: June 2022

In June, we experienced four incidents resulting in significant impact to multiple GitHub.com services. This report also sheds light into an incident that impacted several GitHub.com services in May.

GitHub

Day 2 Operations with the James Webb Space Telescope is about to begin

Using the Webb telescope as an example, this article describes the progression of a system toward production operation using a metaphor of 3 days.

Robert Barron — IBM

Outages

LaunchDarkly
- This one was from last week when I missed an issue, but it was pretty big and interesting, so I decided to include it this week.
Twitter
Snapchat
Peloton
- I see no evidence that Lizzo was the cause, despite rumors.

SRE Weekly Issue #329

lex

July 3, 2022

General

Comments

View on sreweekly.com

Articles

“Keep calm and use the runbook” – Why runbooks are the key to handling any situation effectively

A primer on what makes a good runbook.

Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.

Cortex

Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.

Philipp Gündisch and Vladyslav Ukis — Siemens

Tech Debt, Incidents and On Call

After an explanation of tech debt, this article goes into a possible solution: having on-call folks fix lingering problems in between pages.

Dormain Drewitz — The New Stack

How to Standardize Service Ownership at Scale for Improved Incident Response

I’ve read plenty of articles about service ownership, but this one has something new: a discussion of how to divvy up a monolith into separate “services” for teams to own.

Hannah Culver — PagerDuty

Streamlining our incident responses

The folks at Sendinblue have chronicled their journey to better incident response, and there’s a lot here to learn from.

Tanguy Antoine — Sendinblue

Why More Incidents Are Better

Incidents will always happen, but thankfully they have plenty of upsides, as this article explains.

Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

My Alerts Don’t Work Anymore, Now What?

You’re not getting paged. Is it because you’ve fixed all the things, or has your alerting atrophied?

Boris Cherkasky

Uncovering the mysteries of on-call

The folks at incident.io are here with the results of their survey of on-call practices. I like the focus on compensation for being on-call.

incident.io

Outages

Twitter
Netflix briefly went down for some users after the new ‘Stranger Things’ episodes debuted Friday morning, according to outage reports
GitHub
reddit
Zebrium
- I noticed this one while trying to read one of their articles. I was getting NXDOMAIN trying to resolve zebrium.com.

SRE Weekly Issue #328

lex

June 27, 2022

General

Comments

View on sreweekly.com

Articles

Cloudflare outage on June 21, 2022

Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.

Tom Strick and Jeremy Hartman — Cloudflare

Metastability and Distributed Systems

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

Marc Brooker

The Good, The Bad, And Alerting on Derivatives

By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.

Boris Cherkasky

How to Adopt an SRE Practice (When You’re not Google)

In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.

Jemiah Sius — devops.com

Tech Interview Questions: Typing “google.com” in a browser

I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.

Will Gallego

Outages

Cloudflare
- Cloudflare had a major outage, taking many sites and services with it.