General

SRE Weekly Issue #428

lex

June 9, 2024

General

Comments

View on sreweekly.com

The Reverse Red Herring

This article presents in incident theme that I’ve lived through many times but never had such a pithy name for.

Geoff Townsend — Blameless

Centralisation and distribution: When one node is enough

There are risks and downsides inherent in a distributed system, so it’s worth thinking about whether you really need one.

Pipitz — Adevinta

Not Just Scale

And here’s a counterpoint to the previous article: deciding whether you need a distributed system isn’t just about scale.

Marc Brooker

Use Memes

The effectiveness of memes in availability campaigns.

This short post is a pile of memes, and the video one is top notch.

Ross Brodbeck

Flaky alerts are telling you something

Paraphrasing part of this article: either you didn’t understand your system fully when you wrote the alert, or there really are sporadic failures.

Chris Siebenmann

You can’t judge risk in hindsight

If you’ve ever created an action item from an incident along the lines of “don’t take unnecessary risks in the future”, you need to read this one.

The rest of you need to read it too.

Lorin Hochstein

Anomaly Alerting in Prometheus

A how-to for building anomaly detection alerting in Prometheus with specific config examples.

Karl Stoney

r/sre: I almost re-imaged servers that were LIVE – Caused Disruption!

A panicked engineer asks reddit’s r/sre about an incident they caused: how could they have done better? Will they be fired? The comments are spot on, and this conversation is fresh enough that you could jump in too if you’re interested.

u/console_fulcrum and others — reddit

[Honeycomb incident followup]: US Production site is down

Last Monday, Honeycomb had an outaged related to a schema migration involving MySQL’s ENUM data type, and they posted this incident report.

Bonus content: I wasn’t aware of ENUMs at all, so I had to brush up with this article: 8 Reasons Why MySQL’s ENUM Data Type Is Evil.

Honeycomb

Full disclosure: Honeycomb is my employer.

Cracking the SRE Interview

An experienced SRE discusses the skills and experiences you might be quizzed about in an interview for an SRE role.

Krishna Vinnakota — DZone

SRE Weekly Issue #427

lex

June 2, 2024

General

Comments

View on sreweekly.com

Why didn’t you status?

Written by a GitHub employee, this article seeks to answer the titular question, with discussions of noise reduction concerns and incidents that affect only a subset of customers.

Ross Brodbeck

Google Cloud Incident Report: VPC Incident on May 16, 2024

Wow, this incident is a really great example of the idea that there is no one single root cause.

Google

Trial by Fire: Tales from the SRE Frontlines — Ep2: The Scary ApplicationSet

Understand the safeguard configuration of the ArgoCD’s ApplicationSet through the experience of our SRE who learned from an incident

Tanat Lokejaroenlarb — Adevinta

Make Two Trips

Sometimes it’s better to do something in multiple passes, even if it’s less efficient. This applies to individual programs and major deployments alike.

Thomas A. Limoncelli — ACM Queue

The problem with a root cause is that it explains too much

Another thought-provoking take on the argument that there is no one root cause.

Lorin Hochstein

Kubernetes Tip: What Happens To Pods Running On Node That Become Unreachable?

I referenced this at work the other day, but the interesting bit is that the pod-eviction-timeout option has been removed in Kubernetes 1.27 and I’ve had difficulty finding out what it was replaced by.

Bhargav Bhikkaji

Incident Summaries using LLMs

How to use llama-2 7b to generate summaries of your incidents, using Cloudflare workers and Workers AI.

It’s a complete how-to using an open source LLM.

Karl Stoney

Incident 2023-12-04: Data leak and loss in some free tier databases

Here’s a great incident writeup from last December that I came across this week.

By the way, if you see or write an incident followup post, I’d be grateful if you sent a link my way!

Turso

SRE Weekly Issue #426

lex

May 26, 2024

General

Comments

View on sreweekly.com

Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.

The Rule of 5 Errors

If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.

Ross Brodbeck

How reliability differs between monolithic and microservice-based architectures

This article outlines five facets of microservice architectures that can have implications for reliability.

Andre Newman — Gremlin

Children of the Magenta: The crash of American Airlines flight 965

If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.

Kyra Dempsey (Admiral Cloudberg)

Establishing and Enabling a Center of Production Excellence

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.

Nick Travaglini — Honeycomb

Full disclosure: Honeycomb is my employer.

Details of Google Cloud GCVE incident

Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.

Google

Heroku Incident #2664 Followup

This incident is intriguing because exponential backoff made the problem harder to detect.

Heroku

Service level pitfalls

A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.

Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.

Alex Ewerlöf

The error term isn’t Pareto distributed

if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.

Lorin Hochstein

SRE Weekly Issue #425

lex

May 19, 2024

General

Comments

View on sreweekly.com

Presenting to Engineering Leadership

Great practical advice for how to present reliability problems (and your proposed solutions) to e-staff.

Ross Brodbeck

Green is the color of complacency

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk.

Lorin Hochstein

Three roles you need for reliability success

The three kinds of roles covered in this article relate to Standards, Operations, and Leadership.

Gavin Cahill — Gremlin

It’s always TCP_NODELAY. Every damn time.

Nagle’s algorithm considered harmful? It’s important to be aware of it because it can trip you up.

Marc Brooker

Amazon S3 will no longer charge for several HTTP error codes

In issue #423, I linked to a story about Amazon charging for unauthenticated and failed requests to S3 buckets. Thankfully, they’re no longer charging for that.

Amazon

UniSuper services fully restored

A little low on details, but interesting nonetheless: Google Cloud did something weird and accidentally deleted a customer’s account out from under them.

UniSuper

Service

What is a “service” in the context of service levels (SLI/SLO)?

Alex Ewerlöf

The importance of psychological safety in incident management

My favorite part of this one is the description of techniques for improving psychological safety at your company.

Incident.io

SRE Weekly, a production of Tinker Tinker Tinker, LLC · {{Sender_Address}} · {{Sender_City}}, {{Sender_State}} {{Sender_Zip}}

Unsubscribe – Unsubscribe Preferences

SRE Weekly Issue #424

lex

May 12, 2024

General

Comments

View on sreweekly.com

My Availability Investment Playbook

Here’s an ultra-practical guide to pushing for reliability investments at your company, formatted as a runbook with a set of specific steps.

Ross Brodbeck

MemoryDB: Speed, Durability, and Composition.

A neat dive into how Amazon’s MemoryDB composes multiple systems to create a redundant Redis-compatible data store.

Marc Brooker

The real cost of a blameful culture

This article looks into the economic and psychological impact of a culture of blame.

Lee Atchison — Blameless

The perils of outcome-based analysis

It took me two read-throughs to fully get this one, and I’m reallyglad I did it.

If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

Lorin Hochstein

The Hat Man

To prevent dangerous deploy collisions, these folks wrote an open source tool to mediate who gets to deploy when.

Andrew Kannan — Klaviyo

The technical learning curve at a startup is gentler than you might think

if you’ve never worked at a startup before, you may be over-estimating how much you need to learn and how quickly.

When all you have is early adopters, you’re in a more forgiving environment, including for reliability.

Nicholas Yan — Graphite

The Promise and Peril of JSON logging

Structured logging is great, but there can be pitfalls and gotchas.

Oakley Hall

SLO

An intro to SLOs with useful formulas, from the creator of the SLO Calculator featured here awhile back.

Alex Ewerlöf

SRE Weekly, a production of Tinker Tinker Tinker, LLC · {{Sender_Address}} · {{Sender_City}}, {{Sender_State}} {{Sender_Zip}}

Unsubscribe – Unsubscribe Preferences

SRE Weekly Issue #428

SRE Weekly Issue #427

SRE Weekly Issue #426

SRE Weekly Issue #425

SRE Weekly Issue #424

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues