General

SRE Weekly Issue #424

lex

May 12, 2024

Here’s an ultra-practical guide to pushing for reliability investments at your company, formatted as a runbook with a set of specific steps.

Ross Brodbeck

MemoryDB: Speed, Durability, and Composition.

A neat dive into how Amazon’s MemoryDB composes multiple systems to create a redundant Redis-compatible data store.

Marc Brooker

The real cost of a blameful culture

This article looks into the economic and psychological impact of a culture of blame.

Lee Atchison — Blameless

The perils of outcome-based analysis

It took me two read-throughs to fully get this one, and I’m reallyglad I did it.

If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

Lorin Hochstein

The Hat Man

To prevent dangerous deploy collisions, these folks wrote an open source tool to mediate who gets to deploy when.

Andrew Kannan — Klaviyo

The technical learning curve at a startup is gentler than you might think

if you’ve never worked at a startup before, you may be over-estimating how much you need to learn and how quickly.

When all you have is early adopters, you’re in a more forgiving environment, including for reliability.

Nicholas Yan — Graphite

The Promise and Peril of JSON logging

Structured logging is great, but there can be pitfalls and gotchas.

Oakley Hall

SLO

An intro to SLOs with useful formulas, from the creator of the SLO Calculator featured here awhile back.

Alex Ewerlöf

SRE Weekly, a production of Tinker Tinker Tinker, LLC · {{Sender_Address}} · {{Sender_City}}, {{Sender_State}} {{Sender_Zip}}

Unsubscribe – Unsubscribe Preferences

SRE Weekly Issue #423

lex

May 5, 2024

General

Comments

View on sreweekly.com

How to Fight Alert Fatigue with Synthetic Monitoring

This one’s full of great advice about making sure alerts are actionable, including alerting on flows that actually matter to customers.

Nočnica Mellifera — Checkly

What playing Magic: the Gathering taught me about incidents.

Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.

Ross Brodbeck

Upgrading Kubernetes: From 1.11 to 1.18 in a month

It was a classic application of technical debt: they chose to focus on customer-facing features and let k8s updates slide. Here’s how they caught back up safely.

Jeff Wolski

Rice’s Theorem and Software Failures

This article presents an interesting hypothesis, and from it draws some nifty conclusions about reasoning about failure in systems.

we cannot know for sure whether or not software is going to be incident-free. It might well be, but we can’t ever prove it.

Niall Murphy

The role of psychological safety in incident response

For teams to solve incidents quickly and effectively, responders need to be able to trust each other and stakeholders have to trust the responders. This level of trust is hard to cultivate if your organization doesn’t have a significant amount of psychological safety.

Mandi Walls — PagerDuty

Klaviyo Incident Management: Interview with Laura Stone

More than just an interview, this article outlines a multi-year transformation from disorganized haphazard incident investigation to a smooth and efficient incident response process.

Eric Silberstein — Klaviyo

Graceful shutdown in Kubernetes

In this article, you will learn how to prevent broken connections when a Pod starts or shuts down. You will also learn how to shut down long-running tasks and connections gracefully.

Daniele Polencic — Learnk8s

How an empty S3 bucket can make your AWS bill explode

It turns out that an S3 bucket owner pays for failed requests to that bucket, even if they’re unauthenticated, so anyone can run up your AWS bill if they know your bucket’s name. Oops.

Oh, and they can get the bucket name from CT logs (thanks, Corey Quinn!)

Maciej Pocwierz

SRE Weekly Issue #422

lex

April 28, 2024

General

Comments

View on sreweekly.com

PIOSEE Decision Model and preparations for critical situations

The PIOSEE model is taught to pilots as a rubric for coming to a decision in a difficult aviation situation. As this article explains, we can also use it during IT incidents.

Francisco Melo Jr.

Solving Observability’s Cardinality Conundrum

What is high cardinality in monitoring systems? Here’s an excellent explanation that includes tips on how to manage cardinality.

Ash P — SREPath

Building a customer-focused Observability Maturity Model

As Xero transitioned to a standard of “you build it you run it”, suddenly more engineering teams were responsible for knowing about and implementing observability. They designed this maturity model to help teams understand what they were aiming for and how to get there.

Andrew Macdonald — Xero

The invisible seafaring industry that keeps the internet afloat

With around 200 undersea fiber cuts worldwide per year, a fleet of ships is at the ready to pull up the cables and repair them.

Josh Dzieza — The Verge

Major data center power failure (again): Cloudflare Code Orange tested

Last year, Cloudflare suffered a control plane outage when one of their datacenters lost power. They since did significant work to improve their resilience to power outages, and it was put to the test when the same datacenter lost power again.

Matthew Prince, John Graham-Cumming, and Jeremy Hartman — Cloudflare

How the Platform team became effective in working remotely

Going from non-remote to remote was challenging but here’s how our team changed as we began working from home

Stefan Mikolajczyk — WeTransfer

The Platform Empathy Gap

Platform teams have a hugely important role to fill in the engineering organization. They are often the teams that enable other teams to move with more speed and safety. They can also quickly become disconnected from their customers.

Ross Brodbeck

Graceful Degradation and SLOs

When your system successfully serves a degraded response to the customer, how should you count that toward your SLO? Is it success? Failure? Something in between?

Niall Murphy

SRE Weekly Issue #421

lex

April 21, 2024

General

Comments

View on sreweekly.com

Last week, I mistakenly attributed [an article](https://www.paigerduty.com/sre-biggest-problem/) to PagerDuty. Actually, it was by Paige Cruz, whose clever blog name I didn’t pay anywhere near close enough attention to! Thanks to several readers that nudged me gently about my error.

The problem with invariants is that they change over time

If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change.

Easily the majority of incidents I’ve been in.

Lorin Hochstein

The TwinSLO Proposal

This article is about trying to solve for this problem:

a potentially significant number of customers or queries can be affected by an outage and this won’t trigger an SLO violation.

Niall Murphy

An Anonymous Complaint/Dr. Poston’s Response

A surgeon struggles with the difficulties in building a culture of retrospectives and introspection in their surgical team, by running a fascinating retro on himself in this blog post.

Robert Poston, MD

Incidents and the requirement of slowing down

An argument for buying yourself time to slow down and make decisions carefully, as a way of ultimately speeding up incident resolution.

Shayon Mukherjee

Build your own role-playing game: the business continuity plan drill

Disasters threatening a business’ ability to operate core functions don’t occur that often (phew!), but we do want to ensure we are prepared to keep our business running if they do. To practice disaster response skills, we run business continuity drills, and you can too with our 10-step plan!

Janna Brummel — WeTransfer

Availability Archetypes

How people think about reliability varies between companies. Which of the four different perspectives laid out int his article does your company fit into, if any?

Ross Brodbeck

eu1 ingest and UI down

Honeycomb posted this followup on their April 9 outage, explaining what went wrong and how they’re responding.

Honeycomb

Full disclosure: Honeycomb is my employer.

For an SRE, relationships and communication matter most: advice from SRE’s

The author of this article posed a question on r/sre:

What matters most for your success as an SRE?

They share a summary of the answers they got, with their commentary.

Nočnica Mellifera — Checkly

SRE Weekly Issue #420

lex

April 15, 2024

General

Comments

View on sreweekly.com

1.0 Launch Retrospective

The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it.

EHG_Kain — Last Epoch

Autonomous hardware diagnostics and recovery at scale

Cloudflare’s Phoenix system can find and recover failed servers, reducing toil.

Jet Mariscal, Aakash Shah, and Yilin Xiong — Cloudflare

SLA vs SLO vs SLI: What’s the Difference?

More than just another glossary of SL*s, this one also has examples and best practices.

Sara Miteva — Checkly

What’s the biggest unsolved problem within Site Reliability Engineering?

Spurred from a question in the SRECon attendee survey, this one really gets you thinking: how does the current “generation” of SREs differ from those that came before?

Paige — PagerDuty

Finding the common ground with executives in incidents

This one’s about finding out what execs need in incidents and figuring out how to get everone’s needs met.

Chris Evans — incident.io

Minimizing on-call burnout through alerts observability

This post explains how Cloudflare gathers information about their alerts and improves them to benefit reliability and on-call health.

Monika Singh — Cloudflare

Composite SLO

This one contains formulas for calculating compound SLOs when downstream dependencies are parallel or serial.

Alex Ewerlöf

SRE Weekly Issue #424

SRE Weekly Issue #423

SRE Weekly Issue #422

SRE Weekly Issue #421

SRE Weekly Issue #420

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues