SRE WEEKLY – Page 29 – scalability, availability, incident response, automation

SRE Weekly Issue #388

lex

September 3, 2023

Articles

Operating effectively in high surprise mode

This article makes a cool analogy between designing systems to operate well under unexpected load and designing socio-technical systems that operate well when the people are surprised by what the system is doing.

Lorin Hochstein

10 service level agreement practices you should implement

If you need to create SLAs, this article has some solid advice on how to go about it — and what to avoid.

incident.io

Prometheus scrape failures can cause alerts to be ‘resolved’

If Prometheus can’t scrape your service, an alert can get resolved incorrectly — and that can happen exactly when your service is failing!

Chris Siebenmann

A Spectrum of Actions

A really nifty three-part exploration of action items in the aftermath of an incidents. Rather than consider cost/benefit, this article series proposes that we think about the likelihood of an action item being completed.

J. Paul Reed

Is Northern Virginia Really the Least Reliable AWS Region And Why?

Yes, as it turns out — and these folks have the receipts (along with some theories as to why).

Colin Bartlett

Reader: Insight and Incidents

The “wow” moment in this article is under the heading, “What can we learn from creative desperation?”

Eric Dobbs — Learning From Incidents

How to create automated paging and on-call at your startup

Before explaining how they set up their on-call, these folks share why they avoided it in the early stages of their startup, and what made them finally take the plunge.

Dustin Brown — DoltHub

The Dark Side of SRE

For the good of the profession, the SRE community still needs to coalesce around more consistent job ladders, expectations, and competencies.

Code Reliant

Incident Review: What Comes Up Must First Go Down

Honeycomb had their worst incident ever at the end of July, and in their characteristic style, they’ve posted an incredibly detailed analysis of what happened — and that’s just the blog post. Then you can click through for a 17-page PDF with lots more detail.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #387

lex

August 27, 2023

General

Comments

View on sreweekly.com

Articles

Scaling Software Systems: 10 Key Factors

In this post, we’ll explore 10 areas that are key to designing highly scalable architectures.

The 10 areas they cover in-depth are:

Horizontal vs. Vertical Scaling

Load Balancing

Database Scaling

Asynchronous Processing

Stateless Systems

Caching

Network Bandwidth Optimization
8, Progressive Enhancement

Graceful Degradation

Code Scalability

Code Reliant

Time based vs Event based SLIs

Are you looking at the number of requests that were served successfully out of the total number of requests? Or the percentage of time the system was up and working properly?

Alex Ewerlöf

I Don’t Alert on Apdex. It Confuses Me

This is my personal take on something that is considered standard that I just don’t understand. So here we go — the Apdex, what it is, and why I don’t use it!

Boris Cherkasky

Reader: How the video “Three analytical traps in accident investigation” Helps me be a Better Incident Analyst

Here’s a great explanation of three common cognitive biases we should try to avoid while analyzing incidents.

Randy Horwitz — Learning From Incidents

Lily Cohen (@lily) Re: firefish.lgbt, musician.social, and outdoors.lgbt

A horrifying tale of gitops gone wrong and backups that didn’t back up, leading to catastrophic data loss. This, this is what hugops is for. I’m so sorry, Lily!

Lily Cohen

Authentication slowness or failure to load Duo Prompt on DUO1

Here’s a followup analysis from Duo for an incident they had last week.

Practical Guidance for First-Time Site Reliability Engineers

The first SRE hire at incident.io shares what they learned as they became familiar with the infrastructure and figured out what to do with it.

Ben Wheatley — The New Stack

Keeping the Lights On: The On-Call Process that Works

This is a story of building a new on-call rotation in a company that didn’t have one. They started out with a pretty awesome list of principles that we could all aspire to.

Felix Lopez — The New Stack

Test In Production

Why should we test in production? This article gives a really spot-on argument and goes on to explain how to do it.

Sven Hans Knecht

SRE Weekly Issue #386

lex

August 21, 2023

General

Comments

View on sreweekly.com

This issue was delayed a day while I was enjoying a much-needed vacation with my family. While I’m on the subject, it’s hot take time: vacations are important for the reliability of our sociotechnical systems, so good SREs should take vacations regularly and encourage others to as well.

Articles

Broken Ownership

If “you build it, you run it” requires mandate, knowledge, and responsibility, what happens when one of those is missing?

Alex Ewerlöf

Service Delivery Index: A Driver for Reliability

Slack developed an all-encompassing metric for the user experience that goes beyond a simple SLO.

Matthew McKeen and Ryan Katkov

Transactions in a Microservice World

This whitepaper delves deep into the ways a microservice architecture changes how transactions work. It presents a method of dealing with microservice transaction failures through application-specific compensation logic.

Frank Leymann — WSO2

Initial Investigation in the Bambu Cloud Temporary Outage

Bambu is a brand of 3d printers that are primarily cloud-based. A problem in their cloud system resulted in printers running jobs unexpectedly, causing significant damage to some customer’s printers.

Bambu Lab

Google Cloud Hybrid Connectivity Incident Report

An interesting confluence of fiber optic line failures resulted in loss of connectivity on what should have been a redundant link.

Google

SLOs Are Overrated

I know the title looks like click-bait, but this article delivers with 7 well thought-out critiques of SLOs.

Code Reliant

GitHub – runbear-io/awesome-runbook

This latest entry into the awesome-* arena is a curated list of runbooks and related resources for popular software.

Runbear

Normal incidents

You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?”

This article immediately made me think of the latest Mentour Pilot accident investigation in which everyone acted nearly perfectly and yet still only narrowly avoided a mid-air collision.

Lorin Hochstein

SRE Weekly Issue #385

lex

August 13, 2023

General

Comments

View on sreweekly.com

Many apologies to Matt Cooper at GitHub, who is the actual author of the article Scaling Merge-ort Across GitHub from last week. Sorry for the mis-credit, Matt!

Articles

Debunking Myths About Reliability

This article will really come in handy next time you need to explain SRE to your execs.

Kit Merker — DevOps.com

Assessing Organizational Culture to Drive SRE Adoption

By mapping the Westrum Model of organizational cultures to SRE, we can understand SRE culture adoption.

Vladyslav Ukis and Ben Linders — InfoQ

Inside Disney’s Site Reliability Engineering practice

Disney’s SRE teams have ensured that the magic keeps happening, even as experiences and their underlying technology become more and more complex.

Ash Patel — SREPath

Tears in the Rain: The 2002 Überlingen midair collision

There’s so much to learn from this tragedy, I might read this one again. A mid-air collision these days should be effectively impossible due to TCAS. In this case, many factors conspired to bring about disaster.

Admiral Cloudberg

Hidden Benefits of SLOs

Here they are, out in the open:

SLOs create a common understanding in the organization about reliability

SLOs require investment into improved observability

SLOs prompt decisions about risk management… and risk-taking

Amin Astaneh — Certo Modo

Five Standard Models to Work on Incidents Effectively

The “five standard models” are actually more like a 5-stage workflow:

Triage,

Examine,

Diagnose,

Test, and

Cure.

Saheed Oladosu

Migrating Netflix to GraphQL Safely

This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.

Jennifer Shin, Tejas Shikhare, Will Emmanuel — Netflix

Why Adaptive Rate Limiting is a Game-Changer

Building from a review of traditional rate limiting techniques, this article then explains adaptive rate limiting and its benefits.

Sudhanshu Prajapati — FluxNinja

SRE Weekly Issue #384

lex

August 6, 2023

General

Comments

View on sreweekly.com

Articles

Scaling merge-ort across GitHub

They tested this new git merge strategy by using Scientist, a framework that runs both the old and new implementation and compares the results.

Jesse Toth — GitHub

Why is DNS still hard to learn?

DNS is simple (kinda) but it can be really difficult to fully wrap your head around it. This article explains why, and in the process gives a blueprint for designing more understandable tools in general.

Julia Evans

Fallback

Fallback is different from Failover for a number of reasons. This article describes how they differ, how fallback works, and why you might choose it over failover.

Alex Ewerlöf

GitHub – bregman-arie/sre-checklist: A checklist of anyone practicing Site Reliability Engineering

Repository Purpose: Provide teams and individuals an idea on what to take into consideration and what to aspire for in the SRE field and work

Note: these checklists are opinionated.

Arie Bregman

Reader: Carrots, sticks, and making things worse

A thought-provoking article on trying to change people’s behavior in incidents through incentives (positive or negative) without also changing the context in which they act.

Fred Hebert — Learning From Incidents

Hardening Workers KV

Cloudflare shares what they learned as they transitioned their KV service to a new architecture which resulted in multiple unexpected problems.

Matt Silverlock, Charles Burnett, Rob Sutter, and Kris Evans — Cloudflare

Anything But Tech Debt

In this article, learn about two interesting strategies for getting an organization to prioritize technical debt work: using a more specific name for the work, and referencing the work’s impact on an SLO — and the impact of not doing the work.

Emily Nakashima — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #388

Articles

SRE Weekly Issue #387

Articles

SRE Weekly Issue #386

Articles

SRE Weekly Issue #385

Articles

SRE Weekly Issue #384

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues