SRE WEEKLY – Page 21 – scalability, availability, incident response, automation

SRE Weekly Issue #382

lex

July 23, 2023

General

Comments

View on sreweekly.com

Articles

Solving challenges caused by Out Of Memory (OOM) Killer in Linux

The Linux OOM killer can already be a bugbear, and things only get more complicated when you add containers to the mix.

Rafał Korepta — RedPanda

Align platform and product engineering teams over incidents

This post explores how to align platform and product engineering teams by implementing business value proxy metrics and using incidents to inform them.

The same metrics that we use to measure other initiatives against business priorities may be able to show us whether our incident response process is effective.

Gonzalo Maldonado — FireHydrant

DevOps vs SRE: Is it a party?

Here’s another take on devops vs SRE, using a metaphor of organizing a party.

Diogo Souza

Embrace AI Acceleration by Investing in Reliability

how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users?

Jim Gochee — The New Stack

“Human Error” is the Scapegoat for Systemic and Organizational Failures

My favorite part is the bit about the risks of automation and keeping humans in the loop.

Dr. Mica Endsley — Business News This Week

Revolutionizing Infrastructure Management: The Power of Feature Flags in IaC

It’s about reliability: IaC changes carry just as much risk to reliability as product code changes, if not more. How can we bring feature flags to IaC?

Josephine E. Justin, Srikanth Murali, and Norton Stanley S A — DZone

On-Call Stories: Flying Blind

Oh, the tangled web we weave when we send automated emails.

Amin Astaneh — Certo Modo

Lessons Learned Running Presto at Meta Scale

Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

High Scalability

SRE Weekly Issue #381

lex

July 16, 2023

General

Comments

View on sreweekly.com

Articles

The Pyramid of Alerting

The Pyramid introduced in this article is three levels of monitoring: Operational, Data Validation, and Business Assumptions. These roughly correspond to questions like: is the system up? Is the right amount of data flowing through it? Is that data correct?

Karel Vanden Bussche — DEV

Incident Review for Site-wide Outage for GitLab.com – Stale Terraform Pipeline #15997 (#15999) · Issues · GitLab.com / GitLab Infrastructure Team / production

Extremely powerful tools can become extremely powerful footguns, for example Terraform.

Dave Smith — GitLab

latency: a primer

Sure, you know what latency is, but do you really know what a percentile is? A histogram? A heatmap?

igor

CDN Observability

If you’re using a CDN, you need to keep an eye on it. Here’s a primer on what to watch for.

Or Hillel — DZone

Principles of Reliable Software Design

This article series covers 12 aspects important in the design of reliable systems. Some of the aspects, such as modularity, loose coupling, graceful degradation, and redundancy, are covered in depth.

Code Reliant

GitHub Availability Report: June 2023

A couple weeks back, GitHub was hard down, even including its status page at times. This report goes into that in detail, and the cause is pretty interesting.

Jakub Oleksy – GitHub

Failover

An in-depth look at different kinds of failover, including each kind’s methodology and purposes.

Alex Ewerlöf

Finding Fault: The crash of Korean Air Cargo flight 8509

This one is especially interesting for the controversial and baseless conclusions popularized in the media about a supposed cause rooted in Korean culture. It’s a good reminder that we need to be careful to ensure the validity of the lessons we learn from incidents.

Admiral Cloudberg

SRE Weekly Issue #380

lex

July 9, 2023

General

Comments

View on sreweekly.com

Articles

Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All

Well, that cleared things up. (It didn’t, but the debate is interesting).

Scott M. Fulton III — The New Stack

5 strategies to improve your incident communication

This article has five tips for great incident communication, along with a section on why this matters.

Luis Gonzalez — incident.io

SRE Engagement Models

Beyond just a list of ways SREs interface with other teams, this article also compares them and gives advantages and disadvantages of each.

Amin Astaneh — Certo Modo

Resilience requires helping each other out

Building every system to be strong enough to handle peak load can be very expensive. Can we instead build our systems to take excess load from each other cooperatively?

Lorin Hochstein — Surfing Complexity

Ensuring reliability: SLOs, on-call process, and postmortems

Another useful “how we do SRE” post, including an incident report template.

Pavel Pritchin — Dodo Engineering

Incident severity: why you need it and how to ensure it’s set

Here’s an interesting twist on the usual “incident severity 101” article: in a company where “anyone can declare an incident”, how do you make sure incident severity gets set consistently in every incident?

Mike Lacsamana — FireHydrant

Impedance Mismatch: SRE vs Dev Speed

How can we work to improve reliability when folks perceive our efforts to be counter to velocity?

Code Reliant

The Problem With Nonpunitive Safety Culture

In a blameless culture without consequences, what’s the incentive for learning to make the system more reliable? This is an incredibly thought-provoking article and I’m still not sure how I feel about it.

Robert Poston MD

SRE Weekly Issue #379

lex

July 2, 2023

General

Comments

View on sreweekly.com

Articles

The Saga Is Antipattern

In case you weren’t familiar with the Saga pattern like I was, it’s basically a pseudo-transaction across multiple microservices. Here’s why it might not be a great idea.

Sergiy Yevtushenko

The Story Behind Last Week’s Let’s Encrypt Downtime

During a rolling deploy, for a very brief period of time, different parts of the infrastructure had old or new code running, with unexpected results.

Andrew Ayer

Generating sequential numbers in a distributed manner

On its face, we have a simple requirement:

Generate sequential numbers

Ensure that there can be no gaps

Do that in a distributed manner

It’s never simple with distributed systems.

Lost in transit: debugging dropped packets from negative header lengths

In classic Cloudflare style, here’s an ultra-deep dive into the kernel to find the source of trouble-making packet loss.

Terin Stock — Cloudflare

There Are No Repeat Incidents

Even with a “duplicate” incident, there’s always at least one thing that’s different: the fact that it’s happened before. That changes things. In practice, a lot more will be different too.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

Why So Many Companies Run in AWS us-east-1

There are definitely pros and cons to being in the most popular (and most oft-maligned) AWS region.

Jeff Martens — Metrist

What Is That Change Which Is The Source Of All Instability?

Changes are frequent causes of incidents, but what exactly counts as a change? This article delves into that with examples.

Boris Cherkasky

Collision with the Terminal: The crash of RwandAir flight 205

This crash is a great reminder that we have to look past “human error” to the systems around the humans that set them up for failure (or don’t set them up for success).

Admiral Cloudberg

SRE Weekly Issue #378

lex

June 25, 2023

General

Comments

View on sreweekly.com

Articles

“One-Engined-Zulu”

This is the story of a fascinating incident in which a commercial airplane’s engine was ripped off during takeoff (also covered on Mentour Pilot). What really struck me is the way a huge team on the ground and in the air assembled around the incident and all played very important roles in getting the plane down safely.

Mark D. Young — PoliticsWeb

Catchpoint’s 2024 SRE Survey Is Here – We Need YOU!

Time for another Catchpoint SRE Survey! They donate $5 to the Red Cross for every completed survey, so let’s all work together and drive a huge donation!

Catchpoint

FTC Request, Answered: How Cloud Providers Do Business

The US Federal Trade Commission (FTC) put out a request for information about cloud providers, including reliability among other topics. Here’s Corey Quinn’s answer.

Corey Quinn — The Duckbill Group

The “people problem” of incident management

What can you do when running an incident feels like herding cats? This article has some tips.

Robert Ross — FireHydrant

Monitoring is a Pain

I have a confession. Despite having been hired multiple times in part due to my experience with monitoring platforms, I have come to hate monitoring.

This jaded tale also contains some good suggestions for dealing with monitoring pitfalls.

Mathew Duggan

Resilient Retry and Recovery Mechanism: Enhancing Fault Tolerance and System Reliability

The cardinal rule of engineering:

your solution shouldn’t become your next problem.

Kumar Amit — Mercari

Embrace Complexity; Tighten Your Feedback Loops

Here’s the articlization of a talk Fred Hebert gave at QCon New York. The alternate title of the talk is:

This Is All Going To Hell Anyway
All We Can Do Is Influence How Long It’s Gonna Take

I had the pleasure of seeing a draft version of this talk at work, since (full disclosure) Fred is my coworker.

Fred Hebert

Why elasticity is essential for delivering realtime updates at scale

This article makes the case that elastic scaling is both harder to implement and more important for use cases involving streaming updates to users in real-time.

Mittul Madaan — Ably

Parallel Distributed Shell

An intro to pdsh, my favorite of the tools that run commands on many hosts via SSH.

Amin Astaneh — Certo Modo

SRE Weekly Issue #382

Articles

SRE Weekly Issue #381

Articles

SRE Weekly Issue #380

Articles

SRE Weekly Issue #379

Articles

SRE Weekly Issue #378

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues