SRE WEEKLY – Page 24 – scalability, availability, incident response, automation

SRE Weekly Issue #410

lex

February 4, 2024

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

Hochuen Wong and Levon Stepanian — DoorDash

APAC Retrospective: Learnings from a Year of Tech Outages – Dismantling Knowledge Silos

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

David Ridge — PagerDuty

How the data center site selection process works at Dropbox

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

Edward del Rio — Dropbox

Untangle Blockers that impede Site Reliability Engineering (SRE) adoption.

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

Shyam Venkat

Beyond Debugging: Harnessing Preattentive Processes in Incident Response

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

Dennis Henry

SRE and Product

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

Jamie Allen

Panic on the Schoolyard: The Merion midair collision (death of Senator John Heinz)

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

Kyra Dempsey (Admiral Cloudberg)

SRE Weekly Issue #409

lex

January 28, 2024

General

Comments

View on sreweekly.com

Executing Cron Scripts Reliably At Scale

I’ve occasionally wondered what’s behind Slack’s /remind or “clear my away status after my vacation ends”. Now I know!

Claire Adams

Consistency

This article is an exploration of consistency and coordination in distributed systems, with lots of really interesting examples.

Lorin Hochstein

GitHub – seifrajhi/awesome-platform-engineering-tools: A curated list of Platform Engineering Tools

Lots of good stuff in here, including infrastructure, monitoring, and incident management tools.

saifeddine Rajhi

AWS re:Invent 2023 – an SREs experience

my first conference

Whew, way to dive into the deep end!

Mike [surname unknown] — SREZone

Enhancing Resiliency: Implementing the Circuit Breaker Pattern for Strong Serverless Architecture on AWS

This article explains why circuit breakers are especially useful in microservice architectures based on Lambda. It explains how to implement circuit breakers using Step Functions.

Satrajit Basu — DZone

5 SRE Predictions For 2024

Definitely some interesting (and spicy!) takes in this one.

Code Reliant

The Evolution of Enforcing our Professional Community Policies at Scale

When you’re at LinkedIn’s scale, building an automated abuse mitigation means designing for high throughput. The answer: lots of caching.

Amit Mathapati — LinkedIn

SRE Tip: SREs Require Their Own Chain of Command

A short but thought-provoking article about where SREs belong in the management heirarchy, and why.

Jamie Allen

SRE Weekly Issue #408

lex

January 21, 2024

General

Comments

View on sreweekly.com

Tell me about a time…

This is either a set of SRE interview topics or the squares for the SRE bingo card.

Lorin Hochstein

Blame Awareness is Universal

Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.

Will Gallego

Rebuilding Netflix Video Processing Pipeline with Microservices

a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.

Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix

Best practices to prevent alert fatigue

Here are five concrete tips to fix your alerts and improve alert fatigue.

Candace Shamieh, Daljeet Sandu, and Nicolas Narbais — Datadog

SRE Governance

This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.

Jamie Allen

Alerts Are Fundamentally Messy

However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

#23 – The Danger of Unreliable Platforms (with Jade Rubick)

This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.

Ash Patel — SREPath

Slack’s Migration to a Cellular Architecture

Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.

Cooper Bethea — Slack

SRE Weekly Issue #407

lex

January 14, 2024

General

Comments

View on sreweekly.com

On chains and complex systems

If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains.

Lorin Hochstein

Practitioners Share How They Remove the Fear of On-Call

We asked members of the PagerDuty Community what they do to remove the fear of being on-call and also asked them to share a piece of advice for those starting out on the on-call rotation and here are some of their insightful tips!

Xenda Amici

How to conduct a postmortem review meeting

There’s some interesting advice in here that I haven’t heard before, like rerunning the incident review meeting if you don’t get enough out of it the first time. Have any of you ever done this?

Jonathan Word

The SRE Report 2024

Catchpint’s annual SRE report is out, and you can download the PDF without even having to fill out a form.

Catchpoint

The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals

The cool thing about this article is the discussions of anti-patterns to avoid, sprinkled throughout.

Vanessa Huerta Granda — InfoQ

A Deep Dive Into Azure Load Balancing Options

I cover GCP and AWS here a lot, so now it’s Azure’s turn, with this detailed guide on load balancing.

Shivaprasad Sankesha Narayana — DZone

An overview of Cloudflare’s logging pipeline

Read this one to learn how Cloudflare implemented a reliable logging pipeline with 1 million log lines per second.

Colin Douch — Cloudflare

SRE Weekly Issue #406

lex

January 7, 2024

General

Comments

View on sreweekly.com

How to Show Your Value In DevOps/SRE

This article describes how to clearly show your value delivered to a tech company as someone who focuses on non-functional requirements such as operability, performance, or reliability.

Amin Astaneh — Certo Modo

The courage to imagine other failures

Doggedly preventing a recurrence of an incident may not be the best way to protect our systems — and may in fact make things worse.

Lorin Hochstein

SLO Compliance Period

Should your SLO cover a rolling 30 days? 7 days? A calendar month?

Alex Ewerlöf

How Meta built the infrastructure for Threads

Threads was built in five months and had over 100 million users in its first week.

Laine Campbell and Chunqiang (CQ) Tang — Meta

Setting the foundations for on-call that’s fair, balanced, and human-focused

This article is full of advice on setting up an on-call process that’s livable and less likely to burn folks out.

incident.io

Lesser of Two Evils: The crash of Ameristar Charters flight 9363

A pilot violated a major aviation principle, and it was the right move. It’s very interesting to me that pilots are trained on the principle but not on the exceptions, with the expectation that they will react well in exceptional circumstances.

Admiral Cloudberg

How to Do UUID as Primary Keys the Right Way

Integer IDs or UUIDs as your DB primary key? I can’t count the number of incidents I’ve been involved in where integer primary keys played a part.

Bertrand Florat

SRE Weekly Issue #410

SRE Weekly Issue #409

SRE Weekly Issue #408

SRE Weekly Issue #407

SRE Weekly Issue #406

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues