General

SRE Weekly Issue #340

lex

September 25, 2022

Articles

SREcon Americas 2020: Exposing the Human Factor

This one’s from a couple years ago and covers 3 main themes the author saw at SRECon Americas 2020. Fascinating topics include providing context for newbies, learning from incidents, and rethinking the incident command system.

Taylor Barnett — Transposit

Honeycomb preliminary incident report: Ingestion delays

On September 8, Honeycomb had a major outage in data ingestion, and they’ve posted this preliminary report, “pending an in-depth incident review in the upcoming weeks”.

BONUS CONTENT: Another outage report from a different outage the next day.

Honeycomb
Full disclosure: Honeycomb is my employer.

/r/sre Thread: A “real” day in the life of an SRE

This is neat! Someone posted a day in their life as an actual SRE, and a bunch of commenters followed suit.

Various commenters — Reddit

What’s Difficult About Problem Detection? Three Key Takeaways

Some big names in SRE got together to talk about how to know when your system is broken. Listen to the recording or read this excellent summary that goes in depth on grey failures and more.

Emily Arnott — Blameless

Scaling Robinhood Crypto Systems

To better scale our systems, our infrastructure and product teams got together and decided to make these optimizations: reduce database loads, conduct load tests and size the demand and prioritize critical flows.

…and sharding.

Robinhood

How an incident transformed Razorpay — Building our Command Center

A major incident went poorly, and that catalyzed investment in developing a new incident response system. They worked to transition from swarming to Incident Command.

Vikrant Saini — Razorpay

Consider these 9 microservices best practices to help you ditch your monolith — Cortex

I love this part:

[…] if you have to deploy your microservices in a certain order, they’re not really microservices.

Cortex

Heroku Incident 2451 Follow-up

This one had an interesting interplay of contributing factors.

Heroku

SRE Weekly Issue #339

lex

September 18, 2022

General

Comments

View on sreweekly.com

It’s with great sadness that I note the passing of a giant in our field, Dr. Richard Cook. His memory will live on through his huge body of work and the countless ways he’s impacted our thinking and practice as SREs.

Articles

The Career, Accomplishments, and Impact of Richard I. Cook: A Life in Many Acts

Here’s a wonderful tribute to the many ways Dr. Cook has advanced our field and others.

John Allspaw — Adaptive Capacity Labs

How Complex Systems Fail

This seems like a fitting time to feature Dr. Cook’s seminal treatise here again.

Dr. Richard Cook

A new channel per incident – helpful or harmful?

A good argument could be made either way, but what really caught my eye was this (emphasis mine):

Responding to incidents should distract as few people as reasonably possible. Organisations should be shooting for minimum viable participation, whilst still responding effectively, to allow them to retain focus.

Chris Evans — incident.io

The Curious Connection Between Cloud Repatriation and SRE Ops

Noticing a correlation between the adoption of SRE and cloud repatriation (moving apps out of the cloud), the author of this article asks, is there causation?

Lori Macvittie — Devops.com

The Hows and Whys of Effective Production-Readiness Reviews

I like the line this article draws between incident retrospectives and developing a PRR process, and also the emphasis on psychological safety.

Incidents reveal what your organization is good at and what needs improvement in your PRR processes.

Nora Jones — Jeli

fluxninja/aperture: Flow control and reliability management for modern cloud applications

Aperture is a new open source tool helps you prevent cascading failures using load-shedding and rate limiting.

BONUS CONTENT: Here‘s their article explaining how it works.

FluxNinja

SRE Weekly Issue #338

lex

September 11, 2022

General

Comments

View on sreweekly.com

Articles

Intro to Themes and Takeaways

This one advocates for looking beyond “root cause” when analyzing an incident, and instead finding Themes and Takeaways.

If it can be solved with a pull request it’s not a takeaway.

Vanessa Huerta Granda — Jeli

Incident Review: Working as Designed, But Still Failing

In this juicy incident, the Incident Commander’s intimate knowledge of a similar failure mode fixated incident response away from the true cause.

Fred Hebert — Honeycomb

Running More Low-Severity Incidents Is Improving Our Culture

[…] the more we normalize lower-impact incidents, the more confidence and experience we build for Sev1 situations.

Dan Condomitti — The New Stack

We’re making our on-call calculator free

Want to compensate folks extra for on-call work? This tool connects to PagerDuty to do all the heavy lifting for you.

Lawrence Jones — incident.io

What’s the weirdest outage reason you dealt with throughout your career?

This Reddit post in r/sre has some really great stories in the comments.

various users — Reddit

Why you need an incident timeline

Along with the “why”, this article also goes into the “how”.

Martha Lambert — incident.io

How to send raw network packets in Python with tun/tap

Early in my career, I had to write a raw IP packet generator to reproduce a DoS attack so that I could mitigate it. It’s fun!

Julia Evans

GitHub Availability Report: August 2022

In an incident in July, a cloud provider change broke provisioning for new Codespaces VMs, taking down the service.

Jakub Oleksy — GitHub

Avoid the Dirty Dozen

Put Safety First and Minimize
the 12 Common Causes of Mistakes
in the Aviation Workplace

FAA (US’s Federal Aviation Administration)

SRE Weekly Issue #337

lex

September 4, 2022

General

Comments

View on sreweekly.com

Thanks for all the vacation well-wishes! It was really great and relaxing. Take vacations, it’s important for reliability!

While I was out, I shipped the past two issues with content prepared in advance, and without the Outages section. This gave me a chance to really think hard about the value of the Outages section versus the time and effort I put into it.

I’ve decided to put the Outages section on hiatus for the time being. For notable outages, I’ll include them in the main section, on a case-by-case basis. Read on if you’re interested in what went into this decision.

The Outages section has always been of lower quality than the rest of the newsletter. I have no scientific process for choosing which Outages make the cut — mostly it’s just whatever shows up in my Google search alerts and seems “important”, minus a few arbitrary categories that don’t seem particularly interesting like telecoms and games. I do only a cursory review of the outage-related news articles I link to, and often they’re on poor-quality sites with a ton of intrusive ads. Gathering the list of Outages has begun taking more and more of my time, and I’d much rather spend that effort on curating quality content, so that’s what I’m going to do going forward.

10 Things I Learned From My First Incident Review

Every one of these 10 items is enough reason to read this article! This makes me want to go investigate some incidents right now.

Fischer Jemison — Jeli

Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD

Slack shares with us in great detail why they use circuit breakers and how they rolled them out.

Frank Chen — Slack

Tips to Make Your On-Call Process Less Stressful

My favorite part of this one is the section on expectations. We need to socialize this to help reduce the pressure on folks going on call for the first time.

Prakya Vasudevan — Squadcast

Why Status Pages Are Lying to You and What To Do About It

Status pages are marketing material. Prove me wrong.

Ellen Steinke — Metrist

Using incidents to level up your teams

incidents have unusually high information density compared with day-to-day work, and they enable you to piggy-back on the experience of others

Lisa Karlin Curtis — incident.io

How we store and process millions of orders daily

These folks realized that they had two different use cases for the same data, real-time transactions and batch processing. Rather than try to find one DB that could support both, they fork two copies of the data.

Xi Chen and Siliang Cao — Grab

Live Your Best Life With Structured Events

It’s all about gathering enough information that you can ask new questions when something goes wrong, rather than being stuck with only answers to the questions you thought to ask in advance.

Charity Majors

How Discord Supercharges Network Disks for Extreme Low Latency

They needed the speed of local ephemeral SSDs but the reliability of network-based persistent disks. The solution: a linux MD option to mirror but prefer to read from the local disks. Neat!

Glen Oakley — Discord

Operating system upgrades at LinkedIn’s scale

OS upgrades can be risky. LinkedIn developed a system to unify OS upgrade procedures and make them much less risky.

Hengyang Hu, Dinesh Dhakal, and Kalyanasundaram Somasundaram — LinkedIn

SRE Weekly Issue #336

lex

August 28, 2022

General

Comments

View on sreweekly.com

Articles

What it’s like to work as an embedded microservices SRE

In this article, I will introduce several improvements being made by the Microservices SRE Team, embedded with other teams.

MizumotoShota — Mercari

What should be on a SLI dashboard

What really stood out to me in this article is the Service Info section. A dashboard will quickly atrophy and lose its meaning without an explanation of what it’s for.

Ali Sattari

SRE: From Theory to Practice: What’s Difficult About Incident Command?

When things go wrong, who is in charge? And what does it feel like to do that role?

This is a summary of a forum discussion about incident command, in case you don’t have time to listen to the whole thing.

Emily Arnott — Blameless

Complex Adaptive Systems and ITSM

Complex systems are weird, and a traditional deterministic view such as in older ITIL iterations doesn’t capture the situation. We need to evolve our practices.

Jon Stevens-Hall

Latency- and Throughput-Optimized Clusters Under Load

How can you design and interpret metrics for systems optimized for latency or throughput?

Dan Slimmon

The Latency/Throughput Tradeoff: Why Fast Services Are Slow And Vice Versa

You can optimize for latency or throughput in a given system, but not both, since the two are directly at odds.

Dan Slimmon

SRE Weekly Issue #340

Articles

SRE Weekly Issue #339

Articles

SRE Weekly Issue #338

Articles

SRE Weekly Issue #337

SRE Weekly Issue #336

Articles

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues