SRE WEEKLY – Page 25 – scalability, availability, incident response, automation

SRE Weekly Issue #364

lex

March 19, 2023

Articles

Infrastructure as Code is Not the Answer!

Heresy! This article provides a counterpoint to many of the benefits of IaC. While IaC may still be the right answer, it’s not a slam dunk.

Luke Shaughnessy

The Foundational Pillars of Site Reliability Engineering

Short but sweet, this article outlines three focus areas that the author argues should be a part of any SRE role.

Kyle Robertson

Failure Mitigation for Microservices: An Intro to Aperture

Way beyond just an intro to aperture, this article also covers microservice architecture failure modes, techniques used to avoid failures, and the weaknesses in those techniques.

Cong Ma and Matt Ranney — Doordash

How to put the plus in ‘staff+’ engineer

I’m including this here not just for the staff+ SREs out there. Many of these skills are important for SREs to develop much earlier than the Staff level, since our role can be so collaborative.

Ryn Daniels — GitHub

Managing Risk as an SRE

I love that fully half of this article is about mentoring developing SREs in identifying and managing risk.

Ross Brodbeck

How We Define SRE Work, as a Team

Learn how the Honeycomb SRE team has structured its work, including a fully copy of the team charter.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer and I am a member of the SRE team described in this article.

How to Engineer your Technical Debt Response

An intriguing approach: define technical debt as a risk, and manage it in much the same way that we handle reliability-related risks, with a “threat budget”.

Jason Bloomberg — Intellyx

Making peace with the imperfect nature of mental models

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect.

Lorin Hochstein

FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency?

Using a simulation, this article comes to the conclusion that a hybrid between FIFO and LIFO is better than picking just one.

Eugene Retunsky — DZone

SRE Weekly Issue #363

lex

March 12, 2023

General

Comments

View on sreweekly.com

Articles

Oncall Compensation for Software Engineers

A super in-depth look at on-call compensation strategies. Includes a sampling of companies and how much they pay (if anything).

Gergely Orosz — The Pragmatic Engineer

Husky: Exactly Once Ingestion and Multi Tenancy at Scale

Husky uses a nifty sharding strategy where a customer’s shard allocation changes over time automatically based on load.

Daniel Intskirveli — Datadog

Your codebase is a Jenga tower

This analogy goes far enough to even include rules. Anyone up for a round?

Robert Ross

SRE Evangelist

[…] in order to be truly great at being an SRE you will constantly need to understand how to work with people in the organization, how to set expectations and how to move the needle on people’s understanding of reliability.

Ross Brodbeck

How Discord Stores Trillions of Messages

MongoDB -> Cassandra -> ScyllaDB. Storing a ton of stuff is hard.

Bo Ingram — Discord

Keeping humans in the loop

When designing complex technical systems, you should ask yourself, “how does the human operator fit into the picture”.

Cursed Quail

Conference recap: Learning From Incidents (LFI) 2023

It sounds like it was a great conference!

Paige Cruz — Chronosphere

Caring for Complex Systems: We Can Do This

[…] complex systems don’t yield to analysis. We have to add another skill: sense-making.

Jessica Kerr — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #362

lex

March 5, 2023

General

Comments

View on sreweekly.com

Articles

Keep your AI claims in check

You might wonder why I have given almost zero coverage to “AIOps” here, and why my coverage of “anomaly detection” has included heavy skepticism. The reason: I simply haven’t seen any proof that it works.

The FTC’s recent stance on AI sums up my position nicely. If you want your AIOps product covered here, don’t just tell me it works, prove to me that it works.

Michael Atleson — Federal Trade Commission

Taking the fear out of migrations

How? With a safe and repeatable procedure for database migrations involving double-writing.

Lisa Karlin Curtis — incident.io

Scaling Microservices Alerting With Zero Ops.

Push to main on a new microservice repo and it deploys to production, spins up a slack channel for alerts, invites the CODEOWNERS, creates an on-call rotation, and puts them in it. Wow!

Kiselev Ivan — Better Programming

Incident Report: Google Cloud Platform incident on February 27

A routing issue caused widespread packet loss with worldwide impact across many services.

Google

GitHub Availability Report: February 2023

This month’s report had a couple of fascinating incidents, especially the one about source code archive hashes.

Jakub Oleksy — GitHub

Chaos Data Engineering Manifesto

Folks from the New York Times used chaos engineering to prepare for the surge of traffic during the US’s presidential election. They share 5 guidelines for effective chaos engineering for big data systems.

Shane Murray — Monte Carlo

Recapping LFI Conf 2023

Here’s that LFI Conf recap I wanted!

Vanessa Huerta Granda — Jeli

SRE in the Real World

Former Google folks published this guide to help recently laid-off Google SREs integrate with the way SRE is done in the rest of the tech world. There’s an interesting hint about Google’s on-call compensation that I’m going to have to look into.

Murali Suriar and Niall Murphy

At the Edge of Endurance: The crash of American International Airways flight 808

A normally conscientious airline captain made a decision he normally would not have, likely owing to severe sleep deprivation.

Admiral Cloudberg

SRE Weekly Issue #361

lex

February 26, 2023

General

Comments

View on sreweekly.com

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

Articles

Health Checking

An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

Srinavas — eightnoteight

Exploring the Architecture of Amazon SQS

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

Satrajit Basu — DZone

Chris’s Wiki :: blog/web/SingleSignOnVsAvailability

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

Chris Siebenmann

Scaling Etsy Payments with Vitess: Part 1 – The Data Model

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

River Rainne and Amy Ciavolino — Etsy

Consistent hashing algorithm

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

Nk — High Scalability

Hodor: Overload scenarios and the evolution of their detection and handling

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn

The Captain’s Gambit: The crash of Allegheny Airlines flight 485

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

Admiral Cloudberg

SRE Weekly Issue #360

lex

February 19, 2023

General

Comments

View on sreweekly.com

Articles

Overworked and Underpaid: The crash of TransAsia Airways flight 222

Another case of “pilot error” vs “systemic problems”. It’s interesting to me how the organizational pressures the pilots were facing mirror many stories I’ve seen in tech firms, especially startups.

Admiral Cloudberg

Incident travel time

This article recommends improving MTTA (mean time to assemble) by modeling our dispatch systems on the emergency services for a large city.

Robert Ross

Our 2023 Site Reliability Engineering Wish List

Lots of great stuff to aspire to, with a big emphasis on observability.

Adriana Villela and Ana Margarita Medina — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.

Move past incident response to reliability

I really love the concept of “incident legalism” introduced in this article. I’ve definitely been there.

The high cost of low ambiguity

Anyone who has coordinated over Slack during the incident has felt the pain of the ambiguity of Slack messages.

But communicating with specificity has a cost.

Lorin Hochstein

Spotify Engineering Incident Report: Spotify Outage on January 14, 2023 Infrastructure

I remember this one! I was trying to listen to music at the time. Turns out it was DNS (and a git repo).

Erik Lindblad — Spotify

Good category, bad category (or: tag, don’t bucket)

If you’re gonna group your incidents, use tags, not exclusive groups.

Lorin Hochstein

SRE Weekly Issue #364

Articles

SRE Weekly Issue #363

Articles

SRE Weekly Issue #362

Articles

SRE Weekly Issue #361

Articles

SRE Weekly Issue #360

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues