General

SRE Weekly Issue #393

lex

October 8, 2023

General

Comments

View on sreweekly.com

GitHub – teivah/sre-roadmap: An Opinionated Roadmap to Become an SRE (Concepts > Tools)

This repo contains a path to learn SRE, in the form of a list of concepts to familiarize oneself with.

Teiva Harsanyi

Is a $1 million Splunk bill worth it?

How can we justify the (sometimes significant) expense of instilling observability into our systems?

Nočnica Mellifera — SigNoz

1.1.1.1 lookup failures on October 4th, 2023

It was DNS. Cloudflare’s 1.1.1.1 recursive DNS service failed this week, stemming from failure to parse the new ZONEMD record type.

Ólafur Guðmundsson — Cloudflare

CAP Theorem: Use It to Choose an Open Source Database

Rather than just dry theory, this article helps you understand what the CAP theory means in practice as you choose a data store.

Note: this link was 504ing at time of publishing, so here’s the archive.org copy.

Bala Kalavala — Open Source For U

Whose fault was it anyway? On blameless post-mortems

A “blameless” culture can get in the way if it means you’re not allowed to make any mention of who was at the pointy-end of your system when things blew up.

incident.io

Building Resilience in the Face of Disruption: LinkedIn’s Journey to ISO 22301 Certification

In this post, we will share how we formalized the LinkedIn Business Continuity & Resilience Program, how this new program helped increase our customers’ confidence in our operations, and the lessons that we learned as we attained ISO 22301 certification.

Chau Vu — LinkedIn

Sre Interview Prep Plan | week 1

This is the start of a 6-article series, with each going through one week along a path to prepare for SRE interviews.

We’ll spend each week focusing on building up your expertise in the key areas SREs need to know, like automation, monitoring, incident response, etc.

Code Reliant

PACELC Theorem Explained: Distributed Systems Series

Beyond the CAP theorem, what actually happens during a partition?

“ if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency and consistency (L and C)” [Daniel J. Abadi]

Lohith Chittineni

SRE Weekly Issue #392

lex

October 1, 2023

General

Comments

View on sreweekly.com

Why embracing complexity is the real challenge in software today

In the midst of industry discussions about productivity and automation, it’s all too easy to overlook the importance of properly reckoning with complexity.

There’s a cool bit in there about redistributing complexity rather than simply getting rid of it, using microservices as an example.

Ken Mugrage — Thoughtworks — MIT Technology Review

Creating The Conditions to Learn From Incidents

Interesting idea: if we go too far toward making incident investigations comfortable and routine, we can make learning less likely.

Dane Hillard — Jeli

If P99 Latency Is BS, What’s the Alternative?

A problem with P99 is that 1% of your customers have a worse experience, and P99 doesn’t capture how worse.

Cynthia Dunlop — The New Stack

How to Operate Reliable AWS Lambda Applications in Production

Lambda isn’t “NoOps”, it’s just a different flavor of ops.

Ernesto Marquez — Concurrency Labs

[Salesforce] Service Disruption on Multiple Clouds on September 20, 2023

Salesforce had a major outage earlier this month, and now they’ve posted this followup analysis.

Salesforce

Editing stuff in prod

This sysadmin story is a lesson in understanding the full context before passing judgement.

rachelbythebay

Better learning from incidents: A guide to post-mortem documents

Things get interesting toward the end, where they warn that focusing too narrowly on learning from incidents can cause problems.

Luis Gonzalez — incident.io

The Fail Fast Principle

The fail fast pattern is highly relevant for building reliable distributed systems. Rapid error detection and failure propagation prevents localized issues from cascading across system components.

Code Reliant

SRE Weekly Issue #391

lex

September 24, 2023

General

Comments

View on sreweekly.com

Articles

Abstraction as a Reliability Tool

Operating complex systems is about creating accurate mental models, and abstractions are a key ingredient.

Code Reliant

Why LFI is a tough sell

Why is it hard to get an organization to focus on LFI (learning from incidents) rather than RCA (root cause analysis)? Here’s a really great explanation.

Lorin Hochstein

The Iceberg of Engineering Incident Costs

It’s about more than just money — like engineer morale, slowed innovation, and lost customers.

Aaron Lober — Blameless

CAP Theorem Explained: Distributed Systems Series

A great primer on the CAP theorem with a real-world example scenario.

Lohith Chittineni

How Waiting Room makes queueing decisions on Cloudflare’s highly distributed network

It’s really interesting to see how they handled distributed queuing and throttling across a highly distributed cache network without sacrificing speed.

George Thomas — Cloudflare

LLMs Demand Observability-Driven Development

[…] LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

Feedback: I try to answer “how to become a systems engineer”

Dig into and understand how enough things work, and eventually you’ll look like a wizard.

Rachel By the Bay

Don’t trust default timeouts

As a rule of thumb, always set timeouts when making network calls. And if you build libraries, always set reasonable default timeouts and make them configurable for your clients.

Roberto Vitillo

SRE Weekly Issue #390

lex

September 17, 2023

General

Comments

View on sreweekly.com

Many apologies to my email subscribers, who have seen two accidental re-sends of old issues recently due to a weird glitch in my automation. I think I’ve gotten a handle on it, and I’ll run an internal retrospective of this incident, of course.

Articles

SRE vs Platform Engineer: Can’t We All Just Get Along?

Is it really SRE vs platform engineer? Or is there a way platforms can take site reliability to the next level?

Jennifer Riggins — The New Stack

Our Prerequisites are Never Enough for High Risk

A surgeon delves into the key component that allows a group of skilled individuals to work effectively and safely together, using the term “heed” to describe this special interaction.

Sidenote: in a hilarious coincidence this article managed to spoil me on a movie I was in the middle of watching (Arrival) — but it also put me in a really cool mindset to watch the rest of the film.

Dr. Rob Poston

Degraded Performance: Square Services

More details on Square’s outage from a couple weeks ago (it was DNS).

Square

Azure status history

Azure had an interesting outage in its Australia East region involving a power failure and the order cooling units were restored in.

Microsoft Azure

How Did It Make Sense at the Time? Understanding Incidents As They Occurred, Not as They Are Remembered

Asking this question is how you unlock the hidden essence of an incident. This talk compares two public incident reports to show what it looks like when you dig into this question and when you don’t.

Jacob Scott — InfoQ

The Fallible Mind: The crash of Comair flight 5191

In this air accident, the pilots made a seemingly inexplicable mistake.

This sentence really stood out to me, especially after reading the “How Did It Make Sense at the Time?” article:

When we inexplicably grab the wrong utensil when cooking or accidentally start taking our dirty dishes to the bathroom instead of the kitchen, we should be thankful that we aren’t responsible for a plane full of people.

Admiral Cloudberg

GitHub Availability Report: August 2023

There’s an interesting failure mode in this one that might stand out for the Kafka admins among us:

The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process.

Jakub Oleksy — GitHub

The connection between incident management and problem management

After explaining the difference between the ITIL terms “incident management” and “problem management”, this article goes into a discussion of recent trends and whether it still makes sense to draw a distinction between the two.

Luis Gonzalez — incident.io

SRE Weekly Issue #389

lex

September 10, 2023

General

Comments

View on sreweekly.com

Articles

Building a Successful SRE Team

Here’s four of the lessons I learned that should help you build a successful SRE organization.

Focus on Developer Training

Focus on the Right Abstractions

Focus on Self Service

Automate Yourself out of a job

Sven Hans Knecht

Exploring distributed vs centralized incident command models

In this blog post, we’ll talk about two incident management structure models — distributed and centralized, including the pros and cons of each, and examples of what each structure looks like in our community.

Robert Ross — FireHydrant

Understanding the Rasmussen model for failures

The Rasmussen model conceptualizes the limits of a system along 3 boundaries: Cost, System Performance, and Human Capacity.

Nishant Modak — Last9

Accelerator Report: Leak repaired, cooling in progress

Wow, this is a really interesting incident. it has all the hallmarks of a nightmare sev1: time pressure, unknown problem, inventing new procedures on the spot, multiple different teams/specialties having to work together, etc.

Jorg Wenninger — CERN

Scheduling Oncall considering Sabbath and other frequent recurring conflicts

What do you do when many engineers all need to take the same day off each week for religious reasons?

TimeWeSp

Concerning the production order system malfunction

Toyota recently halted production in their factories due to a problem in their order system, about which they shared some interesting details.

Toyota

Being The First SRE

Here’s a guidebook on how to handle being the first SRE at a company.

Sven Hans Knecht

SRE Weekly Issue #393

SRE Weekly Issue #392

SRE Weekly Issue #391

Articles

SRE Weekly Issue #390

Articles

SRE Weekly Issue #389

Articles

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

A message from our sponsor, Rootly:

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues