SRE Weekly Issue #391

Articles

Operating complex systems is about creating accurate mental models, and abstractions are a key ingredient.

Code Reliant

Why is it hard to get an organization to focus on LFI (learning from incidents) rather than RCA (root cause analysis)? Here’s a really great explanation.

Lorin Hochstein

The Iceberg of Engineering Incident Costs

It’s about more than just money — like engineer morale, slowed innovation, and lost customers.

Aaron Lober — Blameless

CAP Theorem Explained: Distributed Systems Series

A great primer on the CAP theorem with a real-world example scenario.

Lohith Chittineni

How Waiting Room makes queueing decisions on Cloudflare’s highly distributed network

It’s really interesting to see how they handled distributed queuing and throttling across a highly distributed cache network without sacrificing speed.

George Thomas — Cloudflare

LLMs Demand Observability-Driven Development

[…] LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

Feedback: I try to answer “how to become a systems engineer”

Dig into and understand how enough things work, and eventually you’ll look like a wizard.

Rachel By the Bay

Don’t trust default timeouts

As a rule of thumb, always set timeouts when making network calls. And if you build libraries, always set reasonable default timeouts and make them configurable for your clients.

Roberto Vitillo

SRE Weekly Issue #391

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues