SRE Weekly Issue #258

Articles

When acting as a retrospective facilitator, there’s a huge potential to color the discussion with our words and actions.

You’re there to position other folks to learn, not wear the badge.

Will Gallego

GitHub Repo: upgundecha/howtheysre

upgundecha/howtheysre: A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

A huge thanks to the curator for the many awesome links in this repo! Some have been featured here in previous issues, and some are new to me. As I go through those, I’ll share my favorites here and tell you why I think you should read them.

Unmesh Gundecha

Engineering dependability and fault tolerance in a distributed system

In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.

Paddy Byers — Ably

Phishing complaints cause Notion outage

More details on the Notion outage mentioned here last week. Complaints of phishing by a Notion user resulted in their registrar pulling their domain name out of DNS.

Peter Judge — Datacenter Dynamics

What Is True Resilience? (Hint: It’s Not About Managing Risk)

Google has three guiding principles for improving resiliency:

Create maximum observability of the overall system

Design for effectiveness, not perfection

Learn and iterate as you go

Will Grannis — Google

4 Things you Need to Know about Writing Better Production Readiness Checklists

This is an awesome guide to writing a production-ready checklist — and why you’d want one.

Emily Arnott — Blameless

Fix Fast for finding and fixing regressions

Facebook found that as a regression is discovered later, it will take much longer to deploy a fix. With a combination of heuristics and machine learning, they’re detecting regressions earlier and bringing them to the attention of folks that can fix them.

Jian Zhang and Brian Keller — Facebook

Outages

Google Voice
Kia
- Kia had an outage in the internet-enabled features of some of their cars.
Disney+
Microsoft Teams

SRE Weekly Issue #258

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues