When acting as a retrospective facilitator, there’s a huge potential to color the discussion with our words and actions.
You’re there to position other folks to learn, not wear the badge.
upgundecha/howtheysre: A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
A huge thanks to the curator for the many awesome links in this repo! Some have been featured here in previous issues, and some are new to me. As I go through those, I’ll share my favorites here and tell you why I think you should read them.
In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.
Paddy Byers — Ably
More details on the Notion outage mentioned here last week. Complaints of phishing by a Notion user resulted in their registrar pulling their domain name out of DNS.
Peter Judge — Datacenter Dynamics
Google has three guiding principles for improving resiliency:
- Create maximum observability of the overall system
- Design for effectiveness, not perfection
- Learn and iterate as you go
Will Grannis — Google
This is an awesome guide to writing a production-ready checklist — and why you’d want one.
Emily Arnott — Blameless
Facebook found that as a regression is discovered later, it will take much longer to deploy a fix. With a combination of heuristics and machine learning, they’re detecting regressions earlier and bringing them to the attention of folks that can fix them.
Jian Zhang and Brian Keller — Facebook