SRE Weekly Issue #258

A message from our sponsor, StackHawk:

On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session


When acting as a retrospective facilitator, there’s a huge potential to color the discussion with our words and actions.

You’re there to position other folks to learn, not wear the badge.

Will Gallego

upgundecha/howtheysre: A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

A huge thanks to the curator for the many awesome links in this repo! Some have been featured here in previous issues, and some are new to me. As I go through those, I’ll share my favorites here and tell you why I think you should read them.

Unmesh Gundecha

In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.

Paddy Byers — Ably

More details on the Notion outage mentioned here last week. Complaints of phishing by a Notion user resulted in their registrar pulling their domain name out of DNS.

Peter Judge — Datacenter Dynamics

Google has three guiding principles for improving resiliency:

  • Create maximum observability of the overall system
  • Design for effectiveness, not perfection
  • Learn and iterate as you go

Will Grannis — Google

This is an awesome guide to writing a production-ready checklist — and why you’d want one.

Emily Arnott — Blameless

Facebook found that as a regression is discovered later, it will take much longer to deploy a fix. With a combination of heuristics and machine learning, they’re detecting regressions earlier and bringing them to the attention of folks that can fix them.

Jian Zhang and Brian Keller — Facebook


Updated: February 21, 2021 — 8:37 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme