SRE Weekly Issue #259

A message from our sponsor, StackHawk:

Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project’s roadmap


This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.


Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

Understanding that no system is without errors is critical to building resilient systems.

Heidi Waterhouse

The very first sentence sets the tone, and I love it:

Resilience is a process: something you must actively perform, not something you check off a list once.

Ryn Daniels

Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.

The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.

This article is a great primer on what an IC is and how to adopt incident command at your organization.

Tanya Reilly

After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.

I love the W. C. Fields quote.

Anand Prashant

It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.

DevOps Institute, Catchpoint, and VMWare Tanzu

When considering the value of a QA test, SLIs can provide very valuable context.

SRE and QA can work hand in hand.

Emily Arnott — Blameless

This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.

Harish Dattatraya Dixit — Facebook

Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.

Mike Adler — Etsy


SRE Weekly Issue #258

A message from our sponsor, StackHawk:

On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session


When acting as a retrospective facilitator, there’s a huge potential to color the discussion with our words and actions.

You’re there to position other folks to learn, not wear the badge.

Will Gallego

upgundecha/howtheysre: A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

A huge thanks to the curator for the many awesome links in this repo! Some have been featured here in previous issues, and some are new to me. As I go through those, I’ll share my favorites here and tell you why I think you should read them.

Unmesh Gundecha

In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.

Paddy Byers — Ably

More details on the Notion outage mentioned here last week. Complaints of phishing by a Notion user resulted in their registrar pulling their domain name out of DNS.

Peter Judge — Datacenter Dynamics

Google has three guiding principles for improving resiliency:

  • Create maximum observability of the overall system
  • Design for effectiveness, not perfection
  • Learn and iterate as you go

Will Grannis — Google

This is an awesome guide to writing a production-ready checklist — and why you’d want one.

Emily Arnott — Blameless

Facebook found that as a regression is discovered later, it will take much longer to deploy a fix. With a combination of heuristics and machine learning, they’re detecting regressions earlier and bringing them to the attention of folks that can fix them.

Jian Zhang and Brian Keller — Facebook


SRE Weekly Issue #257

A message from our sponsor, StackHawk:

Keeping your APIs secure requires thoughtful design and testing. Learn how to protect your REST, SOAP and GraphQL APIs from security vulnerabilities with StackHawk


This one really got me thinking. Make sure you document why an alert exists, not just what it checks for.

Chris Siebenmann

If you start with a monolith and adopt a microservice architecture, your incident response process will need to change as well.

Mya Pitzeruse — effx

Another one that needs a disclaimer: there’s no single “root cause” for an incident, and this article is not about that. This is about using statistical software to aid humans in debugging by looking at the activities performed by different users before they encounter a given bug.

Vijay Murali, Edward Yao, Umang Mathur, Satish Chandra — Facebook

A new SRE at Honeycomb shares insight on the job and SRE attitudes in general.

Fred Hebert — Honeycomb

This post considers the January 4th Slack outage as a set of cases of saturation.

Lorin Hochstein


SRE Weekly Issue #256

A message from our sponsor, StackHawk:

Register now for the first-ever ZAPCon taking place March 9th. The free event will focus on OWASP ZAP and application security best practices. You wont want to miss it!


Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

Coinbase posted this detailed analysis of their January 29th incident.


Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, Transposit) — Forbes

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?



SRE Weekly Issue #255

A message from our sponsor, StackHawk:

With StackHawk’s new GitHub Action, you can integrate AppSec testing directly into your GitHub CI/CD pipeline. See how:


It really should! Even Google is much more accurately described as a “service” than a “site”.

Chris Riley — Splunk

There are migrations, and then there’s the time between migrations.

Will Larson

2020 was the year mainstream folks realized how important reliability is. Will overall reliability improve in 2021?

Robert Ross — FireHydrant

I love this for the click-bait title and the content. An HAProxy feature designed for HA had a surprising an unexpected behavior.

Andre Newman — GitLab

Twilio builds customer trust through a reliability culture, customer empathy, and accountability.

Andre Newman — Gremlin

This WTFinar tackles the beginning of understanding SRE. It focuses on service level indicators (SLIs) and service level objectives (SLOs) – components of error budgets.

Container Solutions


SRE WEEKLY © 2015 Frontier Theme