General

SRE Weekly Issue #257

lex

February 14, 2021

General

Comments

View on sreweekly.com

Articles

Sometimes alerts have inobvious reasons for existing

This one really got me thinking. Make sure you document why an alert exists, not just what it checks for.

Chris Siebenmann

Incident response from monolith to microservices

If you start with a monolith and adopt a microservice architecture, your incident response process will need to change as well.

Mya Pitzeruse — effx

Minesweeper automates root cause analysis as a first-line defense against bugs

Another one that needs a disclaimer: there’s no single “root cause” for an incident, and this article is not about that. This is about using statistical software to aid humans in debugging by looking at the activities performed by different users before they encounter a given bug.

Vijay Murali, Edward Yao, Umang Mathur, Satish Chandra — Facebook

On Not Being a Cog in the Machine

A new SRE at Honeycomb shares insight on the job and SRE attitudes in general.

Fred Hebert — Honeycomb

Slack’s Jan 2021 outage: a tale of saturation

This post considers the January 4th Slack outage as a set of cases of saturation.

Lorin Hochstein

Outages

SRE Weekly Issue #256

lex

February 7, 2021

General

Comments

View on sreweekly.com

Articles

Slack’s Outage on January 4th 2021

Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

NOTAM for SREs

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

Incident Post Mortem: January 29, 2021 [Coinbase]

Coinbase posted this detailed analysis of their January 29th incident.

Coinbase

Council Post: How Cloud Services Platform Teams Can Drive The Adoption Of Effective SRE Practices

Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, Transposit) — Forbes

“I’m Just Doing my Job,” An SRE Myth

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

GitHub Availability Report: January 2021

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

Open source update: School of SRE

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Push some big numbers through your system and look for bugs

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?

rachelbythebay

Outages

SRE Weekly Issue #255

lex

January 31, 2021

General

Comments

View on sreweekly.com

Articles

Why It Should Be Service, Not Site Reliability

It really should! Even Google is much more accurately described as a “service” than a “site”.

Chris Riley — Splunk

Migrations: the sole scalable fix to tech debt.

There are migrations, and then there’s the time between migrations.

Will Larson

2021 is the Year of Reliability

2020 was the year mainstream folks realized how important reliability is. Will overall reliability improve in 2021?

Robert Ross — FireHydrant

This SRE atempted to roll out an HAProxy config change. You won’t believe what happened next…

I love this for the click-bait title and the content. An HAProxy feature designed for HA had a surprising an unexpected behavior.

Andre Newman — GitLab

Tyler Wells on building a culture of reliability at Twilio

Twilio builds customer trust through a reliability culture, customer empathy, and accountability.

Andre Newman — Gremlin

WTF is SRE WTFinar

This WTFinar tackles the beginning of understanding SRE. It focuses on service level indicators (SLIs) and service level objectives (SLOs) – components of error budgets.

Container Solutions

Outages

Robinhood
Disney+
Public Access to Court Electronic Records (PACER)
- There’s speculation that PACER struggled under the onslaught of people wanting to read the lawsuits against Robinhood related to GameStop.
Coinbase
Kraken
reddit
- Reddit had several small-to-medium outages.

SRE Weekly Issue #254

lex

January 24, 2021

General

Comments

View on sreweekly.com

Articles

Coinbase Incident Post Mortem: January 6–7, 2021

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

Soar: Simulation for Observability, reliAbility, and secuRity

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

Failing to make progress under excess request load

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

Google Cloud Issue Summary — Google Meet — 2021-01-08

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

The Next Gen Database Servers Powering Let’s Encrypt

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

Incident Management in 2021: from Basics to Best Practices

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

Using GPT-3 for plain language incident root cause from logs

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

Taming Operational Load with VMware CRE

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

Slack RCA for outage on January 4, 2021

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

Slack

Outages

SRE Weekly Issue #253

lex

January 17, 2021

General

Comments

View on sreweekly.com

Articles

May 30 SSL incident

TLS can be such a headache.

This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it.

Adam Surak — Algolia

Shifting Modes: Creating a Program to Support Sustained Resilience

A well-researched article on shifting emphasis from incident prevention to learning and resilience.

Incidents cannot be prevented, because incidents are the inevitable result of success.

Alex Elman

Error budgets and the legacy of Herbert Heinrich

This one’s worth reading through twice to let it sink in. It puts me in mind of this article by WIll Gallego, which is another thoughtful critique of error budgets.

Here are the claims I’m going to make:

Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.

Error budgets don’t help reduce risk of large incidents.

Lorin Hochstein

97 things every SRE should know – Part 01

This is a review of a few of the chapters of the book of the same title by Emil Stolarsky and Jaime Woo.

Have you read it too? I’d love to read your take on it!

Dean Wilson

Understanding Incidents: Three Analytical Traps

This one’s worth reading the next time need to do an incident retrospective. The traps are:

Counterfactual reasoning

Normative language

Mechanistic reasoning

John Allspaw — Adaptive Capacity Labs

This Is the Most Underappreciated Skill for SREs

The skill in question is glue work, and I sure appreciate a good gluer when I see one.

Emily Arnott — Blameless

Building and Scaling Your SRE Team

This one starts out by defining SRE, then goes into how to define your team and fill it with people.

Julie Gunderson — PagerDuty

Outages

Fastly
- Fastly is my employer.
Slack
Tyro Payments
Signal
.ke TLD (Kenya)
Microsoft Teams, Office 365 and OneDrive
Instagram

SRE Weekly Issue #257

Articles

Outages

SRE Weekly Issue #256

Articles

Outages

SRE Weekly Issue #255

Articles

Outages

SRE Weekly Issue #254

Articles

Outages

SRE Weekly Issue #253

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues