General

SRE Weekly Issue #286

lex

September 5, 2021

Articles

This is a review of Marianne Bellotti’s Kill It With Fire a book about modernizing legacy systems. It focuses heavily on operational concepts and “the system around the system”, with a heavy SRE influence.

Laura Nolan — ;login:

Why every software engineering interview should include ops questions

Originally drafted in 2016, this blog post is even more relevant now. Beyond just the “why”, it has several ideas for interview questions to get you started.

Charity Majors

The power of framing a problem

Tell a good story, and you can make things happen.

As SREs, we often know what needs to be done, but convincing others is a hard-won skill.

Lorin Hochstein

Easyjet A320 tells United Boeing 787 to GO AROUND!

In this video report of a commercial aviation accident, there’s a neat discussion of resiliency toward the end. There were several other layers of protection that (probably) would have caught and prevented this incident if the A320 captain hadn’t intervened. And even though no accident occurred, there was still a “near miss” investigation.

Mentor Pilot

The Role of SREs in Observability

Although conversation about observability often ignores SREs, SREs have a central role to play in observability success.

Quentin Rousseau — Rootly

Cascading retries and the sulky applications

In a microservice architecture, having retries several levels deep can be a recipe for nastiness.

Oren Eini — RavenDB

GitHub Availability Report: August 2021

This report has some detail on two major incidents experienced by GitHub last month.

Scott Sanders — GitHub

Outages

SRE Weekly Issue #285

lex

August 29, 2021

General

Comments

View on sreweekly.com

Articles

Computers are the easy part

What’s so great about this incident write-up is the way that entrenched mental models hampered the incident response. There’s so much to learn here.

Ray Ashman — Mailchimp

Rethinking Best Practices

The parallels between this and the Mailchimp article are striking.

Will Gallego

How to Improve Upon Google’s Four Golden Signals of Monitoring

This includes a review of the four golden signals and presents three areas to go further.

JJ Tang — Rootly

Root cause of failure, root cause of success

This one thoughtfully discusses why “root cause” is a flawed concept, approaching the idea from multiple directions.

Lorin Hochstein

IBM PREVAIL Conference: October 19–21, 2021

Check it out, a new SRE conference! This one’s virtual and the CFP is open until October 1.

Robert Barron — IBM

Notes on the Perfidy of Dashboards

To be clear, this article is about static dashboards that just contain pre-set graphs of specific metrics.

every dashboard is an answer to some long-forgotten question

Charity Majors

What makes public posts about incidents different from analysis write-ups

Public incident posts give us useful insight into how companies analyze their incidents, but it’s important to remember that they’re almost never the same as internal incident write-ups.

John Allspaw — Adaptive Capacity Labs

Heroku Incident #2300 Follow-Up

In this incident from July 7, front-line routing hosts exceeded their file descriptor limits, causing requests to be delayed and dropped.

Heroku

TLDs — Putting the ‘.fun’ in the top of the DNS

.io, assigned to the British Indian Ocean Territory is almost exclusively used by annoying startups for content completely unrelated to the islands.

Remember, it’s all fun and games until the random country you’ve attached your business to has an outage in their TLD DNS infrastructure.

Jan Schaumann

Why Observability Requires a Distributed Column Store

If you’re curious about just what a columnar data store is like I was, this article is a good introduction.

Alex Vondrak — Honeycomb

Outages

SRE Weekly Issue #284

lex

August 22, 2021

General

Comments

View on sreweekly.com

Like last week, I prepared this week’s issue in advance, so no Outages section. Have a great week!

Articles

Alerting on SLOs like Pros

Soundcloud is very clear on the fact that they are not at Google scale. It’s interesting to see how they apply SRE principles at their scale.

Björn “Beorn” Rabenstein — SoundCloud

Distributed Troubleshooting

Here’s why Target set up their ELK stack, and how they used it to troubleshoot a problem in ElasticSearch itself.

Dan Getzke — Target

Error Budgets and their Dependencies

A key point in this article is that calculating your error budget as just “100% – SLO” goes about things backward.

Adam Hammond — Squadcast

Capacity Planning at Scale

They periodically scale up their systems just to test and be sure they’ll be ready for big events like Black Friday / Cyber Monday.

Kathryn Tang — Shopify

How to drive ownership in microservices

In this post, we’ll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where’s the best place to start?

Cortex

One, Two, Skip a Few…

This fun troubleshooting story hinges around the internal details of how PostgreSQL’s sequences work.

Pete Hamilton — incident.io

SRE Weekly Issue #283

lex

August 15, 2021

General

Comments

View on sreweekly.com

I’m on vacation enjoying the sunny beaches in Maine with my family, so I prepared this week’s issue in advance. No outages section, save for one big one I noticed due to direct personal experience. See you all next week!

Articles

Moving Quicksilver into production

We needed a way to deploy our new service seamlessly, and to roll back that deploy should something go wrong. Ultimately many, many, things did go wrong, and every bit of failure tolerance put into the system proved to be worth its weight in gold because none of this was visible to customers.

Geoffrey Plouviez — Cloudflare

The Secret of Communicating Incident Retrospectives

I especially like the idea of tailoring retrospective documents to disparate audiences — you may have more than you realize.

Emily Arnott — Blameless

Demystifying Site Outages

An analysis of two incidents from the venerable John Allspaw. These are from 2012 back when he was at Etsy, and yet there’s still a ton we can learn now by reading them.

John Allspaw — Etsy

Why We Swear by the RCA

An account of how Gojek responds to production issues, and why the RCA is a critical part of the process.

Sooraj Rajmohan — Gojek

The Incident Review: 4 Times When Typos Brought Down Critical Systems

Type carefully… or rather, design resilient systems.

JJ Tang — Rootly

The SRE as a Diplomat

Requiring development teams to fully own their services can lead to siloing and redundancy. Heroku works to ameliorate that by embedding SREs in development teams.

Johnny Boursiquot — Salesforce (presented at QCon)

Making Sense out of Incident Metrics

I’ve shared some articles here suggesting doing away with incident metrics like MTTR entirely. This author says that they are useful, but the numbers must be properly ccontextualized.

Vanessa Huerta Granda — Learning From Incidents

Why more incidents is no bad thing

Everything could be fine, or we could failing to report or missing problems altogether — we’re flying blind.

Chris Evans — incident.io

Outages

GitHub

SRE Weekly Issue #282

lex

August 8, 2021

General

Comments

View on sreweekly.com

Articles

A thorough introduction to bpftrace

I really need to learn bpftrace, and this article is a great place to start.

Brendan Gregg

Incidents are for everyone

If we expand our definition of “incident” beyond traditional engineering problems, we increase our opportunity for learning.

Stephen Whitworth — incident.io

Where Do SREs Go From Here?

This is an interview with a director at Catchpoint about their 2021 SRE Report. They discuss two results from the survey: folks report a 15% decrease in toil and slow adoption of AIOps.

Charlene O’Hanlon — devops.com

Incident Retro: Failing Comment Creation + Erroneous Push Notifications

A recurring theme in this story is that the incident was when folks learned how the push notifications work.

Molly Struve — DEV

r/sre – Dev focused SREs do not want to take on operational tasks

In this reddit thread, a company hired some developers as SREs and then found that they didn’t want to do operations work. Folks weigh on why and what to do.

u/red_flock and others — reddit

Latency based SLO

How exactly do you want to phrase (and measure) an SLO about latency percentiles? Beware the subtle details.

Piyush Verma — last9

Resilience in Action E9: Vulnerability, Compassion, and Post-Incident Reviews in the Emergency Room with Dr. Al’ai Alvarez

I’m definitely going to think on the great incident response and followup wisdom in this interview. My favorite:

If I can change 1% to better that outcome, what is that 1%?

Christina Tan — Blameless

Full disclosure: Fastly, my employer, is mentioned.

Burned by ‘let it burn’

Root cause: guessed wrong in the moment

Lorin Hochstein

Incident Management Goes to the Olympics

Here’s a run-down of some IT mishaps from Olympic games past and present.

Quentin Rousseau — Rootly

SRE Weekly Issue #286

Articles

Outages

SRE Weekly Issue #285

Articles

Outages

SRE Weekly Issue #284

Articles

SRE Weekly Issue #283

Articles

Outages

SRE Weekly Issue #282

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues