General

SRE Weekly Issue #285

A message from our sponsor, StackHawk:

Check out the latest from StackHawk’s Chief Security Officer, Scott Gerlach, on why security should be part of building software, and how StackHawk helps teams catch vulns before prod.
https://sthwk.com/cloudnative

Articles

What’s so great about this incident write-up is the way that entrenched mental models hampered the incident response. There’s so much to learn here.

Ray Ashman — Mailchimp

The parallels between this and the Mailchimp article are striking.

Will Gallego

This includes a review of the four golden signals and presents three areas to go further.

JJ Tang — Rootly

This one thoughtfully discusses why “root cause” is a flawed concept, approaching the idea from multiple directions.

Lorin Hochstein

Check it out, a new SRE conference! This one’s virtual and the CFP is open until October 1.

Robert Barron — IBM

To be clear, this article is about static dashboards that just contain pre-set graphs of specific metrics.

every dashboard is an answer to some long-forgotten question

Charity Majors

Public incident posts give us useful insight into how companies analyze their incidents, but it’s important to remember that they’re almost never the same as internal incident write-ups.

John Allspaw — Adaptive Capacity Labs

In this incident from July 7, front-line routing hosts exceeded their file descriptor limits, causing requests to be delayed and dropped.

Heroku

.io, assigned to the British Indian Ocean Territory is almost exclusively used by annoying startups for content completely unrelated to the islands.

Remember, it’s all fun and games until the random country you’ve attached your business to has an outage in their TLD DNS infrastructure.

Jan Schaumann

If you’re curious about just what a columnar data store is like I was, this article is a good introduction.

Alex Vondrak — Honeycomb

Outages

SRE Weekly Issue #284

Like last week, I prepared this week’s issue in advance, so no Outages section.  Have a great week!

A message from our sponsor, StackHawk:

Trying to automate application and API security testing? See how StackHawk and Burp Suite Enterprise stack up:
https://sthwk.com/burp-enterprise

Articles

Soundcloud is very clear on the fact that they are not at Google scale. It’s interesting to see how they apply SRE principles at their scale.

Björn “Beorn” Rabenstein — SoundCloud

Here’s why Target set up their ELK stack, and how they used it to troubleshoot a problem in ElasticSearch itself.

Dan Getzke — Target

A key point in this article is that calculating your error budget as just “100% – SLO” goes about things backward.

Adam Hammond — Squadcast

They periodically scale up their systems just to test and be sure they’ll be ready for big events like Black Friday / Cyber Monday.

Kathryn Tang — Shopify

In this post, we’ll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where’s the best place to start?

Cortex

This fun troubleshooting story hinges around the internal details of how PostgreSQL’s sequences work.

Pete Hamilton — incident.io

SRE Weekly Issue #283

I’m on vacation enjoying the sunny beaches in Maine with my family, so I prepared this week’s issue in advance.  No outages section, save for one big one I noticed due to direct personal experience.  See you all next week!

A message from our sponsor, StackHawk:

StackHawk is now integrated with GitHub Code Scanning! Enginners can run automated dynamic application and API security when they check-in code, with results available directly in GitHub.
https://sthwk.com/GitHub-CodeScanning

Articles

We needed a way to deploy our new service seamlessly, and to roll back that deploy should something go wrong. Ultimately many, many, things did go wrong, and every bit of failure tolerance put into the system proved to be worth its weight in gold because none of this was visible to customers.

Geoffrey Plouviez — Cloudflare

I especially like the idea of tailoring retrospective documents to disparate audiences — you may have more than you realize.

Emily Arnott — Blameless

An analysis of two incidents from the venerable John Allspaw.  These are from 2012 back when he was at Etsy, and yet there’s still a ton we can learn now by reading them.

John Allspaw — Etsy

An account of how Gojek responds to production issues, and why the RCA is a critical part of the process.

Sooraj Rajmohan — Gojek

Type carefully… or rather, design resilient systems.

JJ Tang — Rootly

Requiring development teams to fully own their services can lead to siloing and redundancy. Heroku works to ameliorate that by embedding SREs in development teams.

Johnny Boursiquot — Salesforce (presented at QCon)

I’ve shared some articles here suggesting doing away with incident metrics like MTTR entirely. This author says that they are useful, but the numbers must be properly ccontextualized.

Vanessa Huerta Granda — Learning From Incidents

Everything could be fine, or we could failing to report or missing problems altogether — we’re flying blind.

Chris Evans — incident.io

Outages

SRE Weekly Issue #282

A message from our sponsor, StackHawk:

ICYMI ZAP Creator and Project Lead Simon Bennetts recently unveiled ZAP’s new automation framework. Watch the session and see how it works:
https://sthwk.com/Automation-Framework

Articles

I really need to learn bpftrace, and this article is a great place to start.

Brendan Gregg

If we expand our definition of “incident” beyond traditional engineering problems, we increase our opportunity for learning.

Stephen Whitworth — incident.io

This is an interview with a director at Catchpoint about their 2021 SRE Report. They discuss two results from the survey: folks report a 15% decrease in toil and slow adoption of AIOps.

Charlene O’Hanlon — devops.com

A recurring theme in this story is that the incident was when folks learned how the push notifications work.

Molly Struve — DEV

In this reddit thread, a company hired some developers as SREs and then found that they didn’t want to do operations work. Folks weigh on why and what to do.

u/red_flock and others — reddit

How exactly do you want to phrase (and measure) an SLO about latency percentiles? Beware the subtle details.

Piyush Verma — last9

I’m definitely going to think on the great incident response and followup wisdom in this interview. My favorite:

If I can change 1% to better that outcome, what is that 1%?

Christina Tan — Blameless

Full disclosure: Fastly, my employer, is mentioned.

Root cause: guessed wrong in the moment

Lorin Hochstein

Here’s a run-down of some IT mishaps from Olympic games past and present.

Quentin Rousseau — Rootly

Outages

SRE Weekly Issue #281

A message from our sponsor, StackHawk:

Traditional application security testing methods fail for single page applications. Check out why single page apps are different and how you can run security tests on your SPAs.
https://sthwk.com/SPA

Articles

The incident: a formula 1 car hit the side barrier just over 20 minutes before the race was about to start. The team sprang into action with an incredibly calm, orderly and speedy incident response to replace the damaged parts faster than they ever have before.

This article is a great analysis, and there’s also an excellent 8-minute video that I highly recommend. Listen to the way the sporting director and everyone else communicates so calmly. It’s a rare treat to get video footage of a production incident like this.

Chris Evans — incident.io

The underlying components become the cattle, and the services become the new Pet that you tend to with your utmost care.

Piyush Verma — Last9

AWS posted these example/template incident response playbooks for customers to use in their incident response process.

Aws

A list with descriptions of all DNS record types, even the obscure ones. Tag yourself, I’m HIP.

Jan Schaumann

This one includes a useful set of questions to prompt you as you develop your incident response and classification process.

Hollie Whitehead — xMatters

The author of this article shows us how they communicate actively, perform incident retrospectives, and even discuss “near misses” and normal work in order to better learn how their system works — all skills that apply directly to SRE.

Jason Koppe — Learning From Incidents

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

JJ Tang — Rootly

This one uses Akamai’s incident report from their July 22 major outage as a jumping-off point to discuss openness in incident reports. The text of Akamai’s incident report is included in full.

Geoff Huston — CircleID

Drawing from the “normalization of deviance” concept introduced in the Challenger disaster study [Diane Vaughan], this article explores the idea of studying your organization culture to catch problems early, rather than waiting to respond after they happen.

Stephen Scott

This episode of the StaffEng Podcast is an interview with Lorin Hochstein, whose writings I’ve featured here numerous times. My favorite part of this episode is when they talk about doing incident analysis for near misses. One of the hosts points out that it’s much easier for folks to talk about what happened, because there was no incident so they’re not worried about being blamed.

David Noël-Romas and Alex Kessinger– StaffEng Podcast

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme