General

SRE Weekly Issue #217

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Reliability is something you do, not something you buy.

When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.

Russ Miles — ChaosIQ

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

Lorin Hochstein

It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.

Jamie Woo and Emil Stolarsky — Incident Labs

If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.

Bill Duncan

This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.

I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.

Julia Evans

Outages

SRE Weekly Issue #216

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

And this one!

Hannah Culver — Blamelss

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

SRE Weekly Issue #215

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

SRE Weekly Issue #214

A message from our sponsor, VictorOps:

SRE requires continuous improvement and learning. So, to help out, VictorOps lays out a bunch of educational resources, podcast episodes, videos and more in the new learning library. Check it out:

https://go.victorops.com/sreweekly-learning-library

Articles

A nifty little pitfall in which an ioniced process can block non-ioniced processes.Author: rachelbythebay

Google published this free set of courses on technical writing. As an SRE, I have the constant need to write effectively to justify and document my designs.

Every engineer is also a writer.

This collection of courses and learning resources aims to improve your technical documentation. Learn how to plan and author technical documents.

Google

The ACM has made their ACM Digital Library free to the public for the next 3 months. Many of their articles have been featured here previously.

Includes a great article by Jamie Woo, entitled Imagining Your Post-Incident Report As A Documentary.

Emil Stolarsky and Jaime Woo — The Post-Incident Review

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends.

I especially liked the discussion of pent-up demand that may cause problems when we eventually get to relax social distancing.

Amy Tobey (moderator), Alex Hidalgo, Liz Fong-Jones, Dave Rensin

This is a talk that John Allspaw gave for Spotify.

Learning is not the same as fixing

John Allspaw — Adaptive Capacity Labs

Outages

SRE Weekly Issue #213

A message from our sponsor, VictorOps:

Major incidents lead to more alerts, more downtime and unhappy customers. See how modern DevOps-minded teams are building virtual war rooms to quickly mobilize cross-functional engineering and IT teams around major incidents – improving incident remediation while reducing burnout:

https://go.victorops.com/sreweekly-war-rooms-for-major-incidents

Articles

This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.

Sui Huang

Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.

Rachel Obstler — PagerDuty

Here’s the scoop on all those GitHub incidents in February.

Keith Ballinger — GitHub

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Hannah Culver — Blameless

5 tips for incident management when you’re suddenly remote

I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.

Blake Thorne — Atlassian

Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.

Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme