SRE Weekly Issue #219

Articles

Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure or improving their existing one. It even has a section on compensating teams for being on-call.

Serhat Can — Atlassian

Laura Maguire discusses the compelling data from her PhD dissertation that the Incident Command System actually makes incident response less efficient, along with lots of other interesting findings.

Laura Maguire

A summary of a great talk by Amy Tobey at Failover Conf, amusingly framed as a “retrospective”.

Hannah Culver — Blameless

In this case, the “cloud” refers to actual clouds, the ones in the sky. It’s a comparison between concepts in aviation and SRE, fields that have significant overlaps.

Bill Duncan

My favorite:

The fact that you need to make changes to maintain availability, will itself threaten your availability.

Lee Atchison — diginomica

A bug in a new release of the Facebook SDK caused some iOS apps to crash.

Brian Barrett — WIRED

[…] I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Loren Hochstein

Outages

  • Slack
    • Slack’s server infrastructure scales up every day to handle volume in North America by increasing the size of the server pool available to handle requests. Some of these servers did not successfully register with our load balancing infrastructure during this process of scaling up, and this ultimately led to a decline in the health of the server pool over time.

  • Youtube
  • Coinbase
  • Google Play Store
  • Microsoft Outlook
  • reddit
  • Zoom

SRE Weekly Issue #218

Articles

An airplane pilot’s take on runbooks, by way of comparison to aviation checklists.

Bill Duncan

This article demonstrates that we don’t need to be afraid of spinning up a new thread per connection, and Linux is very good at what it does. This seems to have been a surprisingly controversial point of view, judging by the follow-up article.

Rachel by the bay

It’s not as easy as you think… even if you think it’s not easy.

Oren Eini — RavenDB

Atlassian shows us what’s changed in operations, based on their State of Incident Management survey.

A little over half of survey respondents – 51 percent – reported that their incident response time has been slower since beginning to work remotely

Patrick Hill — Atlassian

A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.

John Allspaw

The part about pandemic-induced decision fatigue was revelatory for me.

Hannah Culver — Blameless

Gremlin talks about Failover Conf, and I love that it pretty much reads like a retrospective.

Kimbre Lancaster — Gremlin

Outages

SRE Weekly Issue #217

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Reliability is something you do, not something you buy.

When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.

Russ Miles — ChaosIQ

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

Lorin Hochstein

It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.

Jamie Woo and Emil Stolarsky — Incident Labs

If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.

Bill Duncan

This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.

I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.

Julia Evans

Outages

SRE Weekly Issue #216

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

And this one!

Hannah Culver — Blamelss

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

SRE Weekly Issue #215

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme