General

SRE Weekly Issue #220

A message from our sponsor, StackHawk:

Hi, SRE Weekly. We’re your new newsletter sponsor, StackHawk. We believe that application security is an important part of reliability engineering, and we’re building tooling to support that. We’d love for you to check us out.
https://www.stackhawk.com?utm_source=SREWeekly

Articles

Catchpoint is holding a mini-conference on the ways that SRE has changed as we shift to all-remote work, and I’m super-excited to be on the Q&A panel! Hope to see you there.

Catchpoint

A seasoned pro discusses some pitfalls of cloud-based architecture based on hard-won experience.

Rachel by the bay

Monzo is back with updates on how their on-call has changed since their original article in 2018.

Shubheksha Jalan — Monzo

Along with this rockin’ article about why it’s important to make on-call bearable, Incident Labs also has a survey on your on-call experience. Click through for the link.

Incident Labs

This really crystallizes a lot of my concerns with anomaly detection.

Danyel Fisher — The New Stack / Honeycomb

If you ask someone why they did something, they’re likely to invent a logical-sounding reason without meaning to.

Lorin Hochstein

Outages

SRE Weekly Issue #219

Articles

Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure or improving their existing one. It even has a section on compensating teams for being on-call.

Serhat Can — Atlassian

Laura Maguire discusses the compelling data from her PhD dissertation that the Incident Command System actually makes incident response less efficient, along with lots of other interesting findings.

Laura Maguire

A summary of a great talk by Amy Tobey at Failover Conf, amusingly framed as a “retrospective”.

Hannah Culver — Blameless

In this case, the “cloud” refers to actual clouds, the ones in the sky. It’s a comparison between concepts in aviation and SRE, fields that have significant overlaps.

Bill Duncan

My favorite:

The fact that you need to make changes to maintain availability, will itself threaten your availability.

Lee Atchison — diginomica

A bug in a new release of the Facebook SDK caused some iOS apps to crash.

Brian Barrett — WIRED

[…] I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Loren Hochstein

Outages

  • Slack
    • Slack’s server infrastructure scales up every day to handle volume in North America by increasing the size of the server pool available to handle requests. Some of these servers did not successfully register with our load balancing infrastructure during this process of scaling up, and this ultimately led to a decline in the health of the server pool over time.

  • Youtube
  • Coinbase
  • Google Play Store
  • Microsoft Outlook
  • reddit
  • Zoom

SRE Weekly Issue #218

Articles

An airplane pilot’s take on runbooks, by way of comparison to aviation checklists.

Bill Duncan

This article demonstrates that we don’t need to be afraid of spinning up a new thread per connection, and Linux is very good at what it does. This seems to have been a surprisingly controversial point of view, judging by the follow-up article.

Rachel by the bay

It’s not as easy as you think… even if you think it’s not easy.

Oren Eini — RavenDB

Atlassian shows us what’s changed in operations, based on their State of Incident Management survey.

A little over half of survey respondents – 51 percent – reported that their incident response time has been slower since beginning to work remotely

Patrick Hill — Atlassian

A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.

John Allspaw

The part about pandemic-induced decision fatigue was revelatory for me.

Hannah Culver — Blameless

Gremlin talks about Failover Conf, and I love that it pretty much reads like a retrospective.

Kimbre Lancaster — Gremlin

Outages

SRE Weekly Issue #217

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Reliability is something you do, not something you buy.

When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.

Russ Miles — ChaosIQ

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

Lorin Hochstein

It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.

Jamie Woo and Emil Stolarsky — Incident Labs

If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.

Bill Duncan

This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.

I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.

Julia Evans

Outages

SRE Weekly Issue #216

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

And this one!

Hannah Culver — Blamelss

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme