General

SRE Weekly Issue #222

A message from our sponsor, StackHawk:

The last thing we need is more noise from more tooling. With the new Findings Management feature, you can add AppSec tests to your CI pipeline without being innundated with alerts.
https://www.stackhawk.com/blog/appsec-findings-management?utm_source=SREWeekly

Articles

This article in a nutshell:

Kolton Andrus — Gremlin

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

In our experience, the three big sources of production stress are:

  • Toil
  • Bad monitoring
  • Immature incident handling procedures

Cheryl Kang — Google

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

  • the boundary to economic failure
  • the boundary of unacceptable work load
  • the boundary of functionally acceptable performance

Lorin Hochstein

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security

Outages

SRE Weekly Issue #221

Don’t forget, Catchpoint’s SRE From Home event is happening this Friday. The speaker list has some names you’ll recognize from articles linked here in previous issues. See you there!

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Casey Rosenthal tips over a herd of sacred cows with this talk that opens with 6 myths about reliable systems.

Casey Rosenthal — Verica

This is written as talking about scale during a job interview, and it’s a pretty good read even if you’re not interviewing right now.

Denise Yu

John Allspaw says we should ask “how”, not “why”. Hollnagel and Woods say that finding out why a joint cognitive system does what it does rather than how. Who’s right?

Lorin Hochstein

Yay, another issue! This one revolves around learning from incidents from organizations in other fields (Bose and NASA).

Jaime Woo and Emil Stolarsky — Incident Labs

This is a followup analysis of a Google Hangouts oiutage from last month.

Google

Outages

SRE Weekly Issue #220

A message from our sponsor, StackHawk:

Hi, SRE Weekly. We’re your new newsletter sponsor, StackHawk. We believe that application security is an important part of reliability engineering, and we’re building tooling to support that. We’d love for you to check us out.
https://www.stackhawk.com?utm_source=SREWeekly

Articles

Catchpoint is holding a mini-conference on the ways that SRE has changed as we shift to all-remote work, and I’m super-excited to be on the Q&A panel! Hope to see you there.

Catchpoint

A seasoned pro discusses some pitfalls of cloud-based architecture based on hard-won experience.

Rachel by the bay

Monzo is back with updates on how their on-call has changed since their original article in 2018.

Shubheksha Jalan — Monzo

Along with this rockin’ article about why it’s important to make on-call bearable, Incident Labs also has a survey on your on-call experience. Click through for the link.

Incident Labs

This really crystallizes a lot of my concerns with anomaly detection.

Danyel Fisher — The New Stack / Honeycomb

If you ask someone why they did something, they’re likely to invent a logical-sounding reason without meaning to.

Lorin Hochstein

Outages

SRE Weekly Issue #219

Articles

Check out this new 100-page ebook on incident response from Atlassian, great for folks setting up a brand new on-call structure or improving their existing one. It even has a section on compensating teams for being on-call.

Serhat Can — Atlassian

Laura Maguire discusses the compelling data from her PhD dissertation that the Incident Command System actually makes incident response less efficient, along with lots of other interesting findings.

Laura Maguire

A summary of a great talk by Amy Tobey at Failover Conf, amusingly framed as a “retrospective”.

Hannah Culver — Blameless

In this case, the “cloud” refers to actual clouds, the ones in the sky. It’s a comparison between concepts in aviation and SRE, fields that have significant overlaps.

Bill Duncan

My favorite:

The fact that you need to make changes to maintain availability, will itself threaten your availability.

Lee Atchison — diginomica

A bug in a new release of the Facebook SDK caused some iOS apps to crash.

Brian Barrett — WIRED

[…] I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Loren Hochstein

Outages

  • Slack
    • Slack’s server infrastructure scales up every day to handle volume in North America by increasing the size of the server pool available to handle requests. Some of these servers did not successfully register with our load balancing infrastructure during this process of scaling up, and this ultimately led to a decline in the health of the server pool over time.

  • Youtube
  • Coinbase
  • Google Play Store
  • Microsoft Outlook
  • reddit
  • Zoom

SRE Weekly Issue #218

Articles

An airplane pilot’s take on runbooks, by way of comparison to aviation checklists.

Bill Duncan

This article demonstrates that we don’t need to be afraid of spinning up a new thread per connection, and Linux is very good at what it does. This seems to have been a surprisingly controversial point of view, judging by the follow-up article.

Rachel by the bay

It’s not as easy as you think… even if you think it’s not easy.

Oren Eini — RavenDB

Atlassian shows us what’s changed in operations, based on their State of Incident Management survey.

A little over half of survey respondents – 51 percent – reported that their incident response time has been slower since beginning to work remotely

Patrick Hill — Atlassian

A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.

John Allspaw

The part about pandemic-induced decision fatigue was revelatory for me.

Hannah Culver — Blameless

Gremlin talks about Failover Conf, and I love that it pretty much reads like a retrospective.

Kimbre Lancaster — Gremlin

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme