SRE Weekly Issue #218

Articles

An airplane pilot’s take on runbooks, by way of comparison to aviation checklists.

Bill Duncan

This article demonstrates that we don’t need to be afraid of spinning up a new thread per connection, and Linux is very good at what it does. This seems to have been a surprisingly controversial point of view, judging by the follow-up article.

Rachel by the bay

It’s not as easy as you think… even if you think it’s not easy.

Oren Eini — RavenDB

Atlassian shows us what’s changed in operations, based on their State of Incident Management survey.

A little over half of survey respondents – 51 percent – reported that their incident response time has been slower since beginning to work remotely

Patrick Hill — Atlassian

A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.

John Allspaw

The part about pandemic-induced decision fatigue was revelatory for me.

Hannah Culver — Blameless

Gremlin talks about Failover Conf, and I love that it pretty much reads like a retrospective.

Kimbre Lancaster — Gremlin

Outages

Updated: May 10, 2020 — 8:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme