SRE Weekly Issue #253

A message from our sponsor, StackHawk:

How do you know if your GraphQL API is secure? Watch StackHawk CSO Scott Gerlach walk through how to run application security tests for GraphQL-backed apps.
http://sthwk.com/graphql-webinar

Articles

TLS can be such a headache.

This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it.

Adam Surak — Algolia

A well-researched article on shifting emphasis from incident prevention to learning and resilience.

Incidents cannot be prevented, because incidents are the inevitable result of success.

Alex Elman

This one’s worth reading through twice to let it sink in. It puts me in mind of this article by WIll Gallego, which is another thoughtful critique of error budgets.

Here are the claims I’m going to make:

  1. Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
  2. Error budgets don’t help reduce risk of large incidents.

Lorin Hochstein

This is a review of a few of the chapters of the book of the same title by Emil Stolarsky and Jaime Woo.

Have you read it too? I’d love to read your take on it!

Dean Wilson

This one’s worth reading the next time need to do an incident retrospective. The traps are:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

John Allspaw — Adaptive Capacity Labs

The skill in question is glue work, and I sure appreciate a good gluer when I see one.

Emily Arnott — Blameless

This one starts out by defining SRE, then goes into how to define your team and fill it with people.

Julie Gunderson — PagerDuty

Outages

Updated: January 17, 2021 — 8:36 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme