SRE Weekly Issue #288

A message from our sponsor, StackHawk:

Want to see what’s new with automated security tooling? Tune in on September 30 to see how StackHawk and Semgrep are making it possible to embed security testing in CI/CD.
https://sthwk.com/whats-new-webinar

Articles

Faced with a difficult hiring market for SREs, they embarked on a well-designed, carefully thought out program to hire and train entry-level folks as SREs — and it worked!

Thomas Betts — InfoQ

No matter how good your tooling is, how experienced you are, or how much you’ve prepared, incidents can still be hard.

Five people share about what they find hardest during incident response.

Chris Evans — incident.io

This one has a lot of ideas about how to guide developers toward full ownership of their services in production.

Ambassador

In this post, I will cover the following modes of system resilience:

  • Adaptive Response
  • Superior Monitoring
  • Coordinated Resilience
  • Heterogenous Systems
  • Dynamic Repositioning
  • Requisite Availability

Ash P — Cruform

Root cause of success: unpatched security vulnerability

TMW a security vulnerability allows you to break into your infrastructure, averting disaster during an incident.

Lorin Hochstein, with incident story by Eric Dobbs

A migration didn’t go as planned, and customer traffic lost its way.

Heroku

I’m a big believer in human-in-the-loop automation. My favorite part of this article was this:

A further problem is that full automation — which aims to take the human out of the picture — requires a complete, nuanced understanding of a system and all potential outcomes, paradoxically resulting in heightened system complexity.

Tina Huang — Transposit

Outages

Updated: June 1, 2022 — 9:45 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme