SRE Weekly Issue #208

A message from our sponsor, VictorOps:

Learn about some more subtle, unknown use cases for using Splunk + VictorOps to drive a more analytical, proactive approach to incident response:

https://go.victorops.com/sreweekly-splunk-for-analytical-incident-response

Articles

There’s so much in this article:

  • how to recognize when your system may be susceptible to cascading failure
  • how to prevent it
  • how to deal with it when it happens (and how hard that can be)

Laura Nolan — Slack

It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.

This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.

Peter Murray — Catchpoint

Both the summary and the original article are well worth reading. This stood out to me:

As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it

Thai Wood (summary)

Dr. Richard Cook (original article)

The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).

Timothy Prickett Morgan — The Next Platform

Ideal: each team manages their microservice(s) in isolation.

Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.

Ben Sigelman — LightStep

This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.

Eric Harvieux — Google

The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.

rachelbythebay

Outages

Updated: February 23, 2020 — 8:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme