SRE Weekly Issue #263

A message from our sponsor, StackHawk:

You can utilize Swagger Docs in security testing to drive more thorough and accurate vulnerability scans of your APIs. Learn how:
http://sthwk.com/swagger-api-testing

Articles

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

I like that this focuses on human factors.

Kevin Casey

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

Updated: March 28, 2021 — 9:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme