SRE Weekly Issue #263

Articles

[Increment: Reliability] Tracing a path to observability

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

7 top Site Reliability Engineer (SRE) job interview questions

I like that this focuses on human factors.

Kevin Casey

How to Scale for Reliability and Trust

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Engineering Failover Handling in Uber’s Mobile Networking Infrastructure

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Incident review: Service outage on 25 October 2020

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

Gmail and a ton of other Android apps
- This one’s kind of weird. Google presented it as a Gmail outage, but it’s actually a problem with the Android system webview component. Tons of apps were crashing.
MangaDex
Canvas
Instagram

SRE Weekly Issue #263

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues