SRE Weekly Issue #137

Articles

Auth0 Architecture: Running In Multiple Cloud Providers And Regions

Read about their transition from multi-cloud to all AWS and how they scaled to 10x the login throughput.

Dirceu Tiegs — Auth0

Franken-algorithms: the deadly consequences of unpredictable code

This article on the emergent behavior of algorithms is well worth thinking about as an SRE. Even without machine learning, our infrastructures have complex emergent behaviors, as you can read in any incident retrospective.

Andrew Smith — The Guardian

Netflix, LinkedIn and Gremlin Engineers Talk Chaos Engineering – The New Stack

This interesting pitfall of chaos engineering stood out to me:

[…] if you hand a team 50 vulnerabilities, they’re probably not going to fix any of them. You know what I mean? So you have to figure out a way to prioritize those …

Andrea Echstenkamper with Nora Jones (Netflix), Ted Strzalkowski (LInkedIn), and Pat Higgins (Gremlin)

We want machines to be people and people to be machines. What is wrong with us?

Well worth a quick listen (2 minutes 30 seconds).

Todd Conklin — Pre-Accident Podcast

Now available: The open source guide to DevOps monitoring tools

In this series, we’ll dig into different types of observability tools. For each type, we’ll cover what they’re used for, what specific tools are available, some use cases, and any unique characteristics that may come up during your search for a new tool.

Linked above is an introduction to the article series. The first in the series is also out, focusing on time-series metric systems.

Dan Barker

Outages

Slack
GitHub
Duo
- Duo posted this followup analysis for two major outages in the past two weeks.
Tesla car network
Heroku Incident #1620
- Also #1622.
Microsoft Office 365
OCBC (bank)
Scotiabank

SRE Weekly Issue #137

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues