Articles
Read about their transition from multi-cloud to all AWS and how they scaled to 10x the login throughput.
Dirceu Tiegs — Auth0
This article on the emergent behavior of algorithms is well worth thinking about as an SRE. Even without machine learning, our infrastructures have complex emergent behaviors, as you can read in any incident retrospective.
Andrew Smith — The Guardian
This interesting pitfall of chaos engineering stood out to me:
[…] if you hand a team 50 vulnerabilities, they’re probably not going to fix any of them. You know what I mean? So you have to figure out a way to prioritize those …
Andrea Echstenkamper with Nora Jones (Netflix), Ted Strzalkowski (LInkedIn), and Pat Higgins (Gremlin)
Well worth a quick listen (2 minutes 30 seconds).
Todd Conklin — Pre-Accident Podcast
In this series, we’ll dig into different types of observability tools. For each type, we’ll cover what they’re used for, what specific tools are available, some use cases, and any unique characteristics that may come up during your search for a new tool.
Linked above is an introduction to the article series. The first in the series is also out, focusing on time-series metric systems.
Dan Barker
Outages
- Slack
- GitHub
- Duo
- Duo posted this followup analysis for two major outages in the past two weeks.
- Tesla car network
- Heroku Incident #1620
- Also #1622.
- Microsoft Office 365
- OCBC (bank)
- Scotiabank