Articles
Sometimes I come across a simple but mind-blowingly awesome new idea. This is one of those times.
During periods of high load and errors, Netflix’s edge load balancer sends feedback to the apps running on users’ devices, adjusting their retry and backoff strategy to keep the service running as smoothly as possible but avoid a thundering herd. Brilliant.
Manuel Correa, Arthur Gonigberg, and Daniel West — Netflix
I helped to invent new approaches to correlate telemetry signals (exemplars, correlation between tracing and logging, profiler labels) that helped our engineers to navigate latency problems faster.
Facebook has two very different users for live streaming: “normal” users and broadcasters streaming sporting events and the like.
Hemal Khatri, Alex Lambert, Jordi Cenzano and Rodrigo Broilo — Facebook
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively.
Charisma Chan and Beth Cooper — Google
The three patterns discussed in this paper are:
- decompensation
- working at cross purposes
- getting stuck in outdated behaviors
David Woods and Matthieu Branlat
Outages
- Gmail
- Microsoft 365
- Apple iCloud
- Netflix
- GitHub
- Apparently GitHub also had an expired TLS certificate later in the week.
- Tabcorp