Articles
This article is highly technical, while also not being overwhelmingly detailed.
It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.
Daniel Abadi
The traps are:
- You don’t have enough cross-team usage or buy-in.
- Your difficult and dense process is slowing down incident response.
- Postmortems are underutilized and don’t encompass in-depth learnings.
- You wait for incidents to happen.
- You stop at incident management without SLOs.
Lyon Wong — Blameless
Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.
dm03514
The question is: what is the proper role of alerting in the modern era of distributed systems? Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?
Charity Majors
Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.
Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox
By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.
MINA GYIMAH — Pusher
Outages
- Tokbox
- Thanks to Aos Dabbagh for this one.
- Chef (system administration tool)
- Many of us experienced failures in our Chef runs after their former employee removed their code. Chef posted a followup explaining their position on the matter.
- Fastly
- Net4 (hosting provider)
- Salesforce
- Google Search
- Heroku
- Squarespace
- Also this one.