Articles
We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.
Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix
The PDF covers 5 main areas:
- Availability
- Performance
- Monitoring
- Incident Response
- Preparation
No account required or form to fill out to download the PDF.
Splunk/VictorOps
This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.
Emily Arnott — Blameless
If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.
Engin Yoeyen — eBay
Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.
Casey Rosenthal — Verica
The last paragraph regarding “unknown unknowns” is noteworthy.
Heroku
There are some great questions in here on blamelessness and full service ownership.
James Thigpen — Gremlin
Outages
- Google Cloud Platform us-west2 region
- They posted a detailed follow-up at the above link.
- TikTok
- Network Solutions and Register.com
- Singapore Exchange (SGX)
- Parler