Articles
DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.
Paul Berthaux — Algolia
We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).
Piyush Verma — Last9
Thanks to An anonymous reader for this one.
The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.
Eric Uriostigue — effx
Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.
Adam Fowler
In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.
Sourabh Suman — Grab
This article covers Netflix’s gnmi-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.
Colin McIntosh and Michael Costello — Netflix
This year, re:Invent is online only, so you still have a chance to attend if you’re interested.
Ana M Medina — Gremlin
Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.
Tom Lianza and Chris Snook — Cloudflare
Outages
- Slack
- Giphy
- Spotify
- Currys PC World
- DoorDash
- Amazon Prime Video
- AWS
- This link points to Amazon’s detailed report on the outage.