Chaos Engineering and Jepsen-style testing is still in its infancy. As this ACM Queue article explains, figuring out what kind of failure to test is still a manual process involving building a mental model of the system. Can we automate it?

GitLab shares the story of how they implemented connection pooling and load balancing with read-only replicas in PostgreSQL.

When you have 600,000(!!) tables in one MySQL Database, traditional migration tools like mysqldump or AWS’s Database Migration Service show cracks. The folks at PressBooks used a different tool instead: mydumper.

AWS Lambda spans multiple availability zones in each region. This author wonders whether it would it be more reliable to have separate installations of Lambda running in each availability zone, to protect against failure in Lambda itself.

High-cardinality fields are where all the interesting data exist, says Charity Majors of Honeycomb. But that’s exactly where most monitoring systems break down, leaving you to throw together hacks to work around their limitations.

Google shares some best practices for building Service Level Objectives.

Hosted Graphite brings candidates in to work with them for a day and pays them for their time.

Grueling is right: their entire team came to the office over the weekend to work on the outage. Lesson learned:

When something goes horribly wrong, don’t bring everybody in. More ideas are good to a point, but if you don’t solve it in the window of a normal human’s ability to stay awake, the value they are giving you goes down exponentially as they get tired.

Google’s Project Aristotle discovered that the number one predictor of successful teams is psychological safety. The anecdotes in this piece show how psychological safety is also critical in analyzing incidents.


