These folks decided that Google Cloud wasn’t for them, and they built and migrated to their own datacenter in 9 months. This article goves over the physical buildout.
Charith Amarasinghe — Railway
I remember when this incident happened in 2017. It was a huge one, and GitLab was very open with information about what happened. Here’s a look back at what happened.
Byte-Sized Design
When your distributed system deals in nanosecond precision, an extra second is a big deal.
Oleg Obleukhov and Patrick Cullen — Meta
Learn how AWS uses formal verification and other techniques.
Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces.
Marc Brooker and Ankush Desai — ACM Queue
Normally, we rely on the thoughts, decisions, and actions of individuals to create resilizence in our sociotechnical systems, but in some time-critical situations, it can be best for one expert to call the shots.
Robert Poston, MD
You do not have to choose between gold-plating dressed as craftsmanship or perfectionism and corner-cutting framed as pragmatism or realism. You can have the quality of the former at the speed and focus of the latter. I call this the Best Simple System for Now.
Dan North & Associates
This is the first I’ve heard of I-PASS, and I like it!
u/devoopseng — r/sre
This article is a roundup of schools of thought on how systems fail, with a pretty excellent list of links to related articles at the end.
Evan Smith