Articles
But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.
Slawomir Walkowski — Google
The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.
Catalin Cimpanu — Zero Day
Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.
Daniel Abadi
Learn how Etsy designed tooling and a repeatable process to forecast resource usage.
Daniel Schauenberg — Etsy
Check out how Grab implemented chaos engineering.
Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab
Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.
Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook
In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.
The system in question is a home-grown feature flags implementation.
Dan Arias — Auth0
Outages
The usual glut of Black Friday outages. I hope you all had an uneventful Friday.
- J. Crew
- Lowe’s
- Netatmo (smart thermostats)
- John Lewis
- AWS in Seoul, South Korea
- The outage took down multiple AWS customers including banks and a cryptocurrency exchange.
- Walmart
- Makro
- LastPass
- Microsoft Azure
- Linked is a detailed followup post describing three distinct “root” causes.