This episode of Greater Than Code features John Allspaw, and it’s pretty much as awesome as I expected. Some highlights:
- rather than asking how an incident happened, ask what prevented it from being worse
- ask “how” rather than “why” an incident happened
- humans plus technology are together a cognitive system
- how can you make automation a team player?
Janelle Klein, John Sawers, Rein Henrichs, and Jessica Kerr, with John Allspaw
What does cold start look like on various FaaS platforms? This article has hard numbers obtained through empirical testing.
Colm MacCárthaigh explains how shuffle sharding improves reliability by acting like some kind of magic lever made of math.
Colm MacCárthaigh — AWS (thanks to Thread Reader for the thread rollup)
Who cares if your CDN has an eleventeen terabaud backbone uplink? What really matters is how they can serve your traffic.
Matt Levine — CacheFly
An engineer pushes a small change and OkCupid goes up in flames.
A new, entry-level employee takes down a big site — due not to a bug in his software, but in a dependency.
Dale Markowitz — LOGIC Magazine (Issue #5)
What happens when you mix Observability and Serverless? Corey Quinn of Last Week in AWS lets you in on the hottest new operations practice.
How will climate change and rising sea levels impact the reliability of our networks?
Carol Barford — iAfrikan
I watched this Nova (PBS) episode this week, and I highly recommend it. It explores why trains crash and what governments are doing to improve safety. The link above requires membership, but you can also watch it on Netflix.
- Google Cloud Storage
- Linked is Google’s apology and followup analysis. Other Google Cloud Platform services dependent on GCS were also impacted.
- Azure status
- August 31:
Our hosting provider will be restarting a significant number of servers during this time window.
Our provider has taken down more than expected capacity.
- August 31:
- Amazon.com search
- Azure (South Central US) and Office 365
- Lightning strikes took out the cooling systems, causing an emergency shutdown.