SRE Weekly Issue #138


A dedication to SRE will improve the lives of your customers and team. For our August Roundup, we’ve compiled a list of top SRE articles in order to help you keep up with the latest news, tips, and topics in SRE:


This episode of Greater Than Code features John Allspaw, and it’s pretty much as awesome as I expected. Some highlights:

  • rather than asking how an incident happened, ask what prevented it from being worse
  • ask “how” rather than “why” an incident happened
  • humans plus technology are together a cognitive system
  • how can you make automation a team player?

Janelle Klein, John Sawers, Rein Henrichs, and Jessica Kerr, with John Allspaw

What does cold start look like on various FaaS platforms? This article has hard numbers obtained through empirical testing.

Mikhail Shilkov

Colm MacCárthaigh explains how shuffle sharding improves reliability by acting like some kind of magic lever made of math.

Colm MacCárthaigh — AWS (thanks to Thread Reader for the thread rollup)

Who cares if your CDN has an eleventeen terabaud backbone uplink? What really matters is how they can serve your traffic.

Matt Levine — CacheFly

An engineer pushes a small change and OkCupid goes up in flames.

A new, entry-level employee takes down a big site — due not to a bug in his software, but in a dependency.

Dale Markowitz — LOGIC Magazine (Issue #5)

What happens when you mix Observability and Serverless? Corey Quinn of Last Week in AWS lets you in on the hottest new operations practice.

Corey Quinn

How will climate change and rising sea levels impact the reliability of our networks?

Carol Barford — iAfrikan

I watched this Nova (PBS) episode this week, and I highly recommend it. It explores why trains crash and what governments are doing to improve safety. The link above requires membership, but you can also watch it on Netflix.



Updated: September 9, 2018 — 8:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme