The Lunch Exercise was my favorite part of the Blackrock3 training, and now Slack has adapted it for their ongoing training.
How Slack trains engineers in incident response by ordering lunch together.
Scott Nelson Windels — Slack
Cloudflare runs programs written in their custom language Topaz in the hot path. They use formal verification in production(!) to ensure that the set of Topaz programs make sense.
ames Larisch, Suleman Ahmad, and Marwan Fayed — Cloudflare
Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.
Rajiv Shringi, Oleksii Tkachuk and Kartik Sathyanarayanan — Netflix
It’s hard, and this article explains why in excellent detail. It also includes a discussion of options to consider when designing a chat system.
Ably
In anticipation of https://aws-news.com‘s busiest period of the year, I redesigned the API access patterns to support very effective caching. This resulted in significantly reduced backend load and a much faster frontend.
Luc van Donkersgoed — AWS News
Recover means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs AND a preventative approach has been implemented.
Will Searle — Causely
Here’s a list of recommended talks for SREs attending re:Invent, with short descriptions explaining why they’re interesting.
Jamie Baker
In this post, I’ll share exactly how we link our code to the team that owns it, so errors and alerting are routed to the right place with minimal maintenance burden.
Martha Lambert — incident.io