AWS had a major Lambda outage in us-east-1, and it took out many customer systems and quite a few other AWS systems, including their support portal.

This person had a fascinating path to SRE, starting out their career as a generator repair technician and transitioning through devops to SRE.

  Brian Hellinger — Towards AWS

In part 1, they outlined how they replay real traffic to test a new system before deploying it. In this article, they build on that with three additional techniques: sticky canaries, A/B testing, and gradually shifting traffic to the new system in production.

  Shyam Gala, Javier Fernandez-Ivern, Anup Rokkam Pratap, and Devang Shah — Netflix

By comparing status page posting to their independent monitoring of services, Metrist is able to produce statistics about how long companies take to post to their status pages when they have an outage.

  Jeff Martens — Metrist

Improvising during an incident isn’t just a one-off occurrence, and we should plan for it.

  Lorin Hochstein — Surfing Complexity

A foreign key column had a smaller integer data type than the key that it referenced, and it failed when the referenced key went too high.


Here, we’ll look at the key considerations you need to make when it comes to the architecture of your chat app, the structure and components of that architecture, and some of the technology options that can help support you in building a reliable chat experience.


A departure from the normal air traffic control procedure allowed the pilots to lose situational awareness. A commonly-held myth about flotation equipment contributed to three deaths in a quite survivable accident.

  Admiral Cloudberg

They kept finding what they thought was the problem, and their fixes helped, but the problem kept coming back.

  Tanat Paul Lokejaroenlarb — Adevinta

