SRE Weekly Issue #140

SPONSOR MESSAGE

Are you exploring serverless architecture on AWS? Check out this post to get step-by-step instructions for setting up and maintaining DynamoDB to keep it from waking you up with unactionable alerts:

http://try.victorops.com/sreweekly/dynamodb-and-aws

Articles

My sincerest apologies to Dale Markowitz, the author of this article who I mispronouned in last week’s issue. I’m kicking myself, because I totally didn’t need to use a pronoun at all.

Dale Markowitz — LOGIC Magazine

Linus Torvalds made waves this week with an email apologizing for his unprofessional behavior and committing to improving.

Linus Torvalds

A pretty detailed article on how LaunchDarkly designed their system for reliability. The streaming vs. polling section is especially interesting.

Adam Zimman — LaunchDarkly

Full disclosure: Fastly, my employer, is mentioned.

Lots of details about how they achieve their reliability goals. I’d love to see a followup with more detail on why writing a solution in-house made sense versus adopting something like Kafka.

Mark Marchukov — Facebook

The staging environment plays an important part. If staging isn’t working for your organization, make sure you aren’t making these common mistakes.

Harshit Paul — DZone

The challenges in question involve testing a microservice’s interactions with other microservices. Read about their system for distributing and running mock servers for each microservice.

Mayank Gupta, K.Vineet Nair, Shivkumar Krishnan, Thuy Nguyen, and Vishal Prakash — Grab

My partner suggested I look into the Deepwater Horizon incident, and I’m glad I did. My two key takeaways were normalization of deviance and this gem:

Researchers who study disasters tell us that a long period without an accident can be a big risk factor in itself: Workers learn to expect safe operation as the norm and can’t even conceive of a devastating failure.

James B. Meigs — Slate

Outages

Updated: September 23, 2018 — 9:05 pm
SRE WEEKLY © 2015 Frontier Theme