SRE Weekly Issue #140

Articles

My sincerest apologies to Dale Markowitz, the author of this article who I mispronouned in last week’s issue. I’m kicking myself, because I totally didn’t need to use a pronoun at all.

Dale Markowitz — LOGIC Magazine

Linux 4.19-rc4 released, an apology, and a maintainership note

Linus Torvalds made waves this week with an email apologizing for his unprofessional behavior and committing to improving.

Linus Torvalds

Designing for Failure to Avoid Disaster

A pretty detailed article on how LaunchDarkly designed their system for reliability. The streaming vs. polling section is especially interesting.

Adam Zimman — LaunchDarkly

Full disclosure: Fastly, my employer, is mentioned.

LogDevice: a distributed data store for logs – Facebook Code

Lots of details about how they achieve their reliability goals. I’d love to see a followup with more detail on why writing a solution in-house made sense versus adopting something like Kafka.

Mark Marchukov — Facebook

13 Reasons a Staging Environment Is Failing in Your Organization – DZone DevOps

The staging environment plays an important part. If staging isn’t working for your organization, make sure you aren’t making these common mistakes.

Harshit Paul — DZone

Mockers – overcoming testing challenges at Grab

The challenges in question involve testing a microservice’s interactions with other microservices. Read about their system for distributing and running mock servers for each microservice.

Mayank Gupta, K.Vineet Nair, Shivkumar Krishnan, Thuy Nguyen, and Vishal Prakash — Grab

BP is to blame for Deepwater Horizon, but its mistake was actually years of small mistakes.

My partner suggested I look into the Deepwater Horizon incident, and I’m glad I did. My two key takeaways were normalization of deviance and this gem:

Researchers who study disasters tell us that a long period without an accident can be a big risk factor in itself: Workers learn to expect safe operation as the norm and can’t even conceive of a devastating failure.

James B. Meigs — Slate

SRE Weekly Issue #140

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues