SRE Weekly Issue #106

Articles

The Limitations of Chaos Engineering – Production Ready

Chaos engineering is extremely useful, and Mathias Lafeldt has written plenty about its virtues. But as with everything, it’s important to be aware of its pitfalls and shortcomings too.

What Went Wrong In Hawaii, Human Error? Nope, Bad Design.

There’s been a lot of talk of firing (or worse) the person whose actions led to the false alarm in Hawaii. That’s why I’m especially glad to see this excellent analysis by Don Norman (The Design of Everyday Things and others). Bonus content: another article along the same vein with some more interesting tidbits.

In defence of swap: common misconceptions

Think twice before you disable swap, says Chris Down, an author of the upcoming cgroup v2 in the Linux kernel.

SRE Survey 2018

Catchpoint is running a survey of SREs and SRE-like folks, and I’d really appreciate it if you’d take a moment to fill it out. Not only will the resulting data be very interesting, but Catchpoint is donating $5 to charity for every survey completed. Let’s stuff that ballot box and get them to hit their cap of $3000!

Building a Distributed Log from Scratch, Part 4: Trade-Offs and Lessons Learned

The awesome continues this week with a discussion of the importance of simplicity in the design of a reliable system.

What Makes a Failure a Disaster?

This article from Heidi Waterhouse at Launch Darkly starts off with a really interesting take on the Y2K bug and continues on to discuss risk management in operations.

When letting the user put the system into an invalid state is a desirable property

This short article has an extremely cogent point: design your system to be flexible enough to allow the user to do something seemingly incorrect, because they might need to while responding to an incident!

Project STAR*: Streamlining Our On-Call Process

LinkedIn had a problem: their on-call system was so dysfunctional that they had to scramble to find coverage for an engineer that had been scheduled to be on call when they were on vacation. They explain how they identified the problem, came up with a solution, and implemented it, including automation and cultural fixes.

Monitoring in a DevOps World

If the phrase “a DevOps World” makes you feel ill, don’t dismiss this article from ACM Queue out of hand. It’s got some great points about designing effective monitoring, and I like the introduction of the “Real Systems Monitoring” concept (akin to “Real User Monitoring” or RUM).

Outages

Heroku
- Heroku had a 29-hour impairment to their application log routing platform.

SRE Weekly Issue #106

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues