Articles
ELBs (Amazon’s Elastic Load Balancers) depend on clients properly respecting DNS round-robin record sets. This article follows a debugging session in excellent detail as they try to answer the question: why are our clients preferring (and overloading) just one ELB IP?
Sarah Schieffer Riehl shares her take on ServerlessConf Austin 2017. She’s got a healthy dose of skepticism that I like, concluding that “serverful and serverless architectures don’t do the same things.” I like this bit:
For processes that require polling or any kind of server wakefulness, converting to a serverless architecture can be an exercise in “serverless for serverless’ sake”.
Wow, this dovetails so well into the Todd Conklin’s “Safety Moment” from last week, on imagining all the possible things that could go wrong. I’d love to hear more thoughts along these lines: is it possible to design a reliable system without envisioning the majority of things that could go wrong?
PagerDuty outlines an incident lifecycle management policy based on ITIL.
DropBox created Cape for “asynchronous processing of billions of events a day, powering many Dropbox features”. Example: you upload a text file, and a Cape job indexes it immediately for full-text searching. I’d love to hear more on why existing solutions didn’t fit the bill, although they do cover their requirements in depth.
When I signed on for my first SRE position, I had no idea how huge a part vendor relations would play in ensuring reliability.
Initially, LinkedIn’s SRE team hired engineers only based on technical skill. As they’ve grown, they’ve discovered the importance of collaboration skills as well.
StatusPage.io explains the reasons for having a solid incident communication policy and guides you through setting one up.
As the title suggest, this ACM Queue article goes into some depth on the kinds of calculations one might make when designing a reliable system. Specifically, they focus on service dependencies and introduce Google’s “rule of the extra 9”: a dependency should have one more nine of reliability than the thing that critically depends on it.
At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.
Outages
- Starbucks
- A server outage halted sales at many stores, and some gave out free drinks to mollify customers. Coincidentally, I also was unable to order at Wendy’s the other night due to a “server update”, and they offered me a free Frosty.
- Let’s Encrypt
- Certificate issuance was impaired for about 17 hours. They also had an OCSP outage around the same time, but as far as I can tell, this wouldn’t actually cause any impact to end-users of Let’s Encrypt certificates.
- AT&T Gets Light FCC Wrist Slap For Largest 911 Outage Ever
- The FCC released a report on AT&T’s 911 outage last March. The cause was apparently a faulty whitelist update.