SRE Weekly Issue #73

Articles

Troubleshooting Unusual AWS ELB 5XX Error

ELBs (Amazon’s Elastic Load Balancers) depend on clients properly respecting DNS round-robin record sets. This article follows a debugging session in excellent detail as they try to answer the question: why are our clients preferring (and overloading) just one ELB IP?

Serverless: A replacement for servers?

Sarah Schieffer Riehl shares her take on ServerlessConf Austin 2017. She’s got a healthy dose of skepticism that I like, concluding that “serverful and serverless architectures don’t do the same things.” I like this bit:

For processes that require polling or any kind of server wakefulness, converting to a serverless architecture can be an exercise in “serverless for serverless’ sake”.

Premortems: The Art of Negative Visualization

Wow, this dovetails so well into the Todd Conklin’s “Safety Moment” from last week, on imagining all the possible things that could go wrong. I’d love to hear more thoughts along these lines: is it possible to design a reliable system without envisioning the majority of things that could go wrong?

Keep Critical Apps and Infrastructure Up and Running

PagerDuty outlines an incident lifecycle management policy based on ITIL.

Introducing Cape

DropBox created Cape for “asynchronous processing of billions of events a day, powering many Dropbox features”. Example: you upload a text file, and a Cape job indexes it immediately for full-text searching. I’d love to hear more on why existing solutions didn’t fit the bill, although they do cover their requirements in depth.

What a SaaS outage taught me about the meaning of partnerships

When I signed on for my first SRE position, I had no idea how huge a part vendor relations would play in ensuring reliability.

Building the SRE Culture at LinkedIn

Initially, LinkedIn’s SRE team hired engineers only based on technical skill. As they’ve grown, they’ve discovered the importance of collaboration skills as well.

Incident communication best practices

StatusPage.io explains the reasons for having a solid incident communication policy and guides you through setting one up.

The Calculus of Service Availability

As the title suggest, this ACM Queue article goes into some depth on the kinds of calculations one might make when designing a reliable system. Specifically, they focus on service dependencies and introduce Google’s “rule of the extra 9”: a dependency should have one more nine of reliability than the thing that critically depends on it.

It Takes More Than a Circuit Breaker to Create a Resilient Application

At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.

Outages

Starbucks
- A server outage halted sales at many stores, and some gave out free drinks to mollify customers. Coincidentally, I also was unable to order at Wendy’s the other night due to a “server update”, and they offered me a free Frosty.
Let’s Encrypt
- Certificate issuance was impaired for about 17 hours. They also had an OCSP outage around the same time, but as far as I can tell, this wouldn’t actually cause any impact to end-users of Let’s Encrypt certificates.
Twitter
Whatsapp
AT&T Gets Light FCC Wrist Slap For Largest 911 Outage Ever
- The FCC released a report on AT&T’s 911 outage last March. The cause was apparently a faulty whitelist update.
Instagram

SRE Weekly Issue #73

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues