I’m still working out all of the kinks for SRE Weekly, so the issue for this “week” is hot on the heels of the last one as I clear out my backlog of articles. Coming soonish: decent CSS.
Articles
Managing the burden of on-call is critical to any organization’s incident response. Tired incident responders make mistakes, miss pages, and don’t perform as effectively. In SRE, we can’t afford to ignore this stuff. Thanks to VictorOps for doing the legwork on this!
A talk at QCon from LinkedIn about how they spread out to multiple datacenters.
A review of designing a disaster recovery solution, and where virtualization fits in the picture.
Not strictly directly related to reliability (unless you’re providing ELK as a service, of course), but I’ve found ELK to be very valuable in detecting and investigating incidents. Scaling ELK well can be an art, and in this article, Etsy describes how they set theirs up.
This series of articles is actually the first time I’d seen mention of DRaaS. I’m not sure I’m convinced that it makes sense to hire an outside firm to handle your DR, but it’s an interesting concept.
Outages
A weekend outage for Rockstar.
A large hospital network in the US went down, making health records unavailable.
Snapchat suffered an extended outage.
Anonymous is suspected to be involved.
A case-sensitivity bug took down Snapchat, among other users of Google Cloud.
Google’s postmortem analyses are excellent, in my opinion. We can learn a lot from the issues they encounter, given such a thorough explanation.