SRE Weekly Issue #105


A quick note: Friday was my last day at Heroku/Salesforce, so don’t be surprised if you see my “full disclosure” notices change.

SPONSOR MESSAGE

See how CloudBees Jenkins Solutions & VictorOps work together to bridge the on-call gap for CI/CD in this webinar. Register today. http://join.cloudbees.com/l/272242/2018-01-09/739hy

Articles

PagerDuty put a call out on Twitter, asking what folks are doing to improve the on-call experience at their companies.

Here’s part three in the series. This one’s about sharding, horizontal scaling, and client versus server complexity.

Here’s how Azure’s new availability zones change the way highly available apps can be designed on Azure.

The meltdown patch seems to be having a disproportionate impact on Redis performance. Here’s Grab’s story of how they figured out what was up and what they did to deal with it.

I don’t often do the Twitter thing, but this chain by Charity Majors is worth reading. Is that what they call it? a chain?

Google on the advantages of Cloud Spanner’s strong consistency and why to use it. I’m still looking out for an explanation of what the downside to Spanner is…

Just to be clear, this is about how critical it is that Facebook keep their machine learning applications running, rather than using machine learning to design disaster recovery solutions.

This article is about useful error messages, which are important both for the customer experience and for operations. I’m not sure what really qualifies as a “mainframe” these days, though….

LinkedIn is open-sourcing two tools that they use for troubleshooting during incidents. Fossor automates running data-gathering can and Ascii Etch displays graphs using ASCII art.

Outages

  • LastPass
  • Slack
  • Spotify
  • Bitbucket
    • Bitbucket has had severe performance problems due to a failure in their storage layer.
  • Kraken (cryptocurrency exchange)
    • This appears to have been a scheduled upgrade that blew up in complexity, preventing Kraken from coming back up for two days. From the article:

      Most astonishing of all, about 36 hours after the upgrade began, Kraken apparently sent their engineers home to take a nap!

      Not that astonishing! Tired engineers make mistakes, after all.

  • Missile threat alert for Hawaii a false alarm
    • There’s so much more to this story than we’ve been told, and I really wish I could be a fly on the wall during the retrospective.
Updated: January 14, 2018 — 8:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme