SRE Weekly Issue #105

View on sreweekly.com
A quick note: Friday was my last day at Heroku/Salesforce, so don’t be surprised if you see my “full disclosure” notices change.

Articles

Page It Forward!

PagerDuty put a call out on Twitter, asking what folks are doing to improve the on-call experience at their companies.

Building a Distributed Log from Scratch, Part 3: Scaling Message Delivery

Here’s part three in the series. This one’s about sharding, horizontal scaling, and client versus server complexity.

What are Azure Availability Zones and why should you use them

Here’s how Azure’s new availability zones change the way highly available apps can be designed on Azure.

Dealing with the Meltdown patch at Grab

The meltdown patch seems to be having a disproportionate impact on Redis performance. Here’s Grab’s story of how they figured out what was up and what they did to deal with it.

Twitter: mipsytipsy on monitoring vs observability

I don’t often do the Twitter thing, but this chain by Charity Majors is worth reading. Is that what they call it? a chain?

Google Cloud Platform Blog: Why you should pick strong consistency, whenever possible

Google on the advantages of Cloud Spanner’s strong consistency and why to use it. I’m still looking out for an explanation of what the downside to Spanner is…

Machine Learning Drives Changing Disaster Recovery At Facebook

Just to be clear, this is about how critical it is that Facebook keep their machine learning applications running, rather than using machine learning to design disaster recovery solutions.

Planning Better for Failure: How Mainframe Error Messages Impact CX

This article is about useful error messages, which are important both for the customer experience and for operations. I’m not sure what really qualifies as a “mainframe” these days, though….

Automating Your Oncall: Open Sourcing Fossor and Ascii Etch

LinkedIn is open-sourcing two tools that they use for troubleshooting during incidents. Fossor automates running data-gathering can and Ascii Etch displays graphs using ASCII art.

Outages

LastPass
Slack
Spotify
Bitbucket
- Bitbucket has had severe performance problems due to a failure in their storage layer.
Kraken (cryptocurrency exchange)
- This appears to have been a scheduled upgrade that blew up in complexity, preventing Kraken from coming back up for two days. From the article:
  
  Most astonishing of all, about 36 hours after the upgrade began, Kraken apparently sent their engineers home to take a nap!
  
  Not that astonishing! Tired engineers make mistakes, after all.
Missile threat alert for Hawaii a false alarm
- There’s so much more to this story than we’ve been told, and I really wish I could be a fly on the wall during the retrospective.

SRE Weekly Issue #105

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues