Articles
This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.
Matthew O’Riordan — Ably
This article argues that spending a little time to fix staging can make production significantly more stable.
Michael Nygard
This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.
[…]
As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.
John Casey
A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.
Pavel Trukhanov — okmeter
On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.
Great advice, useful for managers and individual contributors.
Charity Majors
Outages
- Apple CloudKit
-
There appears to be some prolonged issues with Apple’s CloudKit service today, which Apple offers to developers as a way to store user data and sync across devices. Several developers have reported to us that they have seen data for their apps temporarily wiped in the last 24 hours as the CloudKit service experiences some form of outage.
-
- Heroku
- Commonwealth Bank (AU)
- Coles (Supermarket chain)
- Sydney, AU train system
- reddit
- And another one.