I’m posting this week’s issue from the airport on my way to the west coast for business and SRECon16. I’m hoping to see some of you there! I’ll have a limited amount of incredibly exclusive hand-made SRE Weekly patches to give out — just ask.
View on sreweekly.com
Articles
If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.
Thanks to Courtney for this one.
This deployment was an important remediation from the previous outages and seemed like the right thing to do.
Lesson learned: No matter how much you want to remediate a problem for your users, consider the risk, complexity, and timing, and then reconsider again.
Online gaming customers are twice as likely to abandon a game when they experience a network delay of 50 additional milliseconds
Data centers use complex hardware that presents unforeseen problems that calendar-based maintenance checks simply cannot anticipate.
Outages
- MedStar Health
- HipChat
-
Yet another HipChat outage for beleagered Atlassian, this one after the above-linked postmortem.
-
- Spotify
- Wisconsin (US state) voting system
-
More trouble in the US primary election process.
-
- Sydney, AU Digital radio stations
-
The issue was traced back to Telstra, which had a major equipment failure inside the North Sydney exchange.
-
- Telstra
-
Telstra’s free data day correlated with lots of complaints of slowdowns by Telstra users. They surpassed the previous free data day’s data transfer (1.8 PB) by 4pm AEST.
-
- Sprint