A packed issue this week with a really exciting discovery/announcement up top. Thanks to all of the awesome folks on the hangops slack and especially #incident_response for tips, feedback, and general awesomeness.
Articles
They’ve also started the Postmortem Report Reviews project, in which contributors submit “book reports” on incident postmortems (both past and current). PRs with new reports are welcome, and I hope you all will consider writing at least one. I know I will!
This is exactly the kind of development I was hoping to see in SRE and I couldn’t be happier. I look forward to supporting the OIB project however I can, and I’ll be watching them closely as they get organized. Good luck and great work, folks!
Thanks to Charity Majors for pointing OIB out to me.
Say… wouldn’t it be neat to start a Common Reliability Risks Database or something?
Fail fast and roll forward simply aren’t sustainable in many of today’s most core business applications such as banking, retail, media, manufacturing or any other industry vertical.
Thanks to Devops Weekly for this one.
Outages
- Datadog
- Tinder
-
Predictable hilarity ensued on Twitter.
-
- HipChat Status
-
Atlassian’s HipChat has had a rocky week with several outages. They posted an initial description of the problems and a promise of a detailed postmortem soon.
Thanks to dbsmasher on hangops #incident_response for the tip on this one.
-
- Data Centre Outage Causes Drama For Theatre Ticket Seller
-
A switch failure takes out a ticket sales site. It’s interesting how many companies try to become ops companies. I hope we see that kind of practice diminish in favor of increased adoption of PaaS/IaaS.
-
- Telstra
-
Another major outage for Telstra, and they’re offering another free data day. Perhaps this time they’ll top 2 petabytes. This article describes the troubles people saw during the last free data day including slow speeds and signal drops.
-
- Squarespace
-
Water main break in their datacenter.
Thanks to stonith on hangops #incident_response.
-