Articles
I love the analogy: you can’t work around a slow drain with a bigger sink.
Stephen Thorne, a Google SRE, annotates the first chapter of the Google SRE book with his personal opinions and interpretations.
The author of this short article starts with the blooper during the Oscars and beautifully segues into a description of techniques organizations can use to halt the propagation of errors.
This webinar looks really interesting, and I’m going to try to see it. It’s about the importance of providing context to incident responders, how much to provide, and how to provide it.
This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.
Outages
- AT&T 911 service
- AT&T customers were unable to make emergency calls across the US. The Federal Communications Commission (FCC) is investigating.
- Bitly (link-shortening service)
- The cause here is interesting: Comcast’s automated system decided Bitly was a phishing site.
- Post-mortem: Outages on 1/19/17 and 1/23/17 – Skyliner
- I really like their methodical hunt for the offending memory leak.
- HSBC (Bank)
- Google accidentally resets OnHub and Google Wifi routers with server error
-
The routers occasionally ping Google servers for authorization, and on February 23rd the server was sending back an error message. Through some esoteric fallback mechanism in the routers, this caused them to reset to factory settings. So, a problem on Google’s servers can reset your router. Oops.
-
- Incident 1059 | Heroku Status
- Heroku posted a followup regarding their outage on February 28th stemming from the Amazon S3 outage.
Full disclosure: Heroku is my employer and I was involved in writing this followup.
- Heroku posted a followup regarding their outage on February 28th stemming from the Amazon S3 outage.