Look, a new newsletter about monitoring! I’m really excited to see what they have to offer.
And another new newsletter! Like Monitoring Weekly, I’m betting this one will have a lot of articles of interest to SREs.
VictorOps held a webinar last Thursday to present and discuss the concept of context in incident management. Just paging in a responder isn’t enough: we need to get them up to speed on the incident as soon as possible. Ideally, the page itself would include snapshots of relevant graphs, links to playbooks, etc. But if we’re not careful and add too much information, the responder is overloaded by a “sandstorm” of irrelevant data. “time to learn” — post incident learning careful of info overload in presenting context with pages
This webinar was created by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.
Here’s the next in Stephen Thorne’s series of commentary on chapters of the SRE book. I like that Google makes an effort not to be too reliable for fear of setting expectations too high, and they’re also realistic in their availability goals: no end-user will notice a 20-second outage.
Writing an API, system, server or really anything people might make use of? Don’t make the default timeout be infinite.
PagerDuty really has been churning out excellent articles in the past couple of weeks. [Spoiler Alert] The five things are: internal communication, monitoring, a public status site, a support ticket system, and a defined incident response procedure.
Keep on rockin’ it, PagerDuty. This time they identify common problems that hinder incident response and give suggestions on how to fix them.
The author reviews their experience at SRECon17 Americas, including interesting bits on Julia Evans, training, recruiting, and diversity.
I love that the ideas we’re talking about regarding human error apply even to commercial cannabis growing.
Sadly, little is known about the nature of these errors, mainly because our quest for the truth ends where it should begin, once we know it was a human error or is “someone’s fault.”
The newer and shinier the technology, the bigger the production risk.
In other words, software that has been around for a decade is well understood and has fewer unknowns.
- Kings College London storage system outage and data loss
- Kings College London’s HP storage system suffered a routine failure that, due to a firmware bug, resulted in loss of the entire array. Linked is an incredibly detailed PDF including multiple contributing factors and many remediations. Example: primary backups were to another folder on the same storage system, and secondary tape backups were purposefully incomplete.
- This one’s interesting to me because it seems to have been self-inflicted due to a flash sale.
- Apple Store
- Another (possibly) self-inflicted outage due to a sale.
- Microsoft Azure
- Discord Status – Connectivity Issues
- Finally, my search alert for “thundering herd” paid off! I hadn’t heard of Discord before now, but they sure do write a great postmortem. Did you know that the thundering herd is a sports team?