SRE Weekly Issue #65

Articles

Look, a new newsletter about monitoring! I’m really excited to see what they have to offer.

And another new newsletter! Like Monitoring Weekly, I’m betting this one will have a lot of articles of interest to SREs.

Sandstorm or Significant: The evolving role of context in incident management

VictorOps held a webinar last Thursday to present and discuss the concept of context in incident management. Just paging in a responder isn’t enough: we need to get them up to speed on the incident as soon as possible. Ideally, the page itself would include snapshots of relevant graphs, links to playbooks, etc. But if we’re not careful and add too much information, the responder is overloaded by a “sandstorm” of irrelevant data. “time to learn” — post incident learning careful of info overload in presenting context with pages

This webinar was created by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Embracing Risk

Here’s the next in Stephen Thorne’s series of commentary on chapters of the SRE book. I like that Google makes an effort not to be too reliable for fear of setting expectations too high, and they’re also realistic in their availability goals: no end-user will notice a 20-second outage.

Infinity Is a Bad Timeout

Writing an API, system, server or really anything people might make use of? Don’t make the default timeout be infinite.

5 Incident Management Tools You Need During a Firefight

PagerDuty really has been churning out excellent articles in the past couple of weeks. [Spoiler Alert] The five things are: internal communication, monitoring, a public status site, a support ticket system, and a defined incident response procedure.

Avoiding Incident Response Bottlenecks

Keep on rockin’ it, PagerDuty. This time they identify common problems that hinder incident response and give suggestions on how to fix them.

SREcon17: Brave new world of site reliability engineering

The author reviews their experience at SRECon17 Americas, including interesting bits on Julia Evans, training, recruiting, and diversity.

Human Error? No Problem

I love that the ideas we’re talking about regarding human error apply even to commercial cannabis growing.

Sadly, little is known about the nature of these errors, mainly because our quest for the truth ends where it should begin, once we know it was a human error or is “someone’s fault.”

Sometimes Boring is Better — Production Ready

The newer and shinier the technology, the bigger the production risk.

In other words, software that has been around for a decade is well understood and has fewer unknowns.

Outages

Kings College London storage system outage and data loss
- Kings College London’s HP storage system suffered a routine failure that, due to a firmware bug, resulted in loss of the entire array. Linked is an incredibly detailed PDF including multiple contributing factors and many remediations. Example: primary backups were to another folder on the same storage system, and secondary tape backups were purposefully incomplete.
Ryanair
- This one’s interesting to me because it seems to have been self-inflicted due to a flash sale.
Apple Store
- Another (possibly) self-inflicted outage due to a sale.
Microsoft Azure
Discord Status – Connectivity Issues
- Finally, my search alert for “thundering herd” paid off! I hadn’t heard of Discord before now, but they sure do write a great postmortem. Did you know that the thundering herd is a sports team?

SRE Weekly Issue #65

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues