The theme of this article is, somebody knows. So often this is the case with lurking infrastructure issues, and it only becomes clear that somebody knew about the underlying risk once things blow up (or never). How can we find out these things that someone already knows, soon enough to act?
In this air crash investigation report, somebody knew: the maintenance supervisor had written multiple memos about a risky maintenance practice to no avail, and the practice directly contributed to the crash.
And in this one, somebody knew too: a trained pilot in a nearby village called air traffic control to warn them that a plane looked likely to crash into a mountain and needed to pull up — shortly before it hit the mountain.
A lolsob-worthy comment on laying off SREs. And here‘s a totally on-point reply with the somebody knew moment.
Partly, it’s about accepting that this is hard work. The other part is choosing where your energy input can yield the most learning.
Full disclosure: Fred is my teammate at work.
Check it out, the incident.io folks started a podcast about incidents!
Here’s Google’s report for a BigQuery outage that occurred on October 13.
At last9, we auto-delete slack messages after 2 days on all personal Direct Messages. These retention policies force teams to improve documentation, kill tribal knowledge and drive accountability for mistakes, errors.
Nishant Modak — Last9
There are some interesting tidbits in the pile of incidents in this report.
Jakub Oleksy — GitHub