This one really got me thinking. Make sure you document why an alert exists, not just what it checks for.
If you start with a monolith and adopt a microservice architecture, your incident response process will need to change as well.
Mya Pitzeruse — effx
Another one that needs a disclaimer: there’s no single “root cause” for an incident, and this article is not about that. This is about using statistical software to aid humans in debugging by looking at the activities performed by different users before they encounter a given bug.
Vijay Murali, Edward Yao, Umang Mathur, Satish Chandra — Facebook
A new SRE at Honeycomb shares insight on the job and SRE attitudes in general.
Fred Hebert — Honeycomb
This post considers the January 4th Slack outage as a set of cases of saturation.