[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.
What functionality is critical? Should we sacrifice feature A to save feature B? It’s important to plan ahead.
It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?
Ellen Steinke — Metrist
Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.
Jouhné Scott — FireHydrant
Another good one to have in your back pocket for those “What would you say… you do here?” moments.
Ash Patel — SREPath
Build versus buy for incident management systems: what is the true cost of rolling your own?
Biju Chacko and Nir Sharma — Squadcast
A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.
Banjo Obayomi — DZone
They solved a cardinality explosion by switching from query-based alerting to stream data processing.
Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix