Increment issue #2 is out! Want to hear what it was like for these three big companies to move to the cloud? Read on.
This article covers a lot of ground, from general strategy to specific methods for estimating capacity needs. I love this:
Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads.
I love the insight this article gives me into the huge networks of big CDNs.
Key point: don’t count your chickens before they’ve recovered.
The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected
Scalyr explains how to move beyond specific playbooks to create a renewal incident response plan.
Here’s a nice little how-to:
A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:
- To alert if the metric breaches a specified threshold.
- To alert if a particular metric has not been sent to CloudWatch within a specified interval.
And another short how-to, this on developing Prometheus with HA.
Self-care is critical in tech, not only for us as individuals, but for the health and reliability of the entire organization. Overstretched engineers make mistakes. This article introduces a new resource: selfcare.tech, which is a curated, open-source repository of self-care resources.
- Today’s Outage · GitHub
- Old but good: this post-incident report from GitHub in 2010 recounts an outage caused by inadvertently running an automated test script against a producing db.
- Pokémon Go Chicago event issues ticket refunds after widespread outage
- 20,000 people in one place trying to play Pokemon Go was apparently enough to overload several mobile phone networks.