This is a big moment for the SRE field. Etsy has distilled the internal training materials they use to teach employees how to facilitate retrospectives (“debriefings” in Etsy parlance). They’ve released a guide and posted this introduction that really stands firmly on its own. I love the real-world story they share.
And here’s the guide itself. This is essential reading for any SRE interested in understanding incidents in their organization.
Slicer is a general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then?
Click through to find out. It’ll be interesting to see what open source projects this paper inspires.
The second in a series, this article delves into the pitfalls of aggregating metrics. Aggregation means you have to choose between bloating your time-series datastore or leaving out crucial stats that you may need during an investigation.
I thought this was going to be primarily an argument for reducing burnout to improve reliability. That’s in there, but the bulk of this article is a bunch of tips and techniques for improving your monitoring and alerting to reduce the likelihood that you’ll be pulled away from your vacation.
The title says it all. Losing the only person with the knowledge of how to keep your infrastructure running is a huge reliability risk. In this article, Heidi Waterhouse (who I coincidentally just met at LISA16!) makes it brilliantly clear why you need good documentation and how to get there.
Here’s another overview of implementing a secondary DNS provider. I like that they cover the difficulties that can arise when you use a provider’s proprietary non-RFC DNS extensions such as weighted round-robin record sets.
- EC2 (us-west-1), Heroku
- EC2’s Dublin region had an outage in the DNS resolver provided to instances via DHCP. Heroku was affected as well.Full disclosure: Heroku is my employer.
- DirecTV Now
- ChangeIP (DNS provider)
- ChangeIP tweeted that they suffered a major MySQL failure.
- ATO (Australian Tax Office)
- The ATO lost a petabyte of data from their HPE 3PAR StoreServe SAN.
- Battlefield 1