SRE Weekly Issue #75


Upcoming webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register:


I’m super-excited to share that I’ll be speaking at Velocity NYC this October! My talk is about what exactly you can do to get out from under a failure of your single DNS provider, if you were so unfortunate as to have only one. It turns out that this question is much harder to answer than I ever imagined.

And while we’re on the subject of DNS, GitHub shared the design they used for their new resilient DNS infrastructure.

I really love when folks take the time to write up their experience in this kind of migration.

Don’t gloss over this one! I don’t want to spoil the punchline of this short but awesome article, but I will say that I always enjoy seeing data that makes me question my previous assumptions.

Production Ready is back! One way we can try to make our systems resilient to human errors is to build checklists. If it works for medicine, it can work for us.

Katie Ballinger, SRE at CircleCI, was part of the SRECon17 Americas panel, “Training New SREs. I’m grateful to her for this recap for those of us that didn’t make it to the conference.

Microservices are pretty popular right now, and lots of folks have great stuff to say about them. But much like with a lot of the tips in Google’s SRE book, we shouldn’t just blindly implement them. If your company isn’t Netflix or Uber, microservices may cause more harm than good, says Adam Drake.

Not only is this a good idea if you want Ops to be able to actually run your code without pulling their hair, it just generally means more reliable code. This article goes not only into the “how”, but the “why” too.


Updated: June 4, 2017 — 9:09 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme