SRE Weekly Issue #36

Last week’s DevOps & SRE AMA was super fun! Thanks to the panelists for participating. Recordings should be posted soon.


Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here:


This is the second half of Server Density’s series on the lessons they learned as they transitioned to a multi-datacenter architecture. There are lots of interesting tidbits in here, such as an explanation of how they handle failover to the secondary DC and what they do if that goes wrong.

Full disclosure: Heroku, my employer, is mentioned.

Here’s the second half of Mathias Lafeldt’s series that seeks to apply Richard Cook’s How Complex Systems Fail to web systems. The article is great, but the really awesome part is the thoughtful responses by Cook himself to both parts one and two, linked at the end of this article.

Here’s a postmortem for last week’s outage that involved a migration gone awry.

Thanks to Jonathan Rudenberg for this one.

A patent holding firm is alleging that the USPTO overstepped its authority in declaring a system outage (reported in issue #4) to be treated as a national holiday for purposes of deadlines, and that this led to the plaintiff being sued.

Burnout is a crucially important consideration in a field with on-call work. VictorOps has a few tips for alleviating burnout gleaned from this year’s Monitorama.

Edith Harbaugh says that staging servers present a reliability risk that doesn’t outweigh their benefit. This article is an update to her original article, which I also recommend reading.

Github uses HAProxy to balance across is read-only MySQL replicas, which is a common method. Their technique for excluding lagging nodes while avoiding entirely emptying the pool if all nodes are lagging is pretty neat.

Thanks to Devops Weekly for this one.

A highly detailed deep-dive on Serverless — what it means, benefits, and drawbacks. I especially enjoyed the #NoOps section:

[Ops] also means at least monitoring, deployment, security, networking and often also means some amount of production debugging and system scaling. These problems all still exist with Serverless apps and you’re still going to need a strategy to deal with them. In some ways Ops is harder in a Serverless world because a lot of this is so new.


Full disclosure: Heroku, my employer, is mentioned.


Updated: August 21, 2016 — 9:56 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme