SRE Weekly Issue #23

Articles

Here’s the talk on Heroku’s SRE model that fellow SRE Courtney Eckhardt and I gave at SRECon16 in April. Heroku uses a “Total Ownership” model for service operations, meaning that individual development teams are responsible for running and maintaining the services that they deploy. This in turn allows SRE to broaden our scope of responsibility to cover a wide range of factors that might impact reliability.

Full disclosure: Heroku, my employer, is mentioned.

RushCard is a prepaid debit card system, and last year they had an outage that lasted for two weeks. As part of a settlement, RushCard will pay affected customers $100 – $500 for their troubles.

Many RushCard customers are low-income minority Americans who don’t have traditional bank accounts. Without access to their money stored on their RushCards, some customers told The Associated Press at the time they could not buy food for their children, pay bills, or pay for gas to get to their jobs.

This article in Brigham and Women’s Hospital’s Safety Matters series highlights the importance of encouraging reporting of safety incidents and a blameless culture. Two excellent case studies involving medication errors are examined.

In early 2015, a fire occurred in the Channel Tunnel. Click through for a summary of the recently-released post-incident analysis. It includes the multiple complicating factors that made this into a major incident plus lots of remediations — my favorite kind of report.

SignalFx shares their in-depth experience with Kafka in this article. This reminds me of moving around ElasticSearch indices:

Although Kafka currently can do quota-based rate limiting for producing and consuming, that’s not a applicable to partition movement. Kafka doesn’t have a concept of rate limiting during partition movement. If we try to migrate many partitions, each with a lot of data, it can easily saturate our network. So trying to go as fast as possible can cause migrations to take a very long time and increase the risk of message loss.

Plagued by pages requiring tedious maintenance of a Golang process, this developer sought to make the service self-healing.

For the Java crowd, Oracle published this simple guide on writing and deploying highly available Java EE apps using Docker. Sort of. Their example uses a single Nginx container for load balancing.

Outages

Updated: May 15, 2016 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme