General

SRE Weekly Issue #26

Articles

Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go.

[…] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […]

I’m still laughing about #NoDevs. Thought-leadering through trolling FTW.

This is an older article (2011), but it’s still well worth reading. Facebook began automating remediation of standard hardware failure, and then they reinvested the time saved into improving the automation.

Today, the FBAR service is run by two full time engineers, but according to the most recent metrics, it’s doing the work of 200 full fine system administrators.

A system that doesn’t auto-scale to meet demand can be unreliable in the face of demand spikes. But auto-scaling adds complexity to a system, and increasing complexity can also decrease reliability. This article outlines a method to attempt to reason about auto-scaling based on multiple metrics. Bonus TIL: Erlang threads busy-wait for work.

A run-down of basic techniques for avoiding and dealing with human error. I like this article for a couple of choice quotes, such as: “human error scales up” — as your infrastructure grows bigger, the scope of potential damage from a single error also grows bigger.

The latest in Mathias Lafeldt’s Production Ready series is this article about complexity.

The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

Outages

SRE Weekly Issue #25

Articles

This blows my mind. Chef held a live, public retrospective meeting for a recent production incident. I love this idea and I can only hope that more companies follow suit. The transparency is great, but more than that is their sharing of their retrospective process itself. They have a well-defined format for retrospectives including a statement of blamelessness at the beginning. Kudos to Chef for this, and thanks to Nell Shamrell-Harrington for posting the link on Hangops.

The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:

The further distant staging is from production, the more likely we are to introduce a bug.

PagerDuty has this explanation of alert fatigue and some tips on preventing it. One thing they missed in their list of impacts of alert fatigue: employee attrition, which directly impacts reliability.

For the network-heads out there, here’s an article on how to set up Anycast routing.

As we become more dependent on our mobile phones, the FCC is gathering information on provider outages. I, for one, wouldn’t be able to call 911 (emergency services) if AT&T had an outage, because I don’t have a land line.

I love this article if only for its title. It’s short, but its thesis bears considering: all the procedure documentation in the world won’t help you if you can’t find it during an incident, or it can’t practically be followed.

The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.

So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

An introduction to the application of formal mathematical verification to network configurations. A good overview, but I wish it went into more practical detail.

[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]

Earlier this year, I featured a story about Pinboard.in and IFTTT. IFTTT released this official apology and explanation of the problems Pinboard.in’s author outlined, and they (unofficially) promised to retain support through the end of 2016. Pinboard.in is an integral part of how I produce SRE Weekly every week, so I’m glad to see that this turned out for the best.

This article is more on the theoretical side than practical, and it’s a really interesting read. It’s the second in a series, but I recommend reading both at once (or skipping the first).

A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.

Outages

  • Twitter
  • NS1
    • NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.

  • Pirate Bay
  • WhatsApp
  • Virginia (US state) government network
  • Walmart MoneyCard
  • Telstra
    • Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.

  • GitLab
    • Linked is their post-incident analysis.

  • Kimbia (May 3)
    • A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.

SRE Weekly Issue #24

My favorite read this week was this first article. It’s long, but it’s well worth a full read.

Articles

Got customers begging to throw money at you if only you’d let them run your SaaS in-house? John Vincent suggests you think twice before going down that road. This isn’t just a garden-variety opinion piece. Clearly John is drawing on extensive experience as he closely examines all of the many pitfalls in trying to convert a service into a reliable, sustainable, supportable on-premises product.

An old but excellent postmortem for an incident stemming from accidental termination of a MySQL cluster.

Thanks to logikal on hangops #incident_response for this one.

Earlier this year, Linden Lab had to do an emergency grid roll on a Friday to patch the GHOST (glibc) vulnerability. April Linden (featured here previously) shares a bit on why it was necessary and how Linden handled GHOST.

This article may be about a medication error, but this could have come straight from a service outage post-analysis:

For example, if the system makes it time consuming and difficult to complete safety steps, it is more likely that staff will skip these steps in an effort to meet productivity goals.

Having a standard incident response process is crucial. When we fail to follow it, incidents can escalate rapidly. In the case of this story from South Africa, the article alleges that the Incident Commander led a team into the fire, rather than staying outside to coordinate safety.

I believe that mistakes during incident response in my job don’t lead directly to deaths now, but how soon before they do? And are my errors perhaps causing deaths indirectly even now? (Hat-tip to Courtney E. for that line of thinking.)

Salesforce published a root cause analysis for the outage last week.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

Earlier this year, Stack Exchange suffered a short outage during a migration. The underlying issue seems to have been an inability to truly test the migration due to an inability to replicate the production environment (CDN and all) in development.

Outages

  • NBA 2K16
  • Westpac (AU bank)
  • iiNet (AU ISP)
  • Whatsapp
  • Iraq
    • Iraq purportedly shut down its internet access (removed its BGP announcements) to prevent students from cheating on exams.

  • Virgin Mobile
    • They offered users a data credit immediately.

  • Telstra
    • Telstra had a long outage this week. They claim that the outage was caused by vandalism in Taree.

  • Datadog
    • Thanks to acabrera on hangops #incident_response for this one.

  • Mailgun
  • Disney Ticketing
    • Disney’s ticketing site suffered under an onslaught of traffic this week brought on by their free dining deal program. Reference: we had a heck of a time making our dining reservations.

SRE Weekly Issue #23

Articles

Here’s the talk on Heroku’s SRE model that fellow SRE Courtney Eckhardt and I gave at SRECon16 in April. Heroku uses a “Total Ownership” model for service operations, meaning that individual development teams are responsible for running and maintaining the services that they deploy. This in turn allows SRE to broaden our scope of responsibility to cover a wide range of factors that might impact reliability.

Full disclosure: Heroku, my employer, is mentioned.

RushCard is a prepaid debit card system, and last year they had an outage that lasted for two weeks. As part of a settlement, RushCard will pay affected customers $100 – $500 for their troubles.

Many RushCard customers are low-income minority Americans who don’t have traditional bank accounts. Without access to their money stored on their RushCards, some customers told The Associated Press at the time they could not buy food for their children, pay bills, or pay for gas to get to their jobs.

This article in Brigham and Women’s Hospital’s Safety Matters series highlights the importance of encouraging reporting of safety incidents and a blameless culture. Two excellent case studies involving medication errors are examined.

In early 2015, a fire occurred in the Channel Tunnel. Click through for a summary of the recently-released post-incident analysis. It includes the multiple complicating factors that made this into a major incident plus lots of remediations — my favorite kind of report.

SignalFx shares their in-depth experience with Kafka in this article. This reminds me of moving around ElasticSearch indices:

Although Kafka currently can do quota-based rate limiting for producing and consuming, that’s not a applicable to partition movement. Kafka doesn’t have a concept of rate limiting during partition movement. If we try to migrate many partitions, each with a lot of data, it can easily saturate our network. So trying to go as fast as possible can cause migrations to take a very long time and increase the risk of message loss.

Plagued by pages requiring tedious maintenance of a Golang process, this developer sought to make the service self-healing.

For the Java crowd, Oracle published this simple guide on writing and deploying highly available Java EE apps using Docker. Sort of. Their example uses a single Nginx container for load balancing.

Outages

SRE Weekly Issue #22

Articles

Landon McDowell, my (incredibly awesome) former boss at Linden Lab, wrote this article in 2014 detailing a spate of bad luck and outages they’d suffered. Causes included hardware failures, DDoS, and an integer DB column hitting its maximum value.

I worked on testing the new class of database hardware mentioned in the previous article. In order to be sure the new hardware could handle our specific query pattern, I captured and replayed production queries in real-time using an open source tool written years earlier at Linden Lab called Apiary. This simple but powerful concept (capture and replay) was first introduced to me by one of Apiary’s co-authors, Charity Majors. I’ve since hacked a ton on Apiary and used at two subsequent jobs.

A group calling themselves the Armada Collective has been making DDoS extortion threats to many companies recently. Cloudflare called them out as entirely toothless, with no actual attacks, but apparently some companies have paid anyway.

An excellent deep dive into a performance issue (which really equals a reliability issue), including some good lessons learned.

This is specifically referring to disaster scenarios such as hurricanes, but the general idea of a “resiliency cooperative” intrigues me.

A review of the Fire and Emergency Services response found flaws in the actions and procedures taken by the incident commander who was the active fire chief at the time. The NTSB said the commander had not training on the incident management system that would have prepared him to better command the response.

Matthias Lafeldt goes deeper into chaos engineering in this latest installment of his series. He also introduces his Dockerized version of Netflix’s Chaos Monkey and shows how to automate chaos experiments to gain further confidence in your infrastructure’s reliability.

A great overview of the difficulties inherent in anomaly detection and alerting. Note that this article is written by OpsClarity and the end reads a bit like an ad for their service.

I’m not sure exactly what it is they’re offering now that they weren’t before, but this seems important. I think.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme