SRE Weekly Issue #89

SPONSOR MESSAGE

Acknowledge and resolve IT & DevOps alerts directly from Slack with the new native integration with VictorOps. Learn all about it here:
http://try.victorops.com/slack/SREWeekly

Articles

Cachet looks like a pretty good contender to incumbents like StatusPage.

Hosted Graphite used PySyncObj to create a fault-tolerant threshold alerting feature.

Talk about a high-pressure incident! When a teleconferencing provider’s wires got crossed, hilarity (and embarassment) ensued.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This story is from a PagerDuty engineer. What’d you learn while shadowing on-call? I’d love to hear your story!

Here’s how SYNQ set their status page up. They’re the folks that committed to publishing all of their incident followups publicly a month or two back. Transparency FTW!

I’ll save you the math: that’s ~17k req/sec. I really like that this article takes us through their learning process and their first failed attempts.

Quid wrote up this explanation of how they set up their game day and what they learned. I really like the structure they used, and I may draw heavily on it for my own game days.

“Observability” as a term is making the rounds like “DevOps” did (and still does…). Here’s Baron Schwartz’s take on it.

Outages

  • Google Services
    • As two astute readers pointed out (thanks!), the Gmail outage I included in the last issue was from 2009(!). Oops. However, Google has been experiencing a series of outages and degradations this month, so I’m just going to pretend I knew that rather than that I forgot to check the date on the article.
  • s3 outage
    • S3 had an outage in us-east-1 on September 14th. This one showed up as yellow on their status site, with the text below. Companies that depend on S3 probably saw impact as well, but I couldn’t find any status posts other than Heroku’s.

      11:58 AM PDT We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.
      12:20 PM PDT We can confirm that some customers are receiving throttling errors accessing S3. We are currently investigating the root cause.
      12:38 PM PDT We continue to work towards resolving the increased throttling errors for Amazon S3 requests in the US-EAST-1 Region. We have identified the subsystem responsible for the errors, identified root cause and are now working to resolve the issue.
      12:49 PM PDT We are now seeing recovery in the throttle error rates accessing Amazon S3. We have identified the root cause and have taken actions to prevent recurrence.
      1:05 PM PDT Between 11:40 AM and 12:56 PM PDT we experienced throttling errors accessing Amazon S3 in the US-EAST-1 Region. The issue is resolved and the service is operating normally.

      Full disclosure: Heroku is my employer.

  • IBM
    • IBM had a mishap when transferring control of some of its domains to a different registrar. Some of their services including their Global Load Balancer went down.
Updated: September 17, 2017 — 8:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme