SRE Weekly Issue #87

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

John Allspaw describes the Architecture Review Working Group at Etsy. I like the idea of an open discussion with peers before creating a novel system that will add significant operational burden.

Here’s part two of Jason Hand’s series of posts with key takeaways from his new eBook, “Post-Incident Reviews”. In the next three chapters, he shows why a traditional RCA process misses the mark.

[…] problems stem — not from one primary cause — but from the complex interplay of our systems and the teams tasked with managing them.

Honeycomb.io eschews plain monitoring in favor of “observability”, which they define as the ability to “ask any arbitrary question” about a system.

But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers… the majority of your questions trend towards the unknown-unknown.

Here’s another primer on microservices. It has a nice “caveats” section, which is exactly where operations and reliability come into the picture.

Facebook shared a lot of detail about how they evolved from 3 daily pushes to quasi-continuous releases. They’ve got a well-defined canary system, reminding me of Charity’s article on testing in production last week.

AppDynamics presents their list in shiny PDF form. You’ll have to fill in your spam-bucket address contact info to download it.

PagerDuty is hosting a “breakathon”: small teams will compete to resolve a series of infrastructure issues. Sounds like bunch of fun!

Outages

  • Japan
    • Google accidentally announced some BGP prefixes it shouldn’t have, taking Japan offline for a couple of hours. Linked above is a really neat in-depth analysis from BGPmon, for all you BGP geeks out there.

      Since Google essentially leaked a full table towards Verizon, we get to peek into what Google’s peering relationships look like and how their peers traffic engineer towards Google.

  • Heroku
  • AWS
    • EC2’s Ireland region suffered an outage in VPC peering on August 23. Their status site doesn’t allow for deep links, so here’s an excerpt:

      11:32 AM PDT We are investigating network connectivity issues for some instances in the EU-WEST-1 Region.

      11:55 AM PDT We have identified root cause of the network connectivity issues in the EU-WEST-1 Region. Connectivity between peered VPCs is affected by this issue. Connectivity between instances within a VPC or between instances and the Internet or AWS services is not affected. We continue to work towards full recovery.

      12:51 PM PDT Between 10:32 AM and 12:44 PM PDT we experienced connectivity issues when using VPC peering in the EU-WEST-1 Region. Connectivity between instances in the same VPC and from instances to the Internet or AWS services was not affected. The issue has been resolved and the service is operating normally.

  • Google Cloud
    • Google Cloud suffered a massive 30-hour worldwide outage in some cloud load-balancers. In their impressive style, they posted frequent updates during the incident and issued a followup analysis of the incident just 2 days after resolution.

      In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.

  • WhatsApp
  • Twitch (video streaming service)
Updated: September 3, 2017 — 9:39 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme