SRE Weekly Issue #87

Articles

Multiple Perspectives On Technical Problems and Solutions

John Allspaw describes the Architecture Review Working Group at Etsy. I like the idea of an open discussion with peers before creating a novel system that will add significant operational burden.

Post-Incident Reviews Part Two: The Demise of Root Cause Analysis

Here’s part two of Jason Hand’s series of posts with key takeaways from his new eBook, “Post-Incident Reviews”. In the next three chapters, he shows why a traditional RCA process misses the mark.

[…] problems stem — not from one primary cause — but from the complex interplay of our systems and the teams tasked with managing them.

Observability: What’s In a Name?

Honeycomb.io eschews plain monitoring in favor of “observability”, which they define as the ability to “ask any arbitrary question” about a system.

But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers… the majority of your questions trend towards the unknown-unknown.

Serverless computing: It’s all about functional stateless microservices

Here’s another primer on microservices. It has a nice “caveats” section, which is exactly where operations and reliability come into the picture.

Rapid release at massive scale

Facebook shared a lot of detail about how they evolved from 3 daily pushes to quasi-continuous releases. They’ve got a well-defined canary system, reminding me of Charity’s article on testing in production last week.

10 Essential Skills of a Site Reliability Engineer

AppDynamics presents their list in shiny PDF form. You’ll have to fill in your ~~spam-bucket address~~ contact info to download it.

Everything You Need to Know About Our Breakathon

PagerDuty is hosting a “breakathon”: small teams will compete to resolve a series of infrastructure issues. Sounds like bunch of fun!

Outages

Japan
- Google accidentally announced some BGP prefixes it shouldn’t have, taking Japan offline for a couple of hours. Linked above is a really neat in-depth analysis from BGPmon, for all you BGP geeks out there.
  
  Since Google essentially leaked a full table towards Verizon, we get to peek into what Google’s peering relationships look like and how their peers traffic engineer towards Google.
Heroku
AWS
- EC2’s Ireland region suffered an outage in VPC peering on August 23. Their status site doesn’t allow for deep links, so here’s an excerpt:
  
  11:32 AM PDT We are investigating network connectivity issues for some instances in the EU-WEST-1 Region.
  
  11:55 AM PDT We have identified root cause of the network connectivity issues in the EU-WEST-1 Region. Connectivity between peered VPCs is affected by this issue. Connectivity between instances within a VPC or between instances and the Internet or AWS services is not affected. We continue to work towards full recovery.
  
  12:51 PM PDT Between 10:32 AM and 12:44 PM PDT we experienced connectivity issues when using VPC peering in the EU-WEST-1 Region. Connectivity between instances in the same VPC and from instances to the Internet or AWS services was not affected. The issue has been resolved and the service is operating normally.
Google Cloud
- Google Cloud suffered a massive 30-hour worldwide outage in some cloud load-balancers. In their impressive style, they posted frequent updates during the incident and issued a followup analysis of the incident just 2 days after resolution.
  
  In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.
WhatsApp
Twitch (video streaming service)

SRE Weekly Issue #87

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues