General

SRE Weekly Issue #96

lex

November 5, 2017

General

Comments

View on sreweekly.com

Articles

The Phone Book Is On Fire: Lessons From the Dyn DNS DDoS — Velocity NYC 2017

Here’s the recording of my Velocity 2017 talk, posted on YouTube with permission from O’Reilly (thanks!). Want to learn about some gnarly DNS details?

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

I fell in love with this after reading just the title, and it only got better from there. Why add debug statements haphazardly when an algorithm can automatically figure out where they’ll be most effective? I especially love the analysis of commit histories to build stats on when debug statements were added to various open source projects.

Operating a Kubernetes network

Julia Evans is back with another article about Kubernetes. Along with explaining how it all fits together, she describes a few things that can go wrong and how to fix them.

How can we apply the principles of chaos engineering to AWS Lambda?

In this introductory post of a four-part series, we learn why chaos testing a lambda-based infrastructure is especially challenging.

Google Vizier: A service for black-box optimization

I love the idea of a service that automatically optimizes things even without knowing anything about their internals. Mmm, cookies.

Lyft’s Envoy dashboards – mattklein123 – Medium

What we are releasing is unfortunately not going to be readily consumable. It is also not an OSS project that will be maintained in any way. The goal is to provide a snapshot of what Lyft does internally (what is on each dashboard, what stats do we look at, etc.). Our hope is having that as a reference will be useful in developing new dashboards for your organization.

Microsoft has built a secret network emulator it says can prevent most cloud outages

It’s not a secret since they published a paper about it. This is an intriguing idea, but I’m wondering whether it’s really more effective than staging environments tend to be in practice.

The Rise of Site Reliability Engineers

A history of the SRE profession and a description of how New Relic does SRE.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Collision with buffer stops at King’s Cross station, London, 15 August 2017
- This is the Rail Accident Investigation Branch’s report on a minor accident involving a driver that suffered a “microsleep” due to fatigue.
LearnVest
Slack

SRE Weekly Issue #95

lex

October 29, 2017

General

Comments

View on sreweekly.com

Articles

Abstracting the Geniuses Away from Failure Testing

Chaos Engineering and Jepsen-style testing is still in its infancy. As this ACM Queue article explains, figuring out what kind of failure to test is still a manual process involving building a mental model of the system. Can we automate it?

Scaling the GitLab database

GitLab shares the story of how they implemented connection pooling and load balancing with read-only replicas in PostgreSQL.

Moving Half a Million Database Tables to AWS Aurora (Part 1)

When you have 600,000(!!) tables in one MySQL Database, traditional migration tools like mysqldump or AWS’s Database Migration Service show cracks. The folks at PressBooks used a different tool instead: mydumper.

Serverless availability zones are the missing level of resiliency for AWS

AWS Lambda spans multiple availability zones in each region. This author wonders whether it would it be more reliable to have separate installations of Lambda running in each availability zone, to protect against failure in Lambda itself.

Metrics: not the observability droids you’re looking for

High-cardinality fields are where all the interesting data exist, says Charity Majors of Honeycomb. But that’s exactly where most monitoring systems break down, leaving you to throw together hacks to work around their limitations.

Google Cloud Platform Blog: Building good SLOs – CRE life lessons

Google shares some best practices for building Service Level Objectives.

Collaboration > evaluation: Why we pay SRE candidates to interview all-day

Hosted Graphite brings candidates in to work with them for a day and pays them for their time.

On-Call Horror Story Number Three: This Wins the Most Grueling Award

Grueling is right: their entire team came to the office over the weekend to work on the outage. Lesson learned:

When something goes horribly wrong, don’t bring everybody in. More ideas are good to a point, but if you don’t solve it in the window of a normal human’s ability to stay awake, the value they are giving you goes down exponentially as they get tired.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Engineering a culture of psychological safety – Inside Intercom

Google’s Project Aristotle discovered that the number one predictor of successful teams is psychological safety. The anecdotes in this piece show how psychological safety is also critical in analyzing incidents.

Outages

Power outage, coupled with Murphy’s law, leads to raw sewage spill
- The power failed, and then the backup generator failed. Sound familiar? I’m glad datacenters don’t flood with sewage when this happens…
eBay
Texas State Fair
- Their ticket system went down, so they had to admit fair-goers for free.

SRE Weekly Issue #94

lex

October 22, 2017

General

Comments

View on sreweekly.com

Articles

High Reliability Health Care: Getting There from Here

This article by the Joint Commission opened my eyes to just how far medicine in the US is from being a High Reliability Organization (HRO). It’s long, but I’m really glad I read it.

HROs recognize that the earliest indicators of threats to organizational performance typically appear in small changes in the organization’s operations.

[…] in several instances, particularly those involving the rapid identification and management of errors and unsafe conditions, it appears that today’s hospitals often exhibit the very opposite of high reliability.

Center stage: Best practices for staging environments

Increment issue #3 is out this week, and Alice Goldfuss gives us this juicy article on staging environments. I love the section on potential pitfalls with staging environments.

For all their advantages, if staging environments are built incorrectly or used for the wrong reasons, they can sometimes make products less stable and reliable.

Microservice Usage at Honeycomb

A Honeycomb engineer gives us a deep-dive into Honeycomb’s infrastructure and shows how they use their product itself (in a separate, isolated installation) to debug problems in their production service. Microservices are key to allowing them to diagnose and fix problems.

Tail-Tolerance by Google

This is a nice summary of a paper by Google employees entitled, “The Tail at Scale”. 99th percentile behavior can really bite you if you’re composing microservices. The paper has some suggestions for how to deal with this.

Focus on Analysis: The End of Root Cause

This post by VictorOps recommends moving away from Root Cause Analysis (RCA) toward a Cynefin-based method.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Open-sourcing RacerD: Fast static race detection at scale

I love the idea of detecting race conditions through static analysis. It sounds hard, but the key is that RacerD seeks only to avoid false-positives, not false-negatives.

RacerD has been running in production for 10 months on our Android codebase and has caught over 1000 multi-threading issues which have been fixed by Facebook developers before the code reaches production.

ButterCMS Architecture: a Mission-Critical API Serving Millions of Requests per Month

Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other strategies to make sure we keep our customers’ websites up and running.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Honeycomb
- Honeycomb had a partial outage on the 17th due to a Kafka bug, and they posted an analysis the next day (nice!). They chronicle their discovery of a Kafka split-brain scenario through snapshots of the investigation they did using their dogfood instance of Honeycomb.
Visual Studio Team Services
- Linked is an absolutely top-notch post-incident analysis by Microsoft. The bug involved is fascinating and their description had me on the edge of my seat (yes, I’m an incident nerd).
Heroku
- Heroku posted a followup for an outage in their API. Faulty rate-limiting logic prevented the service from surviving a flood of requests. Earlier in the week, they posted a followup for incident #1297 (link).Full disclosure: Heroku is my employer.

SRE Weekly Issue #93

lex

October 15, 2017

General

Comments

View on sreweekly.com

Articles

Reasons Kubernetes is cool

Julia Evans tells us why she likes Kubernetes, and along the way explains how its resilient architecture works.

distsys-class/README.markdown at master · aphyr/distsys-class · GitHub

From the Jepsen folks, this outline is detailed enough to read by itself:

This outline accompanies a 12-16 hour overview class on distributed systems fundamentals. The course aims to introduce software engineers to the practical basics of distributed systems, through lecture and discussion. Participants will gain an intuitive understanding of key distributed systems terms, an overview of the algorithmic landscape, and explore production concerns.

When Optimising For Robustness Fails

In this article Steve Smith explains why a production environment is always in a state of near-failure, why optimising for robustness results in a brittle incident response process, and why Dual Value Streams are a common countermeasure to failure.

What will programming look like in the future?

This article seems like a direct reply to last week’s “The Coming Software Apocalypse“. I gave that one a good review, so I feel compelled to include this refutation, but I was left really wishing for more detail on the arguments put forward. Perhaps there’s more to come?

Better requirements and better tools have already been tried and found wanting. Requirements are a trap. They don’t work. Requirements are no less complex and undiscoverable than code.

Monitoring in the time of Cloud Native

This is an article version of Cindy Sridharan’s Velocity 2017 talk. She covers a lot, including major monitoring methods, existing OSS tools, the pitfalls of each, and how to achieve observability in a cloud-based infrastructure.

Mitigating replication lag and reducing read load with freno

GitHub ensures low MySQL replication lag by rate-limiting expensive batch-processing queries based on replica lag. Before freno, this logic resided in each client, with multiple implementations in different languages. Freno (which is open source) centralizes the replica lag polling and query rate-limiting decisions into a queryable service.

Open Sourcing Iris and Oncall

Earlier this year, LinkedIn open sourced their alerting system duo. Together, these tools provide functionality similar to vendor solutions like PagerDuty and VictorOps.

NGINX Rate Limiting

Here’s a great guide to rate-limiting in NGINX including config snippets.

Developer Experience Lessons Operating a Serverless-like Platform At Netflix

Netflix has an in-house serverless environment on which they run “nano-services”. It has nifty features including automatic pre-warming, gradual roll-out scheduling, and canary deployments.

Transit and Peering: How your requests reach GitHub

GitHub details their Internet-facing network topology and explains how they use traffic engineering to ensure their connectivity is fast and reliable.

The pitfalls of A/B testing in social networks

What if two people try to interact, but only one of them is flagged into a new feature? OKCupid tells us why A/B testing is much harder than it seems, and then they explain how they developed useful test cohorts.

Focus on Remediation: Leverage Runbooks to Reduce MTTR

A primer on runbooks, including a nice template you can use as a starting point in writing yours.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

SRE Weekly Issue #92

lex

October 8, 2017

General

Comments

View on sreweekly.com

Shout-out to all the folks I met at Velocity! It was an exhilarating week filled with awesome personal conversations and some really incredible talks.

Then I came back to Earth to discover that everyone chose this week to write awesome SRE-related articles. I’m still working my way through them, but get ready for a great issue.

Articles

The Stella Report

This is the blockbuster PDF dropped by the SNAFUcatchers during their keynote on day two of Velocity. Even just the 15-minute summary by Richard Cook and David Woods had me on the edge of my seat. In this report, they summarize the lessons gleaned from presentations of “SNAFUs” by several companies during winter storm Stella.

SNAFUs are anomalous situations that would have turned into outages were it not for the actions taken by incident responders. Woods et al. introduced a couple of concepts that are new to me: “dark debt” and “blameless versus sanctionless”. I love these ideas and can’t wait to read more.

IT incident response ditches root cause analysis process

Chaos engineering unearths IT deployments’ dark debt

These two articles provide a pretty good round-up of the ideas shared at Velocity this past week.

The Coming Software Apocalypse

This one starts with a 6-hour 911 (emergency services) outage in 2014 and the Toyota unintended acceleration incidents, and then vaults off into really awesome territory. Research is being done into new paradigms of software development that leave the programming to computers, focusing instead on describing behavior using a declarative language. The goal: provably correct systems. Long read, but well worth it.

The Value of Optimizing for Resilience

Drawing from Woods, Allspaw, Snowden, and others, this article explains how and why to improve the resilience of a system. There’s a great hypothetical example of graceful degradation that really clarified it for me.

Nines don’t matter T-Shirt

In a recent talk, Charity Majors made waves by saying, “Nines don’t matter when users aren’t happy.” Look, you can have that in t-shirt and mug format!

Beta Testing in Production Like a Pro

A summary of how six big-name companies test new functionality by gradually rolling it out in production.

How New, Resilient Networks Change Data Center Design

This article jumps off from Azure’s announcement of availability zones to discuss a growing trend in datacenters. We’re moving away from highly reliable “tier 4” datacenters and pushing more of the responsibility for reliability to software and networks.

Ever wanted to know how Xero does incident management?

Of course I do, and I don’t even know who Xero is! They use chat, chatops, and Incident Command, like a lot of other shops. I find it interesting that incident response starts off with someone filling out a form.

Outages

PagerDuty
- PagerDuty posted a lengthy followup report on their outage on September 19-21. TL;DR: Cassandra. It was the worst kind of incident, in which they had to spin up an entirely new cluster and develop, test, and enact a novel cut-over procedure. Ouch.
Heroku
- Heroku suffered a few significant outages. The one linked above includes a followup that describes a memory leak in their request routing layer. These two don’t yet have followups: #1298, #1301
  Full disclosure: Heroku is my employer.
Azure
- On September 29, Azure suffered a 7-hour outage in Northern Europe. They’ve released a preliminary followup that describes an accidental release of fire suppression agent and the resulting carnage. Microsoft promises more detail by October 13.
  Unfortunately can’t deep-link to this followup, so just scroll down to 9/29.
New Relic
Blackboard (education web platform)

← Older Posts

Newer Posts →

General

SRE Weekly Issue #96

Articles

Outages

SRE Weekly Issue #95

Articles

Outages

SRE Weekly Issue #94

Articles

Outages

SRE Weekly Issue #93

Articles

Outages

SRE Weekly Issue #92

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues