General

SRE Weekly Issue #110

lex

February 18, 2018

General

Comments

View on sreweekly.com

Articles

How production engineers support global events on Facebook

Facebook goes in-depth on their preparations for New Year’s Day 2018 in their live streaming infrastructure. They used forecasting based on last year and various kinds of load testing to develop the right kind of scaling strategy to meet demand.

On-call doesn’t have to suck

Cindy Sridharan went and blew up the internet with an excellent and controversial tweet about on-call. She took to Medium to address all of the discussion that followed, and the result is a pretty excellent article about on-call and work/life balance.

Production Test Run: Overburdened and under provisioned

A discussion about how RavenDB handles resource exhaustion, and just how resource exhaustion can be defined and detected.

Development at Honeycomb: Crossing the Observability Bridge to Production

Honeycomb on using observability tooling to precisely analyze how a change actually affects your users. Did the new feature/bugfix have the effect you expected?

Low latency, large working set, and GHC’s garbage collector: pick two of three

Pusher is obsessed with low latency, and for good reason. When they saw high long-tail latency, they discovered that Haskell’s garbage collector is optimized for throughput, rather than latency.

Resilience Engineering at LinkedIn with Project Waterbear | LinkedIn Engineering

Facebook’s Project Waterbear seeks to improve resiliency across many of their services through a combination of chaos engineering, cultural changes, and improvements to Rest.li, their common REST framework.

As SREs, we measure, analyze, and provide best practices to help improve the resilience of each application for the application owners and engineering teams.

Observability: the new wave or buzzword?

The tradeoff for more resilient, soft-failing software systems is more complex debugging when things go wrong. As these problems are now more likely to reside deep in application code — which wasn’t the case not along ago — observability tooling is playing catchup.

Everything You Need to Know About DynamoDB Global Tables

OpsGenie analyzes AWS’s new DynamoDB Global Tables, a cross-region multi-master NoSQL datastore. They share the upsides and the pitfalls and include a discussion of how to transition to a global table.

Why, as a Netflix infrastructure manager, am I on call?

A Netflix manager shares his reasons for still being on-call even though he’s a manager, and they’re pretty great. A lot of it has to do with keeping in tune with what it’s like being a developer on his team, especially with regard to on-call burden and operability.

Outages

Visual Studio Team Services (Microsoft)
- Microsoft posted an incredibly detailed analysis of an incident that occurred on February 7th. The interesting bit is that they still don’t know what went wrong, and they included a lot of detail on how they’ve tried to track it down so far. Lots to learn from here.
TD Bank
Snapchat
Vocus Communications (data center provider)
Twitter

SRE Weekly Issue #109

lex

February 11, 2018

General

Comments

View on sreweekly.com

Articles

Per-IP rate limiting with iptables

Pusher had a problem: their service was being bombarded by connections from rogue clients, and they needed to enforce limits. This article is highly polished, with beautiful diagrams and well-constructed explanations.

This is the story of how we quelled the biggest threat to our service uptime for several years.

Structured Logging and Your Team

Structured logging can bring a lot of uniformity to your infrastructure, as lovingly explained in this article. Snyk explains how that uniformity allows for a standardized troubleshooting methodology that helps them get to the bottom of most problems in minutes.

Instead of focusing on the individual intricacies of each part of our system, we train on the common tools to be used for almost every kind of problem.

Your Feature Flag Management Needs to Include Retirement

Feature flags are awesome! But there’s a downside: adding lots of conditional handling to your code can significantly increase code complexity, which can in turn decrease maintainability and increase risk.

Charity Majors on Twitter

Following up on her appearance in the New York Times last week, Charity Majors posted this excellent Twitter thread about the importance of vendor relationship management and generating business value, as any kind of engineer. I’d argue especially as an SRE.

Google Cloud Platform Blog: Applying the Escalation Policy

Here’s the latest in Google’s CRE Life Lessons series. Previously, they explained how to build an Escalation Policy, and in this article, they analyze how it would be applied to several fictitious scenarios.

Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity

LinkedIn needed a way to test their HDFS cluster against real-world traffic patterns. The existing solutions didn’t meet their needs (for reasons they explain toward the end), so they created Dynamometer.

Humanize Your Digital Operations

PagerDuty released a report this week entitled, “The State of IT Work-Life Balance”, which contains the results of their recent survey. This article is an overview, along with some related tidbits about alert fatigue.

Schrodinger’s Outage

Through an anecdote, Baron Schwartz cautions against the use of counter-factuals (“you should have…”) in analyzing the decisions leading up to an outage.

8 Things to Monitor During a Software Deployment

What it says on the tin. This article would make for a great checklist for deploys.

Outages

Singpass (Singapore ID system)
Uber
Fortnite
- Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues!
  
  They suffered 6 different outages over two days, and they posted this highly-detailed incident analysis just 5 days later. Normally I tend not to include outages for MMO games because they have so many and rarely post in-depth analyses, but this one is worth a read.
Binance (cryptocurrency exchange)
Google App Engine
US stock brokerages
- The US stock market had a rough week, and so did several brokerage websites as they dealt with the high trading volume.
Super Bowl Advertisers
- Several companies that purchased expensive commercial slots during the SuperBowl (an american sportsball thing, for you folks outside the US) were unable to handle the web traffic they brought in.
Super Bowl
- NBC had a 45-second blackout in their broadcast of the Superb Owl.

SRE Weekly Issue #108

lex

February 4, 2018

General

Comments

View on sreweekly.com

Wow, I have a lot of great content to share with you this week! Sometimes it seems like awesome articles come in waves… not sure what that’s about.

Articles

Talking Technology: Nick Rockwell + Charity Majors

This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.

There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!

SRE@Xero: Managing Incidents Part II

Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.

Production Test Run The self flagellating server

Here’s a fun little distributed system debugging story from the founder of RavenDB.

Hawaii false missile alert ‘button pusher’ fired

This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.

Lessons Learned from LinkedIn’s Data Center Journey

LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.

Vanderbilt School of Engineering offers new master of risk, reliability, and resilience engineering

The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.

Stop Wasting Your Beer Money

“Well I’d cut out the pizza and beer and instead pay for Splunk.”

This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.

Feature Flags as a Service: The Only Way You Want Feature Flags

Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.

Using Postmortems to Understand Service Reliability

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.

Predicting Resource Exhaustion with Double Exponential Smoothing

How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.

Planning for Chaos with MongoDB Atlas: Using the “Test Failover” Button | MongoDB

A primer on testing failover in a MongoDB Atlas cluster.

Meltdown Performance Impact on MongoDB: AWS, Azure

Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.

Outages

PolitiFact
- PolitiFact was down for a bit during President Trump’s yearly State of the Union address.
Skype
- It seems that folks with two-factor authentication were unable to log in for multiple days.
The Travis CI Blog: Major build outage: a postmortem report
- Linked is a highly detailed summary of their troubles with an overloaded RabbitMQ cluster.
Netflix

SRE Weekly Issue #107

lex

January 28, 2018

General

Comments

View on sreweekly.com

Articles

Google Cloud Platform Blog: An example escalation policy — CRE life lessons

Here, “escalation policy” refers to ongoing work by SRE to get a service back into its SLO, rather than an escalation policy definition in PagerDuty (for example). This article describes the tactics a hypothetical Google SRE team has at their disposal to deal with an ailing service. It’s especially striking to me how this policy comes across as almost punitive in nature.

Now You See Me, Now You Don’t: LinkedIn’s Real-Time Presence Platform

In this post, we’ll provide a technical walk-through of how we used the Play Framework and the Akka Actor Model to build the massive infrastructure that keeps track of the online status of millions of members at any given moment. We’ll describe how it distributes thousands of changes per second in the online status of these members to millions of other connected members in real time. You will also learn how to apply these techniques to your own applications.

If You’re Going to Fail, Fail Safely

This article from LaunchDarkly is about assuming failure and mitigating harm, through the lens of feature-flag-based deployment.

What Tools Do Site Reliability Engineers Use?

New Relic shares this list of the categories of tools that SREs use to standardize the systems they support.

As Liz [Fong-Jones] told Matthew Flaming, New Relic vice president of software engineering, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing, and they’re each using separate tooling.”

Building a Distributed Log from Scratch, Part 5: Sketching a New System

In the final article of this series, Tyler Treat lays out a design for a new distributed log based on NSQ.

Observations on the Enterprise of Hiring

While perhaps not strictly SRE-related, hiring is still critically important for SRE teams. I really love Honeycomb’s approach to hiring as laid out in this blog post.

Why is random testing effective for partition tolerance bugs?

Why indeed? This issue of The Morning Paper discusses a paper on the effectiveness of random testing in distributed systems. More specifically, it goes over the mathematics behind why randomized testing in Jepsen is actually useful, despite classical theories that it ought not be.

Outages

Pinterest
Google Cloud Storage
- This one’s worth a read. Google’s original status posting stated 100% impact to cloud storage in its US region, but their followup post retroactively reduced that to 2.0% average and 3.6% peak.
Netflix
- This one happened seemingly at the same time as the Google Cloud Storage outage, but that may be a spurious correlation. This is the first time that I learned that Netflix does have a status page of sorts: it’s an article in their help center entitled “Is Netflix Down?” and they update it live. Who knew?
Facebook/Instagram
National Health Service (UK)

SRE Weekly Issue #106

lex

January 21, 2018

General

Comments

View on sreweekly.com

Articles

The Limitations of Chaos Engineering – Production Ready

Chaos engineering is extremely useful, and Mathias Lafeldt has written plenty about its virtues. But as with everything, it’s important to be aware of its pitfalls and shortcomings too.

What Went Wrong In Hawaii, Human Error? Nope, Bad Design.

There’s been a lot of talk of firing (or worse) the person whose actions led to the false alarm in Hawaii. That’s why I’m especially glad to see this excellent analysis by Don Norman (The Design of Everyday Things and others). Bonus content: another article along the same vein with some more interesting tidbits.

In defence of swap: common misconceptions

Think twice before you disable swap, says Chris Down, an author of the upcoming cgroup v2 in the Linux kernel.

SRE Survey 2018

Catchpoint is running a survey of SREs and SRE-like folks, and I’d really appreciate it if you’d take a moment to fill it out. Not only will the resulting data be very interesting, but Catchpoint is donating $5 to charity for every survey completed. Let’s stuff that ballot box and get them to hit their cap of $3000!

Building a Distributed Log from Scratch, Part 4: Trade-Offs and Lessons Learned

The awesome continues this week with a discussion of the importance of simplicity in the design of a reliable system.

What Makes a Failure a Disaster?

This article from Heidi Waterhouse at Launch Darkly starts off with a really interesting take on the Y2K bug and continues on to discuss risk management in operations.

When letting the user put the system into an invalid state is a desirable property

This short article has an extremely cogent point: design your system to be flexible enough to allow the user to do something seemingly incorrect, because they might need to while responding to an incident!

Project STAR*: Streamlining Our On-Call Process

LinkedIn had a problem: their on-call system was so dysfunctional that they had to scramble to find coverage for an engineer that had been scheduled to be on call when they were on vacation. They explain how they identified the problem, came up with a solution, and implemented it, including automation and cultural fixes.

Monitoring in a DevOps World

If the phrase “a DevOps World” makes you feel ill, don’t dismiss this article from ACM Queue out of hand. It’s got some great points about designing effective monitoring, and I like the introduction of the “Real Systems Monitoring” concept (akin to “Real User Monitoring” or RUM).

Outages

Heroku
- Heroku had a 29-hour impairment to their application log routing platform.

← Older Posts

Newer Posts →

General

SRE Weekly Issue #110

Articles

Outages

SRE Weekly Issue #109

Articles

Outages

SRE Weekly Issue #108

Articles

Outages

SRE Weekly Issue #107

Articles

Outages

SRE Weekly Issue #106

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues