SRE Weekly Issue #108

View on sreweekly.com

Wow, I have a lot of great content to share with you this week! Sometimes it seems like awesome articles come in waves… not sure what that’s about.

Articles

Talking Technology: Nick Rockwell + Charity Majors

This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.

There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!

SRE@Xero: Managing Incidents Part II

Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.

Production Test Run The self flagellating server

Here’s a fun little distributed system debugging story from the founder of RavenDB.

Hawaii false missile alert ‘button pusher’ fired

This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.

Lessons Learned from LinkedIn’s Data Center Journey

LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.

Vanderbilt School of Engineering offers new master of risk, reliability, and resilience engineering

The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.

Stop Wasting Your Beer Money

“Well I’d cut out the pizza and beer and instead pay for Splunk.”

This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.

Feature Flags as a Service: The Only Way You Want Feature Flags

Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.

Using Postmortems to Understand Service Reliability

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.

Predicting Resource Exhaustion with Double Exponential Smoothing

How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.

Planning for Chaos with MongoDB Atlas: Using the “Test Failover” Button | MongoDB

A primer on testing failover in a MongoDB Atlas cluster.

Meltdown Performance Impact on MongoDB: AWS, Azure

Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.

Outages

PolitiFact
- PolitiFact was down for a bit during President Trump’s yearly State of the Union address.
Skype
- It seems that folks with two-factor authentication were unable to log in for multiple days.
The Travis CI Blog: Major build outage: a postmortem report
- Linked is a highly detailed summary of their troubles with an overloaded RabbitMQ cluster.
Netflix

SRE Weekly Issue #108

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues