General

SRE Weekly Issue #51

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

This is a big moment for the SRE field. Etsy has distilled the internal training materials they use to teach employees how to facilitate retrospectives (“debriefings” in Etsy parlance). They’ve released a guide and posted this introduction that really stands firmly on its own. I love the real-world story they share.

And here’s the guide itself. This is essential reading for any SRE interested in understanding incidents in their organization.

Slicer is a general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service.  What exactly is Slicer then?

Click through to find out. It’ll be interesting to see what open source projects this paper inspires.

The second in a series, this article delves into the pitfalls of aggregating metrics. Aggregation means you have to choose between bloating your time-series datastore or leaving out crucial stats that you may need during an investigation.

I thought this was going to be primarily an argument for reducing burnout to improve reliability. That’s in there, but the bulk of this article is a bunch of tips and techniques for improving your monitoring and alerting to reduce the likelihood that you’ll be pulled away from your vacation.

The title says it all. Losing the only person with the knowledge of how to keep your infrastructure running is a huge reliability risk. In this article, Heidi Waterhouse (who I coincidentally just met at LISA16!) makes it brilliantly clear why you need good documentation and how to get there.

Here’s another overview of implementing a secondary DNS provider. I like that they cover the difficulties that can arise when you use a provider’s proprietary non-RFC DNS extensions such as weighted round-robin record sets.

Outages

SRE Weekly Issue #50

I’m back! The death plague was pretty terrible. A–, would not buy from again. I’m still catching up on articles from the past couple of weeks, so if I missed something important, please send a link my way!

I’m going to start paring down the Outages section a bit. In the past year, I’ve learned that telecom providers have outages all the time, and people complain loudly about them. They also generally don’t share useful postmortems that we can learn from. If I see a big one, I may still report it here, but for the rest, I’m going to omit them.

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Gabe Abinante has been featured here previously for his contributions to the Operations Incident Board: Postmortem Report Reviews project. To kick off this year’s sysadvent, here’s his excellent argument for having a defined postmortem process.

Having a change management process is useful, even if it’s just a deploy/rollback plan. I knew all that, but this article had a really great idea that I hadn’t thought of before (but should have): your rollback plan should have a set of steps to verify that the rollback was successful.

Let’s be honest: being on-call is kind of an ego boost. It makes me feel important. But not getting paged is way better than getting paged, and we should remember that. #oncallselfie

It’s that time of year again! In a long-standing (1-year-long) tradition here at SRE Weekly, I present you this year’s State of On-Call report from my kind sponsor, VictorOps.

Storing 99th and 95th percentile latency in your time-series DB is great, but what if you need a different percentile? Or if you need to see why those 1% of requests are taking forever? Perhaps they’re all to the same resource?

Orchestrator is a tool for managing a (possibly complex) tree of replicating MySQL servers. This includes master failure detection and automatic failover, akin to MHA4Mysql and other tools. GitHub has adopted Orchestrator and shares some details on how they use it.

A few notable brands suffered impaired availability on and around Black Friday this year. Hats off to AppDynamics for giving us some hard numbers.

Looks like I missed this “Zero Outage Framework” announcement the first time around. I love the idea of information-sharing and it’ll be interesting to see what they come up with. We can all benefit from this, especially if the giants like Microsoft join up.

All IT managers would do well to heed this advice. Remember, burnout very often directly and indirectly impacts reliability.

“If you’re a manager and you are replying to email in the evening, you are setting the expectation to your team – whether you like it or not – that this is normal and expected behaviour”

Signifai has this nice write-up about setting up redundant DNS providers. My favorite bit is how they polled major domains to see who had added a redundant provider since October 21, and they even shared the source for their polling tool!

I’ve featured a lot of articles lately about reducing the amount of code you write. But does that mean that it’s always better to contract with a SaaS provider? This week’s Production Ready delves into the tradeoffs.

Outages

SRE Weekly Issue #49

My vacation visiting the Mouse was awesome!  I had a lot of time to work on a little project of mine.  And now I’m back just in time for Black Friday and Cyber-whatever. Good luck out there, everyone!

SPONSOR MESSAGE

2016/17 State of On-Call Webinar (with DevOps.com): Register to learn what 800+ survey respondents have to say about life on-call. http://try.victorops.com/2016_17_stateofoncall

Articles

Another issue of BWH’s Safety Matters, this time about a prescribing accident. The system seems to have been almost designed to cause this kind of error, so it’s good to hear that a fix was already about to be deployed.

This is a great article on identifying the true root cause(s) of an incident, as opposed to stopping with just a “direct cause”. I only wish it were titled, Use These Five Weird Tricks to Run Your RCA!

Etsy describes how they do change management during the holidays:

[…] at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.

This issue of Production Ready is about battling code rot through incrementally refactoring an area of a codebase while you’re doing development work that touches it.

Shutterstock shares some tips they’ve learned from writing postmortems. My favorite part is about recording a timeline of events in an incident. I’ve found that reading an entire chat transcript for an incident can be tedious, so it can be useful to tag items of interest using a chat-bot command or a search keyword so that you can easily find them later.

The “Outage!” AMA happened while I was on vacation, and I still haven’t had a chance to listen to it. Here’s a link in case you’d like to.

My favorite:

If something breaks in production, how do you know about it?

Weaver is a tool to help you identify problems in your microservice consumers by doing “bad” things like responding slowly to a fraction of requests.

Barclays reduced load on their mainframe by adding MongoDB as a caching layer to handle read requests. What the heck does “mainframe” mean in this decade, anyway?

We’d do well to remember during this holiday season that several seconds of latency in web requests is tantamount to an outage.

Tyler Treat gives us an eloquently presented argument for avoiding writing code as much as possible, for the sake of stability.

Outages

So far, no big-name Black Friday outages. We’ll see what Cyber Monday has in store.

  • Everest (datacenter)
    • Everest suffered a cringe-worthy network outage subsequent to a power failure. Power came off and on a couple of times, prompting their stacked Juniper routers to assume they’d failed in booting and go into failure recovery mode. Unfortunately, the secondary OS partitions on the two devices contained different JunOS versions, so they wouldn’t stack properly.

      I’d really like to read the RFO on the power outage itself, but I can’t find it. If anyone has a link, could you please send it my way?

  • Argos

SRE Weekly Issue #48

This is the first issue of SRE Weekly going out to over 1000 email subscribers! Thanks all of you for continuing to make my little side project so rewarding and fulfilling. I can’t believe I’m almost at a year.

Speaking of which, there won’t be an issue next week while my family and I are vacationing at Disney World. See you in two weeks!

SPONSOR MESSAGE

Downtime sucks. Learn how leading minds in tech respond to outages on the Nov. 16th “Ask Me Anything” from Catchpoint & O’Reilly Media: http://try.victorops.com/AMA

Articles

A detailed description of Disaster Recovery as a Service (DRaaS), including a discussion of the cost versus creating a DR site oneself. This is the part I always wonder about:

However, for larger enterprises with complex infrastructures and larger data volumes spread across disparate systems, DRaaS has often been too complicated and expensive to implement.

This one’s so short I can almost quote the whole thing here. I love its succinctness:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Just over four years ago, Amazon had a major outage in Elastic Block Store (EBS). Did you see impact? I sure did. Here’s Netflix’s account of how they survived the outage mostly unscathed.

I’m glad to see more people writing that Serverless != #NoOps. This article is well-argued even though it turns into an OnPage ad 3 paragraphs from the end.

What else can we expect from Greater Than Code + Charity Majors? This podcast is 50 minutes of awesome, and there’s a transcription, too! Listen/read for awesome phrases like “stamping out chaos”, find out why Charity says, “I personally hate [the term ‘SRE’] (but I hate a lot of things)”, and hear Conway’s law applied to microservices, #NoOps debunking, and a poignant ending about misogyny and equality.

Microsoft released its Route 53 competitor in late September. They say:

Azure DNS has the scale and redundancy built-in to ensure high availability for your domains. As a global service, Azure DNS is resilient to multiple Azure region failures and network partitioning for both its control plane and DNS serving plane.

This issue of BWH Safety Matters details an incident in which a communication issue between teams that don’t normally work together resulted in a patient injury. This is exactly the kind of pitfall that becomes more prevalent with the move toward microservices, as siloed teams sometimes come into contact only during an incident.

A detailed postmortem from an outage last month. Lots of takeaways, including one that kept coming up: test your emergency tooling before you need to use it.

Outages

  • Canada’s immigration site
    • I’m sure this is indicative of something.
  • Office 365
  • Twitter
    • Twitter stopped announcing their AS in BGP worldwide, resulting in a 30-minute outage on Monday.
  • Google BigQuery
    • Google writes really great postmortems! Here’s one for a 4-hour outage in BigQuery on November 8, posted on November 11. Fast turnaround and an excellent analysis. Thanks, Google — we appreciate your hard work and transparency!
  • Pingdom
    • Normally I wouldn’t include such a minor outage, but I love the phrase “unintended human error” that they used. Much better than the intended kind.
  • WikiLeaks
  • eBay

SRE Weekly Issue #47

SPONSOR MESSAGE

Downtime sucks. Learn how leading minds in tech respond to outages on the Nov. 16th “Ask Me Anything” from Catchpoint & O’Reilly Media: http://try.victorops.com/AMA

Articles

Next year, SRECon is expanding to three events: Americas, EMEA, and Asia. The Americas event is also moving from Santa Clara to San Francisco, which I, for one, am especially grateful for. The CFP for SRECon17 Americas just opened up, and proposals are due November 30th, so get cracking! I can’t wait to see what all of you have to share!

I have a somewhat dim view of automated anomaly detection in metrics based on my experience with a few tools, but if Datadog’s algorithms live up to their description, they might really have something worthwhile.

When a responder gets an anomaly alert, he or she needs to know exactly why the alert triggered. The monitor status page for anomaly alerts shows what the metric in question looked like over the alert’s evaluation window, overlaid with the algorithm’s predicted range for that metric.

This issue of Production Ready chronicles Mathias Lafeldt’s effort to create a staging environment. I like the emphasis on using an entirely separate AWS account for staging. This is increasingly becoming a best practice.

What’s causing all that API request latency? Here’s an interesting debug run using Honeycomb. Negative HTTP status codes? Sure, that’s totally a thing, right?

I love this idea: Susan Fowler notes that large, complex systems are constantly changing, and this makes reproducing bugs difficult or impossible. Her suggestion is to log enough that you can logically reconstruct the state of the system at the time the bug occurred. This is the same kind of thing the Honeycomb folks are saying: throw a lot of information into your logs, just in case you might need it to debug something unforeseen.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme