SRE Weekly Issue #50

View on sreweekly.com

I’m back! The death plague was pretty terrible. A–, would not buy from again. I’m still catching up on articles from the past couple of weeks, so if I missed something important, please send a link my way!

I’m going to start paring down the Outages section a bit. In the past year, I’ve learned that telecom providers have outages all the time, and people complain loudly about them. They also generally don’t share useful postmortems that we can learn from. If I see a big one, I may still report it here, but for the rest, I’m going to omit them.

Articles

sysadvent: Day 1 – Why You Need a Postmortem Process

Gabe Abinante has been featured here previously for his contributions to the Operations Incident Board: Postmortem Report Reviews project. To kick off this year’s sysadvent, here’s his excellent argument for having a defined postmortem process.

sysadvent: Day 4 – Change Management: Keep it Simple, Stupid

Having a change management process is useful, even if it’s just a deploy/rollback plan. I knew all that, but this article had a really great idea that I hadn’t thought of before (but should have): your rollback plan should have a set of steps to verify that the rollback was successful.

sysadvent: Day 6 – No More On-Call Martyrs

Let’s be honest: being on-call is kind of an ego boost. It makes me feel important. But not getting paged is way better than getting paged, and we should remember that. #oncallselfie

The State of On-Call 2016-2017 — Kicking off Results Season

It’s that time of year again! In a long-standing (1-year-long) tradition here at SRE Weekly, I present you this year’s State of On-Call report from my kind sponsor, VictorOps.

The Problem with Preaggregated Metrics: Part 1, the “Pre”

Storing 99th and 95th percentile latency in your time-series DB is great, but what if you need a different percentile? Or if you need to see why those 1% of requests are taking forever? Perhaps they’re all to the same resource?

Orchestrator at GitHub

Orchestrator is a tool for managing a (possibly complex) tree of replicating MySQL servers. This includes master failure detection and automatic failover, akin to MHA4Mysql and other tools. GitHub has adopted Orchestrator and shares some details on how they use it.

Black Friday and Cyber Monday Performance Report 2016

A few notable brands suffered impaired availability on and around Black Friday this year. Hats off to AppDynamics for giving us some hard numbers.

Microsoft refuses to join the Zero Outage brigade, Google and AWS keep mum

Looks like I missed this “Zero Outage Framework” announcement the first time around. I love the idea of information-sharing and it’ll be interesting to see what they come up with. We can all benefit from this, especially if the giants like Microsoft join up.

HumanOps: Etsy on how unclear workplace expectations contribute to staff burnout

All IT managers would do well to heed this advice. Remember, burnout very often directly and indirectly impacts reliability.

“If you’re a manager and you are replying to email in the evening, you are setting the expectation to your team – whether you like it or not – that this is normal and expected behaviour”

Multiple DNS providers, the Perfect Gift this holiday Season

Signifai has this nice write-up about setting up redundant DNS providers. My favorite bit is how they polled major domains to see who had added a redundant provider since October 21, and they even shared the source for their polling tool!

The Burden of Running Systems

I’ve featured a lot of articles lately about reducing the amount of code you write. But does that mean that it’s always better to contract with a SaaS provider? This week’s Production Ready delves into the tradeoffs.

Outages

WoW, Battle.net
Grubhub, Seamless
DirecTV Now
- AT&T’s new online version of DirecTV saw issues in its second week of operation.
Zapier

SRE Weekly Issue #50

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues