General

SRE Weekly Issue #159

A huge thanks to my awesome former coworker Greg Burek whose helpful link contributions make up fully half of this issue.  Thanks, Greg!

A message from our sponsor, VictorOps:

Are you an SRE working with Microsoft Azure? Learn more about the key services offered in Azure and how SRE teams can leverage these tools and applications to build and deploy reliable services at a consistent pace:

http://try.victorops.com/sreweekly/microsoft-azure

Articles

This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.

My favorite bit of irony: presenting data to the user in the manner most readily understood results in lower likelihood of remembering the data, so perhaps the most easily grasped display is not actually the best!

Lisanne Bainbridge

Like malice and incompetence, laziness should be far off our radar when we investigate an incident. I hope that reading this article opens minds about the true scope of blamelessness.

Devon Price

Whether or not you agree with this particular attempt at defining what a Systems Engineer (or SRE or anything related) is, it’s worth thinking about and discussing. Our field is evolving quickly, and titles are a moving target.

Matt Ouille

Driven by a desire to update their 737 without causing airlines to have to retrain pilots, Boeing seemingly kept pilots in the dark about what may have been an important little detail of how the new 737 Max operates, with a tragic result.

James Glanz, Julie Creswell, Thomas Kaplan and Zach Wichter — New York Times

An experienced SRE will develop an innate skepticism of new technologies, even if they don’t realize it. This article provides an excellent list of questions to help articulate that skepticism when evaluating a potential design.

Kellan Elliott-McCrea

Auto-scaling isn’t all roses. Like any tool, you have to understand how it works in order to avoid the pitfalls. Read this article to learn what these folks learned the hard way.

Tyson Mote — Segment

Transitioning to a blameless culture can be difficult, especially as folks might blame each other for forgetting to be blameless!

Rachael Byrne — PagerDuty

Many of the old arguments for not instrumenting code (mostly about performance) no longer apply, and a host of new arguments push toward structured events.

Charity Majors

Outages

SRE Weekly Issue #158

A message from our sponsor, VictorOps:

The golden signals of SRE and monitoring helps identify a great starting point for teams looking to proactively build reliability into highly integrated applications and services.

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This air traffic accident analysis is chilling to listen to, and also incredibly educational. As you listen through the conversation, it becomes more and more clear that the pilot is suffering from information overload. An Incident Commander would be wise to remember the lessons learned here.

After listening to the above recording, I got hooked and kept listening to more and more case studies. Here’s another enlightening one: Real Pilot Story: From Miscue to Rescue

US Air Safety Institute

PagerDuty is quickly approaching Etsy’s level of awesome incident-related articles and guides.

Rachael Byrne — PagerDuty

Retiring features and products can often be harder to do safely than deploying them in the first place.

Rachana Kumar– Etsy

Do your SLIs measure what really matters to your customers? This article discusses how to find out and what to do if they don’t.

Adrian Hilton and Yaniv Aknin — Google

Outages

SRE Weekly Issue #157

A message from our sponsor, VictorOps:

See how VictorOps built their SRE efforts from scratch and structured SRE operations across a smaller team. Developing a culture of collaboration and accountability takes time and effort – but it makes all the difference:

http://try.victorops.com/sreweekly/building-a-culture-of-sre

Articles

Best article about post-incident investigations that I’ve seen in awhile. My favorite part is the recommendation not to use a template for the retrospective, as it will artificially narrow the scope of the investigation.

Ryan Frantz

These folks have set up a survey to gather information on whether and how folks are compensated for on-call in IT. This topic has been gaining traction over the past couple of years, and I can’t wait to see the results of the survey. Please take a moment to fill it out.

Chris Evans and Spike Lindsey

I’ll be speaking at SRECon19 Americas this March with my former coworker, Courtney Eckhardt. The talk lineup looks incredible and I’m really excited to go!

If you’re going to be there, drop me an email (I’m terrible at Twitter) and let me know. I’ll have lots of swag available, made with 100% open source software (Ink/Stitch and inkscape-silhouette).

Especially useful for folks new to on-call.

If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier.

Dave Fennell — Hosted Graphite

I have to admit I wasn’t clear on two-phase commit before I read this. Now I know what it’s all about — and its drawbacks.

Daniel Abadi

This guide from Google describes the qualities and practices of SRE teams of various levels from beginner to advanced.

Gustavo Franco — Google

A good intro if you’re new around here.

Sylvia Fronczak — Scalyr

Outages

SRE Weekly Issue #156

A message from our sponsor, VictorOps:

DevOps and SRE go hand-in-hand. See how building a DevOps culture of transparency and collaboration can inherently lead to proactive SRE efforts – and ultimately, more reliable systems:

http://try.victorops.com/sreweekly/devops-leads-to-inherent-sre

Articles

Lots of companies seem to be redesigning their status pages lately. I love learning what was wrong with the old one and what they’ve changed to try to fix it.

Benjamin Stein — Twilio

A cringe-worthy story of a system failure (thankfully not production!) along with some ideas on preventing such failures.

Dan Woods

Just like last year, Catchpoint will donate $5 to charity if you take their survey!

This year we are back with a focus on outages and incidents. What impact do incidents have on the organization and the people responding to the incidents? How does this change across industry and organization?

Catchpoint

You can do a lot better than “the server is unhappy.” Be on the lookout for language like that. It’s usually a good learning opportunity or at the very least a good time to fill some gaps in instrumentation.

Arya Asemanfar — LightStep

Outages

SRE Weekly Issue #155

A message from our sponsor, VictorOps:

Machine learning and AI are becoming integrated into numerous services and applications across industries. See how SRE and DevOps teams can leverage MLOps to help shorten the incident lifecycle and maintain highly reliable systems:

http://try.victorops.com/sreweekly/mlops-incident-lifecycle

Articles

A developer’s perspective on why being on call is important and how to structure it fairly (hint: compensation).

Henrik Warne

The Conclusion section sums it up nicely:

In this post, we talked about various delivery guarantee semantics such as at-least-once, at-most-once, and exactly-once. We also talked about why exactly-once is important, the issues in the way of achieving exactly-once, and how Kafka supports it out-of-the-box with a simple configuration and minimal coding.

Rahul Agarwal — DZone

This is a riveting discussion about retrospective analysis of incidents, hosted by Microsoft. Throughout the discussion, there’s an emphasis on learning from incidents as opposed to simply coming up with action items.

Note: one of the panelists is my fellow employee at Fastly.

Jessica DeVita — Microsoft, with Duck Lawn (Pushpay), Tom Griffin (Pushpay), Sue Allspaw Pomeroy (Fastly), John Allspaw (Adaptive Capatacity Labs) and Dr. Richard Cook (Adaptive Capacity Labs)

If you’re looking for a blueprint of how to structure your SRE organization’s meetings, this is a great resource.

Dave Mangot

This post is the second part of the series on Designing Resilient Systems. In Part 1, we looked at use cases for implementing circuit breakers. In this second part, we will do a deep dive on retries and its use cases, followed by a technical comparison of both approaches.

This article is really thorough and includes a section on combining retries with circuit breakers.

Corey Scott — Grab

The problem is that most advice how to “get design right” only applies to design inside a process boundary. Most of those advices do not work well if applied to distributed systems.

What I have learnt over time is that we basically need to re-learn how to design systems, i.e., how to spread the functionality in a distributed environment.

Uwe Friedrichsen — InfoQ

This really stood out to me:

In practice, we have fixed whole classes of reliability problems by forcing engineers to define deadlines in their service definitions.

Ruslan Nigmatullin and Alexey Ivanov — Dropbox

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme