SRE Weekly Issue #30

Articles

How did I not know about HumanOps before now?? Their site is great, as is their manifesto. A large part of what I do at $JOB is to study and improve the human aspects of operations.

The wellbeing of human operators impacts the reliability of systems.

Slides from Charity Majors’s talk at HumanOps. Some choice tidbits in there, and I can’t wait until they post the audio.

Here’s a description of how Server Density handles their on-call duties. They use a hybrid approach with some alerts going to devs and some handled by a dedicated ops team. This idea is really intriguing to me:

After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

This article is written by Netflix’s integration testing team, which is obviously not their SRE team. Nevertheless, integration testing at Netflix is important to ensure that new features start out working reliably and stay working after they’re out.

The pitfall discussed in this article is a lack of packet-level visibility that hampers operators’ ability to quickly diagnose network issues. The article starts by outlining the issue then discusses methods of mitigating it including Tap As a Service.

This article makes the case for out of band management (OOBM) tools in responding to network issues. It’s good review, especially for those that have experience primarily or solely with cloud infrastructure.

Now there’s an inflammatory article title — it reeks of the NoOps debate. I would argue that a microservice architecture makes an RCA just as necessary if not more so.

Former Slideshare engineer Sylvain Kalache shares this war-story about DevOps gone awry. I’d say there’s a third takeaway not listed in the article: DevOps need not mean full access to the entire infrastructure for everyone.

Outages

Updated: July 10, 2016 — 10:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme