SRE Weekly Issue #30

Articles

How did I not know about HumanOps before now?? Their site is great, as is their manifesto. A large part of what I do at $JOB is to study and improve the human aspects of operations.

The wellbeing of human operators impacts the reliability of systems.

Unfucking Your Oncall Culture (slides)

Slides from Charity Majors’s talk at HumanOps. Some choice tidbits in there, and I can’t wait until they post the audio.

DevOps On-Call: How we Handle our Schedules

Here’s a description of how Server Density handles their on-call duties. They use a hybrid approach with some alerts going to devs and some handled by a dedicated ops team. This idea is really intriguing to me:

After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

Product Integration Testing at the Speed of Netflix

This article is written by Netflix’s integration testing team, which is obviously not their SRE team. Nevertheless, integration testing at Netflix is important to ensure that new features start out working reliably and stay working after they’re out.

De-risking NFV: Pitfalls to Avoid and How to Get Past Them

The pitfall discussed in this article is a lack of packet-level visibility that hampers operators’ ability to quickly diagnose network issues. The article starts by outlining the issue then discusses methods of mitigating it including Tap As a Service.

Response to network outages needs to get smarter

This article makes the case for out of band management (OOBM) tools in responding to network issues. It’s good review, especially for those that have experience primarily or solely with cloud infrastructure.

Distributed Apps, Microservices, and the Shift Away from Root Cause Analysis

Now there’s an inflammatory article title — it reeks of the NoOps debate. I would argue that a microservice architecture makes an RCA just as necessary if not more so.

How DevOps Failed 60K Users

Former Slideshare engineer Sylvain Kalache shares this war-story about DevOps gone awry. I’d say there’s a third takeaway not listed in the article: DevOps need not mean full access to the entire infrastructure for everyone.

Outages

Bankwest
Seacom
Verizon
Spark (New Zealand telecom)
npm
- NPM had an 8+ hour outage that left many build system providers such as Travis, CircleCI, and Heroku and individual users scrambling. Their postmortem indicates that a major end-of-day deploy the day before was to blame. Conversation in a related github issue suggests a monitoring gap and a lack of overnight on-call coverage for complex outages such as this one.
  
  Full disclosure: Heroku, my employer, is mentioned.
WhatsApp
- This article alleges that the government of Zimbabwe cut access to WhatsApp to disrupt anti-government protests.
Etsy
Pokemon Go
- Seems like approximately the entire internet is talking about this.
Tinder
Claro (Dominican Republic telecom
ReachNow (car sharing service)
- BMW is offering customers a $10 credit.
Orange Poland (telecom)
Pingdom
- Pingdom had a series of outages in its API and UI this week. As a result, they are planning to create a status site after previously relying on Twitter to notify customers of outages.

SRE Weekly Issue #30

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues