SRE Weekly Issue #76

View on sreweekly.com

This week, I had the awesome opportunity to attend a short-form training session on the Incident Management System (the broader system that includes Incident Command) given by Blackrock 3 Partners. Shout-out to Rob, Ron, and Chris – it was awesome meeting you guys, and I really enjoyed our conversations!

Articles

Uber Fires 20 Amid Investigation Into Workplace Culture

In case you missed it, Uber kicked off this and another investigation in response to a blog post by Susan Fowler, an SRE whose writing I’ve featured here a number of times. I’m pleased at this first step by Uber and I’m looking forward to what comes next. It might be a leave of absence for Uber’s CEO, although no decision has been made yet.

Jepsen: On the perils of network partitions

Here’s the 2013 article that started it all. If you’re unfamiliar with Jepsen, it’s an article series on testing various distributed data systems for partition tolerance, along with a companion tool set for inducing failures.

Internet Routing and Traffic Engineering

For those not completely “cloud native” (ugh) by this point, here’s a nifty primer on some of the BGP tricks you’ll need to know if you manage your own IP transit links.

A Key Expired In Redis, You Won’t Believe What Happened Next

Redis has a pretty big gotcha regarding deletion of expired keys, as these engineers discovered. In fact, my experience with Redis was full of operational gotchas like this.

Reddit – cscareerquestions – Accidentally destroyed production database on first day of a job, and was told to leave, […] how screwed am i?

This poor anonymous Reddit poster had a very bad day. The community rallied around them to explain that no, the anonymous poster is not to blame. One of the top commenters is Yorick Peterse, the engineer that inadvertently deleted GitLab.com’s main database earlier this year. Click through to see blamelessness in action.

Top Skills for an Incident Commander

PagerDuty is deeply invested in the Incident Management System, and most especially Incident Command. This article is a great overview, and if you want more, don’t forget that they also released their incident response documentation awhile back, including their Incident Commander training material.

Four nines and beyond: A guide to high availability infrastructure

The main theme in this article by StatusPage.io is the direct relationship between increasing complexity and difficulty in attaining high reliability. I like the mention of microservices as a trade-off and not a panacea.

“Serverless and the the death of devops”. Can you not?

Automation doesn’t replace ops, it augments it. Abstraction doesn’t replace ops, it hides it. Function as a service doesn’t remove complexity, it increases it exponentially.

Outages

Amazon product pages went down today in a rare outage
- The linked story was for an outage on June 7th. There was at least one additional similar outage on June 9th (source: personal experience).
Verelox
- Dutch hosting provider Verelox is having a really rough time:
  
  First of all, we want to offer our apologies for any inconvenience. Unfortunately, an ex administrator has deleted all customer data and wiped most servers.
  
  Ouch. Good luck, folks.

SRE Weekly Issue #76

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues