SRE Weekly Issue #31

Huge thanks to SRE Weekly’s new sponsor, VictorOps!

Articles

Opzzz is a new app that graphs sleep data (from a Fitbit) against pager alerts (from PagerDuty or Server Density). I love this idea!

By correlating sleep data with on call incidents, we can then illustrate the human cost of on-call work.

Opsweekly: Measuring on-call experience with alert classification

Speaking of measuring sleep data against pages, Etsy is doing that too with their open source on-call analysis tool Opsweekly. Engineers also classify their events based on whether they were actionable.

We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio.

Dealing with Anxiety in Operations

Slides from a talk on a really important topic. There are some great resource links included.

How to Work an On Call Job and Keep Your Sanity

I’m a firm believer in work/life balance, especially as it pertains to on-call. I have a reputation for rigidly defending my personal time and that of my co-workers. I strongly feel that this is the best thing I can do for my company because exhaustion and burnout are huge reliability risks. Read this article if you’re trying to figure out how to improve your on-call experience and aren’t sure how to start.

Making Facebook self-healing: Automating proactive rack maintenance

FBAR, Facebook’s Auto-Remediation system, was mentioned here last month. This week, they posted an update explaining AMH, their system for safely handing maintenance of blocks of servers.

[Pingdom] Post-mortem for recent incidents

Pingdom released this set of short postmortems for last week’s series of outages.

From idea to reality: containers in production at GoCardless

A really detailed article about how one company got Docker into production safely and reliably. I especially love the parts about nginx cutover (when deploying new container versions) and supervising running containers. With the common refrain that Docker isn’t ready for production, it’s nice to see how GoCardless did it — but it’s also interesting to see how much tooling they felt compelled to write in-house.

The true meaning of availability

What good is an arbitrary number of nines from a cloud service provider if their transit links go down? Or if vast swathes of end-users can’t reach your site due to a major internet disruption? ServiceNow’s vice president argues that cloud providers must pay attention to “real availability” and partner with their customers to deal with external threats to availability.

Bitfinex Outages Raise Questions of Reliability and Regulation

Last month, Bitfinex (a bitcoin exchange) experienced multiple outages, and the subsequent bitcoin sell-off caused the price of the bitcoin to drop 7.5%. Bitcoin’s lack of regulation is a blessing, but is it also a curse?

Learning from Failure at Etsy

How can I even intro a gem like this? John Allspaw’s essay on blameless and just culture at Etsy is a classic, and it’s a great read even if you’re well-versed in the topic. I especially liked the concept of the “Second Story”.

Outages

Fasthosts
Comcast Phone
- Comcast is a cable ISP in the US that also offers VoIP land line phone service.
Pokémon Go
- Multiple outages this week.
NOAA (US weather service)
- NOAA suffered a system outage and also published a bogus flood alert.
AT&T (telecom)
Charter (ISP)
PlayStation Network
Lloyds
SGX (Singapore exchange)
Iraq
- Iraq’s government shut down internet access in response to protests.
Vodacom
EA Games

SRE Weekly Issue #31

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues