SRE Weekly Issue #82

Articles

Case studies in cloud migration: Netflix, Pinterest, and Symantec – Increment issue 2: Cloud

Increment issue #2 is out! Want to hear what it was like for these three big companies to move to the cloud? Read on.

An engineer’s guide to cloud capacity planning – Increment issue 2: Cloud

This article covers a lot of ground, from general strategy to specific methods for estimating capacity needs. I love this:

Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads.

The strange geography of content delivery networks

I love the insight this article gives me into the huge networks of big CDNs.

Reducing MTTR

Key point: don’t count your chickens before they’ve recovered.

The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected

In DevOps Incident Response, Plans Are Worthless, But Planning Is Everything

Scalyr explains how to move beyond specific playbooks to create a renewal incident response plan.

Dead man’s switch with AWS CloudWatch: Freshness-Alerting for Backups and Co

Here’s a nice little how-to:

A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:

To alert if the metric breaches a specified threshold.

To alert if a particular metric has not been sent to CloudWatch within a specified interval.

High Availability Prometheus Alerting and Notification

And another short how-to, this on developing Prometheus with HA.

I won’t tell you to stop working, but I can try to help you not burn out

Self-care is critical in tech, not only for us as individuals, but for the health and reliability of the entire organization. Overstretched engineers make mistakes. This article introduces a new resource: selfcare.tech, which is a curated, open-source repository of self-care resources.

Outages

Today’s Outage · GitHub
- Old but good: this post-incident report from GitHub in 2010 recounts an outage caused by inadvertently running an automated test script against a producing db.
Pokémon Go Chicago event issues ticket refunds after widespread outage
- 20,000 people in one place trying to play Pokemon Go was apparently enough to overload several mobile phone networks.
YouTube

SRE Weekly Issue #82

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues