SRE Weekly Issue #126

Articles

Introducing Grab-Kit: Distributed Service Design at Grab

Our friends in the GrabFood team now save up to 70% development time on creating a new service. We have also recorded improvements in stability and availability of our services.

Karen Kue and Michael Cartmell — Grab

10 ways to avoid CDN issues at peak

Some tips on surviving peak traffic as we head into World Cup season. I like the discussion in #10 (load testing): accurately testing your CDN is all but impossible.

Hadar Weiss — Peer 5 (CDN)

A story of being on call

This is a video recording of a talk by Charity Majors at Monkigras 2018. She has a lot of awesome stuff to say about making on-call enjoyable and owning your code, including this gem:

Babies, by the way, are engineered by evolution to be too cute for you to want to kill them. Your code is not.

Charity Majors — Honeycomb

Heroku Incident #1561 Followup (Platform-wide Outages)

A power disruption occurred at our service provider resulting in a number of instances going offline. Heroku databases running on these instances were impacted.

Presumably this was the us-east-1 power issue I reported on in Issue 124.

Future of Reliability Engineering (Part 1)

The first article in this new series is about the evolution of the Network Engineer into a Network Reliability Engineer. It’s part of the broader breakdown of silos with the goal of understanding holistic reliabilty.

Michael Kehoe

Protecting network availability for GDPR compliance

I hadn’t realized that GDPR has provisions related to site/service reliability.

Theresa Abbamondi — Netscout

Sample Your Traffic: But Keep The Good Stuff

To shamelessly steal a line from this recorded talk, it’s very rarely the right thing for your observability system’s scale to match that of the system it’s observing. To avoid that, you need to throw away some event data rather than storing and indexing everything. How do you do that while still achieving functioning observability?

Ben Hartshorne — Honeycomb

Automated Database Deployments Series Kick Off

I’m looking forward to seeing where this article series goes. Database changes can be a huge reliability risk, and getting them right is critical.

Bob Walker — Octopus Deploy

Outages

Azure south-central US region
- A load spike in a backend storage system caused impact across a range of Azure services, according to the RCA linked above.
  Actually, I’ve linked to their generic “status history” page, since that seems to be as specific as I can get. Readers from Microsoft, perhaps you could ask the folks that run the Azure status page to create dedicated permalinks for each incident, or at least for each RCA? Even an anchor link in the status history page would be super-awesome!
New Relic infrastructure alerting
Travis CI
WhatsApp
American Airlines
Instagram
Google Compute Engine
- While instances were stopped (shut down), newly-launched instances were allowed to take their IPs. The stopped instances then failed on startup due to the IP conflicts. The situation lasted for around 20 hours.
Optus Sport
- World Cup fans had issues watching through Optus. World Cup streaming traffic is massive this time around.
Apple Maps
Netflix
.my TLD

SRE Weekly Issue #126

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues