SRE Weekly Issue #143

Articles

There’s some great statistics theory in here. The challenge is: how can you have accurate, useful A/B tests without having to wait a long time to get a big enough sample size? Can you bail out early if you know the test has already failed? Can you refine the new feature mid-test?

Callie McRee and Kelly Shen — Etsy

SRE: The Biggest Lie Since Kanban | the agile admin

Don’t just rename your Ops team to “SRE” and expect anything different, says this author.

Ernest Mueller — The Agile Admin

We can do better than percentile latencies | theburningmonk.com

Great idea:

So what if we monitor the percentage of requests that are over the threshold instead? To alert us when our SLAs are violated, we can trigger alarms when that percentage is greater than 1% over some predefined time window.

Yan Cui

Dropbox traffic infrastructure: Edge network

There’s a ton of detail here, and it’s a great read. Lots of juicy tidbits about PoP selection, load balancing, and performance monitoring.

Oleg Guba and Alexey Ivanov — Dropbox

Full disclosure: Fastly, my employer, is mentioned.

Preliminary Report Pipeline: Over-pressure of a Columbia Gas of Massachusetts Low-pressure Natural Gas Distribution System

Even as a preliminary report there’s a lot to digest here about what caused the series of gas explosions last month in Massachusetts (US). I feel like I’ve been involved in incidents with similar contributing factors.

US National Transportation Safety Board (NTSB)

What I learned by bringing down LinkedIn.com – VentureBeat

This isn’t just a recap of a bad day, although the outage description is worth reading by itself. Readers also gain insight into the evolution of this engineer’s career and mindset, from entry-level to Senior SRE.

Katie Shannon — LinkedIn

https://about.gitlab.com/2018/10/11/gitlab-com-stability-post-gcp-migration/

GitLab, in their trademark radically open style, goes into detail on the reasons behind the recent increase in the reliability of their service.

Andrew Newdigate — GitLab

Getting to 99.999% Availability with Twilio’s Tyler Wells

Five nines are key when you consider that Twilio’s service uptime can literally mean life and death. Click through to find out why.

Charlie Taylor — Blameless

Outages

Travis CI
Google Compute Engine us-central1-c
- I can’t really summarize this incident report one well, but I highly recommend reading it.
Azure
- Duplicated here since I can’t deep-link:
  
  Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.
Instagram
Heroku
- This one’s notable for the duration: about 10 days of diminished routing performance due to a bad instance.
Microsoft Outlook

SRE Weekly Issue #143

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues