Articles
There’s some great statistics theory in here. The challenge is: how can you have accurate, useful A/B tests without having to wait a long time to get a big enough sample size? Can you bail out early if you know the test has already failed? Can you refine the new feature mid-test?
Callie McRee and Kelly Shen — Etsy
Don’t just rename your Ops team to “SRE” and expect anything different, says this author.
Ernest Mueller — The Agile Admin
Great idea:
So what if we monitor the percentage of requests that are over the threshold instead? To alert us when our SLAs are violated, we can trigger alarms when that percentage is greater than 1% over some predefined time window.
Yan Cui
There’s a ton of detail here, and it’s a great read. Lots of juicy tidbits about PoP selection, load balancing, and performance monitoring.
Oleg Guba and Alexey Ivanov — Dropbox
Full disclosure: Fastly, my employer, is mentioned.
Even as a preliminary report there’s a lot to digest here about what caused the series of gas explosions last month in Massachusetts (US). I feel like I’ve been involved in incidents with similar contributing factors.
US National Transportation Safety Board (NTSB)
This isn’t just a recap of a bad day, although the outage description is worth reading by itself. Readers also gain insight into the evolution of this engineer’s career and mindset, from entry-level to Senior SRE.
Katie Shannon — LinkedIn
GitLab, in their trademark radically open style, goes into detail on the reasons behind the recent increase in the reliability of their service.
Andrew Newdigate — GitLab
Five nines are key when you consider that Twilio’s service uptime can literally mean life and death. Click through to find out why.
Charlie Taylor — Blameless
Outages
- Travis CI
- Google Compute Engine us-central1-c
- I can’t really summarize this incident report one well, but I highly recommend reading it.
- Azure
- Duplicated here since I can’t deep-link:
Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.
- Duplicated here since I can’t deep-link:
- Heroku
- This one’s notable for the duration: about 10 days of diminished routing performance due to a bad instance.
- Microsoft Outlook