SRE Weekly Issue #208

Articles

There’s so much in this article:

how to recognize when your system may be susceptible to cascading failure
how to prevent it
how to deal with it when it happens (and how hard that can be)

Laura Nolan — Slack

Catchpoint’s SRE Survey 2020 Is Here

It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.

This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.

Peter Murray — Catchpoint

Resilience Roundup – Above the Line, Below the Line

Both the summary and the original article are well worth reading. This stood out to me:

As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it

Thai Wood (summary)

Dr. Richard Cook (original article)

The Jellyfish-Inspired Database Under AWS Block Storage

The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).

Timothy Prickett Morgan — The Next Platform

The Problem with Microservices: ‘Deep Systems’

Ideal: each team manages their microservice(s) in isolation.

Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.

Ben Sigelman — LightStep

SRE for single-tiered software applications

This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.

Eric Harvieux — Google

Trying to sneak in a sketchy .so over the weekend

The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.

rachelbythebay

Outages

GitHub
NPM
- Linked is an interesting explanation from Cloudflare, posted as a comment on a GitHub issue.
New Relic
PagerDuty
Fidelity
- Fidelity customers saw a $0 balance for their 401(k) [US retirement] accounts.
Microsoft Office 365 & Outlook down – Users getting service unavailable error
Heathrow Airport (London, UK)
Zillow
Indeed
Kobo
Heroku
Squarespace
- Also this one.
reddit

SRE Weekly Issue #208

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues