SRE Weekly Issue #160

Articles

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

Multi-CDN support in Mux for improved performance and reliability

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

Towards an understanding of technical debt

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

How We Prepared New York Times Engineering for the Midterm Elections

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

@mipsytipsy on Twitter: what to alert on

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Notes from On-call Adjacency – Honeycomb

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

How we used delayed replication for disaster recovery with PostgreSQL

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

Azure Kubernetes Service (US East)
- There’s a pretty interesting incident description in their history page.
VFEmail
- Via Twitter:
  
  At this time, the attacker has formatted all the disks on every server. Every VM is lost. Every file server is lost, every backup server is lost. NL was 100% hosted with a vastly smaller dataset. NL backups by the provideer were intact, and service should be up there.
  
  My sympathies, folks.
Slack
- Emails into slack were failing due to an expired TLS certificate.
Squarespace
- Linked is their followup post explaining more about the incident.
JPMorgan Chase
Instagram
Strava and Garmin Connect
Microsoft Windows Update
Snapchat
Sydney, AU Train Network
Lloyds Bank

SRE Weekly Issue #160

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues