SRE Weekly Issue #291

Articles

Understanding How Facebook Disappeared from the Internet

Facebook’s outage caused significantly increased load on DNS resolvers, among other effects. Cloudflare also published this followup article with more findings.

Celso Martinho and Sabina Zejnilovic — Cloudflare

The New Norm

Shell (the oil company) reduced accidents by 84% by teaching roughnecks to cry. Listen to this podcast (or check it out in article form to find out how. Can we apply this to SRE?

Alix Spiegel and Hanna Rosin — NPR’s Invisibilia

Google’s State of DevOps 2021 Report: What SREs Need to Know

Don’t have time to read Google’s entire report? Here are the highlights.

Quentin Rousseau — Rootly

More details about the October 4 outage

I really like how open Facebook engineering has been about what went wrong on Monday. This article is an update on their initial post.

Santosh Janardhan — Facebook

Tools to explore BGP

Want to learn about BGP? Ride along as Julia Evans learns. I especially like how she whipped out strace to figure out how traceroute was determining ASNs.

Julia Evans

Announcing the VOID

The Verica Open Incident Database is an exciting new project that seeks to create a catalog of public incident postings. Click through to check out the VOID and read the inaugural paper with initial findings. I’m really excited to see what this project brings!

Courtney Nash — Verica

‘date -d’ vs. ‘date -s’, and ‘show foo’ vs. ‘clear foo’

Printing versus setting a date — they’re only separated by a typo. Perhaps something similar happened with Facebook’s outage.

rachelbythebay

SRE Doesn’t Scale

Adopting a microservice architecture can strain your SRE. This article highlights an oft-missed section of the SRE book about scaling SRE.

Tyler Treat

Outages

Facebook, Instagram, WhatsApp, and Oculus
- Well, that sure was a big one. Facebook and related services were totally down for 6+ hours — even their DNS servers.They also had another, smaller outage later in the week.
Slack
- Not a real outage, but Slack reported that users were having a hard time connecting to Slack, because resolvers were overloaded by DNS lookups for facebook.com.
NordVPN
PayPal

SRE Weekly Issue #291

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues