SRE Weekly Issue #36

View on sreweekly.com

Last week’s DevOps & SRE AMA was super fun! Thanks to the panelists for participating. Recordings should be posted soon.

Articles

Multi data center redundancy – sysadmin considerations

This is the second half of Server Density’s series on the lessons they learned as they transitioned to a multi-datacenter architecture. There are lots of interesting tidbits in here, such as an explanation of how they handle failover to the secondary DC and what they do if that goes wrong.

Full disclosure: Heroku, my employer, is mentioned.

How Complex Web Systems Fail

Here’s the second half of Mathias Lafeldt’s series that seeks to apply Richard Cook’s How Complex Systems Fail to web systems. The article is great, but the really awesome part is the thoughtful responses by Cook himself to both parts one and two, linked at the end of this article.

Why Reddit was down on Aug 11

Here’s a postmortem for last week’s outage that involved a migration gone awry.

Thanks to Jonathan Rudenberg for this one.

US Patent Office sued after it declared a power outage a ‘national holiday’

A patent holding firm is alleging that the USPTO overstepped its authority in declaring a system outage (reported in issue #4) to be treated as a national holiday for purposes of deadlines, and that this led to the plaintiff being sued.

Know Anyone With This High-Burnout Job?

Burnout is a crucially important consideration in a field with on-call work. VictorOps has a few tips for alleviating burnout gleaned from this year’s Monitorama.

Staging Servers Are Dead!

Edith Harbaugh says that staging servers present a reliability risk that doesn’t outweigh their benefit. This article is an update to her original article, which I also recommend reading.

Context aware MySQL pools via HAProxy

Github uses HAProxy to balance across is read-only MySQL replicas, which is a common method. Their technique for excluding lagging nodes while avoiding entirely emptying the pool if all nodes are lagging is pretty neat.

Thanks to Devops Weekly for this one.

Serverless Architectures

A highly detailed deep-dive on Serverless — what it means, benefits, and drawbacks. I especially enjoyed the #NoOps section:

[Ops] also means at least monitoring, deployment, security, networking and often also means some amount of production debugging and system scaling. These problems all still exist with Serverless apps and you’re still going to need a strategy to deal with them. In some ways Ops is harder in a Serverless world because a lot of this is so new.

#ServerlessIsMadeOfServers

Full disclosure: Heroku, my employer, is mentioned.

Outages

Slack
- A relatively minor issue, but it impacted me, so I logged it here while awaiting resolution.
MTN (mobile telecom)
Google Cloud Status Dashboard
- Postmortem included, with an interesting cause:
  
  During mitigation of a lower impact performance issue, Google engineers made a configuration change to pipeline orchestration. An error in this configuration caused validation within the orchestration component to reject all requests.
Tesla Vehicles
Xbox Live
Sky (ISP)
Facebook
Apple’s App Store
Twitter
Cisco Jasper
Optus
AT&T
NSA

SRE Weekly Issue #36

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues