SRE Weekly Issue #149

Articles

Cloud Functions pro tips: Using retries to build reliable serverless systems | Google Cloud Blog

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

Emotet malware runs on a dual infrastructure to avoid downtime and takedowns

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

DBMS Musings: Replication and the latency-consistency tradeoff

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Capacity planning for Etsy’s web and API clusters

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Orchestrating Chaos using Grab’s Experimentation Platform

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Predictive test selection to ensure reliable code changes

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

Progressive Service Architecture At Auth0

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages. I hope you all had an uneventful Friday.

J. Crew
Lowe’s
Netatmo (smart thermostats)
John Lewis
AWS in Seoul, South Korea
- The outage took down multiple AWS customers including banks and a cryptocurrency exchange.
Walmart
Makro
Facebook
LastPass
Microsoft Azure
- Linked is a detailed followup post describing three distinct “root” causes.

SRE Weekly Issue #149

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues