SRE Weekly Issue #155

Articles

A developer’s perspective on why being on call is important and how to structure it fairly (hint: compensation).

Henrik Warne

Interpreting Kafka’s Exactly-Once Semantics

The Conclusion section sums it up nicely:

In this post, we talked about various delivery guarantee semantics such as at-least-once, at-most-once, and exactly-once. We also talked about why exactly-once is important, the issues in the way of achieving exactly-once, and how Kafka supports it out-of-the-box with a simple configuration and minimal coding.

Rahul Agarwal — DZone

DevOps Discussions: Postmortem Chat – Part 1 (YouTube)

This is a riveting discussion about retrospective analysis of incidents, hosted by Microsoft. Throughout the discussion, there’s an emphasis on learning from incidents as opposed to simply coming up with action items.

Note: one of the panelists is my fellow employee at Fastly.

Jessica DeVita — Microsoft, with Duck Lawn (Pushpay), Tom Griffin (Pushpay), Sue Allspaw Pomeroy (Fastly), John Allspaw (Adaptive Capatacity Labs) and Dr. Richard Cook (Adaptive Capacity Labs)

An Agile SRE Meeting Plan

If you’re looking for a blueprint of how to structure your SRE organization’s meetings, this is a great resource.

Dave Mangot

Designing resilient systems: Circuit Breakers or Retries? (Part 2)

This post is the second part of the series on Designing Resilient Systems. In Part 1, we looked at use cases for implementing circuit breakers. In this second part, we will do a deep dive on retries and its use cases, followed by a technical comparison of both approaches.

This article is really thorough and includes a section on combining retries with circuit breakers.

Corey Scott — Grab

Towards Successful Resilient Software Design

The problem is that most advice how to “get design right” only applies to design inside a process boundary. Most of those advices do not work well if applied to distributed systems.

What I have learnt over time is that we basically need to re-learn how to design systems, i.e., how to spread the functionality in a distributed environment.

Uwe Friedrichsen — InfoQ

Courier: Dropbox migration to gRPC

This really stood out to me:

In practice, we have fixed whole classes of reliability problems by forcing engineers to define deadlines in their service definitions.

Ruslan Nigmatullin and Alexey Ivanov — Dropbox

Outages

Fastly
- Fastly had the above issue in its MDW PoP and also a repeat.
  Full disclosure: Fastly is my employer.
Zoom
Slack
- Android notifications were busted.
Gov Availability
- This site shows a live-updated availability percentage for the US Government. As of now, the “This Year” percentage is stuck at infinite zeroes (due to our current government shutdown). On a less tongue-in-cheek note, lots of US Government sites have expired TLS certificates because employees aren’t there to renew them.
Duo Security
Azure Storage (UK south region)
GitHub
Reddit
YouTube
Tinder
Google Cloud Platform (various API functions)
- Google engineers began rolling out a new feature designed to improve the fault-tolerance of the metadata store.
  
  Ironically, that rollout took down the metadata store.

SRE Weekly Issue #155

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues