SRE Weekly Issue #90

View on sreweekly.com

A couple of DNS-related links this week. I’ll be giving a talk at Velocity NYC on all of the fascinating things I learned about DNS in the wake of the Dyn DDoS and the .io TLD outage last fall. If you’re there, hit me up for some SRE Weekly swag!

Articles

On the design of distributed programming models

We’re all becoming distributed systems engineers, and this stuff sure isn’t easy.

Isn’t distributed programming just concurrent programming where some of the threads happen to execute on different machines? Tempting, but no.

Continuous self-testing at Hosted Graphite — why we send external canaries, every second

Every-second canarying is a pretty awesome concept. Not only that, but they even post the results on their status page. Impressive!

Mea culpa: Lessons learned from firefighting

So many lessons! My favorite is to make sure you test the “sad path”, as opposed to just the “happy path”. If a customer screws up their input and then continues on correctly from there on, does everything still work?

SRECon EMEA 17 Notes

Extensive notes taken during 19 talks at SRECon 17 EMEA. I’m blown away by the level of detail. Thanks, Aaron!

CPU Performance Checklist

A cheat sheet and tool list for diagnosing CPU-related issues. There’s also one on network troubleshooting by the same author. Note: LinkedIn login required to view.

Antifragility 101 – Production Ready

Antifragility is an interesting concept that I was previously unaware of. I’m not really sure how to apply it practically in an infrastructure design, but I’m going to keep my eye out for antifragile patterns.

Eight reasons why you should conduct a DNS audit

It’s easy to overlook your DNS, but a failure can take your otherwise perfectly running infrastructure down — at least from the perspective of your customers.

An Often Overlooked Tool In Workplace Safety Prevention: The Near-Miss Investigation

Do you run a retrospective on near misses? The screws they tightened in this story could just as easily be databases quietly running at max capacity.

A piece of one of the venting systems fell and almost hit an employee which almost certainly would have caused a serious injury and possibly death. The business determined that (essentially) a screw came loose causing the part to fall. It then checked the remaining venting systems and learned that other screws had starting becoming loose as well and was able to resolve the issue before anyone got hurt.

Introducing Azure Availability Zones for resiliency and high availability

Oh look, Azure has AZs now.

What’s in a transport layer?

The transport layer in question is gRPC, and this article discusses using it to connect a microservice-based infrastructure. If you’ve been looking for an intro to gRPC, check this out.

Beyond Lights-Out: Future Data Centers Will Be Human-Free

How do you prevent human error? Remove the humans. Yeah, I’m not sure I believe it either, but this was still an interesting read just to learn about the current state of lights-out datacenters.

Documenting your architecture: Wireshark, PlantUML and a REPL to glue them all.

This is a really neat idea: generate an interaction diagram automatically using a packet capture and a UML tool.

Thanks to Devops Weekly for this one.

Outages

.io
- The .io TLD went down again, in exactly the same way as last fall.
PagerDuty
- PagerDuty suffered a major outage lasting over 12 hours this past thursday. Customers scrambled to come up with other alerting methods.
  Some really excellent discussion around this incident happened on the hangops slack in the #incident_response channel. I and others requested more details on the actual paging latency and PagerDuty delivered them on their status site. Way to go, folks!
StatusPage.io
- I noticed this minor incident after getting a 500 reloading PagerDuty’s status page.
The Travis CI Blog: Sept 6 – 11 macOS outage postmortem
- This week, Travis posted this followup describing the SAN performance issues that impacted their system.
Outlook and Hotmail

SRE Weekly Issue #90

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues