Suppose they did have the cellular architecture today, but every other fact was ...

HumanOstrich · 2025-11-19T02:05:20 1763517920

What variant of cellular architecture are you referring to? Can you give me a link or few? I'm fascinated by it and I've led a team to break up a monolithic solution running on AWS to a cellular architecture. The results were good, but not magic. The process of learning from failures did not stop, but it did change (for the better).

No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.

otterley · 2025-11-19T02:10:55 1763518255

If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required. Did your service ever fail in multiple regions simultaneously?

Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:

https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

HumanOstrich · 2025-11-19T02:21:56 1763518916

You didn't really put any thought into what I said. Thanks for the links.

otterley · 2025-11-19T02:40:17 1763520017

It wasn't worth thinking about. I'm not going to defend myself against arguments and absolute claims I didn't make. The key word here is mitigation, not perfection.

hedora · 2025-11-19T03:09:11 1763521751

> If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required

Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.

Come to think of it, Cloudflare’s outage today is another good counterexample.

otterley · 2025-11-19T03:53:21 1763524401

It has been a very, very long time since AWS had a simultaneous failure across multiple regions. Even customers impacted by the loss of Route 53 control plane functionality in last month’s us-east-1 were able to gracefully fail over to a backup region if they configured failover records in advance, had Application Recovery Controller set up, or fronted their APIs or websites with Global Accelerator.

Customers survive incidents on a daily basis by failing over across regions (even in the absence of an AWS regional failure, they can fail due to a bad deployment or other cause). The reason you don’t hear about it is because it works.

tptacek · 2025-11-19T01:34:13 1763516053

Pretty sure he's making my point (or, rather, me his) there. (I'm never going to turn down an opportunity to nerd out about Cookism).