Excellent article. When the memory model in the Linux kernel was being improved to better support additional architectures and techniques like Read-Copy-Update, we had many deep discussions about hardware details like this. Jumping into this level of complexity is pretty mind boggling when you first encounter it.
In the programming I do, this kind of low-level detail hardly ever matters. When I do multi-threading, I only use it in the most simple, stereotypical ways possible, and I use locking very diligently; so far I have run into threading-related problems very, very rarely.
But it is very interesting, nevertheless, to get an idea of what is going on under the hood. And given that the subject matter is pretty dry and complex, this article is very well written so someone like me can roughly understand what is going on.
This low-level of detail is extremely important to anyone working with atomic variables which can come up in higher-performance code/optimizing, especially since ARM gives you fine-grained knobs for working with atomics than you have on X86 (better performance if you can exploit it, which you usually can). That being said, most people do overestimate the need for atomics in practice.
There are situations where it is essential to understand things like memory models etc. Personally, I have not run into a problem that called for atomics. I got by using "conventional" stuff like Mutexes, thread-safe queues (or channels in Go), and the occasional barrier (or WaitGroup, in Go).
(I know, goroutines are not the same as threads, but one can easily run into the same problems. So it's best in my experience to treat them like threads when it comes to shared mutable state.)
Sure. I expect it comes up infrequently in Go. It comes up a bit more in c++ or Rust because the tooling allows it and people some times over-engineer solutions (still on the rarer side).
That being said, at the end of the day all the synchronization primitives you use end up using atomics to actually be implemented. They’re the most fundamental concept everything else is built off of.
This is such a fantastic article. When I was doing my undergraduate thesis on lock-free and wait-free data structures, it was so difficult to find resources that explained these things in a straightforward way (or indeed in any way at all). I look forward to the future instalments!
No, it's a model, it just looks that way to code running on the cores. (E.g. because it is working off a core-local cache that's not immediately synced to all other caches)
It does not. The "partitioning" is only a mental model.
This logical partitioning is caused by the physical hardware load and store "queues" [1](and possibly the reorder buffer) which are distinct from the normal coherent caches, and most importantly, are not coherent, causing the illusion of dynamic memory partitioning.
Note that the caches themselves do not lead to visible partitioning. On most architectures, even those with very relaxed MM, normal caches are always coherent.
While the queue are indeed some sort of memory, I believe (but I'm not an hardware designer) that are implemented differently from static (like cache) and dynamic RAM. Possibly they are just a bunch of microarchitectural hardware register.
[1] the term queue is a misnomer because on relaxed MM architectures they are not FIFO.
Does ARM do cache snooping? That used to be how they kept coherency but I don't recall what the more modern techniques were to maintain coherency across CPUs in a NUMA architecture(IIRC snooping doesn't scale well, but maybe they still do a variant).