Hardware Memory Models

bcrl · on July 1, 2021

Excellent article. When the memory model in the Linux kernel was being improved to better support additional architectures and techniques like Read-Copy-Update, we had many deep discussions about hardware details like this. Jumping into this level of complexity is pretty mind boggling when you first encounter it.

krylon · on June 30, 2021

In the programming I do, this kind of low-level detail hardly ever matters. When I do multi-threading, I only use it in the most simple, stereotypical ways possible, and I use locking very diligently; so far I have run into threading-related problems very, very rarely.

But it is very interesting, nevertheless, to get an idea of what is going on under the hood. And given that the subject matter is pretty dry and complex, this article is very well written so someone like me can roughly understand what is going on.

vlovich123 · on June 30, 2021

This low-level of detail is extremely important to anyone working with atomic variables which can come up in higher-performance code/optimizing, especially since ARM gives you fine-grained knobs for working with atomics than you have on X86 (better performance if you can exploit it, which you usually can). That being said, most people do overestimate the need for atomics in practice.

krylon · on June 30, 2021

There are situations where it is essential to understand things like memory models etc. Personally, I have not run into a problem that called for atomics. I got by using "conventional" stuff like Mutexes, thread-safe queues (or channels in Go), and the occasional barrier (or WaitGroup, in Go).

(I know, goroutines are not the same as threads, but one can easily run into the same problems. So it's best in my experience to treat them like threads when it comes to shared mutable state.)

vlovich123 · on July 1, 2021

Sure. I expect it comes up infrequently in Go. It comes up a bit more in c++ or Rust because the tooling allows it and people some times over-engineer solutions (still on the rarer side).

That being said, at the end of the day all the synchronization primitives you use end up using atomics to actually be implemented. They’re the most fundamental concept everything else is built off of.

joppy · on July 1, 2021

This is such a fantastic article. When I was doing my undergraduate thesis on lock-free and wait-free data structures, it was so difficult to find resources that explained these things in a straightforward way (or indeed in any way at all). I look forward to the future instalments!

gpderetta · on June 30, 2021

Very nice article.

I particularly enjoyed the description of the historical evolution of x86-TSO.

bullen · on June 30, 2021

How does ARM solve the splitting of the memory? Just have "number of cores" x "amount of memory"?

detaro · on June 30, 2021

what do you mean by "splitting of the memory"?

bullen · on June 30, 2021

Like the article describes, ARM does not have a queue, instead each core has a copy of the memory.

detaro · on June 30, 2021

No, it's a model, it just looks that way to code running on the cores. (E.g. because it is working off a core-local cache that's not immediately synced to all other caches)

bullen · on June 30, 2021

Aha, but how does the software partition things, wouldn't that require more memory to work?

I guess the queue also needs some sort of memory.

gpderetta · on June 30, 2021

It does not. The "partitioning" is only a mental model.

This logical partitioning is caused by the physical hardware load and store "queues" [1](and possibly the reorder buffer) which are distinct from the normal coherent caches, and most importantly, are not coherent, causing the illusion of dynamic memory partitioning.

Note that the caches themselves do not lead to visible partitioning. On most architectures, even those with very relaxed MM, normal caches are always coherent.

While the queue are indeed some sort of memory, I believe (but I'm not an hardware designer) that are implemented differently from static (like cache) and dynamic RAM. Possibly they are just a bunch of microarchitectural hardware register.

[1] the term queue is a misnomer because on relaxed MM architectures they are not FIFO.

vlovich123 · on June 30, 2021

Does ARM do cache snooping? That used to be how they kept coherency but I don't recall what the more modern techniques were to maintain coherency across CPUs in a NUMA architecture(IIRC snooping doesn't scale well, but maybe they still do a variant).