So what is a memory model? And how to cook it?

A memory model in general is a very broad term, and it determines a lot of things. For example, such an evident thing like pointer size (8, 16, 32, 64 bits) is determined by a memory model. Another good example is what is (was) called... well, a memory model, if you remember those days, - TINY, SMALL, COMPACT, LARGE, FLAT. There are also other things like segmentation, paging and cache associativity.

What I am going to discuss here is memory models in the context of multi-threading/concurrency. There are 3 fundamental properties: Atomicity, Ordering and Visibility; and 2 levels: Compiler and Hardware. In order to implement and reason about any synchronization algorithm you need clear understanding of them.

Atomicity

Atomicity is the most obvious and understood property. I hope you understand what it means in general - indivisibility of an operation, that is, an operation is either does not happen at all or fully completed. No intermediate states and no partial effects can be observed by other threads.

There are 2 classes of operations in the context of atomicity: (1) read-modify-write (RMW) operations and (2) loads and stores. Some people associate atomicity only with RMW operations like compare-and-swap and fetch-and-add; however, atomic loads and stores are no less important.

So, regarding atomic RMW operations. A memory model along with instruction set determine what RMW operations are available and which of them are atomic. Modern processors usually provide at least compare-and-swap (CAS) operation (or equivalent load-linked/store-conditional, LL/SC), and potentially some other atomic RMW operations. Another crucial aspect is whether atomic RMW operations can operate on word/pointer-sized memory locations or also on double-word-sized memory locations (for example, on 128-bit memory locations on 64-bit architecture). On modern IA-32 and Intel 64 architectures double-word atomic operations are available, however they were not available on some early 64-bit processors (to prevent confusion, I call double-word CAS - DWCAS). There is also mostly mythical double-CAS (DCAS, or CAS2) which can operate on 2 disjoint work-sized memory locations - not available on most modern commodity hardware (however, you may encounter it in academic papers on lock-free algorithms).

Regarding loads and stores. Memory model along with instruction set specifies whether plain loads and stores are atomic or not. Typical guarantee for all modern commodity hardware is that aligned word-sized loads and stores are atomic. For example, on x86 architecture (IA-32 and Intel 64) 1-, 2-, 4-, 8- and 16-byte aligned loads and stores are all atomic (that is, plain MOV instruction, MOVQ and MOVDQA are atomic).

For details you must consult your processor or programming language documentation.

For example, regarding IA-32 and Intel 64 architectures consult Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 (Chapter 8. Multiple-Processor Management).

Regarding IA-64 (Itanium) consult Intel® Itanium® Architecture Software Developer’s Manual Volume 2: System Architecture (MP Coherence and Synchronization).

Regarding SPARC consult The SPARC Architecture Manual (Chapter 8. Memory Models).

Move on to Visibility