# CS 636: Memory Consistency Models

#### Swarnendu Biswas

Department of Computer Science and Engineering, Indian Institute of Technology Kanpur

Sem 2024-25-II



"To write correct and efficient shared memory programs, programmers need a precise notion of how memory behaves with respect to read and write operations from multiple processors"

S. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorials. Journal of Computer, vol. 29, no. 12, pp. 66–76, Dec. 1996.

#### Busy-Wait Paradigm

```
1 Object X = null;
2 boolean done = false;
```

| Thread 1                          | Thread 2                                 |
|-----------------------------------|------------------------------------------|
| X = new Object();<br>done = true; | <pre>while (!done) {} X.compute();</pre> |

#### **Possible Errors**



| Thread 1                                      | Thread 2                                 |
|-----------------------------------------------|------------------------------------------|
| <pre>done = true;<br/>X = new Object();</pre> | <pre>while (!done) {} X.compute();</pre> |

#### Accesses are to **different** addresses

- Store-store
   Non-FIFO write buffer (first store misses in the cache while the second hits or the second store can coalesce with an earlier sore)
  - Load-load Cache hits, dynamic scheduling, execute out of order
- Load-store Cache hits, out-of-order core
- Store-load FIFO write buffer with bypassing, out-of-order core

Accesses are to **different** addresses

- Store-store Non-FIFO write buffer (first store misses in the cache while the second hits or the second store can coalesce with an earlier sore)
  - Correct in a single-threaded context
    - Non-trivial in a multithreaded context
- Store-load FIFO write buffer with bypassing, out-of-order core

Load-load

Load-store

#### What values can a load return?

#### Return the "last" write

- Uniprocessor: program order defines the "last" write
- Multiprocessor: operations from different cores/threads are not related by program order

# Memory Consistency Model

Set of rules that govern how systems process memory operation requests from multiple processors

- Determines the order in which memory operations appear to execute
- Specifies allowed behaviors of multithreaded programs executing with shared memory
  - ▶ Both at the hardware-level and at the programming-language-level
  - ► There can be multiple correct behaviors

#### Importance of memory consistency models

- + Determines what optimizations are correct
- + Contract between the programmer and the hardware
- + Influences ease of programming and program performance
- + Impacts program portability

#### Visibility

When are the effects of one thread (e.g., updating a memory location) visible to another?

#### Ordering

When can operations of any given thread appear out of order to another thread?

# Sequential Consistency

A multiprocessor system is sequentially consistent if the result of any execution is the same as if the memory operations of all processors were executed in some **sequential** order, and the operations of each individual processor appear in **program order** 

# • Memory operations execute in program order, and respect data and control dependences

- ► Read from memory returns the value from the last write in program order
- ► Compiler optimizations preserve these semantics

#### Multiprocessor

• All operations execute in order, and the operations of each individual core appear in program order

#### Interleavings with SC

```
1 data = null;
2 flag = false;
```

| Core 1                                           | Core 2                                                           |
|--------------------------------------------------|------------------------------------------------------------------|
| 1 S1: data = new Object();<br>2 S2: flag = true; | L1: r1 = flag;<br>B1: if (r1 != true) goto L1;<br>L2: r2 = data; |

Should r2 always be set to the new Object() stored?

# Interleavings with SC



Every load gets its value from the last store before it (in global memory order) to the same address

Suppose we have two addresses a and b (a == b or a != b). L(a) is a load from a and S(a) is a store to a.

## Challenges in Implementing SC

#### Is preserving program order on a per-location basis sufficient?

- Hardware implementations of SC need to satisfy the following requirements
  - Program order
     Previous memory operation completes before proceeding with the next memory operation in program order
    - Writes to the same location should be serialized, i.e., writes to the same location should be visible in the same order to all processors

Write atomicity

Dekker's Algorithm: Need for Program Order

| Core 1             | Core 2             |
|--------------------|--------------------|
| 1 S1: ST flag1, 1  | 1 S2: ST flag2, 1  |
| 2 L1: LD r1, flag2 | 2 L2: LD r2, flag1 |

# Can both r1 and r2 be set to zero?





# Importance of Maintaining Write-Read Order

- Assume a bus-based system with **no caches**
- Includes a write buffer with bypassing capabilities

| Core 1             | Core 2             |
|--------------------|--------------------|
| 1 S1: ST flag1, 1  | 1 S2: ST flag2, 1  |
| 2 L1: LD r1, flag2 | 2 L2: LD r2, flag1 |

# Importance of Maintaining Write-Read Order

- Assume a bus-based system with **no caches**
- Includes a write buffer with bypassing capabilities



## SC in Architecture with Caches

- Replication of data requires a cache coherence protocol
- A coherence protocol propagates a new value to all other cached copies
  - Several definitions of cache coherence protocols exist
  - ► A memory model places bounds on **when** the value can be propagated to a given processor
- Propagating new values to multiple other caches is non-atomic

## Providing Write Atomicity with Caches

- Consider a system with caches, and assume that all variables are cached by all the cores
- SC can be violated with a network with no ordering guarantees

| Core 1 | Core 2               | Core 3                 |
|--------|----------------------|------------------------|
| A = 1  | if (A == 1)<br>B = 1 | if (B == 1)<br>tmp = A |

## Providing Write Atomicity with Caches

- Consider a system with caches, and assume that all variables are cached by all the cores
- SC can be violated with a network with no ordering guarantees



# Serialization of Writes

| Core 1                           | Core 2                                                                              | Core 3 | Core 4                                                  |
|----------------------------------|-------------------------------------------------------------------------------------|--------|---------------------------------------------------------|
| 1 <b>A</b> = 1<br>2 <b>B</b> = 1 | $\begin{array}{cccc} 1 & \mathbf{A} &=& 2 \\ 2 & \mathbf{C} &=& 1 \\ 3 \end{array}$ |        | 1) {} 1<br>1) {} 2<br>2 while (C != 1) {}<br>3 tmp2 = A |

Writes to A in Cores 1 and 2 should not reach Cores 3 and 4 out of order even if the network is out of order or does not provide guarantees—it would violate SC

# Serialization of Writes

| Core 1                                                                                             | Core 2 | Core 3                | Core 4                                                  |
|----------------------------------------------------------------------------------------------------|--------|-----------------------|---------------------------------------------------------|
| $\begin{array}{cccc} 1 & \mathbf{A} &= & 1 \\ 2 & \mathbf{B} &= & 1 \\ 3 & & & & & \\ \end{array}$ |        | 2 while (C != 1) {} : | <pre>while (B != 1) {} while (C != 1) {} tmp2 = A</pre> |

• Cache coherence must serialize writes to the same memory location

#### Writes to the same memory location must be seen in the same order by all

Writes

netwoi

if the

#### End-to-end SC

- Simple memory model that can be implemented both in hardware and in languages
- Performance can take a hit
  - Naïve hardware
  - Maintaining program order can be expensive for writes

D. Marino et al. A Case for an SC-Preserving Compiler. PLDI'11.

# SC-Preserving Optimizations

| Redundant load  | Original<br>t = X; u = X; | $\Rightarrow$ | Optimized<br>t = X; u = t; |
|-----------------|---------------------------|---------------|----------------------------|
| Forwarded load  | Original<br>X = t; u = X; | ⇒             | Optimized<br>X = t; u = t; |
| Dead store      | Original<br>X = t; X = u; | ⇒             | Optimized<br>X = u;        |
| Redundant store | Original<br>t = X; X = t; | $\Rightarrow$ | Optimized<br>t = X;        |

## Optimizations Forbidden in SC

Loop invariant code motion, common sub-expression elimination, ...

| Original                                   | Optimized     |                                          |
|--------------------------------------------|---------------|------------------------------------------|
| L1: t = X*2;<br>L2: u = Y;<br>L3: v = X*2; | $\Rightarrow$ | L1: t = X*2;<br>L2: u = Y;<br>M3: v = t; |

CSE reorders the memory accesses to Y and the second read from X (relaxes  $L \rightarrow L$  constraint, performs an eager load)

#### Optimizations Forbidden in SC

$$\begin{array}{rcl} X &=& O;\\ Y &=& O; \end{array}$$

| Original                                   | Optimized                                                | Concurrent Thread            |  |
|--------------------------------------------|----------------------------------------------------------|------------------------------|--|
| L1: t = X*2;<br>L2: u = Y;<br>L3: v = X*2; | <pre> → L1: t = X*2; L2: u = Y; M3: v = t; </pre>        | C1: $X = 1;$<br>C2: $Y = 1;$ |  |
|                                            | u == 1 && v == 0 is not<br>possible in the original code |                              |  |

## Problematic Optimizations with SC



Eager load optimizations involve  $S \rightarrow L$  and  $L \rightarrow L$  reordering. These optimizations perform a load earlier than would have been performed without the optimizations.

# Problematic Optimizations with SC

| Dead store      | Original                               | Optimized |                                   |
|-----------------|----------------------------------------|-----------|-----------------------------------|
|                 | L1: X = 1;<br>L2: P = Q;<br>L3: X = 2; | ⇒         | L1: ;<br>L2: P = Q;<br>L3: X = 2; |
| Redundant store | Original                               |           | Optimized                         |
|                 | L1: t = X;<br>L2: P = Q;<br>L3: X = t; | ⇒         | L1: t = X;<br>L2: P = Q;<br>L3: ; |

#### Implementing SC with Compiler Support

Implement a compiler pass (e.g., in LLVM) to deal with non-SC preserving optimizations

| L1: | t | = | X*2; |
|-----|---|---|------|
| L2: | u | = | Υ;   |
| L3: | v | = | X*2; |

 $\Rightarrow$ 

| L1: | t = X* | *2       |       |     |
|-----|--------|----------|-------|-----|
| L2: | u = Y  |          |       |     |
| L3: | v = t  |          |       |     |
| C3: | if (X  | modified | since | L1) |
| L3: | v =    | X*2      |       |     |

D. Marino et al. A Case for an SC-Preserving Compiler. PLDI'11.

#### SC Semantics

#### SC is not a strong memory model

Does not guarantee data race freedom

| Thread 1                    | Thread 2                    |
|-----------------------------|-----------------------------|
| a++;                        | a++;                        |
|                             |                             |
| Thread 3                    | Thread 4                    |
| <pre>buffer[index++];</pre> | <pre>buffer[index++];</pre> |

# Cache Coherence

## Block Diagram of a SMP



#### Data Coherence

#### Private caches create data coherence problem

- Copies of a variable can be present in multiple caches
- Private copies of shared data must be **coherent**, i.e., all copies must have the same value (okay if the requirement holds eventually)

#### Consider the following sequence of operations on a single core system

Final value of x will be 30



## Coherence Challenge with Multicores



(i) Multicore system setup

(ii) Each core reads x

## Coherence Challenge with Multicores



(iii) Each core updates x in its private cache

(iv) Cores write back  $\mathbf{x},$  a store is lost depending on the order of write backs

## Can Write-through Caches Avoid the Coherence Problem?

Assume 3 cores with write-through caches

- (i) Core C0 reads x from memory, caches it, and gets the value 10
- (ii) Core C1 reads x from memory, caches it, and gets the value 10
- (iii) C1 writes x=20, and updates its cached and memory values
- (iv) C0 reads x from its cache and gets the value 10
- (v) C2 reads x from memory, caches it, and gets the value 20
- (vi) C2 writes x=30, and updates its cached and memory value

## Sources of Errors in the Previous Examples

### Write-back cache

- Stores are not visible to memory immediately
- Order of write backs are important
- Lesson learned: do not allow more than one copy of a cache line in dirty state

### Write-through cache

- The value in memory may be correct if the writes are correctly ordered
- Our example system allowed a store to proceed when there is already a cached copy
- Lesson learned: must invalidate all cached copies before allowing a store to proceed

### A memory system is coherent if the following hold:

- (i) A read from a location X by a core C that follows a write by C to X always returns the value written by C provided there are no writes of X by another processor between the two accesses by C.
- (ii) A read from a location X by a core C that follows a write to X by another core returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses.
- (iii) Writes to the same location are serialized. That is, two writes to the same location by any two cores are seen in the same order by all processors.

### **Correctness Requirement**

### For sequential programs, there is only one correct output

A read from a memory location must return the "latest" value written to it

### For parallel programs, there can be multiple correct outputs

- Defining "latest" precisely is crucial
- Assume that the latest value of a location is the latest value "committed" by any thread/process

## **Cache Coherence Protocol**

Multicore processors implement a cache coherence protocol to keep private caches in sync

A "cache coherence protocol" is a set of actions that ensure that a load to address A returns the "last committed" value to A

- Essentially, makes one core's write visible to other cores by propagating the write to other caches
- Aims to make the presence of private caches functionally invisible
- Coherence protocols can operate on granularities from 1–64 bytes, usually operate on whole cache blocks (e.g., 64 bytes)

### 1. Enforces the Single-Writer-Multiple-Reader (SWMR) invariant

For any given memory location, at any given moment in time, there is either a single core that may write it (including read) or some number of cores that may read it

### **2.** Data values must be propagated correctly (data invariant)

The value of a memory location at the start of a read-only time period is the same as the value of the location at the end of its last read-write time period

|   | read-only   | ÷ | read-write | : | read-write | ÷ | read-only   |     | _      |
|---|-------------|---|------------|---|------------|---|-------------|-----|--------|
| • |             | • |            | • |            | • |             | • • | time 🕨 |
|   |             | • |            |   |            | • |             |     |        |
|   | Cores 2 & 3 | • | Core 2     |   | Core 1     | • | Cores 0 & 1 |     |        |
|   | Coles Z & S |   | Core z     |   | COLET      |   | Coles 0 & I |     |        |

### **Definition 2**

A coherent system must appear to execute all threads' loads and stores to a **single** memory location in a total order that respects the program order of each thread

### **Definition 3**

A coherent system satisfies two invariants:

write propagation every store is eventually made visible to all cores, and

write serialization writes to the same memory location are serialized (i.e., observed in the same order by all cores)

### Memory Consistency

- Defines shared memory behavior
- Related to all shared-memory locations
- Policy on when new value is propagated to other cores
- Memory consistency implementations can use cache coherence as a "black box"

### Cache Coherence

- Does not define shared memory behavior
- Specific to a single shared-memory location
- Propagates a new value to other cached copies
- Invalidation-based or update-based

# Hardware Memory Models

## **Characterizing Hardware Memory Models**

### Relax program order

- For example, Store  $\rightarrow$  Load and Store  $\rightarrow$  Store
- Applicable to pairs of operations with different addresses

#### Relax write atomicity

- Read other core's write early
- Applicable to only cache-based systems

### Relax both program order and write atomicity

Read own write early

### Can both r1 and r2 be set to zero?

$$\begin{array}{rcl} x & = & 0; \\ y & = & 0; \\ \end{array}$$

| Core 1                                           | Core 2                                           |  |  |  |
|--------------------------------------------------|--------------------------------------------------|--|--|--|
| <pre>1 S1: x = new Object(); 2 L1: r1 = y;</pre> | <pre>1 S2: y = new Object(); 2 L2: r2 = x;</pre> |  |  |  |

## **Total Store Order**

- Allows reordering stores to loads
  - A read is not allowed to return the value of another processor's write until it is made visible to all other processors (as in SC)
- Requires write atomicity, can read own write early, not other's writes
- Conjecture: widely-used x86 memory model is equivalent to TSO

## **TSO** Formalism

Suppose we have two addresses a and b (a == b or a != b)

Constraints 1. If L(a)  $<_p L(b) \Rightarrow L(a) <_m L(b)$ 2. If L(a)  $<_p S(b) \Rightarrow L(a) <_m S(b)$ 3. If S(a)  $<_p S(b) \Rightarrow S(a) <_m S(b)$ 4. If S(a)  $<_p L(b) \Rightarrow S(a) <_m L(b)$  /\* Enables FIFO write buffer \*/

Every load gets its value from the last store before it to the same address

## Support for FENCE Operations in TSO

If L(a)  $\leq_p$  FENCE  $\Rightarrow$  L(a)  $\leq_m$  FENCE

If S(a)  $<_p$  FENCE  $\Rightarrow$  S(a)  $<_m$  FENCE

If FENCE  $<_p$  FENCE  $\Rightarrow$  FENCE  $<_m$  FENCE

If FENCE  $<_p L(a) \Rightarrow$  FENCE  $<_m L(a)$ 

If FENCE  $<_p S(a) \Rightarrow$  FENCE  $<_m S(a)$ 

If S(a)  $\leq_p$  FENCE  $\Rightarrow$  S(a)  $\leq_m$  FENCE

If FENCE  $<_p L(a) \Rightarrow$  FENCE  $<_m L(a)$ 

## Possible Outcomes with TSO

1 
$$x = 0;$$
  
2  $y = 0;$ 

| Core 1                     | Core 2             |
|----------------------------|--------------------|
| 1 S1: x = NEW;             | 1 S2: y = NEW;     |
| <sup>2</sup> L1: $r1 = x;$ | $_{2}$ L3: r3 = y; |
| $_{3}$ L2: r2 = y;         | $_{3}$ L4: r4 = x; |

Assume r2 and r4 are zero. Can r1 or r3 also be set to zero?

## Possible Outcomes with TSO



### **Outcome**: r2 ==0, r4 == 0, r1 == NEW, and r3 == NEW

Swarnendu Biswas (IIT Kanpur)

## RMW in TSO

- Load of a RMW cannot be performed until earlier stores are performed (i.e., exited the write buffer). Why?
- Load requires read–write coherence permissions, not just read permissions
- To guarantee atomicity, the cache controller may not relinquish coherence permission to the block between the load and the store

## Relationship among Memory Models

- A memory model Y is strictly more relaxed (weaker) than a memory model X if all X executions are also Y executions, but not vice versa
- If Y is more relaxed than X, then all X implementations are also Y implementations
- Two memory models may be incomparable if both allow executions precluded by the other



## Which is correct?

## Processor Consistency (PC)

### PC is similar to TSO, but does not guarantee write atomicity

Writes may become visible to different processors in different order



## Partial Store Order (PSO)

### PSO allows reordering of store to loads and stores to stores

- Writes to different locations from the same processor can be pipelined or overlapped and are allowed to reach memory or other cached copies out of program order
- Can read own write early, not other's writes
- Write-write reordering is present in many architectures, including Alpha, IA64, and POWER

## **Opportunities to Reorder Memory Operations**

```
1 data1 = data2 = null;
2 flag = false;
```

| Core 1                                                                           | Core 2                                                                        |  |  |
|----------------------------------------------------------------------------------|-------------------------------------------------------------------------------|--|--|
| <pre>S1: data1 = new Object(); S2: data2 = new Object(); S3: flag = true; </pre> | L1: r1 = flag;<br>B1: if (!r1) goto L1;<br>L2: r2 = data1;<br>L3: r3 = data2; |  |  |

What order ensures r2 and r3 always see initialized objects?

## Reorder Operations Within a Synchronization Block



### What order ensures correct handoff from critical section 1 to 2?

Swarnendu Biswas (IIT Kanpur)

CS 636: Memory Consistency Models

## **Optimization Opportunities**

- (i) Non-FIFO coalescing write buffer
- (ii) Support non-blocking reads
  - ► Hide latency of reads
  - Use lockup-free caches and speculative execution
- (iii) Simpler support for speculation
  - Need not compare addresses of loads to coherence requests
  - ► For SC, need support to check whether the speculation is correct

## Relaxed Consistency Rules

Loads and stores are unordered excepting TSO rules are followed for ordering two accesses to the same address

• Every load gets its value from the last store before it to the same address

Constraints 1. If  $L(a) <_p L'(a) \Rightarrow L(a) <_m L'(a)$ 2. If  $L(a) <_p S(a) \Rightarrow L(a) <_m S(a)$ 3. If  $S(a) <_p S'(a) \Rightarrow S(a) <_m S'(a)$ 

## Relaxed Consistency Rules

Loads and stores are unordered excepting TSO rules are followed for ordering two accesses to the same address

• Every load gets its value from the last store before it to the same address

Constraints1. If 
$$L(a) <_p L'(a) \Rightarrow L(a) <_m L'(a)$$
  
2. If  $L(a) <_p S(a) \Rightarrow L(a) <_m S(a)$   
3. If  $S(a) <_p S'(a) \Rightarrow S(a) <_m S'(a)$ If  $L(a) <_p$  FENCE  $\Rightarrow L(a) <_m$  FENCEIf FENCE  $<_p L(a) \Rightarrow$  FENCE  $<_m L(a)$ If  $S(a) <_p$  FENCE  $\Rightarrow$  S(a)  $<_m$  FENCEIf FENCE  $<_p S(a) \Rightarrow$  FENCE  $<_m S(a)$ If FENCE  $<_p$  FENCE  $\Rightarrow$  FENCE  $<_m$  FENCEIf FENCE  $<_p S(a) \Rightarrow$  FENCE  $<_m S(a)$ 

## Using Fences under Relaxed Consistency

```
1 data1 = null;
2 data2 = null;
3 flag = false;
```

### Core 1

| 1                         | <pre>S1: data1 = new Object();</pre> | 1                       |  |  |  |
|---------------------------|--------------------------------------|-------------------------|--|--|--|
| 2                         | S2: data2 = new Object();            | 2                       |  |  |  |
| 3                         | F1: FENCE                            | 3                       |  |  |  |
| 4                         | S3: flag = true;                     | 4 L1: $r1 = flag;$      |  |  |  |
| 5                         |                                      | 5 B1: if (!r1) goto L1; |  |  |  |
| 6                         |                                      | 6 F2: FENCE             |  |  |  |
| 7                         |                                      | $^{7}$ L2: r2 = data1;  |  |  |  |
| 8                         |                                      | $^{8}$ L3: r3 = data2;  |  |  |  |
| Are both fences required? |                                      |                         |  |  |  |

Core 2

### Conservative Use of Fences under Relaxed Consistency

| Core 1                                                                              |                                               | -                                               | Core 2                                                                                  |  |  |
|-------------------------------------------------------------------------------------|-----------------------------------------------|-------------------------------------------------|-----------------------------------------------------------------------------------------|--|--|
| <pre>F11: FENCE A11: acquire( F12: FENCE F12: FENCE // Loads L<sub>1i</sub> a</pre> | arbitrarily and with stores $\mathbf{S}_{1j}$ | 1<br>2<br>3<br>4<br>5<br>6<br>7<br>8<br>9<br>10 | F21: FENCE<br>A21: acquire(lock);<br>F22: FENCE<br>// Loads L <sub>2i</sub> arbitrarily |  |  |
| 11                                                                                  |                                               |                                                 | // interleaved with stores $\mathrm{S}_{2j}$ F23: FENCE                                 |  |  |
| 12<br>13                                                                            |                                               |                                                 | R23: FENCE<br>R22: release(lock);                                                       |  |  |
| 14                                                                                  |                                               |                                                 | F24: FENCE                                                                              |  |  |

## Examples of Relaxed Consistency Memory Models

### Weak ordering

- Distinguishes between data and synchronization operations
- A synchronization operation is not issued until all previous operations are complete
- No operations are issued until the previous synchronization operation completes

### Release consistency

- Relaxes WO further, distinguishes between acquire and release synchronization operations
- All previous acquire operations must be performed before an ordinary load or store access is allowed to perform
- Previous accesses have to complete before a release is performed
- RC<sub>sc</sub> maintains SC between synchronization operations
- Acquire  $\rightarrow$  all, all  $\rightarrow$  release, and sync  $\rightarrow$  sync

## Correct Implementation under Relaxed Consistency

| Core 1                                                   | Core 2                                     |  |  |
|----------------------------------------------------------|--------------------------------------------|--|--|
| 1 F11: FENCE                                             | 1                                          |  |  |
| <pre>2 A11: acquire(lock);</pre>                         | 2                                          |  |  |
| 3 F12: FENCE                                             | 3                                          |  |  |
| 4 // Loads L $_{1i}$ arbitrarily                         | 4                                          |  |  |
| $_5$ // interleaved with stores ${f S}_{1j}$             | 5                                          |  |  |
| 6 F13: FENCE                                             | 6                                          |  |  |
| <pre>7 R12: release(lock);</pre>                         | 7 F21: FENCE                               |  |  |
| 8 F14: FENCE                                             | <pre>8 A21: acquire(lock);</pre>           |  |  |
| 9                                                        | 9 F22: FENCE                               |  |  |
| 10                                                       | 10 // Loads $L_{2i}$ arbitrarily           |  |  |
| 11 Which fences are needed to ensure correct             | 11 // interleaved with stores ${f S}_{2j}$ |  |  |
| <sup>12</sup> ordering and visibility between C1 and C2? | 12 F23: FENCE                              |  |  |
| 13                                                       | <pre>13 R22: release(lock);</pre>          |  |  |
| 14                                                       | 14 F24: FENCE                              |  |  |

## Relaxed Consistency Memory Models

### Why should we use them?

Performance

Why should we not use them?

Complexity

## Hardware Memory Models: One Slide Summary

| Model            | $W \rightarrow R$ | $W \rightarrow W$ | $R \to RW$   | Read Own     | Read Other's |
|------------------|-------------------|-------------------|--------------|--------------|--------------|
| model            | VV / IX           |                   |              | Write Early  | Write Early  |
| SC               |                   |                   |              | $\checkmark$ |              |
| TSO              | $\checkmark$      |                   |              | $\checkmark$ |              |
| PC               | $\checkmark$      |                   |              | $\checkmark$ | $\checkmark$ |
| PSO              | $\checkmark$      | $\checkmark$      |              | $\checkmark$ |              |
| WO               | $\checkmark$      | $\checkmark$      | $\checkmark$ | $\checkmark$ |              |
| $RC_{SC}$        | $\checkmark$      | $\checkmark$      | $\checkmark$ | $\checkmark$ |              |
| RC <sub>PC</sub> | $\checkmark$      | $\checkmark$      | $\checkmark$ | $\checkmark$ | $\checkmark$ |

## Desirable Properties of a Memory Model

### Desirable properties: Programmability, Performance, and Portability

Hard to satisfy all three properties

### **Evaluating SC**

- + Intuitive when we think of uniprocessor executions
- + Serializability of instructions
- No atomicity of regions
- Inhibits many compiler transformations
- Almost all recent architectures violate SC

# Programming Language Memory Models

## Language Memory Models

- Data-Race-Free-0 (DRF0) model is conceptually similar to Weak Ordering (WO)
- Assumes no data races
  - ► DRF0 ensures SC for data-race-free programs
  - ► No guarantees for racy programs
- Allows many optimizations in the compiler and hardware
- Language memory models were developed much later than hardware models
  - ► Recent standardizations are largely driven by languages
- Most language models are based on DRF0

Why do we need one? Is the hardware memory model not enough?

# C++ Memory Model and Catch-Fire Semantics

- Adaptation of the DRF0 memory model
  - Provides SC for data-race-free programs
  - ► C/C++ simply ignores data races
- No safety guarantees in the language

```
1 X* x = null;
2 bool done = false;
```

| Thread 1                         | Thread 2                              |
|----------------------------------|---------------------------------------|
| 1 X = new X();<br>2 done = true; | <pre>if (done)    X-&gt;func();</pre> |

# C++ Memory Model and Catch-Fire Semantics

- Adaptation of the DRF0 memory model
  - Provides SC for data-race-free programs
  - C/C++ simply ignores data races
- No safety guarantees in the language



### Memory Operations in C++

### Synchronization Lock, unlock, atomic load, atomic store, atomic RMW Data Load, store

Compiler reordering is allowed for memory operations M1 and M2 if

- M1 is a data operation and M2 is a read synchronization operation
- M1 is write synchronization and M2 is data
- M1 and M2 are both data with no synchronization between them
- M1 is data and M2 is the write of a lock operation
- M1 is unlock and M2 is either a read or write of a lock

### Writing Correct Concurrent C++ Code Using Locks

```
std::mutex mtx;
bool ready = false;
```

| Thread 1                                                          | Thread 2                                                       |  |
|-------------------------------------------------------------------|----------------------------------------------------------------|--|
| <pre>mtx.lock(); prepareData(); ready = true; mtx.unlock();</pre> | <pre>mtx.lock(); if (ready) consumeData(); mtx.unlock();</pre> |  |

# Using Atomics Available from C++11

- "Data race free" by definition (e.g., std::atomic<int>)
  - ► A store synchronizes with operations that load the stored value—similar to volatile in Java
- C++ volatile is different!
  - Does not establish inter-thread synchronization
  - Can be part of a data race



# **Ensuring Visibility**

- Writer thread releases a lock
  - ► Flushes all writes from the thread's working memory
- Reader thread acquires a lock
  - ► Forces a (re)load of the values of the affected variables
- std::atomic in C++ and volatile in Java
  - ► Values written are made visible immediately before any further memory operations
  - ► Readers reload the value upon each access
- Thread join
  - ▶ Parent thread is guaranteed to see the effects made by the child thread

# Memory Order of Atomics

Specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation

• Default is sequential consistency

### atomic.h

| 1 | <pre>enum memory_order {</pre>    |
|---|-----------------------------------|
| 2 | <pre>memory_order_relaxed ,</pre> |
| 3 | <pre>memory_order_consume ,</pre> |
| 4 | <pre>memory_order_acquire ,</pre> |
| 5 | <pre>memory_order_release ,</pre> |
| 6 | <pre>memory_order_acq_rel ,</pre> |
| 7 | memory_order_seq_cst              |
| 8 | };                                |

#### Producer

- Producer thread creates data
- Producer thread stores to an atomic

#### Consumer

- Consumer threads read from the atomic
- When the expected value is seen, data from the producer thread is visible to the consumers

The different memory model modes indicate the strength of data sharing between threads

Memory model synchronization modes

Swarnendu Biswas (IIT Kanpur)

memory\_order\_seq\_cst

$$\begin{array}{rcl} x & = & 0; \\ y & = & 0; \end{array}$$



```
memory_order_seq_cst
```



memory\_order\_relaxed: no happens-before edge

#### Thread 1

```
v.store(20, memory_order_relaxed);
```

```
2 x.store(10, memory_order_relaxed);
```

#### Thread 2

2

3

2

```
if (x.load(memory_order_relaxed) == 10)
    assert(y.load(memory_order_relaxed) == 20);
    y.store(30, memory_order_relaxed);
```

```
Can these asserts
fail?
```

```
if (y.load(memory_order_relaxed) == 30)
```

```
assert(x.load(memory_order_relaxed) == 10);
```

memory\_order\_relaxed: no happens-before edge

#### Thread 1

```
1 x.store(10, memory_order_relaxed);
```

```
x.store(20, memory_order_relaxed);
```



memory\_order\_relaxed: no happens-before edge

#### Thread 1

```
x.store(10, memory_order_relaxed);
```

```
x.store(20, memory_order_relaxed);
```

```
y = x.load(memory_order_relaxed);
z = x.load(memory_order_relaxed);
assert(y <= z);</pre>
```

- In the absence of HB edges, a thread should not rely on the exact ordering of instructions in another thread
- Once a value of a variable from Thread 1 is observed in Thread 2, Thread 2 cannot see an earlier value

memory\_order\_acquire and memory\_order\_release: introduces HB edges only between dependent variables





80/96

memory\_order\_consume: removes HB ordering on non-dependent variables

| Thread 1                                                      |                                          |                            |
|---------------------------------------------------------------|------------------------------------------|----------------------------|
| <pre>n = 1;<br/>m = 1;<br/>p.store(&amp;n, memory_orde</pre>  | pr_release);                             |                            |
| Thread 2<br>t = p.load(memory_order<br>assert(*t == 1 && m == |                                          |                            |
|                                                               | - / ,                                    |                            |
| Thread 3<br>t = p.load(memory_order                           |                                          | Can these asserts<br>fail? |
| assert (*t == 1 && m ==<br>Swarnendu Biswas (IIT Kanpur)      | 1);<br>CS 636: Memory Consistency Models | Sem 2024-25-II             |

81/96

# Happens-Before Memory Model (HBMM)

Read operation a = rd(t, x, v) may return the value written by any write operation b = wr(t, x, v) provided

- (i) b does not happen after a, i.e.,  $b \prec_{HB} a$  or b||a,
- (ii) There is no intervening write c to x where b  $\prec_{HB}$  c  $\prec_{HB}$  a

$$x = y = 0;$$
Thread 1
Thread 2
$$y = 1;$$

$$r1 = x;$$

$$x = 1;$$

$$r2 = y;$$
assert (r1 != 0 || r2 != 0);
(Can this assert fail?)

# Happens-Before Memory Model (HBMM)

Read operation a = rd(t, x, v) may return the value written by any write operation b = wr(t, x, v) provided

- (i) b does not happen after a, i.e.,  $b \prec_{HB} a$  or b||a,
- (ii) There is no intervening write c to x where b  $\prec_{HB}$  c  $\prec_{HB}$  a

$$x = y = 0;$$
Thread 1
Thread 2
$$r1 = x;$$

$$y = 1;$$

$$r2 = y;$$

$$x = 1;$$
assert (r1 == 0 || r2 == 0);
Can this assert fail?

#### HBMM

$$\begin{array}{rcl} x & = & 0; \\ y & = & 0; \end{array}$$

#### Thread 1



#### HBMM



# DRF0 vs HBMM



data-race-free programs

Swarnendu Biswas (IIT Kanpur)

CS 636: Memory Consistency Models

#### DRF0

- DRF0 allows arbitrary behavior for racy executions
- DRF0 is not strictly stronger than HBMM

#### HBMM

- HBMM does not guarantee SC for DRF programs
- HBMM is not strictly stronger than DRF0

### HBMM

HBMM has the potential to generate out-of-thin-air (OOTA) values

| Thread 1 | Thread 2 |
|----------|----------|
| x = y;   | 1 y = x; |

Problematic for garbage-collected languages since the "out-of-thin-air" value could be an invalid pointer

Introduces potential security loopholes

## Java Memory Model (JMM)

- First high-level language to incorporate a relaxed memory model
- JMM provides SC for data-race-free executions (like DRF0)
- Java provides memory- and type-safety, so JMM has to define some semantics for programs with data races
  - ► JMM prohibits out-of-thin-air values

### Outcomes Possible with JMM



| Thread | 1 |
|--------|---|
|--------|---|

1 
$$y = 1;$$
  
2  $r1 = x;$   
1  $x = 1;$   
2  $r2 = y;$ 

### Outcomes Possible with JMM



#### Thread 1

| 1 | r1 = x; | 1 | r2 = y; |
|---|---------|---|---------|
| 2 | y = 1;  | 2 | x = 1;  |

# Outcomes Possible with JMM

| Racy initialization            |             |                        |                      |
|--------------------------------|-------------|------------------------|----------------------|
|                                | obj = null; |                        |                      |
| Thread 1                       |             | Thread 2               |                      |
| <pre>obj = new Circle( 2</pre> | );          | if (obj !=<br>obj.draw | <pre>null) ();</pre> |

JVMs may not exhibit all behaviors permissible under the JMM

Swarnendu Biswas (IIT Kanpur)

CS 636: Memory Consistency Models

### Outcomes Not Possible with JMM



- HBMM permits an execution in which each load reads say 42
- DRF0 allows any arbitrary behavior
- JMM disallows reading 42, is strictly stronger than DRF0 and HBMM

### JVMs do not comply with the JMM!



З

Δ

5

6

7

# JVMs do not comply with the JMM!



#### Specifying semantics for racy programs is hard

Simple optimizations may introduce unintended consequences

#### SC for DRF is now the preferred baseline

- Make sure your program is free of data races
- Compiler and architecture setup will guarantee SC execution

### References



- V. Nagarajan et al. A Primer on Memory Consistency and Cache Coherence. Chapters 1–5, 2<sup>nd</sup> edition, Springer Cham.
- S. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Journal of Computer, vol. 29, no. 12, pp. 66–76, Dec. 1996.
- D. Marino et al. A Case for an SC-Preserving Compiler. PLDI 2011.
- C. Flanagan and S. Freund. Adversarial Memory for Detecting Destructive Races. PLDI 2010.
- M. Cao et al. Prescient Memory: Exposing Weak Memory Model Behavior by Looking into the Future. ISMM 2016.