## CS614: Linux Kernel Programming

#### **Virtual Memory**

Debadatta Mishra, CSE, IIT Kanpur

#### User API for memory management



- Generally, user programs
   use library routines to
   allocate/deallocate
   memory
- OS provides some address
   space manipulation system
   calls (today's agenda)

#### Virtual memory management



- start and end never
   overlaps between two
   vm areas
   can merge/extend vmas
  - if permissions match
  - linux maintains both rb\_tree and a sorted list (see mm/filemap.c)

The OS implements VM system calls like mmap(), mprotect() by manipulating the VMAs

#### Address translation: Paging

- The idea of paging
  - Partition the address space into fixed sized blocks (call it pages)
  - Physical memory partitioned in a similar way (call it page frames)
  - OS creates a mapping between *page* to *page frame*, H/W uses the mapping to translate VA to PA
- With increased address space size, single level page table entry is not feasible, because
  - Increasing page size increases internal fragmentation
  - Small pages may not be suitable to hold all mapping entries

#### 4-level page tables: 48-bit VA (Intel x86\_64)



- Four-levels of page table, entry size = 64 bits

# Paging example (structure of an example PTE)



- PFN occupies a significant portion of PTE entry (8 bits in this example)
  - Present bit,  $1 \Rightarrow$  entry is valid
    - Write bit,  $1 \Rightarrow$  Write allowed

| Ū |
|---|
|---|

Ρ

W

Privilege bit,  $o \Rightarrow$  only kernel mode access is allowed



- Accessed bit,  $1 \Rightarrow$  Address accessed (set by H/W during walk)
- Dirty bit,  $1 \Rightarrow$  Address written (set by H/W during walk)



Execute bit,  $1 \Rightarrow$  Instruction fetch allowed for this page



#### 4-level page tables: example translation



- physical memory
- Page table entry: 12 bits LSB is used for access flags

#### Paging: translation efficiency

sum = 0; for(ctr=0; ctr<10; ++ctr) sum += ctr;

0x20100: mov \$0, %rax; Ox20102: mov %rax, (%rbp); // sum=0 Ox20104: mov \$0, %rcx; // ctr=0 OX2O106: cmp \$10, %rcx; // ctr < 10 0x20109: jge 0x2011f; // jump if >= ox2010f: add %rcx, %rax; Ox2O111: mov %rax, (%rbp); // sum += ctr0x20113: inc %rcx // ++ctr //loop jmp 0x20106 OX20115: 0x2011f: .....

- Considering four-level page table, how many memory accesses are required (for translation) during the execution of the above code?

#### Paging: translation efficiency

0x20100: mov \$0, %rax;

0x20102: mov %rax. (%rbp): // sum=0

- Instruction execution: Loop = 10 \* 6, Others = 2 + 3
  - Memory accesses during translation = 65 \* 4 = 260
- Data/stack access: Initialization = 1, Loop = 10
  - Memory accesses during translation = 11 \* 4 = 44
- A lot of memory accesses (> 300) for address translation
- How many distinct pages are translated?
- Considering four-level page table, how many memory accesses are required (for translation) during the execution of the above code?

### Paging with TLB: translation efficiency

Translate(V){

| ILD    |       |  |  |  |
|--------|-------|--|--|--|
| Page   | PTE   |  |  |  |
| 0x20   | 0x750 |  |  |  |
| 0x7FFF | 0x890 |  |  |  |

TID

PageAddress P = V >> 12; TLBEntry entry = lookup(P); if (entry.valid) return entry.pte; entry = PageTableWalk(V); MakeEntry(entry); return entry.pte;

- TLB is a hardware cache which stores *Page* to *PFN* mapping
- After first miss for instruction fetch address, all others result in a TLB hit
- Similarly, considering the stack virtual address range as 0x7FFF000 0x8000000, one entry in TLB avoids page table walk after first miss

#### Paging: translation efficiency

0x20100: mov \$0, %rax;

- Instruction execution: Loop = 10 \* 6, Others = 2 + 3
  - Memory accesses during translation = 65 \* 4 = 260
- Data/stack access: Initialization = 1, Loop = 10
  - Memory accesses during translation = 11 \* 4 = 44
- A lot of memory accesses (> 300) for address translation
- How many distinct pages are translated?
- One code page (0x20) and one stack page (0x7FFF). Caching these translations, will save a lot of memory accesses.

required (for translation) during the execution of the above code?

#### Address translation (TLB + PTW)



- TLB in the path of address translation
- Separate TLBs for instruction and data, multi-level TLBs
- In X86, OS can not make entries into the TLB directly, it can flush entries

#### Address translation (TLB + PTW)



- How TLB is shared across multiple processes?
- Why page fault is necessary?
- How OS handles the page fault?



into the TLB directly, it can flush entries

| Process (A) | Process (B) |
|-------------|-------------|
|-------------|-------------|

- Assume that, process A is currently executing. What happens when process B is scheduled?

| Page  | PTE      |
|-------|----------|
| 0x100 | 0x200007 |
| 0x101 | 0x205007 |
|       |          |
|       |          |

- A) Do nothing
- B) Flush the whole TLB
- C) Some other solution

| Process (A) | Process (B) |
|-------------|-------------|
|-------------|-------------|

| Page  | PTE      |
|-------|----------|
| 0x100 | 0x200007 |
| 0x101 | 0x205007 |
|       |          |
|       |          |

- Assume that, process A is currently executing. What happens when process B is scheduled?
  - A) Do nothing
  - B) Flush the whole TLB
  - C) Some other solution
- Process B may be using the same addresses used by
   A. Result: Wrong translation

| Process (A) | Process (B) |
|-------------|-------------|
|-------------|-------------|

| Page             | PTE                 |
|------------------|---------------------|
| <del>0x100</del> | <del>0x200007</del> |
| <del>0x101</del> | <del>0x205007</del> |
|                  |                     |
|                  |                     |

- Assume that, process A is currently executing. What happens when process B is scheduled?
  - A) Do nothing
  - B) Flush the whole TLB
  - C) Some other solution
- Correctness ensured. Performance is an issue (with frequent context switching)



- Assume that, process A is currently executing. What happens when process B is scheduled?

| ASID | Page  | PTE      |
|------|-------|----------|
| А    | 0x100 | ox200007 |
| А    | 0x101 | ox205007 |
| В    | 0x100 | 0x301007 |
| В    | 0x101 | 0x302007 |

- A) Do nothing
  - B) Flush the whole TLB
  - C) Some other solution
- Address space identified (ASID) along with each TLB entry to identify the process

#### Address translation (TLB + PTW)



- How TLB is shared across multiple processes?
- Full TLB flush during context switch, using ASID
- Why page fault is necessary?
- How OS handles the page fault?



#### Address translation (TLB + PTW)



- How TLB is shared across multiple processes?
- Full TLB flush during context switch, using ASID
- Why page fault is necessary?
- Page fault is required to support memory over-commitment through lazy allocation and swapping
- How OS handles the page fault?



#### Page fault handling in X86: Hardware

```
If( !pte.valid ||
  (access == write && !pte.write)
  (cpl != 0 && pte.priv == 0)){
      CR_2 = Address;
      errorCode = pte.valid
                    access << 1
                    | cpl << 2;
       Raise pageFault;
} // Simplified
```

#### Page fault handling in X86: Hardware

#### Error code

|                                 | Other and unused I R U W P                                       |
|---------------------------------|------------------------------------------------------------------|
| If( !pte.valid                  |                                                                  |
| (access == write && !pte.write) | <b>P</b> Present bit, $1 \Rightarrow$ fault is due to protection |
| (cpl != 0 && pte.priv == 0)){   | Write bit, $1 \Rightarrow$ Access is write                       |
| CR2 = Address;                  | W Write bit, $1 \Rightarrow$ Access is write                     |
| errorCode = pte.valid           | U Privilege bit, $1 \Rightarrow$ Access is from user mode        |
| access << 1                     | $\square R Reserved bit, 1 \Rightarrow Reserved bit violation$   |
| cpl << 2;                       | <b>R</b> Reserved bit, $1 \Rightarrow$ Reserved bit violation    |
| Raise pageFault;                | I Fetch bit, $1 \Rightarrow$ Access is Instruction Fetch         |
| } // Simplified                 |                                                                  |

- Error code is pushed into the kernel stack by the hardware

```
Page fault handling in X86: OS fault handler
HandlePageFault( u64 address, u64 error_code)
{
   If (AddressExists(current \rightarrow mm_state, address) &&
      AccessPermitted(current \rightarrow mm_state, error_code) {
          PFN = allocate_pfn();
          install_pte(address, PFN);
          return;
   RaiseSignal(SIGSEGV);
```

#### Address translation (TLB + PTW)

- **VA**(*i* How TLB is shared across multiple processes?
  - Full TLB flush during context switch, using ASID
  - Why page fault is necessary?
  - Page fault is required to support memory over-commitment through lazy <sup>0</sup> allocation and swapping
  - How OS handles the page fault?
  - The hardware invokes the page fault handler by placing the error code and virtual address. The OS handles the page fault either fixing it or raising a SEGFAULT.

#### Swapping (swap-out)

DRAM

Swap (Hard disk)

Number of free PFNs are very few in the system. I can not break my promise made to the applications. Let me swap-out some memory. But which one to swap-out?





DRAM



**Page Replacement Policy** 

My page replacement policy will help me deciding the victims (V). Can I just swap-out? What if the swapped-out pages are accessed? I should be prepared for that too!





Swap (Hard disk)

AllocatePFN()



#### Swapping (swap-out)

SwapAddress(V)

V

DRAM



PTE mapping the victim PFN (before swap)

| PFN(V) | 0 | 1 | 1 | 1 | 1 | 1 |
|--------|---|---|---|---|---|---|
|--------|---|---|---|---|---|---|

PTE mapping the victim PFN (after swap)

Content of the PFN is now in the swap device. In future, any translation using the PTE will result in a page fault. The page fault handler would copy it back from the swap device.



Swap (Hard disk)

0

1

#### Page fault with swap-in

```
HandlePageFault( u64 address, u64 error_code)
```

```
If ( AddressExists(current → mm_state, address) &&
AccessPermitted(current → mm_state, error_code) {
    PFN = allocate_pfn();
    If ( is_swapped_pte(address) ) // Check if the PTE is swapped out
    swapin(getPTE(address), PFN); // Copy the swap block to PFN
    install_pte(address, PFN); // and update the PTE
    return;
```

```
RaiseSignal(SIGSEGV);
```

{

#### Efficient translation: Huge page support



Disadvantages?

#### Efficient translation: Huge page support



- Disadvantages? Inefficient management

#### Mixed page size support



walk\_pmd(pmd, vaddr) {
 if(pmd.H)
 paddr = pmd.nextL() + (vaddr & pmask);
 else
 pte = pmd.nextL() + pte\_offset(vaddr)
} // Simplified H/W logic

#### Mixed page size support



walk\_pmd(pmd, vaddr) {
 if(pmd.H)
 paddr = pmd.nextL() + (vaddr & pmask);
 else
 pte = pmd.nextL() + pte\_offset(vaddr)

#### } // Simplified H/W logic

- The OS may use the hardware support to implement any policy
- Transparent hugepage (THP) in Linux trie to create huge page mapping in w/o explicit user space assistance
- Policy knobs through sysfs

- Why not treat kernel as an isolated MM context?

- Why not treat kernel as an isolated MM context?
  - Require MM context loading/unloading on user-kernel context switch
  - In kernel context, user data is accessed (a lot!) why?
  - Even worse, user data of many processes accessed
  - In X86, a small part of the kernel can not be isolated as HW does not perform MM context switch
- Requirement: efficient memory isolation between user and kernel

- Why not treat kernel as an isolated MM context?
  - Require MM context loading/unloading on user-kernel context switch
  - In kernel context, user data is accessed (a lot!) why?
  - Even worse, user data of many processes can be accessed
  - In X86, a small part of the kernel can not be isolated as HW does not perform MM context switch
- Requirement: efficient memory isolation between user and kernel
  - Let kernel use the same MM context of the user process
  - No context switch, no problems of accessing user data

- Why not treat kernel as an isolated MM context?
  - Require MM context loading/unloading on user-kernel context switch
  - In kernel context, user data is accessed (a lot!) why?
  - Even worse, user data of many processes can be accessed
  - In X86, a small part of the kernel can not be isolated as HW does not perform MM context switch
- Requirement: efficient memory isolation between user and kernel
  - Let kernel use the same MM context of the user process
  - No context switch, no problems of accessing user data
- How kernel VM change propagated across processes? Isolation issues?

# Issue of Kernel VM propagation



Kernel virtual address mapping should be present in both process page tables.
Ex: If kernel allocates memory while serving syscall from process-1, process-2 in kernel

mode should see it!

 Solution should consider that,
 "processes and memory are dynamically created and destroyed"

### Linux strives on family values!

- A child process page table inherits the kernel mappings of the parent
- By implication, the inheritance tree is rooted at the first process
- Mapping changes  $\rightarrow$  update mapping in every process?
  - Does not look good!

### Linux strives on family values!

- A child process page table inherits the kernel mappings of the parent
- By implication, the inheritance tree is rooted at the first process
- Mapping changes  $\rightarrow$  update mapping in every process?
  - Does not look good!

Solution: Every process owns its own **pgd** entries but inherits the kernel **pgd** entries from the parent :-)

### Solution overview



- One (or more) entries in PGD-level (level-4) reserved for kernel mapping
- How many?
- Depends on VA-range covered by one entry and the kernel VA size

### Solution overview



# Virtual memory layout (x86\_64)



- User virtual addresses use the LSB 47 bits
- Kernel virtual address does not start from 0x800000000, but from 0xffff80000000000
- Why? Because X86 hardware enforces if 47th bit is one, 48-63 must be set to one

### Process address space (user + kernel)



- Virtual address space is split into two parts, user VA and kernel VA
- Kernel mappings are isolated from user through S/U bit of page table entry
- Advantages: isolation + efficiency
- What is the need for direct map?

### Process address space (user + kernel)



- Virtual address space is split into two parts, user VA and kernel VA
- Kernel mappings are isolated from user through S/U bit of page table entry
- Advantages: isolation + efficiency
- What is the need for direct map? Helps in mapping physical address to an already mapped kernel vaddr

#### Issue with shared address space

char array[256 \* 4096]; //\_alligned(4k); char secret = \*(char \*) 0xffff88800000000; array[secret << 12] = 0;

- This program will result in an exception  $\rightarrow$  Segmentation fault
- Everything seems to be under control. What is the problem then?

#### Information leakage through out-of-order execution



- By the time the instruction in line#3 is committed (and a fault is raised), instructions in line#4 and #5 are completed out-of-order

#### Side-effect: access footprint

- 1. char array[256 \* 4096]; //\_\_alligned(4k);
- 2. char secret = \*(char \*) 0xffff8880000000;
- 3. array[secret << 12] = 0;

Array (before the program execution): block 0 == {0 - 4095} etc.

Array (after out-of-order execution of #3) {assume secret = k}



## OOO vulnerability + Flush-Reload

- unsigned time[256];
- 2. char array[256 \* 4096];
- 3. flush\_array(array);
- 4. char secret = \*(char \*) 0xffff88800000000;
- 5. array[secret << 12] = 0;
- 6. for(i=0; i<256; ++i)

7.



- 8. secret = find\_index\_with\_min\_time( time);
- Result: indirectly read the value of secret
- Meltdown is easy.... Some subtle points still remain
- What is the fix?







