# CS 335: Code Generation 

## Swarnendu Biswas

Semester 2022-2023-II
CSE, IIT Kanpur

## An Overview of Compilation



## Code Generation



## Code Generation

i. Generated output code must be correct
ii. Generated code must be of "good" quality

- Should make efficient use of resources on the target machine
- Notion of good can be vary
iii. Code generation should be efficient
- Generating optimal code is undecidable, compilers make use of welldesigned heuristics


## Code Generation

- Input
- Intermediate representation (IR) generated by the front end
- Linear IRs like 3AC or stack machine representations
- Graphical IRs also work
- Symbol table information


## - Assumptions

- Code generation does not bother with any error checking
- Code generation assumes that types in the IR can be operated on by target machine instructions
- For example, bits, integers, and floats


## Code Generation

## - Output

- Absolute machine code
- Generated addresses are fixed and works when loaded at fixed locations in memory
- Efficient to execute, now primarily used in embedded systems
- Relocatable machine code
- Code can be broken down into separate sections and loaded anywhere in memory that meets size requirements
- Allows for separate compilation, but requires a separate linking and loading phase
- Assembly language
- Simplifies code generation, but requires assembling the generated code


## Steps in Code Generation

- Compiler backend performs three steps to translate IR to executable code
- Instruction selection - Choose appropriate target machine instructions while generating code
- Register allocation - Decide what values to keep in which registers
- Instruction scheduling - Decide in what order to schedule the execution of instructions
- Manage memory during execution


## Instruction Selection

- Complexity arises because each IR instruction can be translated in several ways, combinatorial problem

| $a=a+1$ | 1. LD R0, a ADD R0, R0, \#1 ST a, R0 |
| :---: | :---: |
|  | 2. INC a |

- Target ISA influences instruction selection
- Scalar RISC machine - simple mapping from IR to assembly
- CISC machine - may need to fuse multiple IR operations for effectively using CISC instructions
- Stack machine - need to translate implicit names and destructive instructions to assembly


## Instruction Selection

- Possible idea

$$
\begin{array}{ll} 
& \text { LD R0, } y \\
\text { ADD R0, R0, } z+z \\
& \text { ST } x, R 0
\end{array}
$$

- Devise a target code skeleton for every 3AC IR instruction
- Replace every 3AC instruction with the skeleton
- Need a cost model and heuristics for selection
- Other factors are level of abstraction of the IR, speed of instructions, energy consumption, and space overhead


## Register Allocation

- Instructions operating on register operands are more efficient
- Register allocation - Choose which variables will reside in registers
- Register assignment - Choose which registers to assign to each variable
- Architectures may impose restrictions on usage of registers
- Finding an optimal assignment of registers to variables is NPcomplete
- Architectures such as IBM 370 may require register pairs to be used for some instructions
- x is in the even register, y is in the odd register
- Product occupies the entire even/odd register pair
- 64-bit dividend occupies the even/odd register pair
- Even register holds the remainder, odd register the quotient


## Instruction Scheduling

- Order of evaluating the instructions also affect the efficiency of the target code
- Selecting the best order across inputs is a NP-complete problem


## Example Target Machine

- Efficient code generation requires good understanding of the target ISA
- Assumptions
- Three-address machine, byte-addressable with four-byte words
- n general-purpose registers
- OP dst, $\mathrm{src}_{1}, \mathrm{src}_{2}$; LD dst addr; ST dst, src; BR L; Bcondr L;


## Addressing Modes

- Specifies how to interpret the operands of an instruction

| Mode | Form | Address | Example |
| :---: | :---: | :---: | :---: |
| absolute | M | M | LD R0, M |
| register | R | R | ADD R0, R1, R2 |
| indexed | $c(R)$ | $\mathrm{c}+\mathrm{contents}(\mathrm{R})$ | LD R1, 4(R0) |
| indirect register | *R | contents(R) | LD R1, *R0 |
| indirect indexed | * C (R) | contents( C + contents( R$)$ ) | LD R1, *100(R0) |
| literal | \#C | C | LD R1, \#1 |

## Few Examples

| $x=y-z$ | $\begin{aligned} & \text { LD } R 1, y \\ & \text { LD } R 2, z \\ & \text { SUB } R 1, R 1, R 2 \\ & \text { ST } x, R 1 \end{aligned}$ | $\begin{aligned} & / / R 1=y \\ & / / R 2=z \\ & / / R 1=R 1-R 2 \\ & / / x=R 1 \end{aligned}$ |
| :---: | :---: | :---: |


| if $x<y$ goto $L$ | LD $R 1, x$ | $/ / \mathrm{R} 1=\mathrm{x}$ |
| :--- | :--- | :--- |
|  | LD $R 2, y$ | $/ / \mathrm{R} 2=\mathrm{y}$ |
|  | SUB $R 1, R 1, R 2$ | $/ / \mathrm{R} 1=\mathrm{R} 1-\mathrm{R} 2$ |
|  | BLTZ $R 1, M$ | $/ /$ if R1 < O JMP M |


$b=a[i]$| $\mathrm{LD} R 1, i$ | $/ / \mathrm{R} 1=\mathrm{i}$ |
| :--- | :--- |
| $\mathrm{MUL} R 1, R 1,8$ | $/ / \mathrm{R} 1=\mathrm{R} 1 * 8$ |
| $\mathrm{LD} R 2, a(R 1)$ | $/ / \mathrm{R} 2=\mathrm{c}(\mathrm{a}+\mathrm{c}(\mathrm{R} 1))$ |
|  | $\mathrm{ST} b, R 2$ |

$$
a[j]=c=\begin{array}{lll}
\text { LD } R 1, c & / / \mathrm{R} 1=\mathrm{c} \\
& \mathrm{LDR2,j} & / / \mathrm{R} 2=\mathrm{j} \\
\mathrm{MULR2,R2,8} & / / \mathrm{R} 2=\mathrm{R} 2 * 8 \\
\mathrm{ST} a(R 2), R 1 & / / \mathrm{c}(\mathrm{a}+\mathrm{c}(\mathrm{R} 2))=\mathrm{R} 1
\end{array}
$$

|  | LD $R 1, p$ | $/ / \mathrm{R} 1=\mathrm{p}$ |
| :--- | :--- | :--- |
| $x=* p$ | $L D R 2,0(R 1)$ | $/ / \mathrm{R} 2=\mathrm{c}(0+\mathrm{c}(\mathrm{R} 1)$ |
|  | $\mathrm{ST} x, R 2$ | $/ / \mathrm{x}=\mathrm{R} 2$ |


| LD $R 1, p$ $/ / \mathrm{R} 1=\mathrm{p}$ <br> y $L D R 2, \mathrm{y}$ <br> $\mathrm{ST} 0(R 1), R 2$ $/ / \mathrm{R} 2=\mathrm{y}$ <br>  $/ \mathrm{c}(0+\mathrm{c}(\mathrm{R} 1)=\mathrm{R} 2$ |
| :--- | :--- | :--- |

## Runtime Storage Management

- Let us consider the following 3AC: call callee, return, halt, action
- Assume that the first location in the activation record (given by staticArea) of the callee stores the return address of the caller


## Static Allocation

ST callee.staticArea, \#here $+20 \quad$ Store return address in the first slot in the callee's activation record, assume 2 opcodes and 3 constants are each of 4 bytes

BR callee.codeArea Transfer control to callee
$\xrightarrow[\text { address }]{\text { return }}$ ...

```
BR *callee.staticArea
```

Return transfer to caller

## Determine Addresses in Target Code

- Need to generate code to manage activation records at runtime

| $3 A C$ |
| :--- |
| $/ /$ code for func c $_{\text {action }}^{1}$ |
| call $p$ |
| action |
| halt |$|$| // code for func p |
| :--- |
| action |
| return |


|  | Activation record for c <br> (64 Bytes) |
| :---: | :---: |
| $0:$ | return address |
| $4:$ | arr |
| $56:$ | i |
| $60:$ | j |


|  | Activation record for $p$ <br> (88 Bytes) |
| :---: | :---: |
| $0:$ | return address |
| $4:$ | buf |
|  |  |
| $84:$ | $n$ |

## Target Code for Static Allocation

| text area |  |  |
| :---: | :---: | :---: |
|  |  | // code for c |
| 100: | ACTION ${ }_{1}$ | // assume takes 20 bytes |
| 120: | ST 364, \#140 | // save return address 140 |
| 132: | BR 200 | // call p |
| 140: | ACTION |  |
| 160: | HALT | // Terminate, return to OS |
|  |  | // code for p |
| 200: | $\mathrm{ACTION}_{3}$ |  |
| 220: | BR *364 | // return to address saved in location 364 |


|  |  | // 300-363 hold activation <br> record for c |
| :--- | :--- | :--- |
| 300: |  | // return address |
| 304: |  | // local data for c |
|  |  | // 364-451 hold activation <br> record for $p$ |
| 364: | 140 |  |
| $368:$ |  | // return address |

## Stack Allocation

| Code for first procedure |  |
| :--- | :--- |
| LD SP, \#stackStart <br> code <br> HALT | // initialize the stack |


| Code for procedure call |  |
| :--- | :--- |
| ADD SP, SP, \#caller. recordSize | // increment stack pointer |
| ST $*$ SP, \#here +16 | // save return address in |
|  | // callee's frame |
| BR callee.codeArea | // jump to caller |


|  |  |  |  | Code for return sequence in the callee |
| :--- | :--- | :---: | :---: | :---: |
| $\mathrm{BR} * \mathrm{O}(\mathrm{SP})$ | $/ /$ return to caller |  |  |  |

Code for return sequence in the caller
SUB SP, SP, \#caller.recordSize // decrement stack pointer

| $3 A C$ |
| :--- |
| // code for s |
| action |
| call q |
| action |
| halt |$|$| // code for p |
| :--- |
| action |
| return |

## Target Code for Stack Allocation

|  |  | // code for s |
| :---: | :---: | :---: |
| 100: | LD SP, \#600 | // initialize the stack |
| 108: | $\mathrm{ACTION}_{1}$ | // code for action ${ }_{1}$ |
| 128: | ADD SP, SP, \#ssize | // call sequence begins |
| 136: | ST 0(SP), \#152 | // push return address |
| 144: | BR 300 | // call q |
| 152: | SUB SP, SP, \#ssize | // restore SP |
| 160: | $\mathrm{ACTION}_{2}$ |  |
| 180: | HALT |  |
|  |  | // code for $p$ |
| 200: | $\mathrm{ACTION}_{3}$ |  |
| 220: | BR * O(SP) | // return to caller |


|  |  | // code for q |
| :---: | :---: | :---: |
| 300: | $\mathrm{ACTION}_{4}$ | // conditional jump to 456 |
| 320: | ADD SP, SP, \#qsize |  |
| 328: | ST 0(SP), \#344 | // push return address |
| 336: | BR 200 | // call p |
| 344: | SUB SP, SP, \#qsize |  |
| 352: | $\mathrm{ACTION}_{5}$ |  |
| 372: | ADD SP, SP, \#qsize |  |
| 380: | ST 0(SP), \#396 | // push return address |
| 388: | BR 300 | // call q |
| 396: | SUB SP, SP, \#qsize |  |
| 404: | $\mathrm{ACTION}_{6}$ |  |
| 424: | ADD SP, SP, \#qsize |  |
| 432: | ST 0(SP), \#448 | // push return address |
| 440: | BR 300 | // call q |
| 448: | SUB SP, SP, \#qsize |  |
| 456: | BR * 0 (SP) | // return |
| 600: |  | // stack starts here |

## Basic Blocks and Flow Graphs

## Basic Block

- Maximal sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end
- Entry is to the start of the BB, and exit is from the end of the BB
- Only the start/leader instruction can be the target of a JUMP instruction
- No jumps into the middle of the block
- No branch instructions other than the end

$$
\begin{aligned}
& \mathrm{t}_{1}=\mathrm{a} * a \\
& \mathrm{t}_{2}=\mathrm{a} * \mathrm{~b} \\
& \mathrm{t}_{3}=2 * \mathrm{t}_{2} \\
& \mathrm{t}_{4}=\mathrm{t}_{1}+\mathrm{t}_{3} \\
& \mathrm{t}_{5}=\mathrm{b} * \mathrm{~b} \\
& \mathrm{t}_{6}=\mathrm{t}_{4}+\mathrm{t}_{5}
\end{aligned}
$$

## Identifying Basic Blocks (BBs)

## - Input

- A sequence of 3AC
- Output
- List of BBs with each 3AC in exactly one BB


## - Algorithm

- Identify the leaders which are the first statements in a BB

1. The first statement is a leader
2. Any statement that is the target of a conditional or unconditional goto is a leader
3. Any statement that immediately follows a conditional or unconditional goto is a leader

- For each leader, its BB consists of the leader and all instructions up to but not including the next leader or the end of the program


## Identifying BBs

for i from 1 to 10 do
for $j$ from 1 to 10 do

$$
a[i, j]=0.0
$$

for i from 1 to 10 do

$$
a[i, i]=1.0
$$

Statements (1), (2), (3), (10), (12), and (13) are leaders

There are six BBs: (1), (2), (3)-(9), (10)(11), (12), (13)-(17)
(1) $i=1$
(2) $j=1$
(3) $\boldsymbol{t}_{1}=\mathbf{1 0} \times$
(4) $t_{2}=t_{1}+j$
(5) $t_{3}=8 \times t_{2}$
(6) $t_{4}=t_{3}-88$
(7) $a\left[t_{4}\right]=0.0$
(8) $j=j+1$
(9) if $j \leq 10$ goto (3)
(10) $i=i+1$
(11) if $i \leq 10$ goto (2)
(12) $i=1 \times$ follows a
(13) $t_{5}=i-1 \quad$ conditional
(14) $t_{6}=88 \times t_{5}$
(15) $a\left[t_{6}\right]=1.0$
(16) $i=i+1$
(17) if $i \leq 10$ goto (13)

## Next Use and Liveness

- Knowing when the value of a variable will be used next is important for generating good code
- Remove variables from registers if not used
- Consider the 3AC instruction $I: x=y+z$; we say $I$ defines $x$ and uses $y$ and $z$
- Suppose a statement $I$ defines $x$. If a statement $J$ uses $x$ as an operand, and control can flow from $I$ to $J$ along a path where $x$ is not redefined, then $J$ uses the value of $x$ defined at $I$
- A name in a BB is live at a given point if its value is used after that point
- We say $x$ is live at statement $I$

```
\
Z = Z * 5
t7 = Z + 1
    Y = Z - t7
    X = Z + Y
```

```
X is live at (5), X's next
use at (5) is (15)
(5) \(X=\)

\section*{Example of Next Use and Liveness}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline Intermediate Code & \multicolumn{3}{|c|}{ Live/Dead } & \multicolumn{3}{c|}{ Next use } \\
\hline & & X & y & Z & x & y \\
\hline (1) \(\quad \mathrm{X}=\mathrm{y}+\mathrm{z}\) & L & D & D & \((2)\) & - & - \\
\hline (2) \(\quad \mathrm{Z}=\mathrm{x} * 5\) & D & & L & - & & \((3)\) \\
\hline\((3) \quad \mathrm{y}=\mathrm{z}-7\) & & L & L & & \((5)\) & \((5)\) \\
\hline\((4) \quad \mathrm{X}=\mathrm{z}+\mathrm{y}\) & D & D & D & - & - & - \\
\hline
\end{tabular}

\section*{Determining Next Use and Liveness Information}
- Input
- A BB (say B) of 3AC
- Assume symbol table shows all non-temporary variables in \(B\) as live on exit and all temporaries are dead on exit
- Output
- Liveness and next use information for each statement \(I: x=y\) op \(z\) in \(B\)
- Algorithm
i. \(\quad\) Scan forward over \(B\) to find the last statement.
- For each variable \(x\) used in \(B\), create fields \(x\).live and \(x\).next_use in the symbol table. Initialize \(x\).live \(=\) FALSE and \(x\).next_use \(=\) NONE.
- Each tuple \(I: x=y\) op \(z\) stores next use and liveness information. Initialize tuple.
ii. Scan backward over \(B\). For each statement \(I: x=y\) op \(z\) in \(B\), do
- Copy the next use and liveness information for \(x, y\), and \(z\) from the symbol table to tuple \(I\)
- Update \(x, y\), and \(z\) 's symbol table entries.
- Set \(x\). live \(=\) FALSE, \(x\).next_use \(=\) NONE
- Set \(y\).live \(=z\). live \(=\) TRUE and \(y\).next_use \(=z\).next_use \(=I\)

\section*{Example Computation of Next Use and Liveness Information}


\section*{Example Computation of Next Use and Liveness Information}


\section*{Control Flow Graph (CFG)}
- Graphical representation of control flow during execution
- Each node represents a statement or a BB
- An entry and an exit node are often added to a CFG for a function
- An edge represents possible transfer of control between nodes
- Used for compiler optimizations and static analysis (e.g., instruction scheduling and global register allocation)

straight-line code

predicate

loop iteration

\section*{Example of BBs and a CFG}
```

int main() {
int marks = 63, grade = 0;
if (marks >= 80)
grade = 10;
else if (marks >= 60)
grade = 8;
else if (marks >= 40)
grade = 6;
else
grade = 4;
printf("Grade %d", grade);
return 0;
}

```


\section*{Example CFG Generated with LLVM}
```

int main() {
int marks = 63, grade = 0;
if (marks >= 80)
grade = 10;
else if (marks >= 60)
grade = 8;
else if (marks >= 40)
grade = 6;
else
grade = 4;
printf("Grade %d", grade);
return 0;
}

```


\section*{Control Flow Graph (CFG)}
\begin{tabular}{|c|c|}
\hline 1. & prod \(=0\) \\
\hline 2. & \(i=1\) \\
\hline 3. & \(\mathrm{t}_{1}=4^{*} \mathrm{i}\) \\
\hline 4. & \(\mathrm{t}_{2}=\mathrm{a}[\mathrm{t} 1]\) \\
\hline 5. & \(\mathrm{t}_{3}=4^{*} \mathrm{i}\) \\
\hline 6. & \(\mathrm{t}_{4}=\mathrm{b}\left[\mathrm{t}_{3}\right]\) \\
\hline 7. & \(\mathrm{t}_{5}=\mathrm{t}_{2} * \mathrm{t}_{4}\) \\
\hline 8. & \(\mathrm{t}_{6}=\mathrm{prod}+\mathrm{t}_{5}\) \\
\hline 9. & prod \(=\mathrm{t}_{6}\) \\
\hline 10. & \(\mathrm{t}_{7}=\mathrm{i}+1\) \\
\hline 11. & \(i=t_{7}\) \\
\hline 12. & if \(\mathrm{i}<=20\) goto (3) \\
\hline
\end{tabular}


\section*{Loops in a CFG}
- A set of CFG nodes \(L\) form a loop if that \(L\) contains a node \(e\) called loop entry such that
- \(e\) is not the Entry node,
- No node in \(L\) besides \(e\) has a predecessor outside \(L\)
- Every node in \(L\) has a nonempty path to \(e\) that is completely within L

\(B_{1}\)

\section*{Loops in a CFG}
- There is a unique entry
- Only way to reach a node in \(L\)
from outside the loop is through \(e\)
- Only way to reach a node in \(L\)
from outside the loop is through \(e\)
- All nodes in the group are strongly connected
- There is a path from any node in the loop to any other loop
- Path is wholly-contained within the loop
\[
\begin{aligned}
& \text { prod }=0 \\
& i=1
\end{aligned}
\]
\(B_{1}\)


\section*{Example CFG}
1) \(i=1 / /\) Leader
2) \(j=1\)
3) \(\mathrm{t} 1=10\) * i
4) \(\mathrm{t} 2=\mathrm{t} 1+\mathrm{j}\)
5) \(\mathrm{t} 3=8\) * t2
6) \(\mathrm{t} 4=\mathrm{t} 3-88\)
7) \(\mathrm{a}[\mathrm{t} 4]=0.0\)
8) \(j=j+1\)
9) if \(j<=10\) goto (3)
10) \(i=i+1\)
11) if \(i<=10\) goto (2)
12) \(i=1\)
13) \(\mathrm{t} 5=\mathrm{i}-1\)
14) \(\mathrm{t} 6=88 * t 5\)
15) \(a[t 6]=1.0\)
16) \(i=i+1\)
17) if \(i<=10\) goto (13)


\title{
Optimizing BBs
}

Local optimizations

\section*{Optimization of BBs}
- Code optimizations can lead to substantial improvement in running time and/or energy consumption
- Global optimization analyzes control and data flow among BBs
- E.g., performs control flow, data flow, and data dependence analysis
- Local or intra-BB optimizations can also provide significant improvements
- DAG is a useful data structure for implementing transformations on BBs
- Allows detecting common sub-expressions

\section*{Intra-Block Transformations}
- Expressions are values of names that are live on exit from a \(B B\)
- Two BBs are equivalent if they compute the same set of expressions
- Local transformations on BBs to improve code quality
- Structure-preserving and algebraic transformations
- Should not change the set of expressions computed by a block

\section*{Structure-Preserving Transformations}
i. Common subexpression elimination
- Instructions compute a value that has been computed
ii. Dead code elimination
```

a = b + c a = b + c
b = a - d b = a - d
c = b + c c = b + c
d = a - d d = b

```
- Remove Instructions that define variables that are never used
iii. Renaming temporary variables
- Can always transform a BB into an equivalent block where each statement that defines a temporary uses a new name
- Such a BB is called a normal-form block
iv. Reordering of dependence-free statements
- Normal-form blocks permits statement interchanges without affecting the value of the block
\[
\begin{aligned}
& \mathrm{t}_{1}=\mathrm{b}+\mathrm{c} \\
& \mathrm{t}_{2}=\mathrm{x}+\mathrm{y}
\end{aligned}
\]
- May improve latency of accesses and register usage

\section*{Algebraic Transformations}
- Apply algebraic laws to simplify computation
\begin{tabular}{|c|c|}
\hline \multicolumn{2}{|c|}{ Strength Reduction } \\
\hline Expensive & Cheaper \\
\hline\(x^{2}\) & \(x \times x\) \\
\hline \(2 \times x\) & \(x+x\) \\
\hline\(x \div 2\) & \(x \gg 1\) \\
\hline
\end{tabular}
\[
\begin{aligned}
& x+0=0+x=x \\
& x \times 1=1 \times x=x \\
& x-0=x \\
& x \div 1=1
\end{aligned}
\]
- Constant folding - evaluate constants during compilation
- E.g., \(i=2 * 3.14 * 300 * 300\);
- Relational operators can generate common sub-expressions (e.g., \(x>y\) and \(x-y\) )

\section*{DAG Representation of BBs}

Many optimizations are easier to perform on a DAG representation of BBs
(1) \(t_{1}=4 * i\)
(2) \(t_{2}=a\left[t_{1}\right]\)
(3) \(t_{3}=4 * i\)
(4) \(t_{4}=b\left[t_{3}\right]\)
(5) \(t_{5}=t 2 * t_{4}\)
(6) \(\mathrm{t}_{6}=\operatorname{prod}+\mathrm{t}_{5}\)
(7) \(\operatorname{prod}=t_{6}\)
(8) \(\mathrm{t}_{7}=\mathrm{i}+1\)
(9) \(\mathrm{i}=\mathrm{t}_{7}\)
(10) if i <= 20 goto (1)


\section*{Representing BBs with DAGs}
- Rules on the DAG structure
- Leave nodes are labeled with variable names or constants
- Initial values for each variable is represented by a node
- A node \(N\) is associated with each statement \(s\) in a BB
- Children of \(N\) correspond to statements that last define the operands used in \(s\)
- Inner nodes are labeled by an operator symbol
- Node \(N\) is labeled by the operator applied at \(s\)
- Nodes optionally have a sequence of identifiers for labels
- Output nodes are those variables that are live on exit
- Each BB node in a CFG can be represented with a DAG

\section*{Constructing a DAG}
- Input
- A basic block (BB)
- Output
- A DAG for the BB with the following information
- a label for each node (id for leaf nodes and operator symbols for interior nodes)
- a list of identifiers (not constants) for each node

\section*{- Assumptions}
- Three kinds of 3AC: (i) \(x=y\) op \(z\), (ii) \(x=\) op \(y\), and (iii) \(x=y\)
- Relational operators like "if \(i \leq 20\) goto (1)" are treated like case (i)

\section*{Constructing a DAG}

\section*{- Steps}
- For each statement in the BB,
1. If \(\operatorname{node}(y)\) is undefined, create a leaf labeled \(y\) and set \(\operatorname{node}(y)\) to the new node
2. For case (i), check if there is a node in the DAG labeled op with left child node ( \(y\) ) and right child node ( \(z\) ). If not, then create a node (denoted by \(n\) ).
3. For case (ii), check if there is a node labeled op with node \((y)\) as the only child. If not, then create a node (denoted by \(n\) ).
4. Delete \(x\) from the list of identifiers for node \((x)\). Append \(x\) to the list of identifiers for the node and set \(\operatorname{node}(x)\) to \(n\).

\section*{DAG Representation of BBs}
(1) \(t_{1}=4 * i\)
(2) \(t_{2}=a\left[t_{1}\right]\)
(3) \(t_{3}=4 * i\)
(4) \(t_{4}=b\left[t_{3}\right]\)
(5) \(t_{5}=t 2 * t_{4}\)
(6) \(\mathrm{t}_{6}=\operatorname{prod}+\mathrm{t}_{5}\)
(7) \(\operatorname{prod}=t_{6}\)
(8) \(\mathrm{t}_{7}=\mathrm{i}+1\)
(9) \(\mathrm{i}=\mathrm{t}_{7}\)
(10) if \(i<=20\) goto (1)

1

(3)


2


\section*{Local Common Subexpressions}


\section*{Dead Code Elimination}
- Delete a root node from the DAG if it has no live variables
- Repeat till there are no such nodes
- Assume only \(a\) and \(b\) are live on exit
\[
\begin{aligned}
& \mathrm{a}=\mathrm{b}+\mathrm{c} \\
& \mathrm{~b}=\mathrm{b}-\mathrm{d} \\
& \mathrm{c}=\mathrm{c}+\mathrm{d} \\
& \mathrm{e}=\mathrm{b}+\mathrm{c}
\end{aligned}
\]


\section*{Representing Array References}


\section*{Consider Other Sources of Possible Aliasing}
\[
\begin{aligned}
& b=a+12 \\
& x=b[i] \\
& b[j]=y
\end{aligned}
\]

\(x=* p / /\) Use of every possible variable
* \(\mathrm{q}=\mathrm{y} / /\) Possible assignment to every variable
- =* must include all nodes for optimization analysis
- \(\quad\) = kills all other nodes
- Possible to use pointer analysis to be more precise
- Assume that procedures use variables attached to a node and kills that node

\title{
Code Generation Algorithm
}

Single Basic Blocks

\section*{Code Generation Strategy}
- Goal: Generate target code for a sequence of 3AC within a BB
- Assumptions
- Every 3AC operator has an equivalent operator in the target language
- Computed values can reside in registers and only needs to be saved when
1. The register is required for another computation, or
2. Just before a procedure call, jump, or a labelled statement
- Implies every register must be saved before the end of a BB
- Steps: For each 3AC,
- Identify variables that need to be loaded into registers
- Load the variables into registers
- Generate code for the instruction
- Generate a store if the result needs to be saved into memory

\section*{Challenges in Code Generation}

\section*{Different Possibilities}
\begin{tabular}{l|l|l}
\hline & ADD \(R j, R i\) & \begin{tabular}{l}
\(b\) is in \(R i, c\) is in \(R j, b\) is no longer \\
live on exit
\end{tabular} \\
\hline\(a=b+c\) & ADD \(c, R i\) & \(b\) is in \(R i, b\) is no longer live on exit \\
\begin{tabular}{ll} 
MOV \(c, R j\) \\
ADD \(R j, R i\)
\end{tabular} & \(b\) is in \(R i, b\) is no longer live on exit
\end{tabular}

Usually there will be numerous cases to consider
- An efficient choice depends on a number of factors (e.g., frequency of use of \(b\) and c later)
- Properties of the operator (e.g., commutativity) can add to the complexity

\section*{A Simple Code Generator}
- Treat each IR quadruple as a "macro"
- Replace the macro with pre-existing code templates
\[
\begin{array}{ll}
a=b+c & \text { LD R1, b } \\
& \text { LD R2, c } \\
& \text { ADD R1, R2 } \\
& \text { ST A, R1 }
\end{array} \quad \begin{aligned}
& \text { LD R1, b } \\
& \\
& \\
& \\
& \\
& \\
& \\
& \text { ST AD R1, R1 },
\end{aligned}
\]
- Simple to implement but makes inefficient use of registers
- Goal: Track values in registers and reuse them

\section*{Register and Address Descriptors}

\section*{Register descriptor}
- Keeps track of what name is stored in each register, consulted whenever a new register is needed
- Each register holds the value of zero or more names at any time during execution

\section*{Address descriptor}
- Keeps track of the location(s) where the current value of a name can be found at runtime
- Location can be a register, a stack location, a memory address, or some combination of these (data can get copied)
- Information can be stored in the symbol table

\section*{Code Generation Algorithm}
- For each 3AC instruction \(I\) of the form \(x=y\) op \(z\),
- Invoke function \(\operatorname{getreg}(I)\) to select registers \(R_{x}, R_{y}\), and \(R_{z}\)
- If \(y\) is not in \(R_{y}\) according to the address descriptor, then generate instruction LD \(R_{y}, y^{\prime}\)
- \(y^{\prime}\) is one of the memory locations for \(y\)
- Perform the same steps for \(z\)
- Generate the instruction OP \(R_{x}, R_{y}, R_{z}\)
- For a 3AC copy instruction \(x=y\),
- If \(y\) is not in \(R_{y}\) according to the address descriptor, then generate instruction LD \(R_{y}, y^{\prime}\)
- Adjust the register descriptor for \(R_{y}\) to include \(x\)

\section*{Managing Register and Address Descriptors}
- For an instruction LD \(R, x\),
- Change the register descriptor for \(R\) so it holds only \(x\)
- Change the address descriptor for \(x\) by adding register \(R\) as an additional location
- For instruction ST \(x, R\), change the address descriptor for \(x\) to include its own memory location

\section*{Managing Register and Address Descriptors}
- For an instruction such as ADD \(R_{x}, R_{y}, R_{z}\), implementing a 3AC \(x=\) \(y+z\),
- Change the register descriptor for \(R_{x}\) so that it holds only \(x\)
- Change the address descriptor for \(x\) so that its only location is \(R_{x}\)
- The memory location for \(x\) is no longer in the address descriptor for \(x\)
- Remove \(R_{x}\) from the address descriptor of any variable other than \(x\)
- For a copy instruction \(x=y\), remember to
- Process the load from \(y\) into a register (if needed)
- Add \(x\) to the register descriptor for \(R_{y}\)
- Change the address descriptor for \(x\) so that its only location is \(R_{y}\)

\section*{Usage of Registers}
- Leave the computed result in a register for as long as possible
- Store the result only at the end of a BB or when the register is needed for another computation
- A variable is live at a point if it is used (possibly in later BBs) later, requires global dataflow analysis
- On exit from a BB, store only live variables which are not already in their memory locations (use address descriptors to identify)
- If liveness information is not available, then assume that all variables are live at all times

\section*{Defining Function getreg()}
- Input
- 3AC I: \(x=y\) op \(z\)
- Output
- Returns registers to hold the value of \(x, y\), and \(z\)
- We assume that there is no global register allocation

\section*{\(\operatorname{getreg}()\) : Choosing \(R_{y}\) for \(y\)}
i. If \(y\) is in a register, then return the register containing \(y\) as \(R_{y}\)
ii. If \(y\) is not in a register, but there is an empty register available, then pick one such register as \(R_{y}\)
iii. If \(y\) is not in a register and there are no empty registers, then
- Let \(R\) be a candidate register and suppose \(v\) is one of the variables stored in \(R\)
- Heuristic for candidate selection can be based on farthest references or fewest next use
- If the address descriptor for \(v\) says that \(v\) is somewhere else beside \(R\), then choose \(R\)
- If \(v\) is \(x\), and \(x\) is not an operand of \(I\) (i.e., \(x \neq z\) ), then choose \(R\)
- If \(v\) is not used later, then choose \(R\)
- Else, generate ST \(v, R\) (called a register spill)
- \(R\) may hold several variables, so we need to repeat the previous steps for each variable
- Compute the number of store instructions generated for \(R\) (i.e., score) for each variable
- Pick the register with the lowest score
iv. Selecting \(R_{z}\) for \(z\) is similar

\section*{\(\operatorname{getreg}():\) Choosing \(R_{x}\) for \(x\)}
- Consider selection of a register \(R_{x}\) for \(x\). In addition to the previous checks, try the following.
- A register that holds only \(x\) is always an acceptable choice for \(R_{x}\)
- If \(y\) is not used after instruction \(I\), and \(R_{y}\) holds only \(y\) after being loaded, then \(R_{y}\) can also be used for \(R_{x}\)
- Perform similar checks with \(R_{z}\) if required
- If \(I\) is a copy instruction, then always choose \(R_{y}\)

\section*{Code Generation Example}


\section*{Code Generation Example}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline 3AC & Generated Code & \multicolumn{4}{|c|}{Register Descriptor} & \multicolumn{6}{|c|}{Address Descriptor} \\
\hline & & R1 & R2 & R3 & a & b & c & d & t & u & v \\
\hline & & \(u\) & \(a, d\) & \(v\) & R2 & \(b\) & c & \(d, R 2\) & & R1 & R3 \\
\hline \(d=v+u\) & ADD \(R 1, R 3, R 1\) & & & & & & & & & & \\
\hline & & \(d\) & \(a\) & \(v\) & \(R 2\) & \(b\) & c & R1 & & & R3 \\
\hline exit & \begin{tabular}{l}
ST \(a, R 2\) \\
ST \(d, R 1\)
\end{tabular} & & & & & & & & & & \\
\hline & & \(d\) & \(a\) & \(v\) & \(a, R 2\) & \(b\) & c & \(d, R 1\) & & & R3 \\
\hline
\end{tabular}

\section*{Code Sequences for Indexed and Pointer Assignments}
\begin{tabular}{|c|c|c|c|}
\hline 3AC & \(\boldsymbol{i}\) in register \(\boldsymbol{R} \boldsymbol{i}\) & \(\boldsymbol{i}\) in memory Mi & \(\boldsymbol{i}\) in Stack \\
\hline \(a=b[i]\) & MOV \(b(R i), R\) & MOV Mi,R \(\operatorname{MOV} b(R), R\) & \(\operatorname{MOV} \operatorname{Si}(A), R\) \(\operatorname{MOV} b(R), R\) \\
\hline \(a[i]=b\) & MOV \(b, a(R i)\) & \begin{tabular}{l}
MOV Mi,R \\
MOV \(b, a(R)\)
\end{tabular} & \[
\begin{aligned}
& \operatorname{MOV} \operatorname{Si}(A), R \\
& \operatorname{MOV} b, a(R)
\end{aligned}
\] \\
\hline
\end{tabular}
\begin{tabular}{|c|c|c|c|}
\hline 3AC & \(\boldsymbol{p}\) in register \(\boldsymbol{R} \boldsymbol{p}\) & \(\boldsymbol{p}\) in memory Mp & \(\boldsymbol{p}\) in Stack \\
\hline \(a=* p\) & \(\mathrm{MOV} * R p, a\) & \[
\begin{aligned}
& \operatorname{MOV} M p, R \\
& \mathrm{MOV} * R, R
\end{aligned}
\] & \[
\begin{aligned}
& \operatorname{MOV} \operatorname{Sp}(A), R \\
& \operatorname{MOV} * R, R
\end{aligned}
\] \\
\hline \(* p=b\) & MOV \(a, * R p\) & \[
\begin{aligned}
& \text { MOV } M p, R \\
& \text { MOV } a, * R
\end{aligned}
\] & \[
\begin{aligned}
& \operatorname{MOV} a, R \\
& \operatorname{MOV} R, * \operatorname{Sp}(A)
\end{aligned}
\] \\
\hline
\end{tabular}

Instruction Selection by Tree Rewriting

\section*{Tree Representation}
- Consider the statement
\[
a[i]=b+1
\]
- Assume \(b\) is in memory location \(M_{b}\)
- Array of chars \(a\) is a local and is stored on the stack
- SP points to the beginning of the current activation record
- Addresses of locals \(a\) and \(i\) are given as constant offsets \(C_{a}\) and \(C_{i}\) from the SP


\section*{Tree Rewriting}
- Target code can be generated by applying a sequence of treerewriting rules to reduce the input tree to a single node
- Each rewrite rule is of the form
\[
\text { replacement } \leftarrow \text { template }\{\text { action }\}
\]
where replacement is a single node, template is a tree, and action is a code fragment like in a SDT
- A set of tree rewriting rules is called a tree-translation scheme


\section*{Tree Rewriting Rules}
(1) \(R_{i} \leftarrow C_{a} \quad\left\{\operatorname{LD} R_{i}, \# a\right\}\)

(2) \(R_{i} \leftarrow M_{x}\)
\(\left\{\operatorname{LD} R_{i}, x\right\}\)


\section*{Tree Rewriting Rules}


\section*{Code Generation by Tiling an Input Tree}
- High-level steps in a tree-translation scheme
- Given an input tree, the templates in the tree-rewriting rules are applied to tile its subtrees
- If a template matches, replace the matching subtree with the replacement node of the rule
- Execute the action associated with the rule
- If the action contains a sequence of instructions, the instructions are emitted
- Repeat the above steps until the tree is reduced to a single node, or until no more templates match
- Output of the tree-translation scheme is the instruction sequence generated as the input tree is reduced to a single node

\section*{Example of Code Generation with Tree Rewriting}


\section*{Example of Code Generation with Tree Rewriting}


\section*{Example of Code Generation with Tree Rewriting}


\section*{Example of Code Generation with Tree Rewriting}


\section*{Example of Code Generation with Tree Rewriting}


M

\section*{Example of Code Generation with Tree Rewriting}

```

LD Ro,\#a
ADD R
ADD R}\mp@subsup{R}{0}{},\mp@subsup{R}{0}{},i(SP
LD R R , b
INC R1
ST *R R, R1

```

\section*{Considerations during Tree Reduction}
i. Performance of tree matching impacts the efficiency of code generation at compile time
ii. Multiple templates may match during code generation
iii. Different match sequences of templates will lead to different code being generated which can impact efficiency
iv. If no template matches, then the code-generation process blocks
- Assume each operator in the intermediate code can be implemented by one or more target-machine instructions
- Assume there are sufficient registers to compute each tree node by itself

\section*{Pattern Matching with LR Parsing}
- Idea
- Convert the input tree to a string using prefix form for comparison
- Use a parsing mechanism for pattern matching
- Come up with a syntax-directed translation (SDT) as an alternate for tree rewriting rules


\section*{SDT for Tree Rewriting}
- Terminal m represents a memory location
- Terminal sp represents register SP
- Terminal c represents a constant
- Design a code generator for a different architecture by rewriting the grammar
- Resolve conflicts using estimates of instruction costs, favoring larger reductions, and favoring shifts over reductions
\begin{tabular}{l|l|}
\hline \multicolumn{1}{r|}{ Production } & Semantic Action \\
\hline\(R_{i} \rightarrow \mathbf{c}_{a}\) & \(\left\{\mathrm{LD} R_{i}, \# a\right\}\) \\
\hline\(R_{i} \rightarrow M_{x}\) & \(\left\{\mathrm{LD} R_{i}, x\right\}\) \\
\hline\(M \rightarrow=M_{x} R_{i}\) & \(\left\{\mathrm{ST} x, R_{i}\right\}\) \\
\hline\(M \rightarrow=\) ind \(R_{i} R_{j}\) & \(\left\{\mathrm{ST} * R_{i}, R_{j}\right\}\) \\
\hline\(R_{i} \rightarrow\) ind \(+\mathbf{c}_{a} R_{j}\) & \(\left\{\mathrm{LD} R_{i}, a\left(R_{j}\right)\right\}\) \\
\hline \begin{tabular}{l}
\(R_{i} \rightarrow+R_{i}\) ind + \\
\(\mathbf{c}_{a} R_{j}\)
\end{tabular} & \(\left\{\mathrm{ADD} R_{i}, R_{i}, a\left(R_{j}\right)\right\}\) \\
\(R_{i} \rightarrow+R_{i} R_{j}\) & \(\left\{\mathrm{ADD} R_{i}, R_{i}, R_{j}\right\}\) \\
\hline\(R_{i} \rightarrow+R_{i} \mathbf{c}_{1}\) & \(\left\{\operatorname{INC} R_{i}\right\}\) \\
\hline\(R \rightarrow \mathbf{s p}\) & \\
\hline\(M \rightarrow \mathbf{m}\) & \\
\hline
\end{tabular}

\title{
Dynamic Programming Based Optimal Code Generation
}

\section*{Code Generation from DAGs}
- Optimal code generation for DAGs is NP-complete
- So, DAGs are divided into trees and processed
- An alternative is to replicate shared trees in DAGs but it increases the code size


\section*{Expression Trees}
- A syntax tree for an expression
\[
\begin{aligned}
& (\mathrm{a}-\mathrm{b})+\mathrm{e} *(\mathrm{c} / \mathrm{d}) \\
& \mathrm{t} 1=\mathrm{a}-\mathrm{b} \\
& \mathrm{t} 2=\mathrm{c} / \mathrm{d} \\
& \mathrm{t} 3=\mathrm{e} * \mathrm{t} 2 \\
& \mathrm{t} 4=\mathrm{t} 1+\mathrm{t} 3
\end{aligned}
\]


\section*{Dynamic Programming Based Optimal Code Generation}
- Generates optimal code from an expression tree for a BB
- Optimality is in terms of number of instructions generated
- Machine model - register machines with complex instruction sets
- Assume there are \(r\) interchangeable registers \(R_{0}, \ldots, R_{r-1}\)
- Instructions are of form: \(R_{i}=E\)
- If \(E\) involves registers, then \(R_{i}\) must be one of them (i.e., 2-address instructions)
- Variants: \(R_{i}=M_{j}, R_{i}=R_{i}\) op \(R_{j}, R_{i}=R_{i}\) op \(M_{j}\)
- Other variants: \(R_{i}=R_{j}, M_{i}=R_{j}\)

\section*{Contiguous Evaluation}
- The optimality criterion requires contiguous evaluation of an expression tree
- No higher costs and no more registers
- A program \(P\) evaluates a tree \(T\) contiguously if
- it first evaluates those subtrees of \(T\) that need to be computed into memory,
- it then evaluates \(T_{1}, T 2\), and then root, in order, or \(T_{2}, T_{1}\), and then root, in order

Assume \(E\) is \(E_{1}+E_{2}\)

syntax tree for \(E\)
- Evaluating part of \(T_{1}\) leaving the result in a register, evaluating \(T_{2}\), and then evaluating rest of \(T_{1}\) is not contiguous evaluation

\section*{Dynamic Programming Algorithm}
- Assumption: target has \(r\) registers
1. Compute bottom-up for each node \(n\) of the expression tree \(T\) an array \(C\) of costs
- \(C[i]\) is the minimum cost of computing the subtree \(S\) rooted at \(n\) into a register, assuming \(i\) registers are available for the computation, for \(1<i<r\)
- The cost of computing a node \(n\) includes the count of loads and stores necessary to evaluate \(S\) in the given number of registers plus the cost of computing the operator at the root of \(S\)
2. Traverse \(T\), using the cost vectors to determine which subtrees of \(T\) must be computed into memory
3. Traverse each tree using the cost vectors and associated instructions to generate the final target code
- Code for the subtrees computed into memory locations is generated first, then code for other subtrees, and then code for the root

\section*{Example}
- Consider a machine having two registers \(R_{0}\) and \(R_{1}\)
- Assume the available instructions are
\begin{tabular}{l} 
LD \(R_{i}, M_{j}\) \\
\hline op \(R_{i}, R_{i}, R_{j}\) \\
\hline op \(R_{i}, R_{i}, M_{j}\) \\
\hline LD \(R_{i}, R_{j}\) \\
\hline ST \(M_{i}, R_{j}\) \\
\hline
\end{tabular}

- Furthermore, assume all instructions are of unit cost
- Can be extended to cases where instructions have varying costs

\section*{Expression Tree with Cost Vectors}
\begin{tabular}{ll}
\(C_{a}[0]=0\) & Cost of computing \(a\) in memory \\
\(C_{a}[1]=1\) & Cost of computing \(a\) in a register \\
\(C_{a}[2]=1\) & \begin{tabular}{l} 
Cost of computing \(a\) in a register, \\
with two registers available
\end{tabular}
\end{tabular}
\[
\begin{aligned}
& \operatorname{LD} R_{0}, a \\
& \operatorname{ADD} R_{0}, R_{0}, b
\end{aligned}
\]
- \(C_{-}[1]=C_{a}[1]+C_{b}[0]+1=1+0+1=2\)
\(-\quad C_{-}[2]=\min \left(\begin{array}{l}C_{a}[2]+C_{b}[1]+1, \\ C_{a}[2]+C_{b}[0]+1, \\ C_{a}[1]+C_{b}[2]+1, \\ C_{a}[1]+C_{b}[1]+1, \\ C_{a}[1]+C_{b}[0]+1\end{array}\right)\)
\[
=\min (3,2,3,3,2)=2
\]
- \(C_{-}[0]=C_{-}[2]+1=3\)


\section*{Expression Tree with Cost Vectors}
- \(C_{*}[1]=C_{c}[1]+C_{/}[0]+1=1+3+1=5\)
\(-C_{*}[2]=\min \left(\begin{array}{l}C_{c}[2]+C_{/}[1]+1, \\ C_{c}[2]+C_{/}[0]+1, \\ C_{c}[1]+C_{/}[2]+1, \\ C_{c}[1]+C_{/}[1]+1, \\ C_{c}[1]+C_{/}[0]+1\end{array}\right)\)
\[
=\min (4,5,4,4,5)=4
\]
- \(C_{*}[0]=C_{*}[2]+1=5\)
- \(C_{+}[1]=C_{-}[1]+C_{*}[0]+1=2+5+1=8\)
\(\begin{aligned}-C_{+}[2] & =\min \left(\begin{array}{l}C_{-}[2]+C_{*}[1]+1, \\ C_{-}[2]+C_{*}[0]+1, \\ C_{-}[1]+C_{*}[2]+1, \\ C_{-}[1]+C_{*}[1]+1, \\ C_{-}[1]+C_{*}[0]+1\end{array}\right) \\ & =\min (8,8,7,8,8)=7\end{aligned}\)
\[
=\min (8,8,7,8,8)=7
\]

- \(C_{+}[0]=C_{+}[2]+1=8\)

\section*{Tree Traversal to Generate Code}
- Min cost at node + is 7, which implies right subtree (RST) is computed with 2 registers in \(R_{0}\) and left subtree (LST) is computed with 1 register into \(R_{1}\)
- For node *, compute RST with one register in \(R_{1}\) and LST in \(R_{0}\)
- For node \(c\), emit LD \(R_{0}, c\)
- For node /, compute RST in memory and compute LST in \(R_{1}\)
- For node \(d\), emit LD \(R_{1}, d\)
- For node -, compute RST in memory and compute LST in \(R_{1}\)
- For node \(a\), emit LD \(R_{1}, a\)
\[
\Rightarrow \operatorname{ADD} R_{1}, R_{1}, R_{0}
\]

\title{
Instruction Selection via Peephole Optimization
}

\section*{Peephole Optimization}
- Insight: Find local optimizations by examining short sequences of adjacent operations
- The sliding window, or the peephole, moves over code
- Code in a peephole need not be contiguous
- Goal is to identify code patterns that can be improved
- Rewrite code patterns with improved sequence
\(1 \mathrm{LD} a, R_{0}\) ST \(R_{0}, a\)
\(\longrightarrow \mathrm{LD} a, R_{0}\)
\(2 \mathrm{ST} a, R_{0}\) LD \(R_{3}, a\)

ST \(a, R_{0}\) \(\operatorname{MOV} R_{3}, R_{0}\)
\(3 \mathrm{ADD} R_{7}, R_{0}, 0\) \(\operatorname{MUL} R_{10}, R_{4}, R_{7}\)

\section*{Examples of Peephole Optimizations}
- Eliminate redundant instructions
- Eliminate unreachable code

- Eliminate jump over jumps
- Algebraic simplification
- Strength reduction
- Use of machine idioms


\section*{Peephole Optimization based Code Generation}
- A naïve optimization strategy can use exhaustive search to match the patterns and rewrite code
- Can work if number of patterns and the window size are small
- Does not work for modern complex ISAs
- Workflow in a modern peephole optimizer
\(\xrightarrow{\mathrm{IR}} \underset{\mathrm{IR} \rightarrow \text { LLIR }}{\text { Expander }} \xrightarrow[\text { LLIR }]{\text { SLIR } \rightarrow \text { LLIR }} \xrightarrow{\text { Simplifier }} \xrightarrow[\text { LIIR } \rightarrow \text { ASM }]{\text { Matcher }} \xrightarrow{\text { ASM }}\)
- In an optimizer, the input and output languages are the same
- With a different output language (e.g., ASM), the optimizer can be used for code generation

\section*{Peephole Optimization based Code Generation}
- Expander rewrites the IR to represent all the direct effects of an operation
- If OP \(R_{0}, R_{1}, R_{2}\) sets a condition code, then the LLIR should include an explicit operation to set the condition code
- Simplifier performs limited local optimization on the LLIR in the window
- Matcher compares the simplified LLIR against the pattern library

\section*{Example}

AST computes \(a=b-2 \times c\)
- \(a\) is stored at offset 4 in the local AR
- b stored as a call-by-reference parameter whose pointer is stored at offset - 16 from the ARP
- \(c\) is at offset 12 from the label @ \(G\)
\begin{tabular}{|c|c|c|c|}
\hline Op & Arg \(_{1}\) & Arg \(_{2}\) & Result \\
\hline\(\times\) & 2 & \(c\) & \(t_{1}\) \\
\hline- & \(b\) & \(t_{1}\) & \(a\) \\
\hline
\end{tabular}


\section*{Example}

\[
\begin{aligned}
& r_{10} \leftarrow 2 \\
& r_{11} \leftarrow @ G \\
& r_{12} \leftarrow 12 \\
& r_{13} \leftarrow r_{11}+r_{12} \\
& r_{14} \leftarrow M\left(r_{13}\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16 \\
& r_{17} \leftarrow r_{A R P}+r_{16} \\
& r_{18} \leftarrow M\left(r_{17}\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4 \\
& r_{22} \leftarrow r_{A R P}+r_{21} \\
& M\left(r_{22}\right) \leftarrow r_{20}
\end{aligned}
\]

\section*{Example}
\begin{tabular}{|c|c|c|c|}
\hline Op & Arg \(_{1}\) & Arg \(_{2}\) & Result \\
\hline\(\times\) & 2 & \(c\) & \(t_{1}\) \\
\hline- & \(b\) & \(t_{1}\) & \(a\) \\
\hline
\end{tabular}

\[
r_{10} \leftarrow 2
\]
\[
r_{11} \leftarrow @ G
\]
\[
r_{12} \leftarrow 12
\]

Expand
\[
r_{14} \leftarrow M\left(r_{13}\right)
\]
\[
r_{10} \leftarrow 2
\]
\[
r_{11} \leftarrow @ G
\]
Simplify
\[
r_{14} \leftarrow M\left(r_{11}+12\right)
\]
\[
r_{15} \leftarrow r_{10} \times r_{14}
\]
\[
r_{18} \leftarrow M\left(r_{A R P}-16\right)
\]
\[
r_{15} \leftarrow r_{10} \times r_{14}
\]
\[
r_{19} \leftarrow M\left(r_{18}\right)
\]
\[
r_{16} \leftarrow-16
\]
\[
r_{20} \leftarrow r_{19}-r_{15}
\]
\[
r_{17} \leftarrow r_{A R P}+r_{16}
\]
\[
r_{18} \leftarrow M\left(r_{17}\right)
\]
\[
r_{19} \leftarrow M\left(r_{18}\right)
\]
\[
r_{20} \leftarrow r_{19}-r_{15}
\]
\[
M\left(r_{A R P}+4\right) \leftarrow r_{20}
\]
instructions and
\[
r_{21} \leftarrow 4
\]
\[
r_{22} \leftarrow r_{A R P}+r_{21}
\]
\[
M\left(r_{22}\right) \leftarrow r_{20} \quad
\]

\section*{Sequences Produced by the Simplifier}
\[
\begin{aligned}
& r_{10} \leftarrow 2 \\
& r_{11} \leftarrow @ G \\
& r_{12} \leftarrow 12 \\
& r_{13} \leftarrow r_{11}+r_{12} \\
& r_{14} \leftarrow M\left(r_{13}\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16 \\
& r_{17} \leftarrow r_{A R P}+r_{16} \\
& r_{18} \leftarrow M\left(r_{17}\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4 \\
& r_{22} \leftarrow r_{A R P}+r_{21} \\
& M\left(r_{22}\right) \leftarrow r_{20}
\end{aligned}
\]

\section*{Sequence 1}
\[
\begin{array}{ll}
r_{10} \leftarrow 2 & r_{11} \leftarrow @ G \\
r_{11} \leftarrow @ G & r_{14} \leftarrow M\left(r_{11}+12\right) \\
r_{12} \leftarrow 12 & r_{15} \leftarrow r_{10} \times r_{14}
\end{array}
\]

\section*{Sequence 2}
\[
\begin{aligned}
& r_{11} \leftarrow @ G \\
& r_{12} \leftarrow 12 \\
& r_{13} \leftarrow r_{11}+r_{12}
\end{aligned}
\]

\section*{Sequence 3}
\[
\begin{aligned}
& r_{11} \leftarrow @ G \\
& r_{13} \leftarrow r_{11}+12 \\
& r_{14} \leftarrow M\left(r_{13}\right)
\end{aligned}
\]

\section*{Sequence 4}

\section*{Sequence 5}
\[
\begin{aligned}
& r_{14} \leftarrow M\left(r_{11}+12\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16
\end{aligned}
\]

\section*{Sequence 6}
\[
\begin{aligned}
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16 \\
& r_{17} \leftarrow r_{A R P}+r_{16}
\end{aligned}
\]

Sequence 7
\[
\begin{aligned}
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{17} \leftarrow r_{A R P}-16 \\
& r_{18} \leftarrow M\left(r_{17}\right)
\end{aligned}
\]

\section*{Sequence 8}
\[
\begin{aligned}
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{18} \leftarrow M\left(r_{A R P}-16\right) \\
& r_{19} \leftarrow M\left(r_{18}\right)
\end{aligned}
\]

Sequence 9
\[
\begin{aligned}
& r_{18} \leftarrow M\left(r_{A R P}-16\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15}
\end{aligned}
\]

\section*{Sequences Produced by the Simplifier}
```

r10}\leftarrow
r11}\leftarrow@
r12}\leftarrow1
r13}\leftarrow\mp@subsup{r}{11}{}+\mp@subsup{r}{12}{
r,
r 15}\leftarrow\mp@subsup{r}{10}{}\times\mp@subsup{r}{14}{
r 16}\leftarrow-1
r 17 \leftarrowr raRP + r r 16
r 18}\leftarrowM(\mp@subsup{r}{17}{}
r 19}\leftarrowM(\mp@subsup{r}{18}{}
r20}\leftarrow\mp@subsup{r}{19}{}-\mp@subsup{r}{15}{
r21}\leftarrow
r22}\leftarrow\mp@subsup{r}{ARP}{}+\mp@subsup{r}{21}{
M(r r2)})\leftarrow\mp@subsup{r}{20}{

```

Sequence 10
\[
\begin{aligned}
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4
\end{aligned}
\]

\section*{Sequence 11}
\[
\begin{aligned}
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4 \\
& r_{22} \leftarrow r_{A R P}+r_{21}
\end{aligned}
\]

\section*{Sequence 12}
\(r_{20} \leftarrow r_{19}-r_{15}\)
\(r_{22} \leftarrow r_{A R P}+4\)
\(M\left(r_{22}\right) \leftarrow r_{20}\)

\section*{Sequence 13}
\[
\begin{aligned}
& r_{20} \leftarrow r_{19}-r_{15} \\
& M\left(r_{A R P}+4\right) \leftarrow r_{20}
\end{aligned}
\]

\section*{Example}
\begin{tabular}{|c|c|c|c|}
\hline Op & Arg \(_{\mathbf{1}}\) & Arg \(_{\mathbf{2}}\) & Result \\
\hline\(\times\) & 2 & \(c\) & \(t_{1}\) \\
\hline- & \(b\) & \(t_{1}\) & \(a\) \\
\hline
\end{tabular}

Expand
\[
\begin{aligned}
& r_{10} \leftarrow 2 \\
& r_{11} \leftarrow @ G \\
& r_{12} \leftarrow 12 \\
& r_{13} \leftarrow r_{11}+r_{12} \\
& r_{14} \leftarrow M\left(r_{13}\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16 \\
& r_{17} \leftarrow r_{A R P}+r_{16} \\
& r_{18} \leftarrow M\left(r_{17}\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4 \\
& r_{22} \leftarrow r_{A R P}+r_{21} \\
& M\left(r_{22}\right) \leftarrow r_{20}
\end{aligned}
\]
\[
r_{10} \leftarrow 2
\]
\[
r_{11} \leftarrow @ G
\]

Simplify
\(r_{15} \leftarrow r_{10} \times r_{14}\)
\(r_{14} \leftarrow M\left(r_{11}+12\right)\)
\(r_{18} \leftarrow M\left(r_{A R P}-16\right)\)
\(r_{19} \leftarrow M\left(r_{18}\right)\)
\(r_{20} \leftarrow r_{19}-r_{15}\)
\(M\left(r_{A R P}+4\right) \leftarrow r_{20}\)

\section*{\begin{tabular}{l}
\(\stackrel{\Im}{4}\) \\
\(\stackrel{y}{0}\) \\
\hline
\end{tabular}}

LD \(\quad r_{10}, 2\)
LD \(r_{11}, @ G\)
LD \(r_{14}, 12\left(r_{11}\right)\)
MUL \(r_{15}, r_{10}, r_{14}\)
LD \(r_{18},-16\left(r_{A R P}\right)\)
LD \(r_{19}, r_{18}\)
SUB \(r_{20}, r_{19}, r_{15}\)
ST \(4\left(r_{A R P}\right), r_{20}\)

\section*{Example}
\begin{tabular}{|c|c|c|c|}
\hline Op & Arg \(_{1}\) & Arg \(_{2}\) & Result \\
\hline\(\times\) & 2 & \(c\) & \(t_{1}\) \\
\hline- & \(b\) & \(t_{1}\) & \(a\) \\
\hline
\end{tabular}
\[
\begin{aligned}
& r_{10} \leftarrow 2 \\
& r_{11} \leftarrow @ G \\
& r_{12} \leftarrow 12 \\
& r_{13} \leftarrow r_{11}+r_{12} \\
& r_{14} \leftarrow M\left(r_{13}\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{16} \leftarrow-16 \\
& r_{17} \leftarrow r_{A R P}+r_{16} \\
& r_{18} \leftarrow M\left(r_{17}\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& r_{21} \leftarrow 4 \\
& r_{22} \leftarrow r_{A R P}+r_{21} \\
& M\left(r_{22}\right) \leftarrow r_{20}
\end{aligned}
\]

Expand
\(\xrightarrow{\square}\)
- Correctly identifying dead values, presence of control flow, and window size limit the effectiveness of peephole optimizations
- Can use logical instead based on data flow instead of physical windows
Simplify
\[
\begin{aligned}
& r_{10} \leftarrow 2 \\
& r_{11} \leftarrow @ G \\
& r_{14} \leftarrow M\left(r_{11}+12\right) \\
& r_{15} \leftarrow r_{10} \times r_{14} \\
& r_{18} \leftarrow M\left(r_{A R P}-16\right) \\
& r_{19} \leftarrow M\left(r_{18}\right) \\
& r_{20} \leftarrow r_{19}-r_{15} \\
& M\left(r_{A R P}+4\right) \leftarrow r_{20}
\end{aligned}
\]
\[
\begin{aligned}
& \text { LD } \quad r_{10}, 2 \\
& \text { LD } \quad r_{11}, @ G \\
& \begin{array}{ll}
\text { LD } & r_{14}, 12\left(r_{11}\right) \\
\text { MUL } & r_{15}, \boldsymbol{r}_{10}, r_{14} \\
\text { LD } & r_{18},-16\left(r_{A R P}\right)
\end{array} \\
& \text { LD } \quad r_{19}, r_{18} \\
& \text { SUB } r_{20}, r_{19}, r_{15} \\
& \text { ST } \quad 4\left(r_{A R P}\right), r_{20}
\end{aligned}
\]

\section*{Current State in Code Generation}
- Modern peephole systems automatically generates a matcher from a description of a target machine's instruction set
- Eases the work in retargeting the backend
i. Provide a new appropriate machine description to the pattern generator to produce a new instruction selector
ii. Change the LLIR sequences to match the new ISA
iii. Modify the instruction scheduler and register allocator to reflect the characteristics of the new ISA
- GCC uses a low-level IR Register-Transfer Language (RTL) for optimization and for code generation
- The backend uses a peephole scheme to convert RTL into assembly code

\section*{References}
- A. Aho et al. Compilers: Principles, Techniques, and Tools, \(1^{\text {st }}\) edition, Chapter 9.1-9.8, 9.10, 9.11.
- A. Aho et al. Compilers: Principles, Techniques, and Tools, \(2^{\text {nd }}\) edition, Chapter 8.1-8.6, 8.9, 8.10.
- K. Cooper and L. Torczon. Engineering a Compiler, \(2^{\text {nd }}\) edition, Chapter 11.```

