## Pipelining and hazards

#### CASS 2018 Lavanya Ramapantulu

# Performance Issues

- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory  $\rightarrow$  register file  $\rightarrow$  ALU  $\rightarrow$  data memory  $\rightarrow$  register file
- Not feasible to vary period for different instructions
- Violates design principle

   Making the common case fast
- We will improve performance by pipelining

# **Pipelining Analogy**

• Pipelined laundry: overlapping execution

- Parallelism improves performance



# **RISC-V** Pipeline

- Five stages, one step per stage
  - 1. IF: Instruction fetch from memory
  - 2. ID: Instruction decode & register read
  - 3. EX: Execute operation or calculate address
  - 4. MEM: Access memory operand
  - 5. WB: Write result back to register

## **Pipeline Performance**

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register<br>read | ALU op | Memory<br>access | Register<br>write | Total time |  |
|----------|-------------|------------------|--------|------------------|-------------------|------------|--|
| ld       | 200ps       | 100 ps           | 200ps  | 200ps            | 100 ps            | 800ps      |  |
| sd       | 200ps       | 100 ps           | 200ps  | 200ps            |                   | 700ps      |  |
| R-format | 200ps       | 100 ps           | 200ps  |                  | 100 ps            | 600ps      |  |
| beq      | 200ps       | 100 ps           | 200ps  |                  |                   | 500ps      |  |

## **Pipeline Performance**



# Pipeline Speedup

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub>
     Time between instructions<sub>nonpipelined</sub>
     Number of stages
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

# Pipelining and ISA Design

- RISC-V ISA designed for pipelining
  - All instructions are 32-bits
    - Easier to fetch and decode in one cycle
    - c.f. x86: 1- to 17-byte instructions
  - Few and regular instruction formats
    - Can decode and read registers in one step
  - Load/store addressing
    - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage

## Hazards

- Situations that prevent starting the next instruction in the next cycle
- Structure hazards

A required resource is busy

• Data hazard

 Need to wait for previous instruction to complete its data read/write

- Control hazard
  - Deciding on control action depends on previous instruction

#### Structure Hazards

- Conflict for use of a resource
- In RISC-V pipeline with a single memory
  - Load/store requires data access
  - Instruction fetch would have to *stall* for that cycle
    - Would cause a pipeline "bubble"
- Hence, pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches

#### Data Hazards

 An instruction depends on completion of data access by a previous instruction



# Forwarding (aka Bypassing)

- Use result when it is computed
  - Don't wait for it to be stored in a register
  - Requires extra connections in the datapath



## Load-Use Data Hazard

- Can't always avoid stalls by forwarding
  - If value not computed when needed
  - Can't forward backward in time!



## Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction
- C code for a = b + e; c = b + f;



## **Control Hazards**

- Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - Still working on ID stage of branch
- In RISC-V pipeline
  - Need to compare registers and compute target early in the pipeline
  - Add hardware to do it in ID stage

## Stall on Branch

 Wait until branch outcome determined before fetching next instruction



# **Branch Prediction**

 Longer pipelines can't readily determine branch outcome early

- Stall penalty becomes unacceptable

• Predict outcome of branch

Only stall if prediction is wrong

- In RISC-V pipeline
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

#### More-Realistic Branch Prediction

- Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

# **Pipeline Summary**

#### **The BIG Picture**

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation



## **Pipeline registers**

• Need registers between stages

To hold information produced in previous cycle



# **Pipeline Operation**

- Cycle-by-cycle flow of instructions through the pipelined datapath
  - "Single-clock-cycle" pipeline diagram
    - Shows pipeline usage in a single cycle
    - Highlight resources used
  - c.f. "multi-clock-cycle" diagram
    - Graph of operation over time
- We'll look at "single-clock-cycle" diagrams for load & store

#### IF for Load, Store, ...





#### ID for Load, Store, ...

| ld                 |  |
|--------------------|--|
| Instruction decode |  |



#### EX for Load





#### MEM for Load



## WB for Load



ld

## Corrected Datapath for Load



#### EX for Store





#### **MEM for Store**



#### WB for Store



sd

# Multi-Cycle Pipeline Diagram

#### • Form showing resource usage



# Multi-Cycle Pipeline Diagram

#### • Traditional form

|        |                                               | Time (in             | clock cycle          | es) ——               |                       |                      |                    |                |                |            |
|--------|-----------------------------------------------|----------------------|----------------------|----------------------|-----------------------|----------------------|--------------------|----------------|----------------|------------|
|        |                                               | CC 1                 | CC 2                 | CC 3                 | CC 4                  | CC 5                 | CC 6               | CC 7           | CC 8           | CC 9       |
| e<br>o | rogram<br>xecution<br>rder<br>n instructions) |                      |                      |                      |                       |                      |                    |                |                |            |
|        | ld x10, 40(x1)                                | Instruction<br>fetch | Instruction decode   | Execution            | Data<br>access        | Write-back           |                    |                |                |            |
|        | sub x11, x2, x3                               |                      | Instruction<br>fetch | Instruction decode   | Execution             | Data<br>access       | Write-back         |                |                |            |
|        | add x12, x3, x4                               |                      |                      | Instruction<br>fetch | Instruction<br>decode | Execution            | Data<br>access     | Write-back     |                |            |
|        | ld x13, 48(x1)                                |                      |                      |                      | Instruction<br>fetch  | Instruction decode   | Execution          | Data<br>access | Write-back     |            |
|        | add x14, x5, x6                               |                      |                      |                      |                       | Instruction<br>fetch | Instruction decode | Execution      | Data<br>access | Write-back |
| 1      | 1                                             |                      |                      |                      |                       |                      |                    |                |                |            |

# Single-Cycle Pipeline Diagram

• State of pipeline in a given cycle



# Virtual-Lab Single cycle MIPS

• Try the below link

 <u>http://cse11-</u> <u>iiith.vlabs.ac.in/SingleCycle/v19.swf</u>

## References

 Computer Organization and Design RISC-V Edition, 1st Edition, The Hardware Software Interface by David Patterson John Hennessy -Chapter 4