

# **COMPUTER ARCHITECTURES**

**Out-of-order Execution** 

Gábor Horváth, ghorvath@hit.bme.hu

Budapest, 4/10/25





## **EXAMPLE (REMINDER)**

www.hit.bme.hu

#### C code:

| for | (i=0; i <n; i++)<="" th=""></n;> |
|-----|----------------------------------|
|     | Z[i]=A*X[i];                     |
|     |                                  |



#### Instruction scheduling: (latency of multiplication: 5, integer and memory operations: 1)

| Instructions: | 1. | 2. | 3. | 4.         | 5. | 6.         | 7.         | 8.         | 9.         | 10. | 11. | 12. | 13. | 14. |
|---------------|----|----|----|------------|----|------------|------------|------------|------------|-----|-----|-----|-----|-----|
| D2 ← MEM[R1]  | IF | ID | EX | MEM        | WB |            |            |            |            |     |     |     |     |     |
| D3 ← D2 * D0  |    | IF | ID | D*         | M0 | M1         | M2         | М3         | M4         | MEM | WB  |     |     |     |
| MEM[R2] ← D3  |    |    | IF | <b>S</b> * | ID | <b>D</b> * | <b>D</b> * | <b>D</b> * | <b>D</b> * | EX  | MEM | WB  |     |     |
| R1 ← R1 + 8   |    |    |    |            | IF | <b>S</b> * | <b>S</b> * | <b>S</b> * | <b>S</b> * | ID  | EX  | MEM | WB  |     |
| R2 ← R2 + 8   |    |    |    |            |    |            |            |            |            | IF  | ID  | EX  | MEM | WB  |





Let us determine the optimal order of instructions
 Original:
 Optimized:

| Instructions: | 1. | 2. | 3. | 4.         | 5. | 6. | 7.         | 8.         | 9.         | 10. | 11. | 12. | 13. | 14. |
|---------------|----|----|----|------------|----|----|------------|------------|------------|-----|-----|-----|-----|-----|
| D2 ← MEM[R1]  | IF | ID | EX | MEM        | WB |    |            |            |            |     |     |     |     |     |
| D3 ~ D2 * D0  |    | IF | ID | <b>D</b> * | M0 | M1 | M2         | M3         | M4         | MEM | WB  |     |     |     |
| R1 ← R1 + 8   |    |    | IF | <b>S</b> * | ID | EX | MEM        | WB         |            |     |     |     |     |     |
| MEM[R2] ← D3  |    |    |    |            | IF | ID | <b>D</b> * | <b>D</b> * | <b>D</b> * | EX  | MEM | WB  |     |     |
| R2 ← R2 + 8   |    |    |    |            |    | IF | <b>S</b> * | <b>S</b> * | <b>S</b> * | ID  | EX  | MEM | WB  |     |





- The program become faster due to the optimization
- Problems of this approach:
  - Programmers/compiler have to recognize these possibilities
  - The optimal order depends on the structure of the pipeline
- Ideal solution:
  - The CPU should do it, on-the-fly
  - The CPU should re-order the instructions such that
    - the execution time is lower
    - the semantics of the program remains the same

### $\rightarrow$ Out-of-order execution

- Implementation:
  - Not as complex as it seems
  - In use for a long time
    - First: Scoreboard 1964
    - Advanced: **Tomasulo** 1967
      - Used even in recent CPUs (e.g. Intel Core i7)



### **Out-of-order execution**





- First out-of-order CPU
  - 1964: CDC 6600
- Designed by Seymour Cray
- Features:
  - 16 functional units
  - 10 MHz clock rate
  - >400.000 transistors
  - 5 tons
  - Wired control
- The fastest computer of the word for the next 5 years







Freon-based cooling in each case:

• Dual vector-graphics display:







### **CRAY VS. IBM**

 Thomas Watson (IBM): "Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer."





### **CRAY VS. IBM**

## • Seymour Cray (CD)

"It seems like Mr. Watson has answered his own question."





# THE INGREDIENTS OF OUT-OF-ORDER EXECUTION

- www.hit.bme.hu 🛓
  - The CPU has to fetch several instructions to be able to re-order them
    - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

- An idea to reduce the number of dependencies between the instructions
   → register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



## **THE INGREDIENTS OF OUT-OF-ORDER** EXECUTION

- www.hit.bme.hu
  - The CPU has to fetch several instructions to be able to re-order them
    - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

- A procedure that determines the execution order of the instructions  $\rightarrow$  dynamic scheduler
- An idea to reduce the number of dependencies between the instructions  $\rightarrow$  register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



- Fetching instructions (IF)  $\rightarrow$  in-order
- Executing instructions  $(EX) \rightarrow out-of-order$ 
  - $\rightarrow$  the instruction window is located in the ID phase





## **INSTRUCTION WINDOW / RESERVATION STATIONS**

- Cuts the pipeline into two parts:
  - · Front-end: instructions enter in an in-order way
  - Back-end: instructions enter in an out-of-order way





## **INSTRUCTION WINDOW / RESERVATION STATIONS**

- ID is now separated to (non-standard terminology!):
  - ID1: Dispatch (DS) decodes instructions and puts them into the instruction window
  - ID2: Issue (IS) collects operands and assigns instructions to functional units





## **THE INGREDIENTS OF OUT-OF-ORDER** EXECUTION

- www.hit.bme.hu
  - The CPU has to fetch several instructions to be able to re-order them
    - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

- A procedure that determines the execution order of the instructions  $\rightarrow$  dynamic scheduler
- An idea to reduce the number of dependencies between the instructions  $\rightarrow$  register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



#### **THE INGREDIENTS OF OUT-OF-ORDER** EXECUTION SYSTEMS

- www.hit.bme.hu
  - The CPU has to fetch several instructions to be able to re-order them
    - They must be stored somewhere:
      - $\rightarrow$  instruction window (instruction pool, reservation station)
  - A procedure that determines the execution order of the instructions  $\rightarrow$  dynamic scheduler
  - An idea to reduce the number of dependencies between the instructions  $\rightarrow$  register renaming
  - A trick to show in-order execution to the outside world
    - $\rightarrow$  re-order buffer



## DYNAMIC INSTRUCTION SCHEDULING

• The execution order of instructions is determined using

### Data-flow approach

- A precedence graph is created
  - Nodes: instructions
  - Arcs: which other instructions need to be awaited before executing the instruction  $\rightarrow$  dependencies
- An instruction is ready to be executed *if all its dependencies are resolved*. The dependencies can be:
  - RAW: an input parameter is not yet ready
  - WAW: it would write to a register, which is still to be written by an earlier, not yet finished instruction
  - WAR: it would write to a register, which is still to be read by an earlier, not yet finished instruction
- If an instruction is ready to be executed, and there is an appropriate execution unit available, we execute it immediately



## DYNAMIC INSTRUCTION SCHEDULING

- What happens, if more than one instructions are ready for an execution unit? (Contention)
  - Choose the one that was ready earlier
  - Choose the one that prevents the highest number of instruction from being ready for execution
  - etc.
- For coloring the nodes of the precedence graph, we use the following colors to express the status of the instructions
  - Blue: instruction is loaded into the reservation station
  - **Red**: the execution of the instruction is in progress
  - Green: the execution of the instruction has been finished







| i1:                       | D2 ←                  |                        | 1 F F            | 21]                |
|---------------------------|-----------------------|------------------------|------------------|--------------------|
| i2:                       | D3 ₊                  | - D2                   | *                | DØ                 |
| i3:                       | MEM [                 | [R2]                   | ←                | D3                 |
| i4:                       | R1 .                  | - R1                   | +                | 8                  |
| i5:                       | R2 ←                  | - R2                   | +                | 8                  |
|                           |                       |                        |                  |                    |
| i6:                       | D2 ←                  | - MEN                  | 1 [ F            | R1]                |
| i7·                       | 20                    | 50                     | *                |                    |
| L/.                       | <b>D</b> 5 ←          | - UZ                   | ~                | שט                 |
| 17.<br>18:                | MEM[                  | R2]                    | ~<br>←           | D0<br>D3           |
| 17:<br>18:<br>19:         | D3 ←<br>MEM[<br>R1 ←  | - D2<br>[R2]<br>- R1   | ^<br>←<br>+      | D0<br>D3<br>8      |
| i7:<br>i8:<br>i9:<br>i10: | MEM [<br>R1 ←<br>R2 ← | R2]<br>R2]<br>R1<br>R1 | ^<br>←<br>+<br>+ | D0<br>D3<br>8<br>8 |

1-----







| <br>i1:<br>•i2: | D2 ← MEM[R1]<br>D3 ← D2 * D0 |
|-----------------|------------------------------|
| i3:             | MEM[R2] ← D3                 |
| i4:             | R1 ~ R1 + 8                  |
| i5:             | R2 ← R2 + 8                  |
| 8<br>8<br>8     |                              |
| i6:             | $D2 \leftarrow MEM[R1]$      |
| i7:             | $D3 \leftarrow D2 * D0$      |
| i8:             | $MEM[R2] \leftarrow D3$      |
| i9:             | R1 ~ R1 + 8                  |
| i10:            | R2 ~ R2 + 8                  |
|                 |                              |





-----



| i1:<br>i2:<br>i3:<br>i4:<br>i5:  | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ |
|----------------------------------|------------------------------------------------------|
| i6:<br>i7:<br>i8:<br>i9:<br>i10: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ |

DEPARTMENT OF NETWORKED SYSTEMS AND SERVICES





DEPARTMENT OF NETWORKED SYSTEMS AND SERVICES





DEPARTMENT OF NETWORKED SYSTEMS AND SERVICES





DEPARTMENT OF NETWORKED SYSTEMS AND SERVICES www.hit.bme.hu



**D0** 

**D0** 



DEPARTMENT OF NETWORKED SYSTEMS AND SERVICES www.hit.bme.hu



D0

**D0** 







| i2 h. II hap i4      | i1:<br>i2:<br>i3:<br>i4:<br>i5:  | $D2 \leftarrow MEM[R1]$<br>$D3 \leftarrow D2 * D0$<br>$MEM[R2] \leftarrow D3$<br>$R1 \leftarrow R1 + 8$<br>$R2 \leftarrow R2 + 8$                                                                                        |
|----------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| i3 WAR RAN 16 WAR 19 | i6:<br>i7:<br>i8:<br>i9:<br>i10: | $\mathbf{D2} \leftarrow \mathbf{MEM[R1]}$ $\mathbf{D3} \leftarrow \mathbf{D2} * \mathbf{D0}$ $\mathbf{MEM[R2]} \leftarrow \mathbf{D3}$ $\mathbf{R1} \leftarrow \mathbf{R1} + 8$ $\mathbf{R2} \leftarrow \mathbf{R2} + 8$ |
| is RAW is            | 110.                             |                                                                                                                                                                                                                          |



- The CPU has to fetch several instructions to be able to re-order them
  - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

- A procedure that determines the execution order of the instructions

   *dynamic scheduler*
- An idea to reduce the number of dependencies between the instructions
   → register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



- The CPU has to fetch several instructions to be able to re-order them
  - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

• A procedure that determines the execution order of the instructions

 $\rightarrow$  dynamic scheduler

- An idea to reduce the number of dependencies between the instructions
   → register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



- Goal: to make the precedence graph sparse
- The types of data dependencies (hazards):
  - RAW:
- D3 ← D2 \* D0 MEM[R2] ← D3
- $\rightarrow$  Real dependency
- WAR:
- D3 ~ D2 \* D0
- • •
- $D2 \leftarrow MEM[R1]$
- WAW:

**D3** ~ **D2** \* **D0** 

- **D3** ← MEM[R1]
- Surprise: the WAR and the WAW dependencies can be eliminated → Anti-dependencies







hit.bme.hu

- Why would a programmer write such a code?
  - WAR:
- D3 ~ D2 \* D0
- $D2 \leftarrow MEM[R1]$
- WAW:
- **D3** ~ **D2** \* **D0**
- **D3** ~ MEM[R1]
- The registers are re-used because we have too few of them!
- Let us put a large number of registers to the CPU → Physical registers
- Let us hide them. The programmer still works with  $\rightarrow$  Logical/architectural registers (described in the ISA)
- The CPU re-writes the program on the fly to make use of the physical registers
  - → Register renaming



#### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register

. .

• The register alias table is updated

Register alias table:

| Original:                      | After renaming:          | Logical reg. | Physical |
|--------------------------------|--------------------------|--------------|----------|
| i1: <b>D2</b> ← <b>MEM[R1]</b> | i1: <b>U25</b> ← MEM[T3] |              | reg.     |
|                                |                          |              | T21      |
|                                |                          | R1           | Т3       |
|                                |                          | R2           | T46      |
|                                |                          | R3           | Т8       |
|                                |                          | D0           | U9       |
|                                |                          | D1           | U24      |
|                                |                          | D2           | U17      |
|                                |                          | D3           | U4       |



#### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

Register alias table:

| Original:                      | After renaming:                 | Logical reg. | Physical |
|--------------------------------|---------------------------------|--------------|----------|
| i1: <b>D2</b> ← <b>MEM[R1]</b> | i1: <b>U25</b> ← <b>MEM[T3]</b> |              | reg.     |
|                                |                                 | R0           | T21      |
|                                |                                 | R1           | Т3       |
|                                |                                 | R2           | T46      |
|                                |                                 | R3           | Т8       |
|                                |                                 | D0           | U9       |
|                                |                                 | D1           | U24      |
|                                |                                 | D2           | U25      |
|                                |                                 | D3           | U4       |



#### 

i1: D2 ← MEM[R1]

i2: **D3** ← **D2** \* **D0** 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:

Original:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠

¦i1: U25 ←

i2: **U26** ←

The register alias table is updated ٠

Register alias table:

| After renaming:                         | l ogical reg | Physical |
|-----------------------------------------|--------------|----------|
| i1: U25 ← MEM[T3]                       | Logical rog. | reg.     |
| i2: <b>U26</b> ← <b>U25</b> * <b>U9</b> | R0           | T21      |
|                                         | R1           | Т3       |
|                                         | R2           | T46      |
|                                         | R3           | Т8       |
|                                         | D0           | U9       |
|                                         | D1           | U24      |
|                                         | D2           | U25      |
|                                         | D3           | U4       |



#### 

i1: D2 ← MEM[R1]

i2: **D3** ← **D2** \* **D0** 

Entry i.: the ID of the physical register assigned to logical register i

After renaming:

Procedure:

Original:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠
- The register alias table is updated ٠

Register alias table:

| After renaming:                  | Logical reg. | Physical |
|----------------------------------|--------------|----------|
| i1: U25 ← MEM[T3]                |              | reg.     |
| i2: <b>U26</b> ← <b>U25 * U9</b> | R0           | T21      |
|                                  | R1           | Т3       |
|                                  | R2           | T46      |
|                                  | R3           | Т8       |
|                                  | D0           | U9       |
|                                  | D1           | U24      |
|                                  | D2           | U25      |
| []                               | D3           | U26      |



#### 

i1: D2 ← MEM[R1];

i2: D3 ← D2 \* D0

MEM[R2] ← D3

Entry i.: the ID of the physical register assigned to logical register i

After renaming:

Procedure:

Original:

i3:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠

The register alias table is updated

Register alias table:

| After renaming:                         | Logical reg. | Physical    |
|-----------------------------------------|--------------|-------------|
| i1: U25 ← MEM[T3]<br>i2: U26 ← U25 * U9 | R0           | reg.<br>T21 |
| i3: <b>MEM[T46]</b> ← <b>U26</b>        | R1           | Т3          |
|                                         | R2           | T46         |
|                                         | R3           | Т8          |
|                                         | D0           | U9          |
|                                         | D1           | U24         |
|                                         | D2           | U25         |
|                                         | D3           | U26         |


### 

i1: D2 ← MEM[R1]

D3 ← D2 \* D0

 $MEM[R2] \leftarrow D3$ 

 $R1 \leftarrow R1 + 8$ 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:

Original:

i2:

i3:

i4:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠

: i1:

i2:

i3:

¦i4:

The register alias table is updated ٠

Register alias table:

| After renaming:                               |              |               |
|-----------------------------------------------|--------------|---------------|
| i1 105 MEMIT31                                | Logical reg. | Physical reg. |
| i2: U26 ← U25 * U9                            | R0           | T21           |
| i3: MEM[T46] ← U26<br>i4: <b>T47 ← T3 + 8</b> | R1           | Т3            |
|                                               | R2           | T46           |
|                                               | R3           | Т8            |
|                                               | D0           | U9            |
|                                               | D1           | U24           |
|                                               | D2           | U25           |
|                                               | D3           | U26           |



### 

i1:  $D2 \leftarrow MEM[R1]$ 

D3 ~ D2 \* D0

MEM[R2] ← D3 R1 ← R1 + 8

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:

Original:

i2: i3:

i4:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register
- The register alias table is updated

Register alias table:

| Δft | er renamina.                |              |                 |
|-----|-----------------------------|--------------|-----------------|
| i1  | • 1125 / MEMIT31            | Logical reg. | Physica<br>reg. |
| i2  | : $U26 \leftarrow U25 * U9$ | R0           | T21             |
| i3  | : MEM[T46] ← U26            | <b>R1</b>    | T47             |
| 14  | . 14 <i>7</i> ← 15 + ŏ      | R2           | T46             |
|     |                             | R3           | Т8              |
|     |                             | D0           | U9              |
|     |                             | D1           | U24             |
|     |                             | D2           | U25             |
|     |                             | D3           | U26             |



### 

Entry i.: the ID of the physical register assigned to logical register i

After renaming:

Procedure:

Original:

i1:

i2:

i3:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠

The register alias table is updated

#### Register alias table:

|                                         | Logical reg. | Physical    |
|-----------------------------------------|--------------|-------------|
| i1: U25 ← MEM[T3]<br>i2: U26 ← U25 * U9 | R0           | reg.<br>T21 |
| i3: MEM[T46] ← U26                      | R1           | T47         |
| i5: T48 $\leftarrow$ T46 + 8            | R2           | T46         |
|                                         | R3           | Т8          |
|                                         | D0           | U9          |
|                                         | D1           | U24         |
|                                         | D2           | U25         |
|                                         | D3           | U26         |

#### © Department of Networked Systems and Services

 $D2 \leftarrow MEM[R1]$ 

 $D3 \leftarrow D2 * D0$ 

 $MEM[R2] \leftarrow D3$ 

i4: R1 ← R1 + 8 ↓ i5: **R2** ← **R2 + 8** 



### 

 $D2 \leftarrow MEM[R1]$ 

 $D3 \leftarrow D2 * D0$ 

 $MEM[R2] \leftarrow D3$ 

R1 ← R1 + 8 ¦

i5: **R2** ← **R2** + 8

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:

Original:

i1:

i2:

i3:

i4:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register ٠

The register alias table is updated

Register alias table:

| After renaming:                                               | Logical reg. | Physical    |
|---------------------------------------------------------------|--------------|-------------|
| i1: $U25 \leftarrow MEM[T3]$<br>i2: $U26 \leftarrow U25 * U9$ | R0           | reg.<br>T21 |
| i4: T47 $\leftarrow$ T3 + 8                                   | R1           | T47         |
| i5: <b>T48 ← T46 + 8</b>                                      | R2           | <b>T48</b>  |
|                                                               | R3           | Т8          |
|                                                               | D0           | U9          |
|                                                               | D1           | U24         |
|                                                               | D2           | U25         |
|                                                               | D3           | U26         |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

#### Register alias table:

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U25           |
| D3           | U26           |

| Original:                                                                                                     | After renaming:                                                                                                       |
|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| <pre>i1: D2 ← MEM[R1]<br/>i2: D3 ← D2 * D0<br/>i3: MEM[R2] ← D3<br/>i4: R1 ← R1 + 8<br/>i5: R2 ← R2 + 8</pre> | <pre>i1: U25 ← MEM[T3]<br/>i2: U26 ← U25 * U9<br/>i3: MEM[T46] ← U26<br/>i4: T47 ← T3 + 8<br/>i5: T48 ← T46 + 8</pre> |
| i6: <b>D2</b> ← <b>MEM[R1]</b>                                                                                | i6: <b>U27</b> ← <b>MEM[T47]</b>                                                                                      |
|                                                                                                               |                                                                                                                       |



### 

i1: D2 ← MEM[R1]

i5: R2 ← R2 + 8

D3 ← D2 \* D0

 $MEM[R2] \leftarrow D3$ 

R1 ← R1 + 8 ¦

D2 ← MEM[R1]

Entry i.: the ID of the physical register assigned to logical register i

After renaming:

i1: U25 ← MEM[T3]

i4: T47 ← T3 + 8

i5: T48 ← T46 + 8

U26 ← U25 \* U9

MEM[T46] ← U26

**U27** ← MEM[T47]

• Procedure:

Original:

i2:

i3:

i4:

i6:

- The operands are replaced by physical registers
- The result is put to a brand new (unused) physical register

i2:

i3:

i6:

• The register alias table is updated

#### Register alias table:

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U26           |

#### 42



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | alias | table: |
|----------|-------|--------|
|----------|-------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U26           |

| $\sim$ |    |          |   |              |  |
|--------|----|----------|---|--------------|--|
| 7 1    | rı | $\sim$   | n | $\mathbf{n}$ |  |
| J      |    | ( ]      |   |              |  |
| $\sim$ |    | м.       |   |              |  |
|        |    | <b>U</b> |   |              |  |

After renaming:

| <pre>i1: D2 ← MEM[R1]</pre> | i1: U25 ← MEM[T3]         |
|-----------------------------|---------------------------|
| i2: D3 ← D2 * D0            | i2: U26 ← U25 * U9        |
| i3: MEM[R2] ← D3            | i3: MEM[T46] ← U26        |
| i4: R1 ← R1 + 8             | i4: T47 ← T3 + 8          |
| i5: R2 ← R2 + 8             | i5: T48 ← T46 + 8         |
| i6: D2 ← MEM[R1]            | i6: U27 ← MEM[T47]        |
| i7: <b>D3 ← D2 * D0</b>     | i7: <b>U28 ← U27 * U9</b> |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | alias | table: |
|----------|-------|--------|
|----------|-------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |

# Original:

After renaming:

| i1:<br>i2:<br>i3:<br>i4:<br>i5: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ | <pre>i1: U25 ← MEM[T3]<br/>i2: U26 ← U25 * U9<br/>i3: MEM[T46] ← U26<br/>i4: T47 ← T3 + 8<br/>i5: T48 ← T46 + 8</pre> |
|---------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| i6:                             | D2 ← MEM[R1]                                         | i6: U27 ← MEM[T47]                                                                                                    |
| i7:                             | D3 ← D2 * D0                                         | i7: <b>U28 ← U27 * U9</b>                                                                                             |



### 

i1: D2 ← MEM[R1]

i2: D3 ← D2 \* D0

i7: D3 ← D2 \* D0

MEM[R2] ← D3

MEM[R2] ← D3:

Entry i.: the ID of the physical register assigned to logical register i

After renaming:

i1: U25 ← MEM[T3]

i2: U26 ← U25 \* U9

MEM[T46] ← U26

U28 ← U27 \* U9

MEM[T48] ← U28

• Procedure:

Original:

i3:

i4:

i8:

• The operands are replaced by physical registers

i3:

i7:

:i8:

 $R1 \leftarrow R1 + 8$  | i4: T47  $\leftarrow$  T3 + 8

i5:  $R2 \leftarrow R2 + 8$  i5: T48  $\leftarrow$  T46 + 8

i6: D2  $\leftarrow$  MEM[R1] i6: U27  $\leftarrow$  MEM[T47]

- The result is put to a brand new (unused) physical register
- The register alias table is updated

| Register | alias | table: |
|----------|-------|--------|
|----------|-------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | alias | table: |
|----------|-------|--------|
|----------|-------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T47           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |

| Origiı                          | nal:                                                               | After                           | renaming:                                                                  |
|---------------------------------|--------------------------------------------------------------------|---------------------------------|----------------------------------------------------------------------------|
| i1:<br>i2:<br>i3:<br>i4:<br>i5: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$               | i1:<br>i2:<br>i3:<br>i4:<br>i5: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$                       |
| i6:<br>i7:<br>i8:<br>i9:        | D2 ← MEM[R1]<br>D3 ← D2 * D0<br>MEM[R2] ← D3<br><b>R1 ← R1 + 8</b> | i6:<br>i7:<br>i8:<br>i9:        | U27 ← MEM[T47]<br>U28 ← U27 * U9<br>MEM[T48] ← U28<br><b>T49 ← T47 + 8</b> |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | r alias | table: |
|----------|---------|--------|
|----------|---------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | <b>T49</b>    |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |

| Origir | nal:                                                                                                                                                                                 | After | renaming:                 |
|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|---------------------------|
| i1:    | $\begin{array}{rcrr} D2 &\leftarrow & MEM[R1] \\ D3 &\leftarrow & D2 & * & D0 \\ MEM[R2] &\leftarrow & D3 \\ R1 &\leftarrow & R1 & + & 8 \\ R2 &\leftarrow & R2 & + & 8 \end{array}$ | i1:   | $U25 \leftarrow MEM[T3]$  |
| i2:    |                                                                                                                                                                                      | i2:   | $U26 \leftarrow U25 * U9$ |
| i3:    |                                                                                                                                                                                      | i3:   | $MEM[T46] \leftarrow U26$ |
| i4:    |                                                                                                                                                                                      | i4:   | $T47 \leftarrow T3 + 8$   |
| i5:    |                                                                                                                                                                                      | i5:   | $T48 \leftarrow T46 + 8$  |
| i6:    | D2 ← MEM[R1]                                                                                                                                                                         | i6:   | U27 ← MEM[T47]            |
| i7:    | D3 ← D2 * D0                                                                                                                                                                         | i7:   | U28 ← U27 * U9            |
| i8:    | MEM[R2] ← D3                                                                                                                                                                         | i8:   | MEM[T48] ← U28            |
| i9:    | <b>R1 ← R1 + 8</b>                                                                                                                                                                   | i9:   | <b>T49 ← T47 + 8</b>      |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | r alias | table: |
|----------|---------|--------|
|----------|---------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T49           |
| R2           | T48           |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |

| Origir                           | nal:                                                 | After renaming:                                                                                                          |  |  |
|----------------------------------|------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|--|--|
| i1:<br>i2:<br>i3:<br>i4:<br>i5:  | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ | <pre>i1: U25 ← MEM[T3]<br/>i2: U26 ← U25 * U9<br/>i3: MEM[T46] ← U26<br/>i4: T47 ← T3 + 8<br/>i5: T48 ← T46 + 8</pre>    |  |  |
| i6:<br>i7:<br>i8:<br>i9:<br>i10: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$ | <pre>i6: U27 ← MEM[T47]<br/>i7: U28 ← U27 * U9<br/>i8: MEM[T48] ← U28<br/>i9: T49 ← T47 + 8<br/>i10: T50 ← T48 + 8</pre> |  |  |



### 

- Entry i.: the ID of the physical register assigned to logical register i
- Procedure:
  - The operands are replaced by physical registers
  - The result is put to a brand new (unused) physical register
  - The register alias table is updated

| Register | r alias | table: |
|----------|---------|--------|
|----------|---------|--------|

| Logical reg. | Physical reg. |
|--------------|---------------|
| R0           | T21           |
| R1           | T49           |
| R2           | <b>T50</b>    |
| R3           | Т8            |
| D0           | U9            |
| D1           | U24           |
| D2           | U27           |
| D3           | U28           |

| Original:                      | After renaming:               |
|--------------------------------|-------------------------------|
| <pre>i1: D2 ← MEM[R1]</pre>    | <pre>i1: U25 ← MEM[T3]</pre>  |
| i2: D3 ← D2 * D0               | i2: U26 ← U25 * U9            |
| i3: MEM[R2] ← D3               | i3: MEM[T46] ← U26            |
| i4: R1 ← R1 + 8                | i4: T47 ← T3 + 8              |
| i5: R2 ← R2 + 8                | i5: T48 ← T46 + 8             |
| i6: D2 ← MEM[R1]               | <pre>i6: U27 ← MEM[T47]</pre> |
| i7: D3 ← D2 * D0               | i7: U28 ← U27 * U9            |
| i8: MEM[R2] ← D3               | i8: MEM[T48] ← U28            |
| i9: R1 ← R1 + 8                | i9: T49 ← T47 + 8             |
| i10: <b>R2</b> ← <b>R2 + 8</b> | i10: T50 ← T48 + 8            |



- Result:
  - All WAW and WAR dependencies disappeared
  - ...since we always store the result in a "clean", unused register
- The precedence graph:

Before register renaming:



After register renaming:









| <br>i1:<br>i2:<br>i3:<br>i4:<br>i5: | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$                                             |
|-------------------------------------|--------------------------------------------------------------------------------------------------|
| i6:<br>i7:<br>i8:<br>i9:<br>i10:    | $\begin{array}{rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr$                                             |
|                                     | → i1: D2 ← MEM[R1]<br>i2: D3 ← D2 * D0<br>i3: MEM[R2] ← D3<br>i4: R1 ← R1 + 8<br>i5: R2 ← R2 + 8 |
|                                     | i6: D2 ← MEM[R1]<br>i7: D3 ← D2 * D0<br>i8: MEM[R2] ← D3<br>i9: R1 ← R1 + 8<br>i10: R2 ← R2 + 8  |

(i1







www.hit.bme.hu \_

DEPARTMENT OF NETWORKED SYSTEMS

AND SERVICES



























- Conclusion:
  - Effective!
  - The larger the instruction window is, the more efficient the register renaming is.



- The CPU has to fetch several instructions to be able to re-order them
  - They must be stored somewhere:

 $\rightarrow$  instruction window (instruction pool, reservation station)

• A procedure that determines the execution order of the instructions

 $\rightarrow$  dynamic scheduler

- An idea to reduce the number of dependencies between the instructions
  → register renaming
- A trick to show in-order execution to the outside world
  - $\rightarrow$  re-order buffer



- The CPU has to fetch several instructions to be able to re-order them
  - They must be stored somewhere:

→ instruction window (instruction pool, reservation station)

A procedure that determines the execution order of the instructions

 $\rightarrow$  dynamic scheduler

- An idea to reduce the number of dependencies between the instructions
  → register renaming
- A trick to show in-order execution to the outside world

 $\rightarrow$  re-order buffer





- Unwanted side effect of out-of-order execution:
  - Registers/memory does not change in the order as given by the program
  - The memory is shared
    - Other processors are using it,
    - I/O devices are using it,
    - etc.
  - Not everybody is prepared for this behavior!
- Solution: re-order buffer (ROB)
  - An entry is allocated to each instruction when entering the CPU
  - After executing the instruction, the ROB stores
    - The result of the instruction
    - The destination of the result (register/memory location)
  - The results are written to the destination only when all preceding instructions are ready
    - Complete: execution has been finished
    - Retire: the result is written to its destination



















|  | Instruction:   | Ready?       | Result:         | Where?   |
|--|----------------|--------------|-----------------|----------|
|  | U25 ← MEM[T3]  | $\checkmark$ | <value></value> | U25      |
|  | U26 ~ U25 * U9 |              |                 | U26      |
|  | MEM[T46] ~ U26 |              |                 | MEM[T46] |





|  | Instruction:              | Ready?       | Result:         | Where?   |
|--|---------------------------|--------------|-----------------|----------|
|  | U25 ← MEM[T3]             | $\checkmark$ | <value></value> | U25      |
|  | U26 <sub>~</sub> U25 * U9 |              |                 | U26      |
|  | MEM[T46] ← U26            |              |                 | MEM[T46] |
|  | T47 ← T3 + 8              |              |                 | T47      |
|  |                           |              |                 |          |





|   | Instruction:   | Ready?       | Result:         | Where?   |
|---|----------------|--------------|-----------------|----------|
|   | U25 ← MEM[T3]  | $\checkmark$ | <value></value> | U25      |
|   | U26 ← U25 * U9 |              |                 | U26      |
|   | MEM[T46] ← U26 |              |                 | MEM[T46] |
|   | T47 ← T3 + 8   |              |                 | T47      |
| _ | T48 ← T46 + 8  |              |                 | T48      |





|   | Instruction:                           | Ready?       | Result:         | Where?   |
|---|----------------------------------------|--------------|-----------------|----------|
|   | U25 ← MEM[T3]                          | $\checkmark$ | <value></value> | U25      |
|   | U26 ← U25 * U9                         |              |                 | U26      |
|   | MEM[T46] ~ U26                         |              |                 | MEM[T46] |
|   | T47 ~ T3 + 8                           | $\checkmark$ | <value></value> | T47      |
|   | <b>T48</b> ~ <b>T46</b> + 8            |              |                 | T48      |
|   | <b>U27</b> ← <b>MEM</b> [ <b>T</b> 47] |              |                 | U27      |
| - |                                        |              |                 |          |





|   | Instruction:                 | Ready?       | Result:         | Where?   |
|---|------------------------------|--------------|-----------------|----------|
|   | U25 ← MEM[T3]                | $\checkmark$ | <value></value> | U25      |
|   | U26 ← U25 * U9               |              |                 | U26      |
|   | MEM[T46] ~ U26               |              |                 | MEM[T46] |
|   | T47 ← T3 + 8                 | $\checkmark$ | <value></value> | T47      |
|   | T48 ← T46 + 8                | $\checkmark$ | <value></value> | T48      |
|   | <b>U27</b> ← <b>MEM[T47]</b> |              |                 | U27      |
| _ | U28 ~ U27 * U9               |              |                 | U28      |





|             | Instruction:                | Ready?       | Result:         | Where?   |
|-------------|-----------------------------|--------------|-----------------|----------|
|             | U25 ← MEM[T3]               | $\checkmark$ | <value></value> | U25      |
|             | U26 ← U25 * U9              | $\checkmark$ | <value></value> | U26      |
|             | MEM[T46] ← U26              |              |                 | MEM[T46] |
|             | T47 ~ T3 + 8                | $\checkmark$ | <value></value> | T47      |
|             | <b>T48</b> ← <b>T46</b> + 8 | $\checkmark$ | <value></value> | T48      |
|             | <b>U27</b> ← MEM[T47]       | $\checkmark$ | <value></value> | U27      |
|             | U28 ~ U27 * U9              |              |                 | U28      |
| <b>&gt;</b> | MEM[T48] ← U28              |              |                 | MEM[T48] |
|             |                             |              |                 |          |





| Instruction:                | Ready?       | Result:         | Where?   |
|-----------------------------|--------------|-----------------|----------|
| U25 ← MEM[T3]               | $\checkmark$ | <value></value> | U25      |
| U26 ~ U25 * U9              | $\checkmark$ | <value></value> | U26      |
| MEM[T46] ← U26              | $\checkmark$ | <value></value> | MEM[T46] |
| T47 ~ T3 + 8                | $\checkmark$ | <value></value> | T47      |
| <b>T48</b> ~ <b>T46</b> + 8 | $\checkmark$ | <value></value> | T48      |
| <b>U27</b> ← MEM[T47]       | $\checkmark$ | <value></value> | U27      |
| U28 ~ U27 * U9              |              |                 | U28      |
| MEM[T48] ← U28              |              |                 | MEM[T48] |
| <b>T49 ← T47 + 8</b>        |              |                 | T49      |



### OUT-OF-ORDER INSTRUCTION EXECUTION

- Covered topics:
  - Principle: dependency analysis
  - Can be more efficient if register renaming is used
  - We can imitate in-order execution with a re-order buffer
- Implementations:
  - Scoreboard (1964): no register renaming, no ROB
  - Tomasulo (IBM, 1967): has register renaming, but no ROB
  - Since Intel P6 (including Core i7): both register renaming and ROB
  - In mobile devices: since ARM Cortex A9
