Instruction Level Parallelism

Instruction Level Parallelism (ILP) refers to the ability of a processor to execute multiple instructions simultaneously during a single clock cycle. This parallelism is inherent in the sequence of instructions in a program. ILP aims to utilize this potential to improve the performance of processors by overlapping the execution of instructions. The primary goal of ILP is to reduce the number of clock cycles per instruction (CPI), thereby increasing the overall throughput of the system.

ILP is crucial in modern computing because it addresses the growing demand for faster and more efficient processors. As clock speeds have hit practical limits due to power consumption and heat dissipation issues, exploiting parallelism within instruction streams has become a key strategy for enhancing processor performance.

Challenges in Exploiting ILP

Exploiting ILP is not straightforward due to several challenges:

  1. Data Hazards: These occur when instructions depend on the results of previous instructions. There are three types of data hazards:

    • Read After Write (RAW): A subsequent instruction tries to read a source before a previous instruction has written to it.

    • Write After Read (WAR): A subsequent instruction tries to write a destination before a previous instruction has read it.

    • Write After Write (WAW): Two instructions try to write to the same destination in reverse order of their intended sequence.

  2. Control Hazards: These arise from the need to make decisions based on the results of executed instructions, such as branches and jumps, which can disrupt the instruction flow.

  3. Resource Hazards: These occur when multiple instructions compete for the same resources, such as functional units or memory.

  4. Complexity in Scheduling: Efficiently scheduling instructions to maximize parallelism without violating dependencies and hazards is a complex task.

Overview of Pipelining and Its Limitations

Pipelining is a fundamental technique used to increase the instruction throughput of a processor. It divides the instruction execution process into several stages, with each stage handling a different part of the instruction cycle (e.g., fetch, decode, execute, memory access, and write-back). By overlapping these stages, a new instruction can begin execution before the previous one has completed, much like an assembly line in a factory.

In an ideal scenario, pipelining can make the average CPI approach 1, meaning one instruction is completed per clock cycle. However, pipelining cannot reduce CPI below 1 due to inherent limitations:

  1. Pipeline Stalls: Hazards and dependencies can cause pipeline stalls, where certain stages must wait for previous stages to complete, reducing efficiency.

  2. Branch Delays: Control hazards from branches can cause delays, as the pipeline must wait for the branch decision to be resolved.

  3. Resource Conflicts: Limited resources can lead to conflicts and delays if multiple instructions require the same resources simultaneously.

While pipelining improves performance significantly, it is not enough to achieve a CPI below 1. This is where ILP comes into play. By exploiting ILP techniques, such as out-of-order execution, register renaming, and multiple issue architectures, processors can execute multiple instructions per clock cycle, pushing the CPI below 1. For instance, a superscalar processor can issue multiple instructions in a single clock cycle, and a Very Long Instruction Word (VLIW) processor can execute multiple operations encoded in a single long instruction.

Basic Compiler Techniques for Exposing ILP

To fully exploit Instruction Level Parallelism (ILP), compilers play a crucial role in transforming and optimizing code. This section delves into two fundamental techniques: Loop Unrolling and Scheduling (both Static and Dynamic). These techniques help expose ILP, allowing the processor to execute multiple instructions simultaneously.

Loop Unrolling

Loop unrolling is a compiler optimization technique that increases ILP by reducing the overhead of loop control instructions and increasing the number of instructions available for parallel execution. It involves replicating the loop body multiple times within a single iteration and adjusting the loop control code accordingly.

Example: Original Loop

for (int i = 0; i < 8; i++) {
    a[i] = b[i] + c[i];
}

Unrolled Loop (factor of 4)

for (int i = 0; i < 8; i += 4) {
    a[i] = b[i] + c[i];
    a[i+1] = b[i+1] + c[i+1];
    a[i+2] = b[i+2] + c[i+2];
    a[i+3] = b[i+3] + c[i+3];
}

In the unrolled loop, four iterations of the original loop are combined into one, reducing the number of loop control instructions (like incrementing i and checking the loop condition) and increasing the number of arithmetic operations available for parallel execution.

Scheduling Techniques

Scheduling techniques are crucial for arranging instructions in a way that maximizes parallel execution while avoiding hazards. These techniques can be categorized into Static Scheduling and Dynamic Scheduling.

Static Scheduling

Static Scheduling is performed by the compiler at compile-time. The compiler rearranges the instructions to minimize stalls and maximize parallelism, assuming a fixed hardware configuration and predictable execution paths.

Example: Original Sequence

LD  R1, 0(R2)
LD  R3, 4(R2)
ADD R4, R1, R3
SD  8(R2), R4

Statically Scheduled Sequence

LD  R1, 0(R2)
LD  R3, 4(R2)
NOP ;Inserting a NOP to avoid RAW hazard
ADD R4, R1, R3
SD  8(R2), R4

In this example, a NOP (No Operation) instruction is inserted to avoid a RAW (Read After Write) hazard between the LD and ADD instructions, ensuring that the data dependencies are respected.

Dynamic Scheduling

Dynamic Scheduling is performed at runtime by the processor hardware. It allows the processor to handle unpredictable execution paths and dynamic dependencies. More hardware units and resources are added, such as reservation stations and reorder buffers, to dynamically schedule instructions and execute them out-of-order while maintaining correct program order.

Overcoming Data Hazards with Dynamic Scheduling

Data hazards occur when instructions that exhibit data dependencies are executed in parallel or out of order. These hazards can disrupt the normal flow of instructions in a pipeline and must be managed to ensure correct program execution. There are three primary types of data hazards:

  1. RAW (Read After Write): Occurs when an instruction needs to read a location that a previous instruction is writing to.

  2. WAR (Write After Read): Occurs when an instruction needs to write to a location that a previous instruction is reading from.

  3. WAW (Write After Write): Occurs when two instructions need to write to the same location.

Dynamic Scheduling Techniques

Dynamic scheduling is a technique used by the hardware to manage instruction execution order and resolve data hazards dynamically at runtime. One of the key methods used in dynamic scheduling is Scoreboarding.

Scoreboarding

Scoreboarding is a method used to allow instructions to execute out-of-order but ensures that they do not violate data dependencies. It keeps track of the status of instructions and functional units to manage hazards dynamically.

Basic Concepts

  • Instruction Status: Keeps track of which stage each instruction is in.

  • Functional Unit Status: Keeps track of the status of each functional unit (e.g., whether it is busy, which instruction it is executing).

  • Register Result Status: Indicates which instruction will write to each register, helping to detect WAR and WAW hazards.

Components of a Scoreboard

  • Instruction Status Table: Tracks the stages of all instructions.

  • Functional Unit Status Table**: Tracks which units are busy, the operations they are performing, and their operands.

  • Register Status Table: Tracks which instructions will write to each register.

Scoreboarding Example on MIPS32

Consider the following sequence of MIPS32 instructions:

assemblyCopy code1. LD F6, 34(R2)
2. LD F2, 45(R3)
3. MUL F0, F2, F4
4. SUB F8, F6, F2
5. DIV F10, F0, F6
6. ADD F6, F8, F2

We will use tables to demonstrate how scoreboarding manages these instructions. We will show the initial tables and update them step by step, explaining the changes made in each step.

Initial Tables

Step-by-Step Execution

Step 1: Issue LD F6, 34(R2)

Changes:

  • The LD F6, 34(R2) instruction is issued.

  • The Integer Unit is marked busy with the LD operation.

  • Register F6 will be written by this instruction.

Updated Tables:

Step 2: Issue LD F2, 45(R3)

Changes:

  • The LD F2, 45(R3) instruction is issued.

  • The Integer Unit is marked busy with the LD operation.

  • Register F2 will be written by this instruction.

Updated Tables:

Step 3: Execute LD F6, 34(R2) and Read Operands for LD F2, 45(R3)

Changes:

  • The LD F6, 34(R2) instruction reads its operands and executes.

  • The LD F2, 45(R3) instruction reads its operands.

  • The Integer Unit continues to be busy with the LD operations.

  • Register F6 is updated with the value from memory.

Updated Tables:

Step 4: Write Result for LD F6, 34(R2) and Execute LD F2, 45(R3)

Changes:

  • The LD F6, 34(R2) instruction writes its result to F6.

  • The LD F2, 45(R3) instruction executes.

  • The Integer Unit continues to be busy with the LD operation.

Updated Tables:

Step 5: Write Result for LD F2, 45(R3)

Changes:

  • The LD F2, 45(R3) instruction writes its result to F2.

  • The Integer Unit is now free.

Updated Tables:

Step 6: Issue MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction is issued.

  • The Multiply Unit is marked busy with the MUL operation.

  • Register F0 will be written by this instruction.

Updated Tables:

Step 7: Read Operands for MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction reads its operands.

  • The Multiply Unit continues to be busy with the MUL operation.

Updated Tables:

Step 8: Execute MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction executes.

  • The Multiply Unit continues to be busy with the MUL operation.

Updated Tables:

Step 9: Issue SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction is issued.

  • The Add Unit is marked busy with the SUB operation.

  • Register F8 will be written by this instruction.

Updated Tables:

Step 10: Read Operands for SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction reads its operands.

  • The Add Unit continues to be busy with the SUB operation.

Updated Tables:

Step 11: Execute SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction executes.

  • The Add Unit continues to be busy with the SUB operation.

Updated Tables:

Step 12: Issue DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction is issued.

  • The Divide Unit is marked busy with the DIV operation.

  • Register F10 will be written by this instruction.

Updated Tables:

Step 13: Read Operands for DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction reads its operands.

  • The Divide Unit continues to be busy with the DIV operation.

Updated Tables:

Step 14: Execute DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction executes.

  • The Divide Unit continues to be busy with the DIV operation.

Updated Tables:

Step 15: Issue ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction is issued.

  • The Add Unit is marked busy with the ADD operation.

  • Register F6 will be written by this instruction.

Updated Tables:

Step 16: Read Operands for ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction reads its operands.

  • The Add Unit continues to be busy with the ADD operation.

Updated Tables:

Step 17: Execute ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction executes.

  • The Add Unit continues to be busy with the ADD operation.

Updated Tables:

Step 18: Write Result for MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction writes its result to F0.

  • The Multiply Unit is now free.

Updated Tables:

Step 19: Write Result for SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction writes its result to F8.

  • The Add Unit is now free.

Updated Tables:

Step 20: Write Result for DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction writes its result to F10.

  • The Divide Unit is now free.

Updated Tables:

Step 21: Write Result for ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction writes its result to F6.

  • The Add Unit is now free.

Updated Tables:

Problems with Scoreboarding and the Need for Tomasulo's Algorithm

While scoreboarding is a powerful technique for dynamic scheduling, it has some limitations:

  1. Limited Hazard Handling: Scoreboarding can handle RAW hazards effectively but is less efficient with WAR and WAW hazards.

  2. Resource Stalls: Scoreboarding may lead to resource stalls if all functional units are busy, causing delays in issuing new instructions.

  3. Complex Control Logic: The control logic for scoreboarding can become quite complex, especially with an increasing number of instructions and functional units.

  4. No Register Renaming: Scoreboarding does not support register renaming, which can lead to false dependencies and limit ILP.

To overcome these limitations, Tomasulo's Algorithm was developed. It provides more sophisticated techniques to dynamically schedule instructions, resolve hazards, and optimize ILP.

Tomasulo's Algorithm

Basic Concepts

Tomasulo's Algorithm is a hardware-based method for dynamic instruction scheduling that allows out-of-order execution and resolves data hazards using register renaming. It uses reservation stations for each functional unit and a Common Data Bus (CDB) to broadcast results.

Components of Tomasulo's Algorithm

  1. Reservation Stations: Temporary storage buffers for instructions waiting to be executed. They hold the instruction, the operands, and the status of the operands.

  2. Common Data Bus (CDB): A bus used to broadcast the results of executed instructions to all reservation stations and registers.

  3. Reorder Buffer (ROB): A buffer that ensures instructions commit their results in the correct program order, maintaining precise exceptions and program state.

Tomasulo's Algorithm Example on MIPS32

Consider the following sequence of MIPS32 instructions:

1. LD F6, 34(R2)
2. LD F2, 45(R3)
3. MUL F0, F2, F4
4. SUB F8, F6, F2
5. DIV F10, F0, F6
6. ADD F6, F8, F2

Initial Tables

Step-by-Step Execution

Step 1: Issue LD F6, 34(R2)

Changes:

  • The LD F6, 34(R2) instruction is issued to the Load1 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F6 is marked as being updated by the ROB entry.

Updated Tables:

Step 2: Issue LD F2, 45(R3)

Changes:

  • The LD F2, 45(R3) instruction is issued to the Load2 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F2 is marked as being updated by the ROB entry.

Updated Tables:

Step 3: Execute LD F6, 34(R2) and LD F2, 45(R3)

Changes:

  • The LD F6, 34(R2) and LD F2, 45(R3) instructions read their operands and execute.

  • The results are broadcast on the CDB.

  • ROB entries for these instructions are updated with the results.

Updated Tables:

Step 4: Write Result for LD F6, 34(R2) and LD F2, 45(R3)

Changes:

  • The results of LD F6, 34(R2) and LD F2, 45(R3) are written back to the registers.

  • ROB entries for these instructions are updated to indicate completion.

Updated Tables:

Step 5: Issue MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction is issued to the Mult1 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F0 is marked as being updated by the ROB entry.

Updated Tables:

Step 6: Execute MUL F0, F2, F4

Changes:

  • The MUL F0, F2, F4 instruction executes.

  • The result is broadcast on the CDB.

  • The ROB entry for this instruction is updated with the result.

Updated Tables:

Step 7: Write Result for MUL F0, F2, F4

Changes:

  • The result of MUL F0, F2, F4 is written back to F0.

  • The ROB entry for this instruction is updated to indicate completion.

Updated Tables:

Step 8: Issue SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction is issued to the Add1 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F8 is marked as being updated by the ROB entry.

Updated Tables:

Step 9: Execute SUB F8, F6, F2

Changes:

  • The SUB F8, F6, F2 instruction executes.

  • The result is broadcast on the CDB.

  • The ROB entry for this instruction is updated with the result.

Updated Tables:

Step 10: Write Result for SUB F8, F6, F2

Changes:

  • The result of SUB F8, F6, F2 is written back to F8.

  • The ROB entry for this instruction is updated to indicate completion.

Updated Tables:

Step 11: Issue DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction is issued to the Div1 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F10 is marked as being updated by the ROB entry.

Updated Tables:

Step 12: Execute DIV F10, F0, F6

Changes:

  • The DIV F10, F0, F6 instruction executes.

  • The result is broadcast on the CDB.

  • The ROB entry for this instruction is updated with the result.

Updated Tables:

Step 13: Write Result for DIV F10, F0, F6

Changes:

  • The result of DIV F10, F0, F6 is written back to F10.

  • The ROB entry for this instruction is updated to indicate completion.

Updated Tables:

Step 14: Issue ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction is issued to the Add2 reservation station.

  • An entry is created in the ROB for the instruction.

  • Register F6 is marked as being updated by the ROB entry.

Updated Tables:

Step 15: Execute ADD F6, F8, F2

Changes:

  • The ADD F6, F8, F2 instruction executes.

  • The result is broadcast on the CDB.

  • The ROB entry for this instruction is updated with the result.

Updated Tables:

Step 16: Write Result for ADD F6, F8, F2

Changes:

  • The result of ADD F6, F8, F2 is written back to F6.

  • The ROB entry for this instruction is updated to indicate completion.

Updated Tables:

Introduction to Multiple Issue Processors

Multiple issue processors are designed to improve performance by issuing multiple instructions per clock cycle, significantly increasing the instruction-level parallelism (ILP) that can be exploited. These processors can fetch, decode, and execute more than one instruction simultaneously, which helps in achieving higher throughput.

Superscalar Architecture

Superscalar architecture is a type of multiple issue processor architecture where multiple instructions are issued and executed in parallel. The key idea is to have multiple execution units within the processor so that multiple instructions can be processed at the same time.

Basic Concepts
  1. Instruction Issue Rate: The number of instructions a processor can issue per clock cycle. For example, a 4-issue superscalar processor can issue four instructions per cycle.

  2. Instruction Latency: The number of cycles it takes to complete an instruction from issue to write-back.

  3. Throughput: The number of instructions that can be completed per unit of time.

  4. Parallel Execution: Multiple instructions are executed simultaneously using different execution units.

Superscalar Pipeline Structure

A superscalar pipeline is an extension of the classic pipeline structure, incorporating multiple pipelines to allow for parallel execution of instructions. The stages in a typical superscalar pipeline include:

  1. Fetch Stage: Multiple instructions are fetched from the instruction cache.

  2. Decode Stage: Instructions are decoded, and dependencies are checked.

  3. Issue Stage: Instructions are issued to available execution units. Dependencies are resolved using techniques such as register renaming.

  4. Execute Stage: Instructions are executed in parallel across different execution units.

  5. Write-back Stage: The results of instructions are written back to the register file.

Example of a Superscalar Processor

Consider a 2-issue superscalar processor that can issue two instructions per cycle. Let's illustrate this with a sequence of instructions:

1. ADD R1, R2, R3
2. MUL R4, R5, R6
3. SUB R7, R8, R9
4. DIV R10, R11, R12

In a scalar processor (single issue), these instructions would be executed sequentially. In a 2-issue superscalar processor, the execution can be as follows:

  1. Cycle 1: Fetch and decode ADD and MUL.

  2. Cycle 2: Issue ADD to integer ALU, issue MUL to multiplier.

  3. Cycle 3: Fetch and decode SUB and DIV.

  4. Cycle 4: Issue SUB to integer ALU (assuming ADD is completed), issue DIV to divider (assuming MUL is completed).

In this example, the superscalar processor can complete the sequence in fewer cycles compared to a scalar processor, thus increasing the throughput.

Superpipelined Architecture

Superpipelined architecture is an enhancement of the traditional pipelining technique, where the pipeline stages are broken down into smaller, more finely divided stages. This allows the clock cycle time to be reduced, increasing the number of instructions that can be processed in a given period. The primary aim is to improve the instruction throughput without necessarily increasing the instruction issue rate.

Key Characteristics of Superpipelined Architectures:

  1. Increased Pipeline Stages: More pipeline stages than traditional pipelined architectures.

  2. Shorter Cycle Time: Smaller stages allow for a higher clock frequency.

  3. Higher Throughput: More instructions can be processed per unit of time due to increased clock frequency.

Superpipelining vs. Superscalar

  • Superpipelining:

    • Focuses on increasing the depth of the pipeline.

    • Higher clock frequency due to shorter stages.

    • May still issue one instruction per cycle but completes more instructions over time due to higher clock rates.

  • Superscalar:

    • Focuses on issuing multiple instructions per clock cycle.

    • Requires multiple execution units to handle parallel instruction execution.

    • More complex control logic to manage dependencies and parallelism.

Example of a Superpipelined Processor

Consider a superpipelined processor with a 10-stage pipeline, compared to a traditional 5-stage pipeline. If each stage in the traditional pipeline takes 1 cycle, a 10-stage superpipeline might have stages that each take 0.5 cycles, effectively doubling the clock frequency.

Pipeline Stages:

  1. Instruction Fetch 1 (IF1)

  2. Instruction Fetch 2 (IF2)

  3. Instruction Decode 1 (ID1)

  4. Instruction Decode 2 (ID2)

  5. Execute 1 (EX1)

  6. Execute 2 (EX2)

  7. Memory Access 1 (MEM1)

  8. Memory Access 2 (MEM2)

  9. Write-back 1 (WB1)

  10. Write-back 2 (WB2)

Very Long Instruction Word (VLIW) Processor Architecture

VLIW processors are designed to exploit ILP by issuing long instructions that contain multiple operations to be executed in parallel. Each instruction specifies several independent operations that the hardware executes simultaneously.

Key Characteristics of VLIW Architectures:

  1. Fixed Instruction Format: Each instruction word contains multiple operations.

  2. Static Scheduling: The compiler is responsible for determining the parallelism and scheduling operations within an instruction.

  3. Simpler Hardware: Less complex control logic compared to superscalar processors because the compiler handles dependency resolution.

VLIW vs. Superscalar

  • VLIW:

    • Relies on the compiler for parallelism detection and scheduling.

    • Fixed number of operations per instruction word.

    • Simpler hardware with fewer dynamic decisions.

  • Superscalar:

    • Relies on the hardware to dynamically schedule and issue multiple instructions per cycle.

    • Variable number of instructions issued per cycle.

    • More complex hardware with dynamic scheduling and dependency resolution.

Example of a VLIW Processor

Consider a VLIW processor that can execute 4 operations per instruction word. An instruction word might include an integer addition, a floating-point multiplication, a memory load, and a branch instruction.

VLIW Instruction Word Example:

[ ADD R1, R2, R3 | MUL F1, F2, F3 | LD R4, 0(R5) | BEQ R6, R7, LABEL ]

Execution Timeline:

CycleOperation 1 (ADD)Operation 2 (MUL)Operation 3 (LD)Operation 4 (BEQ)
1FetchFetchFetchFetch
2DecodeDecodeDecodeDecode
3ExecuteExecuteExecuteExecute
4Write-backWrite-backWrite-back

Summary

Superpipelined Architecture:

  • Increases the number of pipeline stages, allowing higher clock frequencies and greater instruction throughput.

  • Focuses on deeper pipelines without necessarily issuing multiple instructions per cycle.

VLIW Architecture:

  • Issues long instruction words containing multiple operations to be executed in parallel.

  • Relies on the compiler to schedule and resolve dependencies, simplifying hardware design.

Both architectures aim to exploit ILP to improve performance but approach it differently: superpipelined architectures through deeper pipelines and higher clock rates, and VLIW architectures through compiler-managed parallelism within fixed-format instruction words.