# HPC Midsem Prep

Consider a 5-Stage Pipeline with cycle times of 60, 70, 90, 100, and 80 ns & interface registers having a delay of 10 ns. Calculate the speedup with respect to a non-pipelined system.

Explain the R, I, and J type MIPS instruction formats with one example from each type.

Given instructions

`I1`

,`I2`

,`I3`

, and`I4`

in a loop in the program order, calculate the number of cycles needed to execute the following loop in a 5-Stage Pipeline:`For (i = 1 to 2) { I1; I2; I3; I4; }`

Consider a non-pipelined processor that takes 4 cycles for ALU operations and 5 cycles for branches and memory operations. Assuming branch instructions account for 15% of all instructions and memory operations account for 25%, what is the average CPI of a non-pipelined CPU?

What is a structural hazard? Explain with a suitable example.

What is a pipeline? Briefly explain the 5-Stage Pipeline for a MIPS processor.

What is a control hazard? How do we deal with control hazards?

In the following code, if the branch instruction is taken in the above sequence of instructions, find out the independent instruction that can be placed into the delayed slot for all possible cases to improve system performance.

`LOAD R2, 10(R5) STORE (04)R8, R2 MUL R3, R2, R4 SUB R2, R3, R1 BNEQZ R2, L1 LOAD R7, (14)R6 END L1: LOAD R7, 21(R2) SUB R4, R7, R5 MUL R7, R4, R5`

Consider the following MIPS instructions in a 5-Stage pipeline processor with IF, ID, EXE, MEM, WB stages of 1 clock cycle each:

`LOAD R2, (02)R3 DIV R7, R1, R2 MUL R1, R7, R1 STORE R1, (04)R8 ADD R1, R7, R2`

Draw a time and space diagram for the above sequence of instructions to determine the total number of clock cycles required to complete their execution with and without operand forwarding.

Consider the following MIPS instructions:

`ADD R1, R2, R3 SUB R3, R1, R2 ADD R4, R1, R3 MUL R1, R2, R3 SUB R3, R5, R6`

Find out all possible dependencies that exist in the above-given instructions and justify them.

- Consider a 5-Stage Pipeline, and we wish to execute
`I1, I2, I3, ..., I15`

instructions in program order. Determine the number of clock cycles required to complete these instructions in the following cases:

Case 1: Find out the number of clock cycles required to complete

`I1, I2, I3, ..., I15`

instructions in a 5-Stage Pipeline.Case 2: Find out the number of clock cycles required to complete

`I1, I2, I3, ..., I15`

instructions in a 5-Stage Pipeline, where`I4`

is an unconditional Branch instruction, and`I12`

is the target instruction.

Short Notes

Flynn’s Classification

Data Hazard and its Types

Seven Dimensions of an ISA

Consider the following MIPS assembly code:

`ST R1, 45(R2) DADD R10, R1, R5 DSUB R8, R1, R6 AND R2, R5, R1 DMUL R6, R4, R8`

Identify each dependency by type. Calculate the number of stalls required for complete execution of the above code segment smoothly.

Operand forwarding cannot remove stalls entirely. Justify.

Explain different positions where the delayed slot can be placed to overcome the branch hazard.

Discuss the number of stalls required for the complete execution of the following instruction code by using operand forwarding approach:

`LD R1, 0(R2) DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9`

A 5-stage pipelined processor has IF, ID, OF, EX, WR stages. The IF, ID, OF, and WR stages take 1 clock cycle each for any instruction. The EX stage takes 1 clock cycle for ADD and SUB instructions, 3 clock cycles for the MUL instruction, and 6 clock cycles for the DIV instruction respectively. Operand forwarding is used in the pipeline. What is the number of clock cycles needed to execute the following sequence of instructions:

`MUL R2, R0, R1 DIV R5, R3, R4 ADD R2, R5, R2 SUB R5, R2, R6`

What is Amdahl’s Law? Assume that 30% of instructions are data transfer instructions, 40% are ALU instructions, and the rest are control instructions. Each of data transfer, ALU, and control instructions takes respectively 6 clock cycles, 4 clock cycles, and 7 clock cycles. Find the CPI of the machine. If using the latest hardware, there is a 3x enhancement in ALU instructions, then find the overall speedup of the machine.

Derive the overall speedup gained by Amdahl’s Law. Suppose a program runs in 100 seconds on a computer with multiply operations responsible for 80 seconds of this time. How much do I have to improve the speed of multiplication if I want my program to run five times faster?

Find out the total number of clock cycles required to execute the following instructions without and with operand forwarding:

`LD R1, 0(R2) DADDIU R1, R1, #1 SD R1, 0(R2) DADDIU R2, R2, #4 DSUB R4, R3, R2`

A five-stage pipeline processor has IF, ID, EXE, MEM, WB stages. The IF, ID, MEM, WB stages take 1 clock cycle each for any instruction. The EXE stage takes 1 clock cycle for LOAD, ADD & SUB instructions, 2 clock cycles for MUL, and DIV instructions respectively. Consider the following instructions:

`LOAD R3, 9(R2) DIV R1, R3, R4 ADD R5, R1, R6 SUB R7, R1, R8 MUL R9, R1, R10`

For the above sequence of instructions, find out the total number of clock cycles required to complete the execution, without operand forwarding.

**Consider a 4-stage pipeline processor.**The number of cycles needed by the 3 instructions I 1:3 in stages S 1:4 is as below:What is the number of cycles needed to execute the following loop?

`For (I = 1 to 2) {I1; I2; I3; I4}`

An MIPS pipeline contains 5 stages with 60, 70, 90, 100, and 80 nsec cycle times & the Interface registers have a delay of 10 ns. Calculate the speedup.

Calculate the execution time of the following set of instructions assuming a 5-stage instruction pipeline (without any resolving techniques):

`ADD R3, R4, R5 SUB R7, R3, R9 MUL R8, R9, R10 ASH R4, R8, R12`

Consider the following instruction mix in a five-stage pipeline, where 25% of the instruction mix are load instructions and in 50% of its cases, the next instruction uses load value, 11% of the instruction mix are store instructions, 17% are conditional branches, and 4% are unconditional branch instructions. For the above instruction mix, the required penalties are as follows: a Penalty of 2 cycles on use of load value immediately after a load, Jumps are resolved in the ID stage with a 1 cycle branch penalty, 80% branch prediction accuracy, and 2 cycles delay on misprediction. Calculate the overall CPI for the above set of instructions.

Consider a five-stage pipeline, where IF, ID, MEM, WB takes one clock cycle, and execution latency for Load and Add instructions is 2 clock cycles, while Mul and Div take 3 clock cycles. Consider the following sequence of instructions:

`LD R1, M[100] ADD R1, R1, R1 MUL R2, R1, R2 DIV R4, R2, R6`

Identify types of dependency and hazards in the above code segment. Compare the number of cycles required to complete the execution of the above code segment with & without adopting hazard resolution techniques.

Without any scheduling, the loop on MIPS will execute as follows:

`Clock Cycle Issued 1 Loop: LD F2, 0(R1) 2 Stall 3 Stall 4 Stall 5 LD F4, 0(R2) 6 Stall 7 Add. D F6, F2, F4 8 Stall 9 Stall 10 SD F6, 0(R1) 11 DADDU R1, R1, #-8 12 Stall 13 BNE R1, R3, Loop`

Apply the concept of instruction scheduling & loop unrolling to reduce the number of cycles per operation. R1 and R2 are initially the addresses of the element in the arrays x[] and y[] respectively. Register R3 is precomputed so that 8(R2) is the address of the last element to operate on.

Discuss different techniques for scheduling the branch delayed slot with examples.

Explain different branch prediction techniques with examples.

Consider a task where 75% of the task can be enhanced. If the task is executed on a 2-core machine

`M`

, then the speedup of the enhanced part is 2 times. Based on the above specification, answer the following questions:

a. Find the number of additional cores required to achieve twice the overall speedup achieved by machine`M`

.

b. Find out the additional number of cores required to achieve a 4-times overall speedup achieved by machine`M`

.