HPC Midsem Prep

Consider a 5-Stage Pipeline with cycle times of 60, 70, 90, 100, and 80 ns & interface registers having a delay of 10 ns. Calculate the speedup with respect to a non-pipelined system.
Explain the R, I, and J type MIPS instruction formats with one example from each type.
Given instructions I1, I2, I3, and I4 in a loop in the program order, calculate the number of cycles needed to execute the following loop in a 5-Stage Pipeline:
```
 For (i = 1 to 2)
 {
    I1;
    I2;
    I3;
    I4;
 }
```
Consider a non-pipelined processor that takes 4 cycles for ALU operations and 5 cycles for branches and memory operations. Assuming branch instructions account for 15% of all instructions and memory operations account for 25%, what is the average CPI of a non-pipelined CPU?
What is a structural hazard? Explain with a suitable example.
What is a pipeline? Briefly explain the 5-Stage Pipeline for a MIPS processor.
What is a control hazard? How do we deal with control hazards?
In the following code, if the branch instruction is taken in the above sequence of instructions, find out the independent instruction that can be placed into the delayed slot for all possible cases to improve system performance.
```
 LOAD   R2, 10(R5)
 STORE  (04)R8, R2
 MUL    R3, R2, R4
 SUB    R2, R3, R1
 BNEQZ  R2, L1
 LOAD   R7, (14)R6
 END
 L1:
 LOAD   R7, 21(R2)
 SUB    R4, R7, R5
 MUL    R7, R4, R5
```
Consider the following MIPS instructions in a 5-Stage pipeline processor with IF, ID, EXE, MEM, WB stages of 1 clock cycle each:

```
    LOAD   R2, (02)R3
    DIV    R7, R1, R2
    MUL    R1, R7, R1
    STORE  R1, (04)R8
    ADD    R1, R7, R2
```
Draw a time and space diagram for the above sequence of instructions to determine the total number of clock cycles required to complete their execution with and without operand forwarding.
Consider the following MIPS instructions:

    ADD    R1, R2, R3
    SUB    R3, R1, R2
    ADD    R4, R1, R3
    MUL    R1, R2, R3
    SUB    R3, R5, R6

Find out all possible dependencies that exist in the above-given instructions and justify them.

Consider a 5-Stage Pipeline, and we wish to execute I1, I2, I3, ..., I15 instructions in program order. Determine the number of clock cycles required to complete these instructions in the following cases:

Case 1: Find out the number of clock cycles required to complete I1, I2, I3, ..., I15 instructions in a 5-Stage Pipeline.
Case 2: Find out the number of clock cycles required to complete I1, I2, I3, ..., I15 instructions in a 5-Stage Pipeline, where I4 is an unconditional Branch instruction, and I12 is the target instruction.

Short Notes
1. Flynn’s Classification
2. Data Hazard and its Types
3. Seven Dimensions of an ISA
Consider the following MIPS assembly code:
```
ST     R1, 45(R2)
DADD   R10, R1, R5
DSUB   R8, R1, R6
AND    R2, R5, R1
DMUL   R6, R4, R8
```
Identify each dependency by type. Calculate the number of stalls required for complete execution of the above code segment smoothly.
Operand forwarding cannot remove stalls entirely. Justify.
Explain different positions where the delayed slot can be placed to overcome the branch hazard.
Discuss the number of stalls required for the complete execution of the following instruction code by using operand forwarding approach:

LD     R1, 0(R2)
DSUB   R4, R1, R5
AND    R6, R1, R7
OR     R8, R1, R9

A 5-stage pipelined processor has IF, ID, OF, EX, WR stages. The IF, ID, OF, and WR stages take 1 clock cycle each for any instruction. The EX stage takes 1 clock cycle for ADD and SUB instructions, 3 clock cycles for the MUL instruction, and 6 clock cycles for the DIV instruction respectively. Operand forwarding is used in the pipeline. What is the number of clock cycles needed to execute the following sequence of instructions:

MUL    R2, R0, R1
DIV    R5, R3, R4
ADD    R2, R5, R2
SUB    R5, R2, R6

What is Amdahl’s Law? Assume that 30% of instructions are data transfer instructions, 40% are ALU instructions, and the rest are control instructions. Each of data transfer, ALU, and control instructions takes respectively 6 clock cycles, 4 clock cycles, and 7 clock cycles. Find the CPI of the machine. If using the latest hardware, there is a 3x enhancement in ALU instructions, then find the overall speedup of the machine.
Derive the overall speedup gained by Amdahl’s Law. Suppose a program runs in 100 seconds on a computer with multiply operations responsible for 80 seconds of this time. How much do I have to improve the speed of multiplication if I want my program to run five times faster?
Find out the total number of clock cycles required to execute the following instructions without and with operand forwarding:

LD      R1, 0(R2)
DADDIU  R1, R1, #1
SD      R1, 0(R2)
DADDIU  R2, R2, #4
DSUB    R4, R3, R2

A five-stage pipeline processor has IF, ID, EXE, MEM, WB stages. The IF, ID, MEM, WB stages take 1 clock cycle each for any instruction. The EXE stage takes 1 clock cycle for LOAD, ADD & SUB instructions, 2 clock cycles for MUL, and DIV instructions respectively. Consider the following instructions:
```
LOAD   R3, 9(R2)
DIV    R1, R3, R4
ADD    R5, R1, R6
SUB    R7, R1, R8
MUL    R9, R1, R10
```
For the above sequence of instructions, find out the total number of clock cycles required to complete the execution, without operand forwarding.
Consider a 4-stage pipeline processor. The number of cycles needed by the 3 instructions I 1:3 in stages S 1:4 is as below:

What is the number of cycles needed to execute the following loop?
```
For (I = 1 to 2)
{I1; I2; I3; I4}
```
An MIPS pipeline contains 5 stages with 60, 70, 90, 100, and 80 nsec cycle times & the Interface registers have a delay of 10 ns. Calculate the speedup.
Calculate the execution time of the following set of instructions assuming a 5-stage instruction pipeline (without any resolving techniques):
```
ADD  R3, R4, R5
SUB  R7, R3, R9
MUL  R8, R9, R10
ASH  R4, R8, R12
```
Consider the following instruction mix in a five-stage pipeline, where 25% of the instruction mix are load instructions and in 50% of its cases, the next instruction uses load value, 11% of the instruction mix are store instructions, 17% are conditional branches, and 4% are unconditional branch instructions. For the above instruction mix, the required penalties are as follows: a Penalty of 2 cycles on use of load value immediately after a load, Jumps are resolved in the ID stage with a 1 cycle branch penalty, 80% branch prediction accuracy, and 2 cycles delay on misprediction. Calculate the overall CPI for the above set of instructions.
Consider a five-stage pipeline, where IF, ID, MEM, WB takes one clock cycle, and execution latency for Load and Add instructions is 2 clock cycles, while Mul and Div take 3 clock cycles. Consider the following sequence of instructions:
```
LD    R1, M[100]
ADD   R1, R1, R1
MUL   R2, R1, R2
DIV   R4, R2, R6
```
Identify types of dependency and hazards in the above code segment. Compare the number of cycles required to complete the execution of the above code segment with & without adopting hazard resolution techniques.
Without any scheduling, the loop on MIPS will execute as follows:
```
Clock Cycle Issued
1   Loop: LD F2, 0(R1)
2   Stall
3   Stall
4   Stall
5   LD F4, 0(R2)
6   Stall
7   Add. D F6, F2, F4
8   Stall
9   Stall
10  SD F6, 0(R1)
11  DADDU R1, R1, #-8
12  Stall
13  BNE R1, R3, Loop
```
Apply the concept of instruction scheduling & loop unrolling to reduce the number of cycles per operation. R1 and R2 are initially the addresses of the element in the arrays x[] and y[] respectively. Register R3 is precomputed so that 8(R2) is the address of the last element to operate on.
Discuss different techniques for scheduling the branch delayed slot with examples.
Explain different branch prediction techniques with examples.
Consider a task where 75% of the task can be enhanced. If the task is executed on a 2-core machine M, then the speedup of the enhanced part is 2 times. Based on the above specification, answer the following questions:
a. Find the number of additional cores required to achieve twice the overall speedup achieved by machine M.
b. Find out the additional number of cores required to achieve a 4-times overall speedup achieved by machine M.

HPC Midsem Prep

Comments

More from this blog

Computer Arithmetic: From Fast Adders to Floating-Point

The Complete Memory Hierarchy: From Semiconductor RAM to Virtual Memory

The Ultimate Spring Boot Guide: From Fundamentals to Production-Ready REST APIs

Understanding Functions, Memory, and Pointers in C: A Complete Guide

Datapath and Control

Command Palette

Comments

More from this blog