Given:
CPU Specifications:
Number of cores: 4
Pipeline stages per core: 5
Frequency: 3 GHz
Program Characteristics:
Total instructions: 1 billion
Parallel portion: 20%
Sequential portion: 80%
Instruction Mix:
Arithmetic instructions: 50%
20% of these incur a 1-cycle stall
10% of these incur a 2-cycle stall
Branch instructions: 30%
- 40% of these introduce a 2-cycle flush
Memory instructions: 20%
30% are loads that incur a 4-cycle stall
70% are stores that incur a 3-cycle stall
Performance Improvement Options and Costs:
Increase Parallelism in the Program:
- Each 1% increase in parallelism costs $2000 in software development.
Increase the Number of Cores:
Each additional core costs $10,000 in hardware design.
Note: The number of cores must be a whole number.
Decrease the Number of Branch Instructions (and overall instruction count):
- Each 1% decrease in branch instructions costs $4000 in compiler and software development.
Questions:
Calculate the effective CPI of the processor.
Calculate the execution time of the program.
Determine the cheapest option to achieve a 20% increase in performance.
- Provide detailed calculations and justifications for your answer.
Solution
To calculate the effective CPI (Cycles Per Instruction) for the given CPU and program characteristics, we need to consider the base CPI of the CPU and the stalls introduced by the instruction mix. Here's the step-by-step calculation:
Step 1: Base CPI
For a simple 5-stage pipeline with no stalls, the base CPI is 1.
Step 2: Calculate the contribution of each type of instruction to the CPI
Arithmetic Instructions
50% of total instructions
20% of these incur a 1-cycle stall: \(0.2 \times 50\% = 10\%\) of total instructions with 1-cycle stall
10% of these incur a 2-cycle stall: \(0.1 \times 50\% = 5\%\) of total instructions with 2-cycle stall
Branch Instructions
30% of total instructions
40% of these introduce a 2-cycle flush: \(0.4 \times 30\% = 12\%\) of total instructions with 2-cycle flush
Memory Instructions
20% of total instructions
30% are loads that incur a 4-cycle stall: \(0.3 \times 20\% = 6\%\) of total instructions with 4-cycle stall
70% are stores that incur a 3-cycle stall: \(0.7 \times 20\% = 14\%\) of total instructions with 3-cycle stall
Step 3: Calculate the effective CPI
The effective CPI can be calculated by summing the contributions from each instruction type:
$$\text{Effective CPI} = \text{Base CPI} + \text{Arithmetic contribution} + \text{Branch contribution} + \text{Memory contribution}$$
$$\text{Arithmetic contribution} = 10\% \times 1 + 5\% \times 2 = 0.1 + 0.1 = 0.2$$
$$\text{Branch contribution} = 12\% \times 2 = 0.24$$
$$\text{Memory contribution} = 6\% \times 4 + 14\% \times 3 = 0.24 + 0.42 = 0.66$$
$$\text{Effective CPI} = 1 + 0.2 + 0.24 + 0.66 = 2.1$$
So, the effective CPI for the given program and CPU specifications is 2.1.
To calculate the execution time considering parallelism, we need to use Amdahl's Law. Amdahl's Law gives the speedup of a program using multiple processors.
Amdahl's Law Formula
$$S = \frac{1}{(1 - P) + \frac{P}{N}}$$
where:
\( S \) is the theoretical speedup of the execution of the whole task.
\( P \) is the portion of the program that can be parallelized.
\( N \) is the number of processors (cores).
Given:
Parallel portion \( P = 20\% = 0.2 \)
Sequential portion \( 1 - P = 80\% = 0.8 \)
Number of cores \( N = 4 \)
Step 1: Calculate the speedup factor using Amdahl's Law
$$S = \frac{1}{(1 - 0.2) + \frac{0.2}{4}}$$
$$S = \frac{1}{0.8 + 0.05}$$
$$S = \frac{1}{0.85}$$
$$S \approx 1.176$$
Step 2: Calculate the execution time considering the speedup factor
$$\text{Execution Time} = \frac{\text{Total Instructions} \times \text{Effective CPI}}{\text{Frequency} \times S}$$
Given:
Total Instructions: \( 1 \text{ billion} = 1 \times 10^9 \)
Effective CPI: \( 2.1 \)
Frequency: \( 3 \text{ GHz} = 3 \times 10^9 \text{ cycles per second} \)
Speedup factor \( S \approx 1.176 \)
$$\text{Execution Time} = \frac{1 \times 10^9 \times 2.1}{3 \times 10^9 \times 1.176}$$
$$\text{Execution Time} = \frac{2.1 \times 10^9}{3.528 \times 10^9}$$
$$\text{Execution Time} = \frac{2.1}{3.528}$$
$$\text{Execution Time} \approx 0.595 \text{ seconds}$$
So, considering the parallel speedup factor, the execution time for the program is approximately 0.595 seconds.
Let's analyze how decrease of branches can bring about an increase of 20% in performance.
To determine the percentage decrease in branch instructions required to achieve a 20% increase in performance, we assume a total of 100 instructions, with 50 arithmetic, 30 branch, and 20 memory instructions. We'll find the percentage decrease in branch instructions needed to achieve a 20% performance increase.
A 20% increase in performance means the new performance is 1.2 times the original performance. Thus, the new effective CPI needs to be:
$$\text{New Effective CPI} = \frac{\text{Old Effective CPI}}{1.2}$$
$$\text{New Effective CPI} = \frac{2.1}{1.2} \approx 1.75$$
Arithmetic Stall Contribution:
20% of 50 incur a 1-cycle stall: \(0.2 \times 50 = 10\) instructions with a 1-cycle stall.
10% of 50 incur a 2-cycle stall: \(0.1 \times 50 = 5\) instructions with a 2-cycle stall.
Arithmetic stall contribution: \((10 \times 1) + (5 \times 2) = 10 + 10 = 20\) cycles.
Branch Flush Contribution:
- 40% of \(30 - x\) introduce a 2-cycle flush:
$$\text{Branch flush contribution} = 0.4 \times (30 - x) \times 2 = 0.8 \times (30 - x)$$
Memory Stall Contribution:
30% of 20 are loads that incur a 4-cycle stall: \(0.3 \times 20 = 6\) instructions with a 4-cycle stall.
70% of 20 are stores that incur a 3-cycle stall: \(0.7 \times 20 = 14\) instructions with a 3-cycle stall.
Memory stall contribution: \((6 \times 4) + (14 \times 3) = 24 + 42 = 66\) cycles.
The total cycles for \(100 - x\) instructions is given by:
$$\text{Total Cycles} = (100 - x) + \text{Arithmetic stall contribution} + \text{Branch flush contribution} + \text{Memory stall contribution}$$
$$\text{Total Cycles} = (100 - x) + 20 + 0.8 \times (30 - x) + 66$$
$$\text{Total Cycles} = (100 - x) + 20 + 24 - 0.8x + 66$$
$$\text{Total Cycles} = 210 - 1.8x$$
The effective CPI is given by:
$$\text{Effective CPI} = \frac{\text{Total Cycles}}{\text{Total Instructions}}$$
$$\text{Effective CPI} = \frac{210 - 1.8x}{100 - x}$$
Set the Effective CPI to the Target CPI:
We set the effective CPI to the target CPI (1.75):
$$1.75 = \frac{210 - 1.8x}{100 - x}$$
Solve for\(x\):
$$x = 700$$
Since we have 30 branches, we cannot reduce that by 700. So, we cannot increase performance by 20% just by decreasing branches.
To determine how many cores need to be increased to achieve a 20% increase in performance, we use Amdahl's Law.
Given:
Initial performance with 4 cores
Desired performance increase: 20%
Initial CPI: 2.1
Parallel portion: 20%
Sequential portion: 80%
Initial number of cores: 4
A 20% increase in performance means the new performance is 1.2 times the original performance. Thus, the new effective CPI needs to be:
$$\text{New Effective CPI} = \frac{\text{Old Effective CPI}}{1.2}$$
$$\text{New Effective CPI} = \frac{2.1}{1.2} \approx 1.75$$
Amdahl's Law states:
$$\text{Performance Increase} = \frac{1}{(1 - P) + \frac{P}{N}}$$
where:
\(\text{Performance Increase}\) is the theoretical speedup
\(P\) is the parallel portion of the program
\(N\) is the number of cores
We know that the new performance should be 1.2 times the initial performance. Therefore, we need to solve for \(N\) such that:
$$\text{Performance Increase} = 1.2$$
We need to solve for \(N\):
$$1.2 = \frac{1}{(1 - 0.2) + \frac{0.2}{N}}$$
$$1.2 = \frac{1}{0.8 + \frac{0.2}{N}}$$
Multiply both sides by \(0.8 + \frac{0.2}{N}\):
$$1.2 (0.8 + \frac{0.2}{N}) = 1$$
$$0.96 + \frac{0.24}{N} = 1$$
Subtract 0.96 from both sides:
$$\frac{0.24}{N} = 0.04$$
Solve for \(N\):
$$N = \frac{0.24}{0.04}$$
$$N = 6$$
To achieve a 20% increase in performance, the number of cores needs to be increased to 6. Hence, we need to add 2 cores at $10,000 per core, so the total cost is $20,000.
To determine the cost needed to achieve a 20% performance increase by changing the parallel portion \( P \) , we'll use Amdahl's Law.
Given:
Initial performance increase: 20%
Initial parallel portion \( P = 0.2 \)
Desired performance increase: 20%
Each 1% increase in parallelism costs $2000
A 20% increase in performance means the new performance is 1.2 times the original performance.
Amdahl's Law states:
$$\text{Performance Increase} = \frac{1}{(1 - P) + \frac{P}{N}}$$
The initial number of cores remains the same. We know the new performance increase should be 1.2 times the initial performance and \( N = 4 \) .
So, we need to solve for the new \( P \) (let's call it \( P_{\text{new}} \) ) such that:
$$1.2 = \frac{1}{(1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4}}$$
Rewriting the equation:
$$1.2 = \frac{1}{(1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4}}$$
Multiply both sides by \( (1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4} \) :
$$1.2 \left( (1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4} \right) = 1$$
$$0.2 = 0.9P_{\text{new}}$$
Solve for \( P_{\text{new}} \) :
Pnew=0.20.9 Pnew≈0.2222
Each 1% increase in parallelism costs $2000. The percentage increase in \( P \) is:
$$\Delta P \times 100 = 0.0222 \times 100 = 2.22\%$$
The cost for this increase is:
$$\text{Cost} = 2.22 \times 2000 = 4440$$
To achieve a 20% increase in performance by increasing the parallel portion \( P \) , the cost needed is $4440. This is the cheapest option.