CPU Performance Preparedness Review

Given:

  1. CPU Specifications:

    • Number of cores: 4

    • Pipeline stages per core: 5

    • Frequency: 3 GHz

  2. Program Characteristics:

    • Total instructions: 1 billion

    • Parallel portion: 20%

    • Sequential portion: 80%

  3. Instruction Mix:

    • Arithmetic instructions: 50%

      • 20% of these incur a 1-cycle stall

      • 10% of these incur a 2-cycle stall

    • Branch instructions: 30%

      • 40% of these introduce a 2-cycle flush
    • Memory instructions: 20%

      • 30% are loads that incur a 4-cycle stall

      • 70% are stores that incur a 3-cycle stall

Performance Improvement Options and Costs:

  1. Increase Parallelism in the Program:

    • Each 1% increase in parallelism costs $2000 in software development.
  2. Increase the Number of Cores:

    • Each additional core costs $10,000 in hardware design.

    • Note: The number of cores must be a whole number.

  3. Decrease the Number of Branch Instructions (and overall instruction count):

    • Each 1% decrease in branch instructions costs $4000 in compiler and software development.

Questions:

  1. Calculate the effective CPI of the processor.

  2. Calculate the execution time of the program.

  3. Determine the cheapest option to achieve a 20% increase in performance.

    • Provide detailed calculations and justifications for your answer.

Solution

To calculate the effective CPI (Cycles Per Instruction) for the given CPU and program characteristics, we need to consider the base CPI of the CPU and the stalls introduced by the instruction mix. Here's the step-by-step calculation:

Step 1: Base CPI

For a simple 5-stage pipeline with no stalls, the base CPI is 1.

Step 2: Calculate the contribution of each type of instruction to the CPI

Arithmetic Instructions

  • 50% of total instructions

  • 20% of these incur a 1-cycle stall: \(0.2 \times 50\% = 10\%\) of total instructions with 1-cycle stall

  • 10% of these incur a 2-cycle stall: \(0.1 \times 50\% = 5\%\) of total instructions with 2-cycle stall

Branch Instructions

  • 30% of total instructions

  • 40% of these introduce a 2-cycle flush: \(0.4 \times 30\% = 12\%\) of total instructions with 2-cycle flush

Memory Instructions

  • 20% of total instructions

  • 30% are loads that incur a 4-cycle stall: \(0.3 \times 20\% = 6\%\) of total instructions with 4-cycle stall

  • 70% are stores that incur a 3-cycle stall: \(0.7 \times 20\% = 14\%\) of total instructions with 3-cycle stall

Step 3: Calculate the effective CPI

The effective CPI can be calculated by summing the contributions from each instruction type:

$$\text{Effective CPI} = \text{Base CPI} + \text{Arithmetic contribution} + \text{Branch contribution} + \text{Memory contribution}$$

$$\text{Arithmetic contribution} = 10\% \times 1 + 5\% \times 2 = 0.1 + 0.1 = 0.2$$

$$\text{Branch contribution} = 12\% \times 2 = 0.24$$

$$\text{Memory contribution} = 6\% \times 4 + 14\% \times 3 = 0.24 + 0.42 = 0.66$$

$$\text{Effective CPI} = 1 + 0.2 + 0.24 + 0.66 = 2.1$$

So, the effective CPI for the given program and CPU specifications is 2.1.

To calculate the execution time considering parallelism, we need to use Amdahl's Law. Amdahl's Law gives the speedup of a program using multiple processors.

Amdahl's Law Formula

$$S = \frac{1}{(1 - P) + \frac{P}{N}}$$

where:

  • \( S \) is the theoretical speedup of the execution of the whole task.

  • \( P \) is the portion of the program that can be parallelized.

  • \( N \) is the number of processors (cores).

Given:

  • Parallel portion \( P = 20\% = 0.2 \)

  • Sequential portion \( 1 - P = 80\% = 0.8 \)

  • Number of cores \( N = 4 \)

Step 1: Calculate the speedup factor using Amdahl's Law

$$S = \frac{1}{(1 - 0.2) + \frac{0.2}{4}}$$

$$S = \frac{1}{0.8 + 0.05}$$

$$S = \frac{1}{0.85}$$

$$S \approx 1.176$$

Step 2: Calculate the execution time considering the speedup factor

$$\text{Execution Time} = \frac{\text{Total Instructions} \times \text{Effective CPI}}{\text{Frequency} \times S}$$

Given:

  • Total Instructions: \( 1 \text{ billion} = 1 \times 10^9 \)

  • Effective CPI: \( 2.1 \)

  • Frequency: \( 3 \text{ GHz} = 3 \times 10^9 \text{ cycles per second} \)

  • Speedup factor \( S \approx 1.176 \)

$$\text{Execution Time} = \frac{1 \times 10^9 \times 2.1}{3 \times 10^9 \times 1.176}$$

$$\text{Execution Time} = \frac{2.1 \times 10^9}{3.528 \times 10^9}$$

$$\text{Execution Time} = \frac{2.1}{3.528}$$

$$\text{Execution Time} \approx 0.595 \text{ seconds}$$

So, considering the parallel speedup factor, the execution time for the program is approximately 0.595 seconds.

Let's analyze how decrease of branches can bring about an increase of 20% in performance.

To determine the percentage decrease in branch instructions required to achieve a 20% increase in performance, we assume a total of 100 instructions, with 50 arithmetic, 30 branch, and 20 memory instructions. We'll find the percentage decrease in branch instructions needed to achieve a 20% performance increase.

A 20% increase in performance means the new performance is 1.2 times the original performance. Thus, the new effective CPI needs to be:

$$\text{New Effective CPI} = \frac{\text{Old Effective CPI}}{1.2}$$

$$\text{New Effective CPI} = \frac{2.1}{1.2} \approx 1.75$$

Arithmetic Stall Contribution:

  • 20% of 50 incur a 1-cycle stall: \(0.2 \times 50 = 10\) instructions with a 1-cycle stall.

  • 10% of 50 incur a 2-cycle stall: \(0.1 \times 50 = 5\) instructions with a 2-cycle stall.

  • Arithmetic stall contribution: \((10 \times 1) + (5 \times 2) = 10 + 10 = 20\) cycles.

Branch Flush Contribution:

  • 40% of \(30 - x\) introduce a 2-cycle flush:

$$\text{Branch flush contribution} = 0.4 \times (30 - x) \times 2 = 0.8 \times (30 - x)$$

Memory Stall Contribution:

  • 30% of 20 are loads that incur a 4-cycle stall: \(0.3 \times 20 = 6\) instructions with a 4-cycle stall.

  • 70% of 20 are stores that incur a 3-cycle stall: \(0.7 \times 20 = 14\) instructions with a 3-cycle stall.

  • Memory stall contribution: \((6 \times 4) + (14 \times 3) = 24 + 42 = 66\) cycles.

The total cycles for \(100 - x\) instructions is given by:

$$\text{Total Cycles} = (100 - x) + \text{Arithmetic stall contribution} + \text{Branch flush contribution} + \text{Memory stall contribution}$$

$$\text{Total Cycles} = (100 - x) + 20 + 0.8 \times (30 - x) + 66$$

$$\text{Total Cycles} = (100 - x) + 20 + 24 - 0.8x + 66$$

$$\text{Total Cycles} = 210 - 1.8x$$

The effective CPI is given by:

$$\text{Effective CPI} = \frac{\text{Total Cycles}}{\text{Total Instructions}}$$

$$\text{Effective CPI} = \frac{210 - 1.8x}{100 - x}$$

Set the Effective CPI to the Target CPI:

We set the effective CPI to the target CPI (1.75):

$$1.75 = \frac{210 - 1.8x}{100 - x}$$

Solve for\(x\):

$$x = 700$$

Since we have 30 branches, we cannot reduce that by 700. So, we cannot increase performance by 20% just by decreasing branches.

To determine how many cores need to be increased to achieve a 20% increase in performance, we use Amdahl's Law.

Given:

  • Initial performance with 4 cores

  • Desired performance increase: 20%

  • Initial CPI: 2.1

  • Parallel portion: 20%

  • Sequential portion: 80%

  • Initial number of cores: 4

A 20% increase in performance means the new performance is 1.2 times the original performance. Thus, the new effective CPI needs to be:

$$\text{New Effective CPI} = \frac{\text{Old Effective CPI}}{1.2}$$

$$\text{New Effective CPI} = \frac{2.1}{1.2} \approx 1.75$$

Amdahl's Law states:

$$\text{Performance Increase} = \frac{1}{(1 - P) + \frac{P}{N}}$$

where:

  • \(\text{Performance Increase}\) is the theoretical speedup

  • \(P\) is the parallel portion of the program

  • \(N\) is the number of cores

We know that the new performance should be 1.2 times the initial performance. Therefore, we need to solve for \(N\) such that:

$$\text{Performance Increase} = 1.2$$

We need to solve for \(N\):

$$1.2 = \frac{1}{(1 - 0.2) + \frac{0.2}{N}}$$

$$1.2 = \frac{1}{0.8 + \frac{0.2}{N}}$$

Multiply both sides by \(0.8 + \frac{0.2}{N}\):

$$1.2 (0.8 + \frac{0.2}{N}) = 1$$

$$0.96 + \frac{0.24}{N} = 1$$

Subtract 0.96 from both sides:

$$\frac{0.24}{N} = 0.04$$

Solve for \(N\):

$$N = \frac{0.24}{0.04}$$

$$N = 6$$

To achieve a 20% increase in performance, the number of cores needs to be increased to 6. Hence, we need to add 2 cores at $10,000 per core, so the total cost is $20,000.

To determine the cost needed to achieve a 20% performance increase by changing the parallel portion \( P \) , we'll use Amdahl's Law.

Given:

  • Initial performance increase: 20%

  • Initial parallel portion \( P = 0.2 \)

  • Desired performance increase: 20%

  • Each 1% increase in parallelism costs $2000

A 20% increase in performance means the new performance is 1.2 times the original performance.

Amdahl's Law states:

$$\text{Performance Increase} = \frac{1}{(1 - P) + \frac{P}{N}}$$

The initial number of cores remains the same. We know the new performance increase should be 1.2 times the initial performance and \( N = 4 \) .

So, we need to solve for the new \( P \) (let's call it \( P_{\text{new}} \) ) such that:

$$1.2 = \frac{1}{(1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4}}$$

Rewriting the equation:

$$1.2 = \frac{1}{(1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4}}$$

Multiply both sides by \( (1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4} \) :

$$1.2 \left( (1 - P_{\text{new}}) + \frac{P_{\text{new}}}{4} \right) = 1$$

$$0.2 = 0.9P_{\text{new}}$$

Solve for \( P_{\text{new}} \) :

Pnew=0.20.9 Pnew≈0.2222

Each 1% increase in parallelism costs $2000. The percentage increase in \( P \) is:

$$\Delta P \times 100 = 0.0222 \times 100 = 2.22\%$$

The cost for this increase is:

$$\text{Cost} = 2.22 \times 2000 = 4440$$

To achieve a 20% increase in performance by increasing the parallel portion \( P \) , the cost needed is $4440. This is the cheapest option.