Skip to main content

Command Palette

Search for a command to run...

Computer Arithmetic: From Fast Adders to Floating-Point

Updated
29 min read

Arithmetic is the beating heart of every processor. Whether your machine is rendering 3-D graphics, training a neural network, or simply incrementing a loop counter, the ALU (Arithmetic Logic Unit) is doing the heavy lifting. In this deep-dive, I want to walk you through every major arithmetic building block: how signed numbers are added and subtracted, how carries can be computed in parallel instead of rippling, how multiplication and division are performed in hardware, and finally how the IEEE 754 floating-point standard lets us represent and operate on real numbers.


1. Addition and Subtraction of Signed Numbers

1.1 The Half Adder and Full Adder

Let's start at the very bottom. The most primitive arithmetic circuit adds two single bits. A half adder produces a sum bit and a carry bit:

s = x XOR y
c = x AND y

Simple enough, right? But here's the thing -- in a multi-bit addition, every bit position (except the least significant) must also accept an incoming carry from the position to its right. That is the job of a full adder (FA).

1.2 Full Adder Truth Table

A full adder takes three inputs -- x_i, y_i, and the carry-in c_i -- and produces a sum s_i and carry-out c_{i+1}.

From the truth table we derive:

s_i     = x_i XOR y_i XOR c_i
c_{i+1} = x_i * y_i + x_i * c_i + y_i * c_i

The carry equation can also be written as:

c_{i+1} = x_i * y_i  +  (x_i XOR y_i) * c_i

This second form is important because it separates the carry into two pieces:

  • A generate term G_i = x_i * y_i -- a carry is produced regardless of the incoming carry.

  • A propagate term P_i = x_i XOR y_i -- an incoming carry is passed through to the next position.

Keep these in mind. They will become critically important when we build fast adders later.

1.3 The Ripple-Carry Adder

To add two n-bit numbers X and Y, the most straightforward approach is to cascade n full adders. The carry-out of stage i becomes the carry-in of stage i+1. This is called a ripple-carry adder, because the carry "ripples" from the least significant bit all the way to the most significant bit.

For an n-bit ripple-carry adder, the worst-case delay is 2n gate delays (each full adder contributes 2 gate delays for the carry chain). As we will see shortly, this becomes the primary bottleneck in high-speed processors.

1.4 Addition and Subtraction of 2's Complement Numbers

Here's where things get elegant. In the 2's complement system, subtraction is performed by adding the complement. To compute X - Y:

  1. Take the 1's complement of Y (invert every bit).

  2. Add 1 (this converts the 1's complement to the 2's complement).

  3. Add the result to X.

The hardware achieves this beautifully using XOR gates controlled by a single Add/Sub signal:

When Add/Sub = 0 (addition): each XOR gate passes y_i unchanged, and c_0 = 0.

When Add/Sub = 1 (subtraction): each XOR gate inverts y_i (producing the 1's complement), and c_0 = 1 (adding 1 to convert to 2's complement).

Think about how clean this is -- the same n-bit adder handles both addition and subtraction with the flip of a single control line. No separate subtraction circuit needed.

1.5 Overflow Detection

When two numbers of the same sign are added and the result has the opposite sign, an overflow has occurred. The result is too large (or too negative) to fit in n bits.

The overflow condition is detected by:

Overflow = c_n XOR c_{n-1}

That is, overflow occurs when the carry into the sign-bit position differs from the carry out of the sign-bit position. Let me show you why with concrete examples:

Examples (4-bit 2's complement, range -8 to +7):

  0111  (+7)            1000  (-8)
+ 0001  (+1)          + 1111  (-1)
------                ------
  1000  (-8) ??        10111 --> 0111 (+7) ??

  c_4=0, c_3=1          c_4=1, c_3=0
  0 XOR 1 = 1           1 XOR 0 = 1
  OVERFLOW!              OVERFLOW!

In both cases the carries disagree, and the result is obviously wrong. The XOR of those two carries gives us a single-bit overflow flag.

Key Takeaway: A single n-bit adder, augmented with XOR gates on the Y input and a control signal fed to c_0, serves as a complete add/subtract unit for 2's complement numbers. Overflow is detected by XOR-ing the two most-significant carries.


2. Design of Fast Adders

2.1 The Problem with Ripple Carry

The ripple-carry adder is simple but slow. For n bits, the carry must propagate through all n stages. The worst-case delay is:

Ripple-carry delay = 2n gate delays

For a 32-bit adder, that is 64 gate delays -- far too slow for a modern processor running at GHz clock speeds. We need a way to compute all the carries in parallel. Let's figure out how.

2.2 Carry-Lookahead Principle

Here's where it gets interesting. The key insight is to express carries as functions of the inputs alone, without waiting for the previous carry to arrive. Remember the generate and propagate signals I mentioned earlier? Let's put them to work. We define two signals for each bit position i:

Generate:   G_i = x_i AND y_i
Propagate:  P_i = x_i XOR y_i

G_i = 1 means position i generates a carry regardless of any incoming carry. P_i = 1 means position i propagates an incoming carry to the next position.

The carry recurrence becomes:

c_{i+1} = G_i + P_i * c_i

Now here's the magic. Let's expand this for positions 0 through 3 (a 4-bit block), assuming the carry into the block is c_0:

c_1 = G_0 + P_0 * c_0

c_2 = G_1 + P_1 * c_1
    = G_1 + P_1 * G_0 + P_1 * P_0 * c_0

c_3 = G_2 + P_2 * c_2
    = G_2 + P_2 * G_1 + P_2 * P_1 * G_0 + P_2 * P_1 * P_0 * c_0

c_4 = G_3 + P_3 * c_3
    = G_3 + P_3 * G_2 + P_3 * P_2 * G_1 + P_3 * P_2 * P_1 * G_0
      + P_3 * P_2 * P_1 * P_0 * c_0

Look at what we have done. Every carry is now expressed purely in terms of G_i, P_i, and c_0 -- no dependency on any previous carry! Since G_i and P_i are computed from x_i and y_i in one gate delay (AND and XOR respectively), and the expanded carry expressions require only two more levels of logic (AND then OR), all four carries are available in just 3 gate delays.

The sum bits require one additional XOR:

s_i = P_i XOR c_i

So all four sum bits are available in 4 gate delays -- regardless of the adder width (within the 4-bit block). Compare that to the 8 gate delays a 4-bit ripple-carry adder would need!

2.3 The 4-Bit Carry-Lookahead Adder

Let me lay out the architecture:

2.4 Building a 16-Bit CLA Adder

Now let's scale up. We cascade four 4-bit CLA blocks. Each block also produces block-level generate and propagate signals:

Block 0 (bits 0-3):
  P'_0 = P_3 * P_2 * P_1 * P_0
  G'_0 = G_3 + P_3*G_2 + P_3*P_2*G_1 + P_3*P_2*P_1*G_0

Block 1 (bits 4-7):
  P'_1 = P_7 * P_6 * P_5 * P_4
  G'_1 = G_7 + P_7*G_6 + P_7*P_6*G_5 + P_7*P_6*P_5*G_4

Block 2 (bits 8-11):
  P'_2 = P_11 * P_10 * P_9 * P_8
  G'_2 = (same pattern for bits 8-11)

Block 3 (bits 12-15):
  P'_3 = P_15 * P_14 * P_13 * P_12
  G'_3 = (same pattern for bits 12-15)

A second-level CLA circuit takes G'_k and P'_k (k = 0,1,2,3) along with c_0 and produces the inter-block carries c_4, c_8, c_12, and c_16 using the exact same lookahead equations:

c_4  = G'_0 + P'_0 * c_0
c_8  = G'_1 + P'_1 * G'_0 + P'_1 * P'_0 * c_0
c_12 = G'_2 + P'_2 * G'_1 + P'_2 * P'_1 * G'_0 + P'_2 * P'_1 * P'_0 * c_0
c_16 = G'_3 + P'_3 * G'_2 + P'_3 * P'_2 * G'_1 + P'_3 * P'_2 * P'_1 * G'_0
       + P'_3 * P'_2 * P'_1 * P'_0 * c_0

Here is the full 16-bit structure:

Delay analysis for the 16-bit CLA:

The 16-bit CLA adder produces all outputs in 8 gate delays, compared to 32 gate delays for a 16-bit ripple-carry adder. That is a 4x speedup.

2.5 Building a 64-Bit CLA Adder

The same principle extends naturally to a third level. We group four 16-bit CLA blocks. Each produces third-level G'' and P'' signals. A third-level CLA circuit generates the carries c_16, c_32, c_48, and c_64.

64-bit CLA: 3 levels of lookahead

Level 1: Bit-level      G_i, P_i       --> 4-bit block carries
Level 2: Block-level     G'_k, P'_k    --> 16-bit section carries
Level 3: Section-level   G''_j, P''_j  --> 64-bit carries (c_16, c_32, c_48, c_64)

Delay: 12 gate delays for s_63
        7 gate delays for c_64

Compare this to a 64-bit ripple-carry adder: 128 gate delays. The three-level CLA is roughly an order of magnitude faster. The pattern is clear -- each additional level of lookahead doubles the width we can handle while adding only a constant delay.

Key Takeaway: Carry-Lookahead Adders eliminate the serial carry chain by expressing each carry as a sum-of-products function of generate and propagate signals. A 4-bit CLA takes 4 gate delays; a 16-bit CLA takes 8; a 64-bit CLA takes 12. The principle can be extended to any width by adding more levels of lookahead.


3. Multiplication of Positive Numbers

3.1 Manual Binary Multiplication

Let's explore multiplication. Binary multiplication follows the same longhand procedure as decimal multiplication, but it is simpler because each multiplier bit is either 0 or 1. No multiplication table needed -- you either copy the multiplicand or you write zero.

Example: 13 x 11 = 143

          1 1 0 1       (13)  Multiplicand M
        x 1 0 1 1       (11)  Multiplier Q
        ---------
          1 1 0 1       (M x q_0, q_0=1)
        1 1 0 1 .       (M x q_1, q_1=1, shifted left 1)
      0 0 0 0 . .       (M x q_2, q_2=0, shifted left 2)
    1 1 0 1 . . .       (M x q_3, q_3=1, shifted left 3)
    -----------------
    1 0 0 0 1 1 1 1     (143)  Product P

The product of two n-bit numbers can be up to 2n bits long. Each partial product row is just the multiplicand AND-ed with a single multiplier bit, shifted to the appropriate position. The final product is the sum of all partial products.

3.2 Array Multiplier

The manual algorithm maps directly to a combinational array multiplier. For an n x n multiplier, the main component in each cell is a full adder. The AND gate in each cell determines whether a multiplicand bit m_j is added to the incoming partial product, based on the value of multiplier bit q_i.

Array multiplier for 4-bit operands (M x Q = P):

  Partial product 0:     0      m3*q0   m2*q0   m1*q0   m0*q0
  Partial product 1:            m3*q1   m2*q1   m1*q1   m0*q1
  Partial product 2:     m3*q2  m2*q2   m1*q2   m0*q2
  Partial product 3:  m3*q3     m2*q3   m1*q3   m0*q3

  ──────────────────────────────────────────────────────────
  Product:             p7  p6   p5      p4      p3  p2  p1  p0

Each cell contains:

  • An AND gate that computes m_j * q_i

  • A full adder that adds this bit product to the incoming partial product bit and the carry from the right

The worst-case delay path runs diagonally from the upper right corner to the lower left (the staircase pattern), giving a total delay of approximately 6(n-1) - 1 gate delays for an n x n array.

3.3 Sequential Multiplication Hardware

For large operands (32-bit or 64-bit), a full array multiplier uses too many gates. Instead, multiplication is performed sequentially using a single n-bit adder and shift registers. Think of it this way: instead of building all the partial product rows at once, we process them one at a time.

Algorithm (n iterations):

  1. Initialize: A = 0, C = 0, Q = multiplier, M = multiplicand

  2. For i = 0 to n-1:

    • If q_0 = 1: Add M to A, store carry in C (i.e., C,A = A + M)

    • If q_0 = 0: No addition

    • Shift right C, A, Q (C goes into MSB of A, LSB of A goes into MSB of Q, LSB of Q is discarded)

  3. After n iterations, the 2n-bit product is in A (high half) and Q (low half)

Notice the clever trick here: as Q is shifted right, its bits are consumed one by one (q_0 is examined then shifted out), and the high-order bits of the product fill in from the left. The multiplier is gradually replaced by the product!

3.4 Worked Example: 13 x 11 = 143

Using 4-bit registers (M = 1101, Q = 1011):

Key Takeaway: Sequential multiplication uses a single n-bit adder and performs n add-and-shift cycles. Each cycle examines the LSB of Q, conditionally adds M, then shifts the combined C-A-Q register right by one position. After n cycles, the 2n-bit product occupies registers A and Q.


4. Signed-Operand Multiplication

4.1 The Problem with Signed Numbers

The shift-and-add algorithm above works only for positive (unsigned) numbers. When one or both operands are negative (in 2's complement), we need a different approach.

Case 1: Positive multiplier, negative multiplicand. When adding a negative multiplicand to a partial product, we must sign-extend the multiplicand to the full 2n-bit width of the product to get the correct result.

Case 2: Negative multiplier. A straightforward solution is to complement both operands (making the multiplier positive) and proceed as before. But there is a much more elegant technique that handles all cases uniformly.

4.2 Booth's Algorithm

The Booth algorithm is one of those ideas that feels like magic the first time you see it. It generates a 2n-bit product and treats both positive and negative 2's-complement n-bit operands uniformly -- no special cases needed.

Core idea: Instead of scanning multiplier bits one at a time and always adding, Booth's algorithm scans pairs of consecutive bits (q_i and q_{i-1}) and decides whether to add, subtract, or do nothing.

The algorithm is based on a beautiful observation: a block of consecutive 1s in the multiplier can be replaced by a subtract at the beginning and an add at the end. For example:

0011110  can be rewritten as  0100000 - 0000010
  (30)                          (32)      (2)

This works because 30 = 32 - 2. Instead of four additions (one for each '1' bit), we perform just one subtraction and one addition. For long runs of 1s, the savings can be dramatic.

4.3 Booth Recoding Table

We scan the multiplier from right to left, examining bit pairs (q_i, q_{i-1}), starting with q_{-1} = 0:

Interpretation -- think of transitions:

  • 0 -> 0: Middle of a block of 0s. Do nothing.

  • 0 -> 1: End of a block of 1s (right-to-left transition from 1 to 0). Add M.

  • 1 -> 0: Beginning of a block of 1s (right-to-left transition from 0 to 1). Subtract M.

  • 1 -> 1: Middle of a block of 1s. Do nothing.

4.4 Booth's Algorithm Hardware

The hardware is similar to the basic sequential multiplier, with these additions:

  • An extra flip-flop q_{-1} (initialized to 0) to hold the previous bit of Q

  • The adder must be able to add or subtract M (2's complement subtraction)

  • The shift is an arithmetic right shift (the sign bit is replicated, not filled with zero)

Algorithm steps (repeat n times):

  1. Examine q_0 and q_{-1}:

    • (0, 0) or (1, 1): No arithmetic operation

    • (0, 1): A = A + M

    • (1, 0): A = A - M

  2. Arithmetic shift right A, Q, q_{-1} (sign bit of A is preserved and replicated)

4.5 Worked Example: Booth's Algorithm

Example: (+5) x (-4) = -20 using Booth's Algorithm (4-bit operands)

M = 0101 (+5), Q = 1100 (-4), with q_{-1} = 0:

Verification: 11101100 in 8-bit 2's complement: = -128 + 64 + 32 + 0 + 8 + 4 + 0 + 0 = -128 + 108 = -20.

The important thing is that Booth's algorithm correctly handles all four sign combinations:

  • Positive x Positive

  • Positive x Negative

  • Negative x Positive

  • Negative x Negative

In all cases, the algorithm produces the correct 2n-bit 2's complement product without any special-case handling.

Key Takeaway: Booth's algorithm handles signed multiplication by scanning pairs of multiplier bits (q_i, q_{i-1}) and performing add, subtract, or no-op accordingly. It skips over blocks of consecutive 1s and 0s, potentially reducing the number of additions. It works correctly for both positive and negative operands in 2's complement.


5. Fast Multiplication

5.1 Bit-Pair Recoding (Modified Booth Algorithm)

Booth's algorithm has a weakness: in the worst case (alternating 1s and 0s like 010101...), every bit position requires an operation, giving no speedup over the basic algorithm.

Bit-pair recoding (also called Modified Booth or Booth-2) fixes this by grouping the Booth-recoded bits in pairs, guaranteeing that the number of summands is halved to at most n/2 for n-bit operands.

The bit-pair recoding table considers three consecutive bits (the pair at positions i+1 and i, plus the bit at i-1) to determine a single multiplicand selection:

The possible multiplicand selections are: 0, +M, -M, +2M, -2M

Since 2M is just M shifted left by one position, all five cases are easily implementable in hardware -- no actual multiplication needed!

5.2 Bit-Pair Recoding Example

Let's apply this to the multiplication (+13) x (-6):

M = 0 1 1 0 1    (+13)
Q = 1 1 0 1 0    (-6)

Step 1: Apply Booth recoding to Q (with implied q_{-1} = 0):

  Q bits:      1     1     0     1     0    [0]
  Booth:       0    -1    +1    -1      0

Step 2: Use the 3-bit grouping from the original multiplier:

  Group 1: q_1, q_0, q_{-1} = 1, 0, 0  -->  -2 x M  at position 0
  Group 2: q_3, q_2, q_1    = 1, 0, 1  -->  -1 x M  at position 2
  Remaining: (sign, q_4, q_3) = (1, 1, 1)  -->   0 x M  at position 4

Bit-pair recoded multiplier selects:  0,  -1,  -2
                          (at positions 4,   2,   0)

Step 3: Form the summands:

  At position 0:  -2 x M = -2 x 13 = -26
  Sign-extended:  1 1 1 1 1 0 0 1 1 0

  At position 2:  -1 x M = -1 x 13 = -13, shifted left 2
  Sign-extended:  1 1 1 1 0 0 1 1 0 0    (that is, -52)

  At position 4:   0 x M = 0
                   0 0 0 0 0 0 0 0 0 0

Step 4: Add the summands:

    1 1 1 1 1 0 0 1 1 0    (-26)
  + 1 1 1 1 0 0 1 1 0 0    (-52)
  + 0 0 0 0 0 0 0 0 0 0    (  0)
  ──────────────────────
    1 1 1 0 1 1 0 0 1 0    (-78)

Only n/2 = 2 non-zero summands instead of up to n with basic Booth!

Verification: -26 + (-52) = -78 = (+13) x (-6). Correct!

5.3 Carry-Save Addition (CSA)

Even with bit-pair recoding reducing the number of summands to n/2, we still need to add them together. In a straightforward approach, adding k summands with ripple-carry adders would take O(k * n) time. Carry-Save Addition dramatically reduces this.

Idea: Instead of letting carries ripple within each addition, we "save" the carries and pass them to the next level. A Carry-Save Adder takes three input vectors and produces two output vectors (a sum vector S and a carry vector C) in the time of a single full-adder delay.

Think of it this way: a CSA is really just a row of independent full adders -- no carry chain at all! Each full adder works independently, producing a sum and a carry that gets passed to the next level rather than to the next position.

5.4 CSA Tree for Multiple Summands

To add k summands using CSA:

  1. Group summands in threes

  2. Apply CSA to each group --> produces 2/3 as many vectors

  3. Repeat until only 2 vectors remain

  4. Add the final 2 vectors with a fast (CLA) adder

Delay analysis for 6 summands:

  • 3 levels of CSA: 3 x 2 = 6 gate delays

  • Final CLA addition: ~8 gate delays (for 12-bit result)

  • Total: ~14 gate delays

Compare to the array multiplier for 6 x 6: 6(6-1)-1 = 29 gate delays. The CSA approach roughly halves the delay.

General formula: For k summands, the number of CSA levels is approximately 1.7 * log_2(k) - 1.7. With bit-pair recoding reducing summands to n/2, this becomes 1.7 * log_2(n) - 3.4 CSA levels.

For a 32 x 32 multiplication: 16 summands (after bit-pair recoding), approximately 8 CSA levels, then one CLA addition. Total delay of about 29 gate delays, compared to 185 gate delays for a 32 x 32 array multiplier. That is over 6x faster.

Key Takeaway: Bit-pair recoding halves the number of summands to n/2 by combining consecutive Booth-recoded bits. Carry-Save Addition then adds these summands in a tree structure, reducing k summands to 2 vectors in approximately 1.7*log2(k) levels. The final two vectors are added by a CLA adder. Together, these techniques make multiplication dramatically faster than the naive sequential or array approaches.


6. Integer Division

Division is the inverse of multiplication. Given a dividend X and a divisor M, we seek a quotient Q and a remainder R such that X = Q * M + R. Let's see how the hardware does this.

6.1 Longhand Division Example

Just like multiplication, let's start with the manual process:

  Decimal:                    Binary:

         21                          10101
    13 ) 274                  1101 ) 100010010
         26                          1101
         --                          ─────
         14                          10000
         13                          1101
         --                          ─────
          1                           1110
                                      1101
                                      ─────
                                         1

  274 / 13 = 21 remainder 1      100010010 / 1101 = 10101 remainder 1

In binary division, each quotient bit is either 0 or 1. We try to subtract the divisor from the current remainder. If the result is non-negative, the quotient bit is 1; otherwise, it is 0. Simpler than decimal, where you have to guess the right digit from 0-9.

6.2 Restoring Division

Restoring division is the straightforward hardware implementation of longhand division.

Restoring Division Algorithm:

Load the n-bit positive divisor into M and the n-bit positive dividend into Q. Set A = 0. Do the following n times:

  1. Shift A and Q left one binary position.

  2. Subtract M from A and place the result back in A.

  3. If the sign of A is 1 (negative result):

    • Set q_0 = 0

    • Restore: Add M back to A (undo the subtraction)

  4. Otherwise (A >= 0):

    • Set q_0 = 1

After n cycles, Q contains the quotient and A contains the remainder.

6.3 Restoring Division Worked Example

Let's divide 8 by 3: Dividend = 1000 (8), Divisor = 0011 (3).

6.4 Non-Restoring Division

The restoring step wastes time because we subtract M and then (if the result is negative) immediately add M back. Non-restoring division avoids this by modifying the next iteration instead.

Key insight: If A is negative after a subtraction, instead of restoring and then shifting and subtracting in the next cycle (which is equivalent to computing 2(A+M) - M = 2A + M), we can simply shift and add in the next cycle. The math works out to the same thing, but we save an entire add operation per cycle.

Non-Restoring Division Algorithm:

Step 1: Do the following n times:

  1. If the sign of A is 0 (positive): shift A,Q left and subtract M from A. If the sign of A is 1 (negative): shift A,Q left and add M to A.

  2. If the sign of A is now 0: set q_0 = 1. Otherwise: set q_0 = 0.

Step 2: If the sign of A is 1 (negative remainder), add M to A to get the correct positive remainder.

6.5 Non-Restoring Division Worked Example

Same example: Divide 8 by 3. M = 0011, Q = 1000, A = 00000.

Advantage of non-restoring division: Exactly one addition or subtraction per cycle (never both). The restoring algorithm requires up to 2 operations per cycle (subtract, then restore). Non-restoring division is therefore faster, at the cost of the single final correction step.

Key Takeaway: Restoring division subtracts the divisor and restores (adds back) if the result is negative. Non-restoring division skips the restore step; if the remainder is negative, the next cycle adds instead of subtracting. Non-restoring division performs exactly one add/subtract per cycle and requires at most one final correction.


7. Floating-Point Numbers and Operations

7.1 Why Floating-Point?

Let's shift gears and talk about real numbers. Fixed-point integers (with the binary point at the right end) can represent values in the range 0 to approximately 2^n for unsigned n-bit numbers. Even with 32 bits, the range is only about 0 to 4.3 billion. This is insufficient for scientific computing, which routinely encounters numbers like Avogadro's number (6.0247 x 10^23) or Planck's constant (6.6254 x 10^-27).

Floating-point representation solves this by allowing the binary point to "float." A number is represented by:

  • A sign bit

  • An exponent (the scale factor)

  • A mantissa (the significant digits, also called the significand)

Value = (-1)^sign  x  mantissa  x  2^exponent

Think of it like scientific notation in binary. Just as we write 6.022 x 10^23 in decimal, we write something like 1.0110 x 2^5 in binary.

7.2 IEEE 754 Standard

The IEEE 754 standard (adopted in 1985, revised in 2008 and 2019) defines two primary formats that are used universally today:

Single Precision (32 bits)

Double Precision (64 bits)

7.3 Bias and Normalized Form

The exponent is stored in biased (also called "excess") notation. For single precision, the bias is 127. For double precision, the bias is 1023.

Stored exponent E_stored = E_actual + bias

Single: E_stored = E_actual + 127
Double: E_stored = E_actual + 1023

Why biased? Because it makes comparing floating-point numbers easier -- you can almost compare them as unsigned integers, which simplifies hardware.

The normalized form requires the mantissa to have the form 1.fraction. Since the leading bit is always 1 for a normalized number, it is not stored explicitly -- it is the hidden bit (also called the implied bit). This gives one extra bit of precision for free!

Value = (-1)^S  x  1.M  x  2^(E - bias)

where M is the 23-bit (or 52-bit) stored mantissa
and E is the 8-bit (or 11-bit) stored exponent

Example: Let's represent -12.625 in IEEE 754 single precision.

Step 1: Convert to binary
  12    = 1100 (binary)
  0.625 = 0.101 (binary):  0.625 x 2 = 1.25  --> 1
                            0.25  x 2 = 0.5   --> 0
                            0.5   x 2 = 1.0   --> 1
  So 12.625 = 1100.101

Step 2: Normalize
  1100.101 = 1.100101 x 2^3

Step 3: Determine fields
  Sign:     1 (negative)
  Exponent: 3 + 127 = 130 = 10000010
  Mantissa: 10010100000000000000000  (23 bits, dropping the leading 1)

Step 4: Assemble
  IEEE 754: 1  10000010  10010100000000000000000
          = C149 0000 (hex)

7.4 Ranges and Special Values

Single precision ranges:

Smallest normalized positive:  1.0 x 2^(-126)             ~ 1.18 x 10^(-38)
Largest normalized positive:   (2 - 2^(-23)) x 2^(127)    ~ 3.40 x 10^(38)
Precision:                     ~7 decimal digits

Double precision ranges:

Smallest normalized positive:  1.0 x 2^(-1022)            ~ 2.23 x 10^(-308)
Largest normalized positive:   (2 - 2^(-52)) x 2^(1023)   ~ 1.80 x 10^(308)
Precision:                     ~16 decimal digits

IEEE 754 reserves certain bit patterns for special values:

Let me explain each special case:

Denormalized numbers (also called subnormal numbers) fill the gap between zero and the smallest normalized number. They have the form 0.M x 2^(-126) (single) or 0.M x 2^(-1022) (double), where the implicit leading bit is 0 instead of 1. This provides gradual underflow rather than an abrupt jump to zero. Without denormals, the gap between zero and the smallest representable number would be much larger than the gaps between other neighboring representable numbers.

NaN (Not a Number) is the result of undefined operations like 0/0 or sqrt(-1). There are two kinds: signaling NaN (causes an exception when used) and quiet NaN (propagates silently through calculations).

Infinity results from overflow or division by zero. Arithmetic with infinity follows mathematical rules: infinity + finite = infinity, finite / infinity = 0, infinity / infinity = NaN, and so on.

7.5 Floating-Point Addition and Subtraction

Adding two floating-point numbers is considerably more complex than integer addition. You cannot just add the bit patterns. The algorithm involves four steps:

To compute A + B where A = M_A x 2^(E_A) and B = M_B x 2^(E_B):

Step 1: Compare exponents
  Compute d = E_A - E_B

Step 2: Align mantissas
  Shift the mantissa of the smaller number RIGHT by |d| positions
  (This aligns the binary points so we are adding like quantities)

Step 3: Add (or subtract) the aligned mantissas
  M_result = M_A +/- M_B (shifted)
  E_result = max(E_A, E_B)

Step 4: Normalize the result
  If the result mantissa is not in the form 1.xxxxx:
    - If >= 2: shift mantissa RIGHT, increment exponent
    - If < 1:  shift mantissa LEFT,  decrement exponent
  Also: check for exponent overflow/underflow
  Round the mantissa to fit in the available bits

Detailed Example: Add 1.5 + 0.4375 in IEEE 754

A = 1.5    = 1.1   x 2^0     -->  S=0, E=127, M=10000...
B = 0.4375 = 1.11  x 2^(-1)  -->  S=0, E=126, M=11000...

Step 1: Compare exponents
  d = 127 - 126 = 1
  B has the smaller exponent, so we shift B's mantissa

Step 2: Align mantissas (shift B right by 1)
  B mantissa: 1.110...  --> shifted right 1 --> 0.111...

Step 3: Add mantissas
      1.100 0000...     (A)
    + 0.111 0000...     (B, shifted)
    ────────────────
     10.011 0000...

Step 4: Normalize
  Result = 10.011 x 2^0 = 1.0011 x 2^1
  New exponent = 0 + 1 = 1, stored as 128

Final: S=0, E=128=10000000, M=00110000...0
Value = 1.0011 x 2^1 = 10.011 = 2 + 0.25 + 0.125 = 1.9375

Verify: 1.5 + 0.4375 = 1.9375   Correct!

7.6 Floating-Point Multiplication

Here's where it gets interesting -- floating-point multiplication is actually simpler than addition! No alignment step is needed.

To compute A x B:

Step 1: Multiply the mantissas
  M_result = M_A x M_B    (using integer multiplication techniques)

Step 2: Add the exponents
  E_result = E_A + E_B - bias
  (We subtract the bias once because both exponents include it)

Step 3: Determine the sign
  S_result = S_A XOR S_B

Step 4: Normalize
  If the mantissa product >= 2: shift right, increment exponent
  Round to fit available bits
  Check for exponent overflow/underflow

Example: Multiply 1.5 x 2.5

A = 1.5   = 1.1   x 2^0     -->  E_A = 127
B = 2.5   = 1.01  x 2^1     -->  E_B = 128

Step 1: Multiply mantissas
  1.1 x 1.01:
        1.1 0
      x 1.0 1
      -------
        1 1 0       (1.10 x 1)
      0 0 0 .       (1.10 x 0, shifted)
    1 1 0 . .       (1.10 x 1, shifted)
    ---------
    1.1 1 1 0       Result: 1.111

Step 2: Add exponents
  E = 127 + 128 - 127 = 128   (subtract bias once)

Step 3: Sign = 0 XOR 0 = 0 (positive)

Step 4: 1.111 is already normalized (between 1 and 2)

Result: 1.111 x 2^1 = 11.11 = 3.75

Verify: 1.5 x 2.5 = 3.75   Correct!

7.7 Guard Bits and Rounding

When shifting the mantissa during alignment or normalization, bits may be shifted off the right end and lost. To maintain accuracy, the hardware uses extra bits beyond the 23 (or 52) mantissa bits during computation.

The IEEE standard requires that the result of any arithmetic operation be the same as if it were computed with infinite precision and then rounded. To achieve this, implementations typically use:

  • Guard bit (G): One extra bit immediately to the right of the mantissa

  • Round bit (R): One more bit to the right of the guard bit

  • Sticky bit (S): The OR of all bits shifted beyond the round bit position

The sticky bit is especially clever: it tells you whether ANY bits were lost during shifting, without needing to track all of them individually.

7.8 Rounding Methods

IEEE 754 defines four rounding modes:

1. Round to Nearest Even (default): The most commonly used mode. Round to the nearest representable value. In case of a tie (exactly halfway), round to the value whose least significant bit is 0 (i.e., even).

Examples (rounding to integer for simplicity):
  2.5 --> 2   (ties to even)
  3.5 --> 4   (ties to even)
  2.3 --> 2
  2.7 --> 3

This eliminates the systematic upward bias of the "round half up" rule that we learned in grade school.

2. Round toward Zero (Truncation / Chopping): Simply discard the extra bits. Always rounds toward zero.

  +2.7 --> +2
  -2.7 --> -2

3. Round toward +Infinity (Ceiling): Always round up (toward positive infinity).

  +2.1 --> +3
  -2.7 --> -2

4. Round toward -Infinity (Floor): Always round down (toward negative infinity).

  +2.7 --> +2
  -2.1 --> -3

7.9 Von Neumann Rounding (An Older Method)

An older rounding technique, sometimes called von Neumann rounding or jamming, works as follows: if any bits are shifted off (i.e., the truncated portion is nonzero), force the LSB of the retained result to 1. This does not bias the result but produces a slightly larger maximum error than round-to-nearest.

It is simple to implement (just OR the sticky bit into the LSB), but IEEE 754's round-to-nearest-even is more accurate and has become the universal default.

7.10 Implementing Rounding with G, R, S

Using the guard, round, and sticky bits, the round-to-nearest-even decision is:

Key Takeaway: IEEE 754 floating-point uses a sign, biased exponent, and normalized mantissa with an implicit leading 1. Addition requires exponent alignment and renormalization. Multiplication adds exponents and multiplies mantissas. Guard, round, and sticky bits ensure that intermediate results can be correctly rounded using one of four IEEE rounding modes. Round-to-nearest-even is the default and eliminates statistical bias.


8. Summary

Let's recap the complete landscape of computer arithmetic we have covered:


9. Key Formulas Reference

Adder Formulas

Full Adder:
  s_i     = x_i XOR y_i XOR c_i
  c_{i+1} = x_i * y_i + (x_i XOR y_i) * c_i

Overflow Detection:
  Overflow = c_n XOR c_{n-1}

Carry-Lookahead:
  G_i     = x_i AND y_i                 (Generate)
  P_i     = x_i XOR y_i                 (Propagate)
  c_{i+1} = G_i + P_i * c_i             (Carry recurrence)

Expanded carries (4-bit block):
  c_1 = G_0 + P_0*c_0
  c_2 = G_1 + P_1*G_0 + P_1*P_0*c_0
  c_3 = G_2 + P_2*G_1 + P_2*P_1*G_0 + P_2*P_1*P_0*c_0
  c_4 = G_3 + P_3*G_2 + P_3*P_2*G_1 + P_3*P_2*P_1*G_0
        + P_3*P_2*P_1*P_0*c_0

Block-level (second level):
  P'_0 = P_3 * P_2 * P_1 * P_0
  G'_0 = G_3 + P_3*G_2 + P_3*P_2*G_1 + P_3*P_2*P_1*G_0

Booth's Algorithm

Bit-Pair Recoding (Modified Booth)

Division Algorithms

Restoring Division (repeat n times):
  1. Shift A,Q left
  2. A <-- A - M
  3. If A < 0: set q_0 = 0, A <-- A + M   (restore)
     Else:     set q_0 = 1

Non-Restoring Division (repeat n times):
  1. If A >= 0: shift A,Q left, A <-- A - M
     If A < 0:  shift A,Q left, A <-- A + M
  2. If A >= 0: set q_0 = 1
     If A < 0:  set q_0 = 0
  Final: If A < 0, add M to A   (correct remainder)

IEEE 754 Floating-Point

Single Precision (32 bits):
  Value = (-1)^S  x  1.M  x  2^(E - 127)
  Sign: 1 bit, Exponent: 8 bits (bias 127), Mantissa: 23 bits
  Range: ~1.18 x 10^(-38) to ~3.40 x 10^38

Double Precision (64 bits):
  Value = (-1)^S  x  1.M  x  2^(E - 1023)
  Sign: 1 bit, Exponent: 11 bits (bias 1023), Mantissa: 52 bits
  Range: ~2.23 x 10^(-308) to ~1.80 x 10^308

Special Values:
  E=0,   M=0     -->  +/- 0
  E=0,   M!=0    -->  Denormalized: (-1)^S x 0.M x 2^(-126 or -1022)
  E=max, M=0     -->  +/- Infinity
  E=max, M!=0    -->  NaN

FP Addition Steps:
  1. Compare exponents
  2. Shift smaller mantissa right by |E_A - E_B|
  3. Add/subtract mantissas
  4. Normalize and round

FP Multiplication Steps:
  1. Multiply mantissas (integer multiplication)
  2. Add exponents, subtract bias
  3. Determine sign (XOR)
  4. Normalize and round

CSA Levels for k summands: ~1.7 * log_2(k) - 1.7

10. Concluding Remarks

Computer arithmetic is one of those topics where the hardware-software boundary becomes razor-thin. Every design choice -- whether to use a CLA or a ripple-carry adder, whether to apply Booth recoding or basic shift-and-add, whether to implement restoring or non-restoring division -- has measurable consequences for chip area, power consumption, and clock speed.

The principles we have covered are not historical curiosities. They are alive in every ALU shipping today:

  • Carry-lookahead and its descendants (carry-select, carry-skip, conditional-sum adders) are used in all high-speed adders.

  • Booth recoding and carry-save addition trees (often arranged as Wallace trees) are standard in multiplier designs.

  • Non-restoring division forms the basis of the SRT division algorithm used in modern processors. (The famous Pentium FDIV bug of 1994 was caused by missing entries in an SRT division lookup table -- a spectacular reminder that getting these circuits right matters enormously.)

  • IEEE 754 floating-point is the universal standard, implemented in hardware by every general-purpose processor and most GPUs.

Understanding these circuits at the gate level gives you the foundation to reason about performance, precision, and the occasional spectacular hardware bug. The next time you see a floating-point rounding error or wonder why your compiler chose a particular instruction, you will know exactly what is happening inside the silicon.