Know your floats better!

In C, the primary floating point types are float, double, and long double. The range and number of bits for these types can vary depending on the system and compiler, but they generally adhere to the IEEE 754 standard for floating-point arithmetic. Here's a breakdown:

float
- Typically 32 bits (4 bytes).
- IEEE 754 standard for single-precision floating-point numbers.
- Range: Approximately ±1.5 x 10^-45 to ±3.4 x 10^38.
- Example representation: 1 bit (sign) + 8 bits (exponent) + 23 bits (fraction).
double
- Typically 64 bits (8 bytes).
- IEEE 754 standard for double-precision floating-point numbers.
- Range: Approximately ±5.0 x 10^-324 to ±1.7 x 10^308.
- Example representation: 1 bit (sign) + 11 bits (exponent) + 52 bits (fraction).
long double
- Size varies: often 80, 96, or 128 bits.
- The standard and range can vary more than float and double.
- May follow IEEE 754 or have a completely different format.

IEEE 754 Floating Point Representation

In IEEE 754 standard, a floating point number is represented by three parts:

Sign bit: Determines if the number is positive (0) or negative (1).
Exponent: Encodes the range of the number. It's stored in a biased format.
Fraction (or mantissa): Represents the precision of the number.

Example in C

Let's consider a float example in C:

#include <stdio.h>
#include <stdint.h>

typedef union {
    float f;
    uint32_t bits;
} FloatUnion;

void printBinary(uint32_t number, int bits) {
    for (int i = bits - 1; i >= 0; i--) {
        printf("%d", (number >> i) & 1);
        if (i == 31 || i == 23) printf(" ");
    }
    printf("\n");
}

int main() {
    FloatUnion fu;
    fu.f = -13.625; // Example float number

    printf("Floating point value: %f\n", fu.f);
    printf("Binary representation: ");
    printBinary(fu.bits, 32);

    return 0;
}

In this code:

We use a union to interpret the bits of a float as an uint32_t.
The printBinary function prints the binary representation of the number, separating the sign, exponent, and fraction parts for clarity.
-13.625 in binary form is broken down into the sign bit, exponent, and fraction according to IEEE 754.

When you run this code, it'll display the binary representation of -13.625 as a 32-bit floating-point number.

As an actual encoding example, for the decimal number123.456, when encoded into afloat and represented according to the IEEE 754 standard, the binary representation is broken down into three parts:

Sign bit: 0 (indicating a positive number)
Exponent: 10000101
Mantissa (Fraction): 11101101110100101111001

The binary representation is thus displayed as:

┌0┐┌10000101┐┌11101101110100101111001┐

Here, each box represents a different part of the IEEE 754 floating-point format:

The first box contains the sign bit.
The second box shows the exponent.
The third box is the mantissa (or fraction part) of the number.

The exponent 10000101 in the IEEE 754 representation of a floating-point number is not the actual exponent but a "biased" exponent. This biasing is a technique used in the IEEE 754 standard to handle both positive and negative exponents in a way that simplifies the design of floating-point hardware.

In the case of 32-bit float (single precision), the bias used is 127. Here's how the biased exponent is calculated:

Find the Actual Exponent: First, determine the actual exponent of the number in binary form. For the decimal number 123.456, it's necessary to convert it to binary and normalize it to a form where there's only one non-zero digit before the decimal point.
Normalize the Number: To convert 123.456 to binary, we can focus on the integer part and fractional part separately:
- The integer part 123 in binary is 1111011.
- The fractional part .456 is converted to binary by repeatedly multiplying by 2 and taking the integer part of the result, but we can stop after a few digits for this example.
- Combining these, the binary representation would start as 1111011.011... (and so on).
Adjust to Scientific Notation: Adjust this binary number to scientific notation (binary form), which would be approximately 1.111011011... x 2^6. Here, 6 is the actual exponent since we moved the decimal point 6 places to the left.
Apply the Bias: Add the bias (127 for 32-bit floats) to the actual exponent. So, the biased exponent is 6 + 127 = 133.
Convert to Binary: Finally, convert 133 to binary, which is 10000101.

Therefore, in the IEEE 754 representation of the number 123.456, the exponent part 10000101 represents the biased exponent, and it is this biased value that is stored in the binary representation of the floating point number.

Understanding the exact binary representation requires familiarity with the IEEE 754 format, including how to calculate the biased exponent and the fraction part from the actual number. This example shows the underlying binary form but understanding the conversion from a floating-point number to this binary form and vice versa is a bit more complex and involves understanding of binary and floating-point arithmetic.

Know your floats better!

IEEE 754 Floating Point Representation

Example in C

Comments

More from this blog

Computer Arithmetic: From Fast Adders to Floating-Point

The Complete Memory Hierarchy: From Semiconductor RAM to Virtual Memory

The Ultimate Spring Boot Guide: From Fundamentals to Production-Ready REST APIs

Understanding Functions, Memory, and Pointers in C: A Complete Guide

Datapath and Control

Command Palette

IEEE 754 Floating Point Representation

Example in C

Comments

More from this blog