Floating-Point Encoding (浮点数编码)

Abstract

Based on the IEEE 754 Standard
Sign: $0$ indicates a positive number, $1$ indicates a negative number
Exponent: Comes with a bias of $- 127$ , represented by all exponent bits set to 0. To obtain a positive exponent, set the 8th (most significant) bit to 1, resulting in a value of 128
Mantissa: This represents the fractional part of the number after normalisation. The binary digits following the decimal point are included in the mantissa.

Why make it so complicated?

A more intuitive way to represent numbers with decimal points is to use fixed-point, where we allocate a certain number of bits for the whole number part and a certain number of bits for the fractional part. However, floating-point encoding allows us to use the same number of bits to represent both very large and very small numbers, with high precision thanks to the use of exponent.

How Mantissa Precision Varies with Exponent

The mantissa in IEEE 754 floating-point representation determines the precision of a number. Let’s analyze how precision changes across different ranges:

Range 1 to 2 ( $2^{0}$ to $2^{1}$ ): All 23 bits of the mantissa contribute to the precision after the decimal point.

Range 2 to 4 ( $2^{1}$ to $2^{2}$ ): One bit is used to represent the whole number (2) before the decimal point, leaving 22 bits for precision after the decimal.

Generalisation: For every increase in the exponent by 1 (doubling the range), the precision after the decimal point decreases by a factor of 2. This is because one more bit is allocated to representing the whole number part.

Tip

To store large whole numbers, use the long data type. Floating-point options like double may introduce precision loss.

For improved readability, convert binary representations to hexadecimal.

Here is an Online Converter to help visualising the floating-point encoding better.

Why 0.1 + 0.2 = 0.30000000000000004?

For numbers whose binary representation requires infinite precision, like the decimal number $0.1$ , the binary equivalent is analogous to representing $\frac{1}{3}$ in decimal form.

With limited precision (e.g., 32 bits), we inevitably lose some accuracy. This is why 0.1 + 0.2 in binary floating-point arithmetic does not strictly equal 0.3.

Approximation destroys associativity

Given 3 floating points number $A$ , $B$ and $C$ , $(A o p B) o p C \neq = A o p (B o p C)$ , which is same as $D o p C \neq = A o p E$ .

Associativity is not guaranteed with floating-point numbers because each operation may produce an approximated value.

Converting from Decimal to Float (IEEE 754 Single Precision)

Convert Decimal to Binary

Convert the binary form to normalised form

Calculate the Exponent Field by adding the bias $127$ to the exponent & convert the sum to binary (8 bits)

Determine the Sign Bit, $0$ for positive, $1$ for negative

Assemble the Float

Convert to Hexadecimal by grouping the 32 bits into groups of 4, and convert each group to its hexadecimal equivalent

Normalised Number

In the context of Floating-Point Encoding (浮点数编码), a normalised number is one where the leading digit (the digit to the left of the decimal point) is always $1$ . This $1$ is not explicitly stored but is implicit, thus one more bit for the mantissa for a higher accuracy

Important

The range of real numbers between $0$ and the Smallest Positive Normalised Number is not covered by normalised numbers, but by subnormal numbers.

The leading 1 is implicit in normalised numbers when the exponent is not zero. When the exponent is zero, we have a subnormal number, and the implicit leading 1 is no longer assumed.

Smallest Positive Normalised Number

One bit for exponent to differentiate from Subnormal Number

Biggest Positive Normalised Number

All 1s for exponent means Infinity

Subnormal Number

Also known as Denormalised Number
Fill up the gap between 0 and the smallest Normalised Number
Without, we will get a 0 if the difference between 2 numbers is smaller than the smallest normalised number

Caution

In non-debug mode, subnormal number maybe turned off for performance reasons, and this may lead to unexpected errors.

Smallest Positive Subnormal Number

The exponent bias is fixed at -126 when denormalised, and 0 is implicit instead of 1

CS Notes

Recent Updates

Database Search

Cron Jobs and Enhanced Monitoring Tools

Race Condition (竞态条件)

Explorer

Floating-Point Encoding (浮点数编码)

Abstract

Normalised Number

Smallest Positive Normalised Number

Biggest Positive Normalised Number

Subnormal Number

Smallest Positive Subnormal Number

Biggest Positive Subnormal Number

3 Special Cases

0

Infinity

NaN

References

Table of Contents

Backlinks

Graph View