Floating-Point Encoding (浮点数编码)

Abstract

Based on the IEEE 754 Standard
sign 0 for positive, 1 for negative
exponent by default -127 with all bits set to 0. If we want positive, we need to set the 8th bit to 1 which is 128
mantissa takes the binary behind the decimal place after normalisation (the yellow circle part)
Reliable precision is 7 decimal digits

Approximation of Real Number

mantissa gives the precision
From 1 to 2 (2^0-2^1), there are 23 bits of mantissa used for precision after decision point
For 2 to 4 (2^1-2^2), there are 22 bits of mantissa used for precision after decision point, one of the bit is used to present the whole number before decimal point
With every range of 2, the precision after the decimal point is reduced by 2
Thus, the precision of the number after decimal point is getting worse as the number getting bigger

Normalised Number

The range of real numbers between 0 and smallest Normalised Number isn’t covered, covered by Subnormal Number (Denormalized Number)
The 1 is implicit when exponent isn’t 0. When exponent is 0, we get Subnormal Number (Denormalized Number)

Smallest positive Normalised Number

One bit for exponent to differentiate from Subnormal Number (Denormalized Number)

Biggest positive Normalised Number

All 1s for exponent means Infinity

Subnormal Number (Denormalized Number)

Fill up the gap between 0 and the smallest Normalised Number
Without, we will get a 0 if the difference between 2 numbers is smaller than the smallest Normalised Number

In non-debug mode, Subnormal Number maybe turned off for performance reasons, and this may lead to unexpected errors

Smallest positive Subnormal Number (Denormalized Number)

The exponent bias is fixed at -126 when denormalised, and 0 is implicit instead of 1

Biggest Subnormal Number (Denormalized Number)

0 when Mantissa bits are all 0

3 Special Cases

0

Both Exponent & Mantissa is 0

Infinity

Exponent is 255, but Mantissa is 0

NaN

Exponent is 255 & Mantissa isn’t 0

Tips

When it comes to store a large whole number, use long to represent, because floating options like double may have precision loss issues
Usually Binary to Hex for better readability
Online Converter to visualise better

Side Notes

Floating-point rounding error

Binary representation that requires infinite precision
Decimal number like 0.1 in binary representation is like 1/3 in decimal presentation. With limited precision (32bits), we will lose some precision. That is why 0.1+0.2 in binary isn’t strictly 0.3

References

Wait, so comparisons in floating point only just KINDA work? What DOES work?

Why Is This Happening?! Floating-point rounding error

CS Notes

Recent Updates

GPU

Graph Data Structure

Tree

Explorer

Floating-Point Encoding (浮点数编码)

Abstract

Approximation of Real Number

Normalised Number

Smallest positive Normalised Number

Biggest positive Normalised Number

Subnormal Number (Denormalized Number)

Smallest positive Subnormal Number (Denormalized Number)

Biggest Subnormal Number (Denormalized Number)

3 Special Cases

0

Infinity

NaN

Tips

Side Notes

Floating-point rounding error

References

Table of Contents

Backlinks

Graph View