Floating-point numbers are often described as “real numbers,” but in reality they are subsets of rational numbers.

Theoretical Representation

We can understand rational numbers as scientific notation with base 2:

However, with the above representation we may have multiple representations for the same number. For example, , that is the reason we need to normalize those the mantissa to always put the comma after the first digit:

Afterward, the first bit become redundant as it is always , and so it doesn’t need to be stored explicitly

IEEE Float

In memory, an IEEE floating point number consists one sign bit, exponent bits and trailing mantissa bits.

  • f32: ,
  • f64: ,

We can use bit operations to isolate sign bit, exponent bits, or mantissa bits.

ieee-format.png

See also

References