In memory, an IEEE floating point number consists one sign bit, exponent bits and trailing mantissa bits.

  • Half-precision f16:
  • Single-precision f32: ,
  • Double-precision f64: ,

ieee-format.png

Subsections

Bias in Exponents

The exponent is biased. The offset can be computed as

For example, for f32, it has an offset of , where 1

Stored Exponent ScalingMeaning
0Zero (if mantissa = 0), subnormal (otherwise)|
1
127
254
255N/AInfinity (if mantissa = 0), NaN (otherwise)

Similarly, f16 has a bias of and f64 has a bias of .

Reference

Footnotes

  1. c++ - What is a subnormal floating point number? - Stack Overflow