Floating-point numbers are often described as “real numbers,” but in reality they are subsets of rational numbers.

Floating point has one sign bit, exponent bits and mantissa bits. They are usually represented as normalized scientific notation with base 2.

Formats

  • IEEE 754 Floats
    • f16: ,
    • f32: ,
    • f64: ,
  • Bfloat16 : 16 bits, ,
  • NVidia’s TensorFloat, 19 bits, ,

Subsections

References