Floating-point numbers are often described as “real numbers,” but in reality they are subsets of rational numbers.
Floating point has one sign bit,
Formats
- IEEE 754 Floats
f16
:, f32
:, f64
:,
- Bfloat16 : 16 bits,
, - NVidia’s TensorFloat, 19 bits,
,