How are floating point numbers typically represented as bits in a computer? Like this!
If you only had 8 bits a non-standard quarter-precision representation could be devised using a three-bit exponent. There would be only 256 possible values that could be represented with all eight-bit combinations. Experiment with this representation:
⬐ | sign bit | |||||||
⬐ | exponent bits | |||||||
⬐ | significand | |||||||
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ` |
sign |
⬐ |
0 000 |
|||||||||||||
+ | 0 | . | 0 | 0 | 0 | 0 | 0 | 0 | |||||||
001 1 |
010 2 |
011 3 |
100 4 |
101 5 |
110 6 |
||||||||||
111 |
↑ | ||||||||||||||
∞ | bias | ||||||||||||||
64/64 | 32/64 | 16/64 | 8/64 | 4/64 | 2/64 | 1/64 |
If you only had 4 bits a non-standard eighth-precision representation could be devised using a two-bit exponent. There would be only 16 possible values that could be represented with all four-bit combinations:
0 | 0 | 0 | 0 | = +0 | 1 | 0 | 0 | 0 | = -0 | Positive or negative zero: All bits of the exponent are cleared, all bits of the significand are cleared. |
0 | 0 | 0 | 1 | = +0.5 | 1 | 0 | 0 | 1 | = -0.5 | Denormal: All bits of the exponent are cleared, Same as next higher exponent but without the implied leading 1. |
0 | 0 | 1 | 0 | = +1 | 1 | 0 | 1 | 0 | = -1 | Unit: All bits of the exponent are set except the msb is cleared. This exponent value is called the bias. If the significand bits are cleared, this is the value for positive or negative one. |
0 | 0 | 1 | 1 | = +1.5 | 1 | 0 | 1 | 1 | = -1.5 | A one in the most significant bit of the significand adds a fraction of half the value of the implied one. Each bit to the right is worth half of the previous. |
0 | 1 | 0 | 0 | = +2 | 1 | 1 | 0 | 0 | = -2 | Increasing the exponent from 0x01 to 0x02 shifts the bit position to double the magnitude from one to two |
0 | 1 | 0 | 1 | = +3 | 1 | 1 | 0 | 1 | = -3 | Increasing the exponent from 0x01 to 0x02 shifts the bit position to double the magnitude from 1.5 to three |
0 | 1 | 1 | 0 | = +∞ | 1 | 1 | 1 | 0 | = -∞ | Positive or negative infinity: All bits of the exponent are set, all bits of the significand are cleared. |
0 | 1 | 1 | 1 | = NaN | 1 | 1 | 1 | 1 | = NaN | Not a Number: All bits of the exponent are set, At least one bits (typically msb) of the significand is set. This is generall used to show the result of indeterminate values like infinity times zero |
The formula for numbers (when not all exponent bits are set) is
(-1)sign2exp-bias+¬exp((¬¬exp)+significand)
where ¬x means 1 if x is nonzero otherwise 0, and significand is the binary value of the significand bits shifted to be a fraction where the most significant bit is in the 1/2 place. When all exponent bits are set, the value is infinite if the significand is zero, otherwise the value with all exponent bits set is not a number.
If you have 16 or more bits you can make use of a standard format. The IEEE 754 floating point standard, which is used for most floating point operations today, is defined in the same manner as the four and eight bit formats above, but with more bits.
Standard | Precision | bits | exponent | significand | bias |
---|---|---|---|---|---|
Non-standard | Eighth | 4 | 2 bits ⠉ | 1 bit ⢀ | 1 = 0x1 |
Non-standard | Quarter | 8 | 3 bits ⠈⠉ | 4 bits ⣀⣀ | 3 = 0x3 |
IEEE 754 | Half | 16 | 5 bits ⠋⠉ | 10 bits ⣤⣶ | 15 = 0xf |
IEEE 754 Basic |
Single (typically float) |
32 | 8 bits ⠸⠋⠉ | 23 bits ⣠⣶⣶⣿ | 127 = 0x7f |
IEEE 754 Basic |
Double (typically double) |
64 | 11 bits ⠿⠋⠉ | 52 bits ⣀⣴⣶⣾ ⣿⣿⣿⣿ | 1023 = 0x3ff |
i86 internal (with explicit 1) |
Extended (typically long double) |
80 | 15 bits ⢸⠿⠋⠉ | 64 bits ⣿⣿⣿⣿ ⣿⣿⣿⣿ | 16383 = 0x3fff |
IEEE 754 Basic |
Quadruple (possibly long double) |
128 | 15 bits ⢸⠿⠋⠉ | 112 bits ⢀⣠⣶⣶ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ | 16383 = 0x3fff |
IEEE 754 | Octuple | 256 | 19 bits ⣿⠿⠋⠉ | 236 bits ⣀⣴⣶ ⣾⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ |
262143 = 0x3Ffff |
IEEE 754 Interchange |
Sedecuple | 512 | 23 bits ⢸⣿⠿⠋⠉ | 488 bits ⢀⣠⣶ ⣶⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ |
4194303 = 0x3fFfff |
IEEE 754 Interchange |
Duotrigtuple | 1024 | 27 bits ⣿⣿⠿⠋⠉ | 996 bits ⣀⣴ ⣶⣾⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ |
67108863 = 0x3ffFfff |
If you had only two bits... What would you give up to limit the representation to store numbers in two bits? The sign? The exponent? The significand? What would all the possible bit combinations mean?
Would you use an implied or explicit 1?
00 = ?
01 = ?
10 = ?
11 = ?