Dr. Shaneyfelt's Floating Point Consruction Gizmo

How are floating point numbers typically represented as bits in a computer? Like this!


If you only had 8 bits a non-standard quarter-precision representation could be devised using a three-bit exponent. There would be only 256 possible values that could be represented with all eight-bit combinations. Experiment with this representation:

Check or uncheck the sign bit,
select the position of the exponent (three-bit value),
and check the desired significand bits to create a floating point number.

sign bit
exponent bits
significand
0 0 0 0 0 0 0 0 `
sign

0
000
+ 0 . 0 0 0 0 0 0

001
1

010
2

011
3

100
4

101
5

110
6
111
bias
64/64 32/64 16/64 8/64 4/64 2/64 1/64

If you only had 4 bits a non-standard eighth-precision representation could be devised using a two-bit exponent. There would be only 16 possible values that could be represented with all four-bit combinations:

0000= +0 1000= -0 Positive or negative zero: All bits of the exponent are cleared, all bits of the significand are cleared.
0001= +0.5 1001= -0.5 Denormal: All bits of the exponent are cleared, Same as next higher exponent but without the implied leading 1.
0010= +1 1010= -1 Unit: All bits of the exponent are set except the msb is cleared. This exponent value is called the bias. If the significand bits are cleared, this is the value for positive or negative one.
0011= +1.5 1011= -1.5 A one in the most significant bit of the significand adds a fraction of half the value of the implied one. Each bit to the right is worth half of the previous.
0100= +2 1100= -2 Increasing the exponent from 0x01 to 0x02 shifts the bit position to double the magnitude from one to two
0101= +3 1101= -3 Increasing the exponent from 0x01 to 0x02 shifts the bit position to double the magnitude from 1.5 to three
0110= +∞ 1110= -∞ Positive or negative infinity: All bits of the exponent are set, all bits of the significand are cleared.
0111= NaN 1111= NaN Not a Number: All bits of the exponent are set, At least one bits (typically msb) of the significand is set. This is generall used to show the result of indeterminate values like infinity times zero


The formula for numbers (when not all exponent bits are set) is

(-1)sign2exp-bias+¬exp((¬¬exp)+significand)

where ¬x means 1 if x is nonzero otherwise 0, and significand is the binary value of the significand bits shifted to be a fraction where the most significant bit is in the 1/2 place. When all exponent bits are set, the value is infinite if the significand is zero, otherwise the value with all exponent bits set is not a number.


If you have 16 or more bits you can make use of a standard format. The IEEE 754 floating point standard, which is used for most floating point operations today, is defined in the same manner as the four and eight bit formats above, but with more bits.

Standard Precision bits exponent significand bias
Non-standard Eighth 4 2 bits ⠉ 1 bit ⢀ 1 = 0x1
Non-standard Quarter 8 3 bits ⠈⠉ 4 bits ⣀⣀ 3 = 0x3
IEEE 754 Half 16 5 bits ⠋⠉ 10 bits ⣤⣶ 15 = 0xf
IEEE 754
Basic
Single
(typically float)
32 8 bits ⠸⠋⠉ 23 bits ⣠⣶⣶⣿ 127 = 0x7f
IEEE 754
Basic
Double
(typically double)
64 11 bits ⠿⠋⠉ 52 bits ⣀⣴⣶⣾ ⣿⣿⣿⣿ 1023 = 0x3ff
i86 internal
(with explicit 1)
Extended
(typically long double)
80 15 bits ⢸⠿⠋⠉ 64 bits ⣿⣿⣿⣿ ⣿⣿⣿⣿ 16383 = 0x3fff
IEEE 754
Basic
Quadruple
(possibly long double)
128 15 bits ⢸⠿⠋⠉ 112 bits ⢀⣠⣶⣶ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ 16383 = 0x3fff
IEEE 754 Octuple 256 19 bits ⣿⠿⠋⠉ 236 bits
⣀⣴⣶ ⣾⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
262143 = 0x3Ffff
IEEE 754
Interchange
Sedecuple 512 23 bits ⢸⣿⠿⠋⠉ 488 bits
⢀⣠⣶ ⣶⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
4194303 = 0x3fFfff
IEEE 754
Interchange
Duotrigtuple 1024 27 bits ⣿⣿⠿⠋⠉ 996 bits
⣀⣴ ⣶⣾⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿ ⣿⣿⣿⣿
67108863 = 0x3ffFfff
Basic IEEE floating point types are typically implemented in C/C++ as float, double, long double.

If you had only two bits... What would you give up to limit the representation to store numbers in two bits? The sign? The exponent? The significand? What would all the possible bit combinations mean? Would you use an implied or explicit 1?
00 = ?
01 = ?
10 = ?
11 = ?