Math 441

Floating Point Numbers

Numbers in Computers

  • binary
  • \(N\) bits
  • Integers:
    • Arithmetics modulo \(2^N\)

    • Unsigned integers: \(0\) to \(2^N - 1\)

    • Signed integers: \(-2^{N-1}\) to \(2^{N-1} - 1\).

    • Exercise:

      Write down all three-bit integers, first as binary, then each of them as an unsigned integer, then each as signed integer.

  • Floating point numbers

Floating Point Numbers

  • \(k\): number of bits for exponent
  • \(t\): number of bits for mantissa
  • \(1 + k + t = N\)

\[x = {\color{#44ee44}(-1)^s} \cdot {\color{red}2^e} \cdot {\color{blue}(1.f)_2} = {\color{#44ee44}(-1)^s} \cdot {\color{red}2^{(c)_2 - b}} \cdot {\color{blue}(1.f)_2}\]


Exponent \(e = (c)_2 - b\) where \(b\) is called a bias. Typically \(b = 2^{k-1} - 1\).

8-bit Example:

  • \(k = 3\): number of bits for exponent
  • \(t = 4\): number of bits for mantissa
  • Bias: \(b = 2^{3 - 1} - 1 = 3\)

Largest 8-bit Number:

  • \(k = 3\): number of bits for exponent
  • \(t = 4\): number of bits for mantissa
  • Bias: \(b = 2^{3 - 1} - 1 = 3\)

โ€œSmallestโ€ Positive 8-bit Number:

  • \(k = 3\): number of bits for exponent
  • \(t = 4\): number of bits for mantissa
  • Bias: \(b = 2^{3 - 1} - 1 = 3\)

Exercise:

Consider 6-bit floating point numbers with 3-bit exponent and 2-bit mantissa:

  • How many positive floating point numbers are there?
  • What is the smallest one?
  • What is the largest one?
  • Find all of them, convert them to decimal representation, and plot them on a number line. Note how they are distributed!

Some Questions

  • How do you express 0?
  • Why didnโ€™t we use all 0โ€™s and all 1โ€™s as exponents?
    • These are reserved for special cases.

\(\pm 0\), Subnormal Numbers

\(0\):

\(-0\):

\(\pm\infty\), NaN

\(\pm\infty\):

NaN:

IEEE 64-bit Floats

0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 exponent mantissa sign ๐‘ ๐‘“ ๐‘ 
  • \(k = 11\)
  • \(t = 52\)