Math 441

Approximation Erors

Some Questions

How do you express 0?
Why didn’t we use all 0’s and all 1’s as exponents?
- These are reserved for special cases.

\(\pm 0\), Subnormal Numbers

\(0\):

\(-0\):

\(\pm\infty\), NaN

\(\pm\infty\):

NaN:

IEEE 64-bit Floats

\(k = 11\)
\(t = 52\)

The Gaps

How big is the gap between the largest 64-bit float and the second-largest 64-bit float?
How big is the gap between 1.0 and the smallest 64-bit float that is larger than 1?

Absolute and Relative Errors

Suppose \(x^*\) is an approximation of a real number \(x\).

\(\left\lvert x - x^*\right\rvert\) is the absolute approximation error.
\(\displaystyle \frac{\left\lvert x - x^*\right\rvert}{\left\lvert x\right\rvert}\) is the relative approximation error.

We say that \(x^*\) approximate \(x\) to \(s\) significant digits if \(s\) is the largest non-negative integer such that

\[\frac{\left\lvert x - x^*\right\rvert}{\left\lvert x\right\rvert} \le 5\times 10^{-s}\]

Floating Point Approximation

Given a real number \(x\), and a floating point system with \(k\) digit exponent and \(t\) digit mantissa:

Denote \(\operatorname{fl}(x)\) the floating point approximation of \(x\).
Example with \(t = 4\)
- Given \(x = 1.01010110100101\times 2^e\)
- \(\operatorname{fl}(x) = 1.0101\times 2^e\)
- \(x - \operatorname{fl}(x) = 0.00000110100101 \times 2^e\)
- \(x - \operatorname{fl}(x) = 0.0110100101 \times 2^{e - 4}\)
- What if \(x = 1.0101{\color{red}1}110100101\times 2^e\)? (chopping vs. rounding)
\[\left\lvert x -\operatorname{fl}(x)\right\rvert \le \begin{cases} 2^{e - t} & \text{ if chopping}\\ 2^{e - t - 1} & \text{ if rounding}\end{cases}\]

Relative Error

We know that \(\left\lvert x\right\rvert = 1.\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\U{2423}\cdots \times 2^e \ge 2^e\)

\[\frac{\left\lvert x -\operatorname{fl}(x)\right\rvert}{\left\lvert x\right\rvert} \le \begin{cases} 2^{-t} & \text{ if chopping}\\ 2^{-t - 1} & \text{ if rounding}\end{cases}\]

Error Propagation

Floating point numbers are not closed under arithmetic operations!

For example, if \(a\) and \(b\) are floating point numbers, \(a + b\) is not necessarily a floating point number!

“Catastrophic” Cancellation

What happens when we subtract two numbers that are very close to each other?

Example

Using a rounding 6-digit decimal floats, subtract \(x = 1.123456\) and \(y = 1.123447\)