Shampine, Allen, Pruess - Fundamentals of Numerical Computing (523185), страница 3
Текст из файла (страница 3)
The effect isworse in a relative sense because the desired quantities En decrease as n increases.For this example there is a way to get a stable algorithm. If we could find anapproximation ÊN to EN for some N, we could evaluate the recursion in reverse order,l - Enn = N , N - 1,...,2,En-lnto approximate EN-1, EN-2, . . .
, E1. Studying the stability of this recursion as before,if ÊN = EN + ε, then4CHAPTER 1 ERRORS AND FLOATING POINT ARITHMETICThe recursion is so cheap and the error damps out so quickly that we can start with apoor approximation ÊN for some large N and get accurate answers inexpensively forthe En that really interest us. Notice that recurring in this direction, the En increase,making the relative errors damp out even faster. The inequality0<En<xn dx =1n+1shows how to easily get an approximation to En with an error that we can bound. Forexample, if we take N = 20, the crude approximation Ê20 = 0 has an absolute error lessthan l/21 in magnitude. The magnitude of the absolute error in Ê19 is then less thanl/(20 × 21) = 0.0024,. . .
, and that in Ê15 is less than 4 × 10-8. The approximationsto E14,. . . , E1 will be even more accurate.A stable recurrence like the second algorithm is the standard way to evaluate certain mathematical functions. It can be especially convenient for a series expansion inthe functions. For example, evaluation of an expansion in Bessel functions of the firstkind,f(x)=anJn(x),requires the evaluation of Jn(x) for many n. Using recurrence on the index n, this isnaccomplished very inexpensively.Any real number y0 can be written in scientific notation asy = ±.d1d2 ···d s d s + 1 ···× 1 0 e(1.1)Here there are an infinite number of digits di .
Each di takes on one of the values0, 1,..., 9 and we assume the number y is normalized so that d1 > 0. The portion.dld2... is called the fraction or mantissa or significand; it has the meaningdl × 10-l + d2 × 10-2 + ··· + ds × 10- s + ··· .There is an ambiguity in this representation; for example, we must agree that0.24000000 ···is the same as0.23999999 ··· .The quantity e in (1.1) is called the exponent; it is a signed integer.Nearly all numerical computations on a digital computer are done in floating pointarithmetic.
This is a number system that uses a finite number of digits to approximatethe real number system used for exact computation. A system with s digits and base10 has all of its numbers of the formy = ±.dld2 ··· ds × 10 e .(1.2)1.1 BASIC CONCEPTS5Again, for nonzero numbers each di is one of the digits 0, 1,...,9 and d1 > 0 for anormalized number. The exponent e also has only a finite number of digits; we assumethe rangeThe number zero is special; it is written as0.0 ··· 0 × 10m ?Example 1.4.If s = l, m = -1, and M = 1, then the set of floating point numbers is+0.l × 10-1, +0.2 × 10-1, ...) +0.9 × 10-1+0.l × 1 0 0 , +0.2 × 1 0 0 , ...) +0.9 × 1 0 0+0.l × 101) +0.2 × 101, ...) +0.9 × 101,together with the negative of each of these numbers and 0.0 × 10-l for zero.
There areonly 55 numbers in this floating point number system. In floating point arithmetic thenumbers are not equally spaced. This is illustrated in Figure 1.1, which is discussedafter we consider number bases other than decimal.nBecause there are only finitely many floating point numbers to represent the realnumber system, each floating point number must represent many real numbers. Whenthe exponent e in (1.1) is bigger than M, it is not possible to represent y at all. If inthe course of some computations a result arises that would need an exponent e > M,the computation is said to have overflowed. Typical operating systems will terminatethe run on overflow.
The situation is less clear when e < m, because such a y mightreasonably be approximated by zero. If such a number arises during a computation,the computation is said to have underflowed. In scientific computation it is usually appropriate to set the result to zero and continue. Some operating systems will terminatethe run on underflow and others will set the result to zero and continue. Those thatcontinue may report the number of under-flows at the end of the run. If the response ofthe operating system is not to your liking, it is usually possible to change the responseby means of a system routine.Overflows and underflows are not unusual in scientific computation.
For example, exp(y) will overflow for y > 0 that are only moderately large, and exp(-y) willunderflow. Our concern should be to prevent going out of range unnecessarily.FORTRAN and C provide for integer arithmetic in addition to floating point arithmetic. Provided that the range of integers allowed is not exceeded, integer arithmeticis exact. It is necessary to beware of overflow because the typical operating systemdoes not report an integer overflow; the computation continues with a number that isnot related to the correct value in an obvious way.Both FORTRAN and C provide for two precisions, that is, two arithmetics withdifferent numbers of digits s, called single and double precision. The languages dealwith mixing the various modes of arithmetic in a sensible way, but the unwary can getinto trouble.
This is more likely in FORTRAN than C because by default, constants inC are double precision numbers. In FORTRAN the type of a constant is taken from the6CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICway it is written. Thus, an expression like (3/4)*5. in FORTRAN and in C means thatthe integer 3 is to be divided by the integer 4 and the result converted to a floating pointnumber for multiplication by the floating point number 5. Here the integer division 3/4results in 0, which might not be what was intended. It is surprising how often usersruin the accuracy of a calculation by providing an inaccurate value for a basic constantlike π Some constants of this kind may be predefined to full accuracy in a compileror a library, but it should be possible to use intrinsic functions to compute accuratelyconstants like π = acos(-1.0).Evaluation of an asymptotic expansion for the special function Ei(x), called theexponential integral, involves computing terms of the form n!/xn.
To contrast computations in integer and floating point arithmetic, we computed terms of this form fora range of n and x = 25 using both integer and double precision functions for thefactorial. Working in C on a PC using IEEE arithmetic, it was found that the resultsagreed through n = 7, but for larger n the results computed with integer arithmetic wereuseless-the result for n = 8 was negative! The integer overflows that are responsiblefor these erroneous results are truly dangerous because there was no indication fromthe system that the answers might not be reliable.Example 1.5.In Chapter 4 we study the use of bisection to find a number z suchthat f(z) = 0, that is, we compute a root of f(x).
Fundamental to this procedure isthe question, Do f(a) and f(b) have opposite signs? If they do, a continuous functionf(x) has a root z between a and b. Many books on programming provide illustrativeprograms that test for f(a)f(b) < 0. However, when f(a) and f(b) are sufficientlysmall, the product underflows and its sign cannot be determined. This is likely tohappen because we are interested in a and b that tend to z, causing f(a) and f(b) totend to zero.
It is easy enough to code the test so as to avoid the difficulty; it is justnecessary to realize that the floating point number system does not behave quite likethe real number system in this test.nAs we shall see in Chapter 4, finding roots of functions is a context in whichunderflow is quite common. This is easy to understand because the aim is to find a zthat makes f(z) as small as possible.Example 1.6. Determinants.In Chapter 2 we discuss the solution of a system oflinear equations. As a by-product of the algorithm and code presented there, the determinant of a system of n equations can be computed as the product of a set of numbersreturned:d e t = y1 y 2 ···y n .Unfortunately, this expression is prone to unnecessary under- and overflows.
If, forexample, M = 100 and yl = 1050, y2 = 1060, y3 = 10-30, all the numbers are in rangeand so is the determinant 1080. However, if we form (yl × y2) × y3, the partial productyl × y2 overflows. Note that yl × (y2 × y3 ) can be formed. This illustrates the fact thatfloating point numbers do not always satisfy the associative law of multiplication thatis true of real numbers.1.1 BASIC CONCEPTS7The more fundamental issue is that because det(cA) = cn det(A), the determinantis extremely sensitive to the scale of the matrix A when the number of equations nis large. A software remedy used in LINPACK [4] in effect extends the range ofexponents available. Another possibility is to use logarithms and exponentials:l n | d e t | = ln|y i ||det|=exp(ln|det|).If this leads to an overflow, it is because the answer cannot be represented in the floating point number system.nExample 1.7.
Magnitude.z=x+iy,When computing the magnitude of a complex numberthere is a difficulty when either x or y is large. Suppose thatIf |x| is sufficiently large, x2 will overflow and we are not able to compute |z| even when it is avalid floating point number. If the computation is reformulated asthe difficulty is avoided. Notice that underflow could occur when |y| << |x|. This isharmless and setting the ratio y/x to zero results in a computed |z| that has a smallrelative error.The evaluation of the Euclidean norm of a vector v = ( v 1 , v 2 , . .
. , v n ) ,involves exactly the same kind of computations. Some writers of mathematical software have preferred to work with the maximum normbecause it avoids the unnecessary overflows and underflows that are possible with astraightforward evaluation of the Euclidean norm.nIf a real number y has an exponent in the allowed range, there are two standardways to approximate it by a floating point number fl(y). If all digits after the first sin (1.1) are dropped, the result is known as a chopped or truncated representation. Afloating point number that is usually closer to y can be found by adding 5 × 10-(s+1)to the fraction in (1.1) and then chopping.
This is called rounding.Example 1.8.arithmeticIf m = -99, M = 99, s = 5, and π = 3.1415926..., then in choppedfl(π) = 0.31415 × 1018CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICwhilefl(π) = 0.31416 × 101nin rounded arithmetic.If the representation (1.1) of y is chopped to s digits, the relative error of fl(y) hasmagnitudeIn decimal arithmetic with s digits the unit roundoff u is defined to be 101-s whenchopping is done. In a similar way it is found thatwhen rounding is done. In this case u is defined to be ½101-s.