Shampine, Allen, Pruess - Fundamentals of Numerical Computing (523185), страница 5
Текст из файла (страница 5)
Ignoring any error made in evaluating the formula for s, show that thisabsolute error of δc induces an absolute error in s of δsFor the range 0/2, arewiththere θ for which this way of computing sin(θ) has anaccuracy comparable to the accuracy of cos(θ)? Arethere θ for which it is much less accurate? Repeat forrelative errors.1.2 EXAMPLES OF FLOATING POINT CALCULATIONSThe floating point number system has properties that are similar to those of the realnumber system, but they are not identical. We have already seen some differencesdue to the finite range of exponents.
It might be thought that because one arithmeticoperation can be carried out with small relative error, the same would be true of severaloperations. Unfortunately this is not true. We shall see that multiplication and divisionare more satisfactory in this respect than addition and subtraction.For floating point numbers x, y and z,1.2 EXAMPLES OF FLOATING POINT CALCULATIONS13The product(1 + δ1)(l + δ2) = 1 + ε,where ε is “small,” and can, of course, be explicitly bounded in terms of u. It is moreilluminating to note that( 1 + δ1 )( 1 + δ2 ) = 1+ δ1 + δ2+ δ1 δ21+δ1+δ2,so thatand an approximate bound for ε is 2u.
Before generalizing this, we observe that it maywell be the case thateven when the exponent range is not exceeded. However,so thatwhere η is “small.” Thus, the associative law for multiplication is approximately true.In general, if we wish to multiply x l ,x2 , . . .
,xn, we might do this by the algorithmPlPi==xlPi-1xi ,i = 2, 3 ,... ,n.Treating these operations in real arithmetic we find thatPi = xlx2···xi(l + δ1)(1 + δ2)···(1 + δι ),where eachThe relative error of each Pi can be bounded in terms of u withoutdifficulty, but more insight is obtained if we approximatePixIx2 ···xi(l + δ1 + δ2 + · · · + δι),which comes from neglecting products of the δι. Then| δ1 + δ2 + ··· + δι |iu.This says that a bound on the approximate relative errors grows additively. Each multiplication could increase the relative error by no more than one unit of roundoff. Division can be analyzed in the same way, and the same conclusion is true concerningthe possible growth of the relative error.Example 1.10.The gamma function, defined as14CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICgeneralizes the factorial function for integers to real numbers x (and complex x aswell).
This follows from the fundamental recursion(1.5)and the fact that (1) = 1. A standard way of approximating(x) for x 2 uses thefundamental recursion to reduce the task to approximating (y) for 2 < y < 3. Thisis done by letting N be an integer such that N < x < N + 1, letting y = x - N + 2, andthen noting that repeated applications of (1.5) yieldThe function I’(y) can be approximated well by the ratio R(y) of two polynomials for2 < y < 3.
Hence, we approximateIf x is not too large, little accuracy is lost when these multiplications are performed infloating point arithmetic. However, it is not possible to evaluate (x) for large x bythis approach because its value grows very quickly as a function of x. This can be seenfrom the Stirling formula (see Case Study 5)This example makes another point: the virtue of floating point arithmetic is that itautomatically deals with numbers of greatly different size. Unfortunately, many of thespecial functions of mathematical physics grow or decay extremely fast. It is by nomeans unusual that the exponent range is exceeded.
When this happens it is necessaryto reformulate the problem to make it better scaled. For example, it is often better towork with the special function lnthan with I’(x) because it is better scaled.nAddition and subtraction are much less satisfactory in floating point arithmeticthan are multiplication and division.
It is necessary to be alert for several situationsthat will be illustrated. When numbers of greatly different magnitudes are added (orsubtracted), some information is lost. Suppose, for example, that we want to addδ = 0.123456 × 10-4 to 0.100000 × 101 in six-digit chopped arithmetic. First, theexponents are adjusted to be the same and then the numbers are added:0.100000+ 0.00001234560.1000123456× 101×101×10 1 .The result is chopped to 0.100012 × 101.
Notice that some of the digits did not participate in the addition. Indeed, if |y| < |x|u, then x y = x and the “small” number yplays no role at all. The loss of information does not mean the answer is inaccurate;it is accurate to one unit of roundoff. The problem is that the lost information may beneeded for later calculations.1.2 EXAMPLES OF FLOATING POINT CALCULATIONSExample 1.11.
Difference quotients.15Earlier we made use of the fact that for smallIn many applications this is used to approximate F'(x). To get an accurate approximation, δ must be “small” compared to x. It had better not be too small for the precision,or else we would haveand compute a value of zero for F'(x). If δ is largeenough to affect the sum but still “small,” some of its digits will not affect the sum inthe sense thatIn the difference quotient we want to divide by the actualdifference of the arguments, not δ itself. A better way to proceed is to defineand approximateThe two approximations are mathematically equivalent, but computationally different.For example, suppose that F(x) = x and we approximate F'(x) for x = 1 using δ =0.123456 × 10-4 in six-digit chopped arithmetic.
We have just worked out 1 δ =0.100012 × 101; similarly, = 0.120000 × 10-4 showing the digits of δ that actuallyaffect the sum. The first formula hasThe second hasObviously the second form provides a better approximation to F'(1) = 1. Qualitycodes for the numerical approximation of the Jacobian matrices needed for optimization, root solving, and the solution of stiff differential equations make use of this simple device.nExample 1.12. Limiting precision.In many of the codes in this book we attempt torecognize when we cannot achieve a desired accuracy with the precision available. Thekind of test we make will be illustrated in terms of approximating a definite integralThis might be done by splitting the integration interval [a,b] into pieces [α,β] andadding up approximations on all the pieces.
Suppose thatThe accuracy of this formula improves as the length β - α of the piece is reduced, sothat, mathematically, any accuracy can be obtained by making this width sufficiently16CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICsmall. However, if |β - α| < 2u|α|, the floating point numbers a and a + (β - α)/2are the same. The details of the test are not important for this chapter; the point is thatwhen the interval is small enough, we cannot ignore the fact that there are only a finitenumber of digits in floating point arithmetic. If a and β cannot be distinguished in theprecision available, the computational results will not behave like mathematical resultsfrom the real number system.
In this case the user of the software must be warned thatnthe requested accuracy is not feasible.Example I. 13. Summing a divergent series.The sum S of a seriesis the limit of partial sumsThere is an obvious algorithm for evaluating S:S1Sn==a1Sn-1an,n = 2, 3,...,continuing until the partial sums stop changing. A classic example of a divergent seriesis the harmonic seriesIf the above algorithm is applied to the harmonic series, the computed Sn increase andthe a, decrease untiland the partial sums stop changing. The surprising thing is how small Sn is when thishappens-try it and see. In floating point arithmetic this divergent series has a finitesum.
The observation that when the terms become small enough, the partial sumsstop changing is true of convergent as well as divergent series. Whether the value soobtained is an accurate approximation to S depends on how fast the series converges.It really is necessary to do some mathematical analysis to get reliable results. Later innthis chapter we consider how to sum the terms a little more accurately.An acute difficulty with addition and subtraction occurs when some information,lost due to adding numbers of greatly different size, is needed later because of a subtraction. Before going into this, we need to discuss a rather tricky point.Example 1.14. Cancellation (loss of significance). Subtracting a number y from anumber x that agrees with it in one or more leading digits leads to a cancellation of171.2 EXAMPLES OF FLOATING POINT CALCULATIONS-5-5these digits.
For example, if x = 0.123654 × 10 and y = 0.123456 × 10 , then–0.1236540.1234560.000198× 10-5× 10-5× 10-5 = 0.198000 × 10-8.The interesting point is that when cancellation takes place, the subtraction is doneexactly, so that xy = x - y. The difficulty is what is called a loss of significance.When cancellation takes place, the result x - y is smaller in size than x and y, soerrors already present in x and y are relatively larger in x - y.
Suppose that x is anapproximation to X and y is an approximation to Y. They might be measured values orthe results of some computations. The difference x - y is an approximation to X - Ywith the magnitude of its relative error satisfyingIf x is so close to y that there is cancellation, the relative error can be large because the denominator X - Y is small compared to X or Y. For example, if X =0.123654700 ··· × 10-5, then x agrees with X to a unit roundoff in six-digit arithmetic.-8With Y = y the value we seek is X - Y = 0.198700 ··· × 10 . Even though the sub-8traction x - y = 0.198000 × 10 is done exactly, x - y and X - Y differ in the fourthdigit. In this example, x and y have at least six significant digits, but their differencenhas only three significant digits.It is worth remarking that we made use of cancellation in Example 1.11 when wecomputedBecause δ is small compared to x, there is cancellation andway we obtain in A the digits of δ that actually affected the sum.Example 1.15.
Roots of a quadratic.In thisSuppose we wish to compute the roots ofx2 + b x + c =0.The familiar quadratic formula gives the roots xl and x2 asassuming b > 0. If c is small compared to b, the square root can be rewritten andapproximated using the binomial series to obtain18CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICThis shows that the true rootsx1x2-b-c/b.In finite precision arithmetic some of the digits of c have no effect on the sum (b/2)2 c. The extreme case isIt is important to appreciate that the quantity is computed accurately in a relative sense.However, some information is lost and we shall see that in some circumstances weneed it later in the computation.