Shampine, Allen, Pruess - Fundamentals of Numerical Computing (523185), страница 4
Текст из файла (страница 4)
In either case, u is abound on the relative error of representing a nonzero real number as a floating pointnumber.Because fl(y) is a real number, for theoretical purposes we can work with it likeany other real number. In particular, it is often convenient to define a real number δsuch thatfl(y) = y(1 + δ).In general, all we know about δ is the boundExample 1.9. Impossible accuracy. Modern codes for the computation of a root ofan equation, a definite integral, the solution of a differential equation, and so on, tryto obtain a result with an accuracy specified by the user. Clearly it is not possibleto compute an answer more accurate than the floating point representation of the truesolution. This means that the user cannot be allowed to ask for a relative error smallerthan the unit roundoff u.
It might seem odd that this would ever happen, but it does.One reason is that the user does not know the value of u and just asks for too muchaccuracy. A more common reason is that the user specifies an absolute error r. Thismeans that any number y* will be acceptable as an approximation to y if1.1 BASIC CONCEPTS9Such a request corresponds to asking for a relative error ofWhen |r/y| < u, that is, r < u|y|, this is an impossible request.
If the true solution isunexpectedly large, an absolute error tolerance that seems modest may be impossiblein practice. Codes that permit users to specify an absolute error tolerance need tobe able to monitor the size of the solution and warn the user when the task posed isimpossible.nThere is a further complication to the floating point number system-most computers do not work with decimal numbers. The common bases are β = 2, binary arithmetic, and β = 16, hexadecimal arithmetic, rather than β = 10, decimal arithmetic. Ingeneral, a real number y is written in base β asy = ±.dl d 2 ···dsds+l ··· × β e ,(1.3)where each digit is one of 0, 1,..., β - 1 and the number is normalized so that d 1 > 0(as long as y 0). This means that± ( dl × β - 1 + d 2 × β -2 +···+ d s × β -s +···) × β e .All the earlier discussion is easily modified for the other bases. In particular, we havein base β with s digits the unit roundoff(1.4)Likewise,fl(y) = y(1 + δ), where |δ|u.For most purposes, the fact that computations are not carried out in decimal is inconsequential.
It should be kept mind that small rounding errors are made as numbersinput are converted from decimal to the base of the machine being used and likewiseon output.Table 1.1 illustrates the variety of machine arithmetics used in the past. Today theIEEE standard [l] described in the last two rows is almost universal. In the table thenotation 1.2(-7) means 1.2 × 10-7.As was noted earlier, both FORTRAN and C specify that there will be two precisions available.
The floating point system built into the computer is its single precisionarithmetic. Double precision may be provided by either software or hardware. Hardware double precision is not greatly slower than single precision, but software doubleprecision arithmetic is considerably slower.The IEEE standard uses a normalization different from (1.2). For y 0 the leadingnonzero digit is immediately to the left of the decimal point. Since this digit must be1, there is no need to store it. The number 0 is distinguished by having its e = m - 1.10CHAPTER 1ERRORS AND FLOATING POINT ARITHMETICTable 1.1 Examples of Computer Arithmetics.machineβsVAXVAXCRAY- 1IBM 3081IBM 3081IEEESingleDouble222161622mM245648614-128-128-16384-64-641271271638363636.0(-08)1.4(-17)3.6(-15)9.5(-07)2.2(-16)2453-125-102112810246.0(-08)l.l(-16)approximate uIt used to be some trouble to find out the unit roundoff, exponent range, and thelike, but the situation has improved greatly.
In standard C, constants related to floating point arithmetic are available in <float.h>. For example, dbl_epsilon is the unitroundoff in double precision. Similarly, in Fortran 90 the constants are available fromintrinsic functions. Because this is not true of FORTRAN 77, several approaches weretaken to provide them: some compilers provide the constants as extensions of the language; there are subroutines DlMACH and IlMACH for the machine constants thatare widely available because they are public domain.
Major libraries like IMSL andNAG include subroutines that are similar to DlMACH and IlMACH.In Example 1.4 earlier in this section we mentioned that the numbers in the floatingpoint number system were not equally spaced. As an illustration, see Figure 1.1 whereall 19 floating point numbers are displayed for the system for which β = 4, s = 1,m = -1, and M = 1.Arithmetic in the floating point number system is to approximate that in the realnumber system.
We useto indicate the floating point approximations tothe arithmetic operations +, -, ×, /. If y and z are floating point numbers of s digits,the product y × z has 2s digits. For example, 0.999 × 0.999 = 0.998001. About thebest we could hope for is that the arithmetic hardware produce the result fl(y × z), sou. It is practical to do thisthatfor some real number δ with |δ|for all the basic arithmetic operations. We assume an idealized arithmetic that for thebasic arithmetic operations producesprovided that the results lie in the range of the floating point system.
Hence,where op = +, -, ×, or / and δ is a real number with |δ|u. This is a reasonableassumption, although hardware considerations may lead to arithmetic for which thebound on δ is a small multiple of u.1.1 BASIC CONCEPTS11Figure 1.1 Distribution of floating point numbers for β = 4, s = 1, m = -1, M = 1.To carry out computations in this model arithmetic by hand, for each operation+, -, ×, /, perform the operation in exact arithmetic, normalize the result, and round(chop) it to the allotted number of digits. Put differently, for each operation, calculatethe result and convert it to the machine representation before going on to the nextoperation.Because of increasingly sophisticated architectures, the unit roundoff as definedin (1.4) is simplistic.For example, many computers do intermediate computationswith more than s digits.
They have at least one “guard digit,” perhaps several, and asa consequence results can be rather more accurate than expected. (When arithmeticoperations are carried out with more than s digits, apparently harmless actions likeprinting out intermediate results can cause the final result of a computation to change!This happens when the extra digits are shed as numbers are moved from arithmeticunits to storage or output devices.) It is interesting to compute (1 + δ) -1 for decreasing δ to see how small δ can be made and still get a nonzero result.
A number of codesfor mathematical computations that are in wide use avoid defining the unit roundoffby coding a test for u|x| < h asif ((x+h)x) then . . . .On today’s computers this is not likely to work properly for two reasons, one being thepresence of guard digits just discussed. The other is that modern compilers defeat thetest when they “optimize” the coding by converting the test toif (h0) then . . . ,which is always passed.EXERCISES1.1 Solvemates the unit roundoff u by a computable quantityU:0.461 x 1 + 0.311x 2 = 0.1500.209 x 1 + 0.141x2 = 0.068using three-digit chopped decimal arithmetic.
The exact answer is x1 = 1, x2 = -1; how does yours compare?1.2 The following algorithm (due to Cleve Moler) esti-ABCU:= 4./3:= A -1:= B+B+B:= |C - 1.|(a) What does the above algorithm yield for U in sixdigit decimal rounded arithmetic?12CHAPTER 1ERRORS AND FLOATING POINT ARITHMETIC(b) What does it yield for U in six-digit decimalchopped arithmetic?(c) What are the exact values from (1.4) for u in thearithmetics of (a) and (b)?(d) Use this algorithm on the machine(s) and calculator(s) you are likely to use.
What do you get?1.3 Consider the following algorithm for generating noisein a quantity x:A := 10n * xB:=A+xy:=B-A(a) Calculate y when x = 0.123456 and n = 3 usingsix-digit decimal chopped arithmetic. What is the error x - y?(b) Repeat (a) for n = 5.1.4 Show that the evaluation of F(x) = cosx is wellconditioned near x = 0; that is, for |x|showthat the magnitude of the relative error | [F(x) F(0)] /F(0) | is bounded by a quantity that is not large.1.5 If F(x) = (x - 1)2, what is the exact formula for[F(x + εx) - F(x)]/F(x)? What does this say aboutthe conditioning of the evaluation of F(x) near x = l?sinxdx and show that two inte1.6 Let Sn :=grations by parts results in the recursionFurther argue that S 0 = 2 and that Sn-1 > Sn > 0 forevery n.(a) Compute Sl5 with this recursion (make sure thatyou use an accurate value for π).(b) To analyze what happened in (a), consider the recursionwith = 2( 1 - u), that is, the same computation withthe starting value perturbed by one digit in the lastplace.
Find a recursion for Sn . From this recursion, derive a formula forin terms ofUse this formula to explain what happened in (a).(c) Examine the “backwards” recursionstarting with= 0. What isWhy?1.7 For brevity let us write s = sin(θ), c = cos(θ) for somevalue of θ. Once c is computed, we can compute sinexpensively from s =(Either sign of thesquare root may be needed in general, but let us consider here only the positive root.) Suppose the cosineroutine produces c + δc instead of c.