Conte, de Boor - Elementary Numerical Analysis. An Algorithmic Approach (523140), страница 3
Текст из файла (страница 3)
HenceThis example was rigged to give a terminating binary fraction. Unhappily, not every terminating decimal fraction gives rise to a terminatingbinary fraction. This is due to the fact that the binary fraction for6NUMBER SYSTEMS AND ERRORSis not terminating. We haveand now we are back to a fractional part of 0.2, so that the digits cycle. Itfollows thatThe procedure just outlined is formalized in the following algorithm.Algorithm 1.2 Given x between 0 and 1 and an integer1. Generate recursively b1, b2, b3, .
. . bygreater thanThenWe have stated this algorithm for a general baserather than for thefor two reasons. If this conversion to binary isspecific binary basecarried out with pencil and paper, it is usually faster to convert first toand then to convert from octal to binary. Also, theoctal, i.e., usealgorithm can be used to convert a binary (or octal) fraction to decimal, byand using binary (or octal) arithmetic.choosingTo give an example, if x = (.lOl)2, then, withandbinary arithmetic, we get from Algorithm 1.2Hence subsequent bk’s are zero.
This shows thatconfirming our earlier calculation. Note that if xF is a terminating binary1.3FLOATING-POINT ARITHMETIC7fraction with n digits, then it is also a terminating decimal fraction with ndigits, sinceEXERCISES1.2-l Convert the following binary fractions to decimal fractions:(.1100011)2(.
1 1 1 1 1 1 1 1)21.2-2 Find the first 5 digits of .1 written as an octal fraction, then compute from it the first 15digits of .1 as a binary fraction.1.2-3 Convert the following octal fractions to decimal:(.614)8(.776)8Compare with your answer in Exercise 1.2-1.1.2-4 Find a binary number which approximatesto within 10-3.1.2-5 If we want to convert a decimal integer N to binary using Algorithm 1.1, we have to usebinary arithmetic. Show how to carry out this conversion using Algorithm 1.2 and decimalarithmetic. (Hint: Divide N by the appropriate power of 2, convert the result to binary, thenshift the “binary point” appropriately.)1.2-6 If we want to convert a terminating binary fraction x to a decimal fraction usingAlgorithm 1.2, we have to use binary arithmetic. Show how to carry out this conversion usingAlgorithm 1.1 and decimal arithmetic.1.3 FLOATING-POINT ARITHMETICScientific calculations are usually carried out in floating-point arithmetic.An n-digit floating-point number in base has the form(1.5)is acalled the mantissa, and e is anwhereinteger called the exponent.
Such a floating-point number is said to benormalized in caseor elseFor most computers,although on some,and in handcalculations and on most desk and pocket calculators,The precision or length n of floating-point numbers on any particularcomputer is usually determined by the word length of the computer andmay therefore vary widely (see Fig. 1.1). Computing systems which acceptFORTRAN programs are expected to provide floating-point numbers oftwo different lengths, one roughly double the other. The shorter one, calledsingle precision, is ordinarily used unless the other, called double precision,is specifically asked for. Calculation in double precision usually doublesthe storage requirements and more than doubles running time as comparedwith single precision.8NUMBER SYSTEMS AND ERRORSFigure 1.1 Floating-point characteristics.The exponent e is limited to a range(1.6)for certain integers m and M.
Usually, m = - M, but the limits may varywidely; see Fig. 1.1.There are two commonly used ways of translating a given real numberx into an nfloating-point number fl(x), rounding and chopping. Inrounding, fl(x) is chosen as the normalized floating-point number nearestx; some special rule, such as symmetric rounding (rounding to an evendigit), is used in case of a tie.
In chopping, fl(x) is chosen as the nearestnormalized floating-point number between x and 0. If, for example, twodecimal-digit floating-point numbers are used, thenandOn some computers, this definition of fl(x) is modified in case(underflow), where m and M are thebounds on the exponents; either fl(x) is not defined in this case, causing astop, or else fl(x) is represented by a special number which is not subject tothe usual rules of arithmetic when combined with ordinary floating-pointnumbers.The difference between x and fl(x) is called the round-off error. Theround-off error depends on the size of x and is therefore best measuredrelative to x.
For if we write(1.7)is some number depending on x, then it is possible towhereboundindependently of x, at least as long as x causes no overflow orunderflow. For such an x, it is not difficult to show thatin roundingwhilein chopping(1.8)(1.9)1.3FLOATING-POINT ARITHMETIC9See Exercise 1.3-3. The maximum possible value foris often called theunit roundoff and is denoted by u.When an arithmetic operation is applied to two floating-point numbers, the result usually fails to be a floating-point number of the samelength.
If, for example, we deal with two-decimal-digit numbers andthenHence, if denotes one of the arithmetic operations (addition, subtraction,multiplication, or division) anddenotes the floating-point operation ofthe same name provided by the computer, then, however the computermay arrive at the resultfor two given floating-point numbers x andy, we can be sure that usuallyAlthough the floating-point operationsome details from machine to machine,corresponding to may vary inis usually constructed so that(1.10)In words, the floating-point sum (difference, product, or quotient) of twofloating-point numbers usually equals the floating-point number whichrepresents the exact sum (difference, product, or quotient) of the twonumbers. Hence (unless overflow or underflow occurs) we have(1.11 a)where u is the unit roundoff.
In certain situations, it is more convenient touse the equivalent formula(1.116)Equation (1.11) expresses the basic idea of backward error analysis (see J.H. Wilkinson [24]†). Explicitly, Eq. (1.11) allows one to interpret a floating-point result as the result of the corresponding ordinary arithmetic, butperformed on slightly perturbed data. In this way, the analysis of the effectof floating-point arithmetic can be carried out in terms of ordinaryarithmetic.For example, the value of the functionat a point x0 can becalculated by n squarings, i.e., by carrying out the sequence of stepsIn floating-point arithmetic, we compute instead, accordwithing to Eq. (1.1 la), the sequence of numbers†Numbers in brackets refer to items in the references at the end of the book.10NUMBER SYSTEMS AND ERRORSwithall i.
The computed answer is, therefore,To simplify this expression, we observe that, iffor somefor somethen(see Exercise 1.3-6). Also thenConsequently,for someIn words, the computed valueis theexact value of f(x) at the perturbed argumentWe can now gauge the effect which the use of floating-point arithmetichas had on the accuracy of the computed value for f(x0) by studying howthe value of the (exactly computed) function f(x) changes when theargument x is perturbed, as is done in the next section.
Further, we notethat this error is, in our example, comparable to the error due to the factthat we had to convert the initial datum x0 to a floating-point number tobegin with.As a second example, of particular interest in Chap. 4, considercalculation of the number s from the equation(1.12)by the formulaIf we obtain s through the stepsthen the corresponding numbers computed in floating-point arithmeticsatisfyHere, we have used Eqs. (1.11a ) and (1.11b), and have not bothered to1.3distinguish the variousFLOATING-POINT ARITHMETIC11by subscripts.
Consequently,This shows that the computed valuefor s satisfies the perturbed equation(1.13)Note that we can reduce all exponents by 1 in case ar+1 = 1, that is, incase the last division need not be carried out.EXERCISES1.3-1 The following numbers are given in a decimal computer with a four-digit normalizedmantissa:Perform the following operations, and indicate the error in the result, assuming symmetricrounding:1.3-2 Letbe given by chopping.
Show that(unless overflow or underflow occurs).and that13-3 Letbe given by chopping and letbe such that(IfShow that thenis bounded as in (1.9).1.3-4 Give examples to show that most of the laws of arithmetic fail to hold for floating-pointarithmetic. (Hint: Try laws involving three operands.)1.3-5 Write a FORTRAN FUNCTION FL(X) which returns the value of the n-decimal-digitfloating-point number derived from X by rounding.
Take n to be 4 and check yourcalculations in Exercise 1.3-l. [Use ALOG10(ABS(X)) to determine e such that1.3-6 LetShow that for allthere existsthatShow also thatsomeprovidedall have the same sign.1.3-7 Carry out a backward error analysis for the calculation of the scalar productRedo the analysis under the assumption that double-precisioncumulation is used. This means that the double-precision results of each multiplicatioinretained and added to the sum in double precision, with the resulting sum rounded only atend to single precision.oforacarethe12NUMBER SYSTEMS AND ERRORS1.4 LOSS OF SIGNIFICANCE AND ERROR PROPAGATION;CONDITION AND INSTABILITYIf the number x* is an approximation to the exact answer x, then we callthe difference x - x* the error in x*; thusExact = approximation + error(1.14)The relative error in x*, as an approximation to x, is defined to be thenumber (x - x*)/x.
Note that this number is close to the number (x then (x x * ) / x * if it is at all small. [Precisely, ifx*)/x* =Every floating-point operation in a computational process may giverise to an error which, once generated, may then be amplified or reducedin subsequent operations.One of the most common (and often avoidable) ways of increasing theimportance of an error is commonly called loss of significant digits. If x* isan approximation to x, then we say that x* approximates x to r significantprovided the absolute error |x - x*| is at most in the rt hsignificantof x. This can be expressed in a formula as(1.15)with s the largest integer such thatFor instance, x* = 3 agreeswithto one significant (decimal) digit, whileis correct to three significant digits (as an approximation to ). Supposenow that we are to calculate the numberand that we have approximations x* and y* for x and y, respectively,available, each of which is good to r digits.
Thenis an approximation for z, which is also good to r digits unless x* and y*agree to one or more digits. In this latter case, there will be cancellation ofdigits during the subtraction, and consequently z* will be accurate to fewerthan r digits.Consider, for example,and assume each to be an approximation to x and y, respectively, correctto seven significant digits. Then, in eight-digit floating-point arithmetic,is the exact difference between x* and y*. But as an approximation toz = x - y,z* is good only to three digits, since the fourth significant digitof z* is derived from the eighth digits of x* and y*, both possibly in error.1.4LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY13Hence, while the error in z* (as an approximation to z = x - y) is at mostthe sum of the errors in x* and y*, the relative error in z* is possibly 10,000times the relative error in x* or y*.