Heath - Scientific Computing (523150), страница 8
Текст из файла (страница 8)
If available, these quantities canbe helpful in designing software that deals gracefully with exceptional situations rather than14CHAPTER 1. SCIENTIFIC COMPUTINGabruptly aborting the program. In MATLAB (see Section 1.4.2), for example, if Inf and NaNarise, they are propagated sensibly through a computation (e.g., 1 + Inf = Inf). It is stilldesirable, however, to avoid such exceptional situations entirely, if possible. In addition toalerting the user to arithmetic exceptions, these special values can also be useful as flagsthat cannot be confused with any legitimate numeric value. For example, NaN might beused to indicate a portion of an array that has not yet been defined.1.3.8Floating-Point ArithmeticIn adding or subtracting two floating-point numbers, their exponents must match beforetheir mantissas can be added or subtracted.
If they do not match initially, then the mantissaof one of the numbers must be shifted until the exponents do match. In performing sucha shift, some of the trailing digits of the smaller (in magnitude) number will be shifted offthe end of the mantissa field, and thus the correct result of the arithmetic operation cannotbe represented exactly in the floating-point system. Indeed, if the difference in magnitudeis too great, then the entire mantissa of the smaller number may be shifted completelybeyond the field width so that the result is simply the larger of the operands. Another wayof saying this is that if the true sum of two t-digit numbers contains more than t digits,then the excess digits will be lost when the result is rounded to t digits, and in the worstcase the operand of smaller magnitude may be lost completely.Multiplication of two floating-point numbers does not require that their exponentsmatch—the exponents are simply summed and the mantissas multiplied.
However, theproduct of two t-digit mantissas will in general contain up to 2t digits, and thus once againthe correct result cannot be represented exactly in the floating-point system and must berounded.Example 1.7 Floating-Point Arithmetic. Consider a floating-point system with β = 10and t = 6. If x = 1.92403 × 102 and y = 6.35782 × 10−1 , then floating-point addition givesthe result x + y = 1.93039 × 102 , assuming rounding to nearest. Note that the last twodigits of y have no effect on the result. With an even smaller exponent, y could havehad no effect at all on the result. Similarly, floating-point multiplication gives the resultx ∗ y = 1.22326 × 102 , which discards half of the digits of the true product.Division of two floating-point numbers may also give a result that cannot be representedexactly.
For example, 1 and 10 are both exactly representable as binary floating-pointnumbers, but their quotient, 1/10, has a nonterminating binary expansion and thus is nota binary floating-point number.In each of the cases just cited, the result of a floating-point arithmetic operation maydiffer from the result that would be given by the corresponding real arithmetic operationon the same operands because there is insufficient precision to represent the correct realresult. The real result may also be unrepresentable because its exponent is beyond therange available in the floating-point system (overflow or underflow). Overflow is usuallya more serious problem than underflow in the sense that there is no good approximationin a floating-point system to arbitrarily large numbers, whereas zero is often a reasonableapproximation for arbitrarily small numbers.
For this reason, on many computer systems1.3. COMPUTER ARITHMETIC15the occurrence of an overflow aborts the program with a fatal error, but an underflow maybe silently set to zero without disrupting execution.Example 1.8 Summing a Series. As an illustration of these issues, the infinite series∞X1nn=1has a finite sum in floating-point arithmetic even though the real series is divergent. At firstblush, one might think that this result occurs because 1/n will eventually underflow, or thepartial sum will eventually overflow, as indeed they must. But before either of these occurs,the partial sum ceasesPto change once 1/n becomes negligible relative to the partial sum,i.e., when 1/n < mach n−1k=1 (1/k), and thus the sum is finite (see Computer Problem 1.8).As we have noted, a real arithmetic operation on two floating-point numbers does notnecessarily result in another floating-point number.
If a number that is not exactly representable as a floating-point number is entered into the computer or is produced by asubsequent arithmetic operation, then it must be rounded (using one of the rounding rulesgiven earlier) to obtain a floating-point number. Because floating-point numbers are notequally spaced, the absolute error made in such an approximation is not uniform, but therelative error is bounded by the unit roundoff mach .Ideally, x flop y = fl(x op y) (i.e., floating-point arithmetic operations produce correctlyrounded results); and many computers, such as those meeting the IEEE floating-pointstandard, achieve this ideal as long as x op y is within the range of the floating-point system.Nevertheless, some familiar laws of real arithmetic are not necessarily valid in a floatingpoint system.
In particular, floating-point addition and multiplication are commutative butnot associative. For example, if is a positive floating-point number slightly smaller thanthe unit roundoff mach , then (1 + ) + = 1, but 1 + ( + ) > 1.The failure of floating-point arithmetic to satisfy the normal laws of real arithmetic isone reason that forward error analysis can be difficult. One advantage of backward erroranalysis is that it permits the use of real arithmetic in the analysis.1.3.9CancellationRounding is not the only necessary evil in finite-precision arithmetic.
Subtraction betweentwo t-digit numbers having the same sign and similar magnitudes yields a result with fewerthan t significant digits, and hence it is always exactly representable (provided the twonumbers involved do not differ in magnitude by more than a factor of two). The reasonis that the leading digits of the two numbers cancel (i.e., their difference is zero). Forexample, again taking β = 10 and t = 6, if x = 1.92403 × 102 and z = 1.92275 × 102 , thenwe obtain the result x − z = 1.28000 × 10−1 , which, with only three significant digits, isexactly representable.Despite the exactness of the result, however, such cancellation nevertheless often impliesa serious loss of information.
The problem is that the operands are often uncertain, owingto rounding or other previous errors, in which case the relative uncertainty in the difference16CHAPTER 1. SCIENTIFIC COMPUTINGmay be large. In effect, if two nearly equal numbers are accurate only to within roundingerror, then taking their difference leaves only rounding error as a result.As a simple example, if is a positive number slightly smaller than the unit roundoffmach , then (1 + ) − (1 − ) = 1 − 1 = 0 in floating-point arithmetic, which is correct for theactual operands of the final subtraction, but the true result of the overall computation, 2,has been completely lost.
The subtraction itself is not at fault: it merely signals the loss ofinformation that had already occurred.Of course, the loss of information is not always complete, but the fact remains that thedigits lost to cancellation are the most significant, leading digits, whereas the digits lost inrounding are the least significant, trailing digits.
Because of this effect, computing a smallquantity as a difference of large quantities is generally a bad idea, for rounding error is likelyto dominate the result. For example, summing an alternating series, such asex = 1 + x +x2 x3++ ···2!3!for x < 0, may give disastrous results because of catastrophic cancellation (see ComputerProblem 1.9).Example 1.9 Cancellation.