Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 14
Текст из файла (страница 14)
MATLAB uses IEEE standardF LOATING P OINT A RITHMETIC44Figure 2.1. Relative distance from x to the next larger machine number (β = 2,t = 24), displaying wobbling precision.double precision arithmetic on those machines that support it in hardware.In Fortran 90 the intrinsic function EPSILON returns the machine epsilon corresponding to the KIND of its REAL argument.2.2. Model of ArithmeticTo carry out rounding error analysis of an algorithm we need to make someassumptions about the accuracy of the basic arithmetic operations. The mostcommon assumptions are embodied in the following model, in which x, yF:STANDARD MODELf l(xopy)=( x opy)(1+ δ ),|δ|< u,op = +, -, *, /.(2.4)It is normal to assume that (2.4) holds also for the square root operation.Note that now we are using fl(·) with an argument that is an arithmeticexpression to denote the computed value of that expression.
The model saysthat the computed value of x op y is “as good as” the rounded exact answer,in the sense that the relative error bound is the same in both cases. However,2.3 IEEE A RITHMETIC45the model does not require that δ = 0 when x op yF-a condition whichobviously does hold for the rounded exact answer-so the model does notcapture all the features we might require of floating point arithmetic. Thismodel is valid for most computers, and, in particular, holds for IEEE standardarithmetic.
Cases in which the model is not valid are described in §2.4.The following modification of (2.4) can also be used (cf. Theorem 2.3):(2.5)Note: Throughout this book, the standard model (2.4) is used unlessotherwise stated. Most results proved using the standard model remain truewith the weaker model (2.6) described below, possibly subject to slight increases in the constants. We identify problems for which the choice of modelsignificantly affects the results that can be proved.2.3. IEEE ArithmeticIEEE standard 754, published in 1985 [597, 1985], defines a binary floatingpoint arithmetic system. It is the result of several years’ work by a workinggroup of a subcommittee of the IEEE Computer Society Computer StandardsCommittee.Among the design principles of the standard were that it should encourageexperts to develop robust, efficient, and portable numerical programs, enablethe handling of arithmetic exceptions, and provide for the development oftranscendental functions and very high precision arithmetic.The standard specifies floating point number formats, the results of thebasic floating point operations and comparisons, rounding modes, floatingpoint exceptions and their handling, and conversion between different arithmetic formats.
Square root is included as a basic operation. The standardsays nothing about exponentiation or transcendental functions such as expand cos.Two main floating point formats are defined:TypeSingleDoubleSize32 bits64 bitsMantissa23+1 bits52+1 bitsExponent Unit roundoff8 bits2-245.96 × 10-8-5311 bits21.11 × 10-16Range10 ±3810±308In both formats one bit is reserved as a sign bit. Since the floating pointnumbers are normalized, the most significant bit is always 1 and is not stored(except for the denormalized numbers described below). This hidden bit accounts for the “+l” in the table.F LOATING P OINT A RITHMETIC46Table 2.2.
IEEE arithmetic exceptions and default results.Exception typeInvalid operationOverflowDivide by zeroUnderflowInexactDefault resultNaN (Not a Number)Exampleo/o, 0 ×Finite nonzero/0Whenever fl(xopy)xopySubnormal numbersCorrectly rounded resultThe standard specifies that all arithmetic operations are to be performedas if they were first calculated to infinite precision and then rounded accordingto one of four modes. The default rounding mode is to round to the nearestrepresentable number, with rounding to even (zero least significant bit) in thecase of a tie. With this default mode, the model (2.4) is obviously satisfied.Note that computing with a single guard bit (see §2.4) will not always give thesame answer as computing the exact result and then rounding.
But the useof a second guard bit and a third sticky bit (the logical OR of all succeedingbits) enables the rounded exact result to be computed. Rounding to plus orminus infinity is also supported by the standard: this facilitates the implementation of interval arithmetic. The fourth supported mode is rounding tozero (truncation, or chopping).IEEE arithmetic is a closed system: every arithmetic operation producesa result, whether it is mathematically expected or not, and exceptional operations raise a signal. The default results are shown in Table 2.2. The defaultresponse to an exception is to set a flag and continue, but it is also possibleto take a trap (pass control to a trap handler).A NaN is a special bit pattern that cannot be generated in the course ofunexceptional operations because it has a reserved exponent field.
Since themantissa is arbitrary, subject to being nonzero, a NaN can have somethingabout its provenance encoded in it, and this information can be used forretrospective diagnostics. A NaN is generated by operations such as 0/0,andOne creative use of the NaN is to0×denote uninitialized or missing data. Arithmetic operations involving a NaNreturn a NaN as the answer.
A NaN compares as unordered and unequal witheverything including itself (a NaN can be tested with the predicate xx orwith the IEEE recommended function isnan. if provided).The IEEE standard provides distinct representations for +0 and -0, butcomparisons are defined so that +0 = -0. Signed zeros provide an elegantway to handle branch cuts in complex arithmetic; for details, see Kahan [632,19 8 7 ].The infinity symbol is represented by a zero mantissa and the same ex-2.3 IEEE A RITHMETIC47ponent field as a NaN; the sign bit distinguishes betweenThe infinitysymbol obeys the usual mathematical conventions regarding infinity, such asThe standard allows subnormal numbers to be represented, instead offlushing them to zero as in many systems, and this feature permits gradualunderflow (sometimes called graceful underflow).
Gradual underflow makesit easier to write reliable numerical software; see Demmel [280, 1984].The standard may be implemented in hardware or software. The firsthardware implementation was the Intel 8087 floating point coprocessor, whichwas produced in 1981 and implements an early draft of the standard (the8087 very nearly conforms to the present standard). This chip, together withits bigger and more recent brothers the Intel 80287, 80387, 80486 and thePentium, is used in IBM PC compatibles (the 80486DX and Pentium aregeneral-purpose chips that incorporate a floating point coprocessor). Othermanufacturers that produce processors implementing IEEE arithmetic includeDEC (Alpha), Hewlett Packard (Precision Architecture), IBM (RS/6000),Inmos (T800, T900), Motorola (680x0)) and Sun (SPARCstation).The IEEE standard defines minimum requirements for two extended number formats: single extended and double extended.
The double extended format has at least 79 bits, with at least 63 bits in the mantissa and at least15 in the exponent; it therefore surpasses the double format in both precision and range, having unit roundoff u < 5.42 × 10 -20 and range at least10 ±4932 . The purpose of the extended precision formats is not to provide forhigher precision computation per se, but to enable double precision resultsto be computed more reliably (by avoiding intermediate overflow and underflow) and more accurately (by reducing the effect of cancellation) than wouldotherwise be possible. In particular, extended precision makes it easier towrite accurate routines to evaluate the elementary functions, as explained byHough [584, 1981].A double extended format of 80 bits is supported by the Intel and Motorolachips mentioned above (which are used in many PC and Macintosh computers); these chips, in fact, normally do all their floating point arithmetic in 80bit arithmetic (even for arguments of the single or double format).
However,double extended is not supported by Sun SPARCstations or machines that usethe PowerPC or DEC Alpha chips. Furthermore, the extended format (whenavailable) is supported little by compilers and packages such as Mathematicaand Maple. Kahan [636, 1994] notes that “What you do not use, you aredestined to lose”, and encourages the development of benchmarks to measureaccuracy and related attributes. He also explains thatFor now the 10-byte Extended format is a tolerable compromisebetween the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precisionF LOATING P OINT A RITHMETIC48will become tolerable, and ultimately a 16-byte format .
. . Thatkind of gradual evolution towards wider precision was already inview when IEEE Standard 754 for Floating-Point Arithmetic wasframed.A possible side effect of the use of an extended format is the phenomenonof double rounding, whereby a result computed “as if to infinite precision” (asspecified by the standard) is rounded first to the extended format and then tothe destination format. Double rounding (which is allowed by the standard)can give a different result from that obtained by rounding directly to thedestination format, and so can lead to subtle differences between the resultsobtained with different implementations of IEEE arithmetic (see Problems 2.9and 3.11).An IEEE Standard 854, which generalizes the binary standard 754, waspublished in 1987 [598, 198 7].
It is a standard for floating point arithmeticthat is independent of word length and base (although in fact only bases 2 and10 are provided in the standard, since the drafting committee “could find novalid technical reason for allowing other radices, and found several reasons fornot allowing them” [223, 1988]). Base 10 IEEE 854 arithmetic is implementedin the HP-71B calculator.2.4. Aberrant ArithmeticsUnfortunately, not all computer floating point arithmetics adhere to the model(2.4). The most common reason for noncompliance with the model is thatthe arithmetic lacks a guard digit in subtraction.
The role of a guard digit iseasily explained with a simple example.Consider a floating point arithmetic system with base β = 2 and t = 3digits in the mantissa. Subtracting from 1.0 the next smaller floating numberwe have, in binary notation,21 × 0.10020 × 0.1112 1 × 0.10021 × 0.01112 1 × 0.0001 = 2 -2 × 0.100Notice that to do the subtraction we had to line up the binary points, therebyunnormalizing the second number and using, temporarily, a fourth mantissadigit, known as a guard digit. Some machines do not have a guard digit.Without a guard digit in our example we would compute as follows, assumingthe extra digits are simply discarded:21 × 0.10020 × 0.1112 1 × 0.1002 1 × 0.011 (last digit dropped)21 × 0.001 = 2-l × 0.100492.4 A BERRANT A RITHMETICSThe computed answer is too big by a factor 2 and so has relative error l! Formachines without a guard digit it is not true thatf l( x ± y) = (x ± y)(1 + δ),| δ |< u ,but it is true thatOur model of floating point arithmetic becomesNo GUARD DIGIT MODELf l(x ± y) = x(1 + a) ± y(l + β),fl(xopy) = (xopy) ( l +δ ) ,|a | , | β| < u,|δ| < u,op = *,/,(2.6a)(2.6b)where we have stated a weaker condition on a and β that is generally easierto work with.Notable examples of machines that lack guard digits are several modelsof Cray computers (Cray 1, 2, X-MP, Y-MP, and C90).