Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 13
Текст из файла (страница 13)
. . (1980)When discussing the floating-point capabilities of a new machine,we always ask the manufacturer two questions:Does the machine use IEEE arithmetic?Does it support graceful underflow and provideuser control of rounding mode and exception flags?Frequently the designer believes his machine is using IEEE arithmeticwhen it is using only the IEEE formats without the other required features.-W. J. CODY, Floating-Point Standards-Theory and Practice (1988)Arithmetic on Cray computers is interesting because it is driven by amotivation for the highest possible floating-point performance . . .Addition on Cray computers does not have a guard digit,and multiplication is even less accurate than addition .
. .At least Cray computers serve to keep numerical analysts on their toes!-DAVID GOLDBERG 9, Computer Arithmetic (1990)It is rather conventional to obtain a “realistic” estimateof the possible overall error due to k roundoffs,when k is fairly large,by replacing k byin an expression for (or an estimate of)the maximum resultant error.-F. B. HILDEBRAND, introduction to Numerical Analysis (1974)9In Hennessy and Patterson [515,1990 ,App. A].39NextF LOATING P OINT A RITHMETIC402.1.
Floating Point Number SystemA floating point number system FIR is a subset of the real numbers whoseelements have the formy = ±m × β e - t .(2.1)The system F is characterized by four integer parameters:• the base β (sometimes called the radix),• the precision t, and• the exponent range emin < e < emax.The mantissa m is an integer satisfying 0 < m < β t - 1. To ensure a uniquerepresent at ion for each y F it is assumed that m > β t -1 if y0, so thatthe system is normalized. The range of the nonzero floating point numbersin F is given by β e min-1 < |y| < β e max(1 - β - t ).
Values of the parametersfor some machines of interest are given in Table 2.1 (the unit roundoff u isdefined on page 42).Note that an alternative (and more common) way of expressing y is(2.2)where each digit d i satisfies 0 < d i < β - 1, and d 10 for normalizednumbers. We prefer the more concise representation (2.1), which we usuallyfind easier to work with. This “nonpositional” representation has pedagogicaladvantages.
being entirely integer based and therefore simpler to grasp. Inthe representation (2.2). d 1 is called the most significant digit and dt the leastsignificant digit.It is important to realize that the floating point numbers are not equallyspaced. If β = 2, t = 3, emin = -1, and emax = 3 then the nonnegativefloating point numbers are0, 0.25, 0.3125, 0.3750, 0.4375, 0.5, 0.625, 0.750, 0.875,1.0, 1.25, 1.50, 1.75. 2.0, 2.5, 3.0.
3.5, 4.0, 5.0, 6.0, 7.0.They can be represented pictorially as follows:412.1 F LOATING P OINT N UMBER S YSTEMTable 2.1. Floating point arithmetic parameters.Machine and arithmeticCray- 1 singleCray- 1 doubleDEC VAX G format, doubleDEC VAX D format, doubleHP 28 and 48G calculatorsIBM 3090 singleIBM 3090 doubleIBM 3090 extendedIEEE singleIEEE doubleIEEE extended (typical)β222210161616222t489653561261428245364e maxemin8191-81928191-81921023-1023127-127499-49963-6463-6463-64128-1251024-1021-16381 1638441115512615×××××××××××u10-1510-2910-1610-1710-1210 -710 - 1 610-3310 -810-1610-20Notice that the spacing of the floating point numbers jumps by a factor 2at each power of 2.
The spacing can be characterized in terms of machineepsilon, which is thefrom 1.0 to the next larger floating pointnumber. Clearly,= β 1 -t , and this is the spacing of the floating pointnumbers between 1.0 and β; the spacing of the numbers between 1.0 andF is estimated by the1/β is β -t =/β. The spacing at an arbitrary xfollowing lemma.Lemma 2.1. The spacing between a normalized floating point number x andan adjacent normalized floating point number is at least β - 1 | x| and at most|x| (unless x or the neighbour is zero).Proof. See Problem 2.2.The system F can be extended by including subnormal numbers (alsoknown as denormalized numbers), which, in the notation of (2.1), are thenumbersy = ±m × β e m i n- t ,0 < m < β t- 1 ,which have the minimum exponent and are not normalized (equivalently, in(2.2) e = emin and the most significant digit d 1 = 0).
The subnormal numbershave fewer digits of precision than the normalized numbers. The smallestpositive normalized floating point number is = β e min-1 while the smallestpositive subnormal number is µ = β e min-t =The subnormal numbersfill the gap betweenand 0 and are equally spaced, with spacing the sameas that of the numbers of F between andnamely= β e min-t. Forexample, in the system illustrated above with β = 2, t = 3, emin = -1, andemax = 3, we have = 2-2 and µ = 2-4, the subnormal numbers are42F LOATING P OINT A RITHMETIC0.0625, 0.125, 0.1875,and the complete number system can be represented asLet GIR denote all real numbers of the form (2.1) with no restrictionon the exponent e. If xIR then fl(x) denotes an element of G nearest to x,and the transformation xf l(x) is called rounding.
There are several waysto break ties when x is equidistant from two floating point numbers, includingtaking fl(x) to be the number of larger magnitude (round away from zero)or the one with an even last digit dt (round to even): the latter rule enjoysimpeccable statistics [144, 1973]. For more on tie-breaking strategies see theNotes and References.Although we have defined fl as a mapping onto G, we are only interestedin the cases where it produces a result in F.
We say that fl(x) overflows if|fl(x)| > max{|y| : yF} and underflows if 0 < |fl(x)| < min{|y| : 0yF} .The following result shows that every real number x lying in the range ofF can be approximated by an element of F with a relative error no largerthan u = ½β 1 -t . The quantity u is called the unit roundoff. It is the mostuseful quantity associated with F and is ubiquitous in the world of roundingerror analysis.Theorem 2.2. If xIR lies in the range of F thenfl(x) = x(1 + δ),|δ | < u.(2.3)Proof. We can assume that x > 0.
Writing the real number x in the formx = µ × βe- tβt -1 < µ < β t - 1,we see that x lies between the adjacent floating point numbers y 1 = |µ | β e - tand y2 = [µ ]β e - t. Thus fl(x) = y1 or y2 and we haveHence2.1 F L O A T I N G P O I N T N U M B E R S Y S T E M43The last inequality is strict unless µ = β t -1, in which case x = fl(x), hencethe inequality of the theorem is strict.Theorem 2.2 says that fl(x) is equal to x multiplied by a factor very closeto 1. The representation 1 + δ for the factor is the standard choice, but it isnot the only possibility. For example, we could write the factor as ea , with abound on |a| a little less than u (cf. the rp notation in §3.4).The following modified version of this theorem can also be useful.Theorem 2.3. If xIR lies in the range of F thenProof.
See Problem 2.4.The widely used IEEE standard arithmetic (described in §2.3) has β = 2and supports two precisions. Single precision has t = 24, emin = -125,emax = 128, and u = 2-245.96 × 10-8. Double precision has t = 53,emin = -1021, emax = 1024, and u = 2-53 = 1.11 × 10-16. IEEE arithmeticuses round to even.It is easy to see thatHence, while the relative error in representing x is bounded by ½β 1 -t (asit must be, by Theorem 2.2), the relative error varies with x by as muchas a factor β. This phenomenon is called wobbling precision, and is one ofthe reasons why small bases (in particular, β = 2) are favoured. The effectof wobbling precision is clearly displayed in Figure 2.1, which plots machinenumbers x versus the relative distance from x to the next larger machinenumber, for 1 < x < 16 in IEEE single precision arithmetic.
In this plot, therelative distances range from about 2-23 = 1.19 × 10-7 just to the right of apower of 2 to about 2-24 = 5.96 × 10-8 just to the left of a power of 2 (seeLemma 2.1).The notion of ulp, or “unit in last place”, is sometimes used when describing the accuracy of a floating point result. One ulp of the normalized floatingpoint number y = ±β e × .d 1 d 2 . . . dt is ulp(y) = β e × .00 . .
. 01 = β e-t. If xis any real number we can say that y and x agree to | y - x|/ulp(y) ulps iny. This measure of accuracy “wobbles” when y changes from a power of β tothe next smaller floating point number, since ulp(y ) decreases by a factor β.In MATLAB the permanent variable eps represents the machine epsilon(not the unit roundoff as is sometimes thought).