Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 18
Текст из файла (страница 18)
For a clever use of an FMA operation to achieve increasedaccuracy in a computation, see Problem 2.25.During the design of the IBM 7030, Sweeney [982, 1965] collected statisticson the floating point additions carried out by selected application programson an IBM 704. He found that 11% of all instructions traced were floatingpoint additions. Details were recorded of the shifting needed to align floatingpoint numbers prior to addition, and the results were used in the design ofthe shifter on the IBM 7030.The word bit, meaning binary digit, first appeared in print in a 1948paper of Claude E.
Shannon, but the term was apparently coined by JohnW. Tukey [1022, 198 4]. The word byte, meaning a group of (usually eight)bits, did not appear in print until 1959 [156, 1981].The earliest reference we know for Theorem 2.5 is Sterbenz [938, 1974,Thm. 4.3.1]. Theorem 2.4 is due to Ferguson [370, 1995], who proves a moregeneral version of the theorem that allows for trailing zero digits in x and y.A variation in which the condition is 0 < y < x < y + β e .
where e = min{j:βj > y}, is stated by Ziv [1132, 1991] and can be proved in a similar way.For more on the choice of base, see Cody [227, 1973] and Kuki and Cody[677, 1973]. Buchholz’s paper Fingers or Fists? [155, 1959] on binary versus2.9 N OTESANDR EFERENCES61decimal representation of numbers on a computer deserves mention for itsclever title, though the content is only of historical interest.The model (2.4) ignores the possibility of underflow and overflow. To takeunderflow into account the model must be modified tofl(x op y) = (x op y)(1 + δ) + η,op = +,-,*,/.(2.8)As before, |δ| < u. If underflow is gradual, as in IEEE arithmetic, then |η| <which is half the spacing between the subnormal numberse min-1= βis the smallest positive normalized floating point number); ifunderflows are flushed to zero then |η | <Only one of δ and η is nonzero:δ if no underflow occurs, otherwise η.
With gradual underflow the absoluteerror of an underflowed result is no greater than the smallest (bound for the)absolute error that arises from an operation fl(x op y) in which the argumentsand result are normalized. For more details, and a thorough discussion of howerror analysis of standard algorithms is affected by using the model (2.8), seethe perceptive paper by Demmel [280, 198 4]. Another relevant reference isNeumaier [789, 1985].Algorithms for evaluating elementary functions in IEEE arithmetic aredeveloped by Tang [987, 1989], [989, 1990], [991, 1992], Gal and Bachelis [412,1991], and Ziv [1132, 1991]. Tang [990, 1991] gives a very readable descriptionof table lookup algorithms for evaluating elementary functions, which are usedin a number of current chips.Algorithms for evaluating complex elementary functions that exploit exception handling and assume the availability of algorithms for the real elementary functions are presented by Hull, Fairgrieve, and Tang [592, 1994].
Fordetails of how elementary functions are evaluated on many of today’s pocketcalculators see Schelin [898, 1983].An important problem not considered in this chapter is the conversion ofnumbers between decimal and binary representations. These conversions areneeded whenever numbers are read into a computer or printed out. They tendto be taken for granted, but if not done carefully they can lead to puzzlingbehaviour, such as a number read in as 0.1 being printed out as 0.099. .
.9.To be precise, the problems of interest are (a) convert a number representedin decimal notation to the best binary floating point representation of a givenprecision, and (b) given a binary floating point number, print a correctlyrounded decimal representation, either to a given number of significant digitsor to the smallest number of significant digits that allows the number to bere-read without loss of accuracy. Algorithms for solving these problems aregiven by Clinger [220, 1990] and Steele and White [936, 1990]; Gay [431, 1990]gives some improvements to the algorithms and C code implementing them.Precise requirements for binary-decimal conversion are specified in the IEEEarithmetic standard. A program for testing the correctness of binary-decimalconversion routines is described by Paxson [823, 1991].
Early references onF LOATING P OINT A RITHMETIC62base conversion are Goldberg [458, 1967] and Matula [739, 1968], [740, 1970].It is interesting to note that, in Fortran or C, where the output format fora “print” statement can be precisely specified, most compilers will, for an(in)appropriate choice of format, print a decimal string that contains manymore significant digits than are determined by the floating point number whosevalue is being represented.Other authors who have analysed various aspects of floating (and fixed)point arithmetic include Diamond [305, 1978], Urabe [1037, 1968], and Feldstein, Goodman, and co-authors [471, 1975], [368, 198 2], [472, 198 5], [369,19 86].
For a survey of computer arithmetic up until 1976 that includes anumber of references not given here, see Garner [420, 1976].ProblemsThe exercise had warmed my blood, andI was beginning to enjoy myself amazingly.-JOHN BUCHAN, The Thirty-Nine Steps (1915)2.1. How many normalized numbers and how many subnormal numbers arethere in the system F defined in (2.1) with emin < e < emax? What are thefigures for IEEE single and double precision (base 2)?2.2.
Prove Lemma 2.1.2.3. In IEEE arithmetic how many double precision numbers are there between any two adjacent nonzero single precision numbers?2.4. Prove Theorem 2.3.2.5. Show thatand deduce that 0.1 has the base 2 representation 0.0001100 (repeating last 4bits). Let= fl(0.1) be the rounded version of 0.1 obtained in binary IEEEsingle precision arithmetic (u = 2-24). Show that2.6. What is the largest integer p such that all integers in the interval [-p,p]are exactly representable in IEEE double precision arithmetic? What is thecorresponding p for IEEE single precision arithmetic?2.7. Which of the following statements is true in IEEE arithmetic, assumingthat a and b are normalized floating point numbers and that no exceptionoccurs in the stated operations?1.
fl(a op b) = fl(b op a),2 . f l(b - a) = -fl(a - b).op = +,*.P ROBLEMS633. fl(a + a) = fl(2* a ) .4 . f l(0.5*a) = fl(a/2).5. fl((a + b) + c) = fl(a + (b + c)).6. a < fl( (a + b)/2) < b, given that a < b.2.8. Show that the inequalities a < fl((a + b)/2) < b, where a and b arefloating point numbers with a < b, can be violated in base 10 arithmetic.Show that a < fl(a+(b-a)/2) < b in base β arithmetic, for any β, assumingthe use of a guard digit.2.9. What is the result of the computationin IEEE double precision arithmetic, with and without double rounding from an extended formatwith a 64-bit mantissa?2.10. A theorem of Kahan [457, 1991, Thm.
7] says that if β = 2 and thearithmetic rounds as specified in the IEEE standard, then for integers m andn with |m| < 2t-1 and n = 2i + 2j (some i, j), fl((m/n)*n) = m. Thus,for example, fl((1/3) * 3) = 1 (even though fl(1/3)1/3). The sequence ofallowable n begins 1,2,3,4,5,6,8,9,10,12,16,17,18,20, so Kahan’s theoremcovers many common cases. Test the theorem on your computer.2.11. Investigate the leading significant digit distribution for numbers obtained as follows.1. kn , n = 0:1000 for k = 2 and 3.2. n!, n = 1:1000.3. The eigenvalues of a random symmetric matrix.4.
Physical constants from published tables.5. From the front page of the London Times or the New York Times.(Note that in writing a program for the first case you can form the powers of2 or 3 in order, following each multiplication by a division by 10, as necessary,to keep the result in the range [1,10]. Similarly for the second case.)2.12. (Edelman [343, 1994]) Let x be a floating point number in IEEE doubleprecision arithmetic satisfying 1 < x < 2. Show that fl( x *(1/x )) is either 1or 1 where= 2-52 (the machine epsilon).2.13.
(Edelman [343, 1994]) Consider IEEE double precision arithmetic. Findthe smallest positive integer j such that fl( x *(1/x ))1, where x = 1 +with= 2-52 (the machine epsilon).2.14. Kahan has stated that “an (over-)estimate of u can be obtained foralmost any machine by computing |3 × (4/3 - 1) - 1| using rounded floatingpoint for every operation”. Test this estimate against u on any machinesavailable to you.F LOATING P OINT A RITHMETIC642.15.