Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 97
Текст из файла (страница 97)
The spacing of the-52double precision numbers on [1,2] is 2 , so the answer is 229 – 1 5.4 x 108.2.4. Inspecting the proof of Theorem 2.2 we see that |yi | > β e-1, i = 1, 2, and so wealso have |fl(x) – x|/|fl(x)| < u, that is, fl(x) = x /(1 + δ), |δ| < u. Note that thisis not the same δ as in Theorem 2.2, although it has the same bound, and unlike inTheorem 2.2 there can be equality in this bound for δ.2.5. The first part is trivial. Since, in binary notation,x = 0.1100 1100 1100 1100 1100 1100 | 1100 .
. . x 2-3,we have, rounding to 24 bits,0.1100 1100 1100 1100 1100 1101 x 2-3.Thusand so2.6. Since the double precision mantissa contains 53 bits, p = 253 = 90071992547409929.01 x 1015. For single precision, p = 224 = 16777216 1.68 x 107.2.7.1. True, since in IEEE arithmetic fl( a op b) is defined to be the rounded valueof a op b, round(a op b), which is the same as round(b op a).2. True for round to nearest (the default) and round to zero, but false for roundto3. True, because fl(a + a) := round(a + a) = round(2 * a) =: fl(2 * a).4. True: similar to 3.5. False, in general.534S OLUTIONSTOP ROBLEMS6. True for binary arithmetic.
Since the division by 2 is exact, the inequality isequivalent to 2a < fl(a + b) < 2b. But a + b < 2b, so, by the monotonicityof rounding, fl(a + b) = round(a + b) < round(2b) = 2b. The lower bound isverified similarly.2.8. Examples in 3-digit decimal arithmetic are f1((5.01 + 5.02)/2) = f1(10.0/2) =5.0 and fl((5.02 + 5.04)/2) = f1(10.1/2) = 5.05.The left-hand inequality is immediate, since fl((b – a)/2) > 0.
The right-handinequality can possibly be violated only if a and b are close (otherwise (a + b)/2 issafely less than b, so the inequality will hold). But then, by Theorem 2.5, fl( b – a) isobtained exactly, and the result will have several zero low-order digits in the mantissabecause of cancellation. Consequently, if the base is even the division is done exactly,and the right-hand inequality follows by the monotonicity of rounding. Alternatively,and even more simply, for any base we can argue that fl((b – a)/2) < b – a, so thata + fl((b – a)/2) < b and, again, the result follows by the monotonicity of rounding.2.9.(binary).Rounded directly to 53 bits this yields 1 – 2-53. But rounded first to 64 bits it yieldsand when this number is rounded to 53 bits using the round to even rule it yields1.0.2.12.
The spacing between the floating point numbers in the interval (1/2,1] is(cf. Lemma 2.1), so |1/x – fl(1/x)| <which implies that |1 – xfl(1/x)| <ThusSince the spacing of the floating point numbers just to the right of 1 ismust round to either 1 –or 1.xfl(1/x)2.13.
If there is no double rounding the answer is 257,736,490. For a proof thatcombines mathematical analysis with computer searching, see Edelman [343, 1994].2.15. The IEEE standard does not define the results of exponentiation. The choicethen0° = 1 can be justified in several ways. For example, if p(x) =p (0) = a0 = a0 x 00, and the binomial expansion (x + y) n =yields 1 = 00 for x = 0, y = 1.
For more detailed discussions see Goldberg [457,1991, p. 32] and Knuth [669, 1992, pp. 406–408].S OLUTIONSTO535P ROBLEMS2.17. For IEEE arithmetic the answer is no. Since fl( x op y) is defined to be therounded version of x op y, b2 – ac > 0 implies fl(b2) – fl(ac) > 0 (rounding is amonotonic operation). The final computed answer isfl(fl(b2) - fl(ac)) = (fl(b2) - fl(ac))(1 + δ),> 0.|δ| < u2.18. No. A counterexample in 2-digit base-10 arithmetic is fl(2.0 – 0.91) =f l(1.09) = 1.1.2.19.
The functionmaps the set of positive floating point numbers onto aset of floating point numbers with about half as many elements. Hence there existtwo distinct floating point numbers x having the same value ofand so thecondition= |x| cannot always be satisfied in floating point arithmetic. Therequirement|x| is reasonable for base 2, however, and is satisfied in IEEEarithmetic, as we now show.Without loss of generality, we can assume that 1 < x < 2, since scaling x by apower of 2 does not alterBy definition,is the nearest floatingpoint number toandNow the spacing of the floating point numbers between 1 and 2 is= 2u, soHence |θ| < u if u < 1/4 (say), and then |x| is the nearest floating point number toso thatIn base-10 floating point arithmetic, the conditioncan be violated.For example, working to 5 significant decimal digits, if x = 3.1625 then fl(x2) =fl (10.0014 0625) = 10.001, and= fl(3.1624 3577 .
. .) = 3.1624 < x.2.20. On a Cray Y-MP the answer is yes, but in base-2 IEEE arithmetic the answeris no. It suffices to demonstrate that= sign(x)which is shownby the proof of Problem 2.19.2.21. The test “x > y“ returns false if x or y is a NaN, so the code computesmax(NaN, 5) = 5 and max(5, NaN) = NaN, which violates the obvious requirementthat max(x, y) = max(y, x). Since the test x x identifies a NaN, the followingcode implements a reasonable definition of max(x, y):% max(x, y)if x x thenmax = yelse if y y thenmax = xelse if y > x thenS OLUTIONS536TOP ROBLEMSmax = yelsemax = xendendendA further refinement is to ensure that max(-0, + 0) = +0, which is not satisfied bythe code above since –0 and +0 compare as equal; this requires bit-level programming.2.22.
We give an informal proof; the details are obtained by using the modelf l(x op y) = (x op y)(1 + δ), but they obscure the argument.Since a, b, and c are nonnegative, a + (b + c) is computed accurately. Sincec < b < a, c + (a – b) and a + (b – c) are the sums of two positive numbers andso are computed accurately. Since a, b, and c are the lengths of sides of a triangle,a < b + c; hence, using c < b < a,b < a < b + c < 2b ,which implies that a – b is computed exactly, by Theorem 2.5.
Hence c – ( a – b) is thedifference of two exactly represented numbers and so is computed accurately. Thusf l(A) =where is an accurate approximation to the desired argument xof the square root. It follows that fl(A) is accurate.2.23. For a machine with a guard digit, y = x, by Theorem 2.5 (assuming 2 x doesnot overflow).
If the machine lacks a guard digit then the subtraction produces xif the last bit of x is zero, otherwise it produces an adjacent floating point numberwith a zero last bit; in either case the result has a zero last bit. Gu, Demmel, andDhillon [484, 1994] apply this bit zeroing technique to numbers d1, d2,. . . . dn arisingin a divide and conquer bidiagonal SVD algorithm, their motivation being that thedifferences di – dj can then be computed accurately even on machines without aguard digit.2.24. The function f(x) = 3x – 1 has the single root x = 1/3. We can havefl(f(x)) = 0 only for x 1/3.
For x1/3 the evaluation can be summarized asfollows:The first, second, and fourth subtractions are done exactly, by Theorem 2.5. Theresult of the first subtraction has a zero least-significant bit and the result of theSOLUTIONSTO537PROBLEMSsecond has two zero least-significant bits; consequently the third subtraction suffersno loss of trailing bits and is done exactly, Therefore f(x) is computed exactlyfor x = fl(x) near 1/3.
But fl(x) can never equal 1/3 on a binary machine, sofl(f(x))0 for all x.2.25. We havewhere |δi | < u, i = 1:4. Hence(1 + δ4 ) = (ad – bc – bcδ 1)(1 + δ3) + bcδ1(1 + δ2) = x + xδ3 – bcδ1(δ3 – δ2),so thatwhich implies high relative accuracy unless u|bc| >> |x|. For comparison, the boundfor standard evaluation of fl(ad – bc) is |x –< γ2(|ad| + |bc|).2.26.
Newton’s method isThe quadratic convergence can be seen from 1 – xk+l a = (1 – xka ) 2 .2.27. We would like (2.8) to be satisfied, that is, we want= fl(x/y) to satisfy(A.1)This implies thatIn place of r =– x we have to usewhereand where we can assume the subtraction is exact, since the case of interest is where(see Theorem 2.5).
Thususing (Al) we obtainTherefore the convergence test isSinceputeunderflows to zero it cannot be precomputed, and we should instead com-S OLUTIONS538TOP ROBLEMS3.1. The proof is by induction. Inductive step: for ρ n = +l,For ρn = – 1 we find, similarly, that3.2. This result can be proved by a modification of the proof of Lemma 3.1. But itfollows immediately from the penultimate line of the proof of Lemma 3.4.3.3. The computed iteratesDefiningsatisfy= qk + ek, we haveThis givesThe running error bound µ can therefore be computed along with the continuedfraction as follows:qn+1 = an+1fn+1 = 0for k = n: – 1:0r k = b k /q k+lq k = a k + rkfk = |qk| + |rk| + |bk|fk+l/((|qk+l| – ufk+l|)|qk+l|)endµ = uf0The error bound is valid provided that |qk+l| – ufk+l > 0 for all k.
Otherwise amore sophisticated approach is necessary (for example, to handle the case whereqk+l = 0, qk =and qk-1 is finite).S OLUTIONSTOP ROBLEMS5393.4. We prove just the division result and the last result. LetThen α =However, it is easy to verify that, in fact,if j < k.For the last result,3.5.
We have fl(AB) = AB + ∆C = (A + ∆CB-1)B =: (A + ∆A)B, where∆ A = ∆CB-l, and |∆C| <by (3.12). Similarly, fl(AB) = A(B + ∆B),3.6. We havewhich impliesSolving these inequalities for gives the required lower bound.The definition is flawed in general as can be seen from rank considerations.For example if A and B are vectors and rank(C) 1, then no perturbations AAand ∆B exist to satisfy C = (A + ∆A)(B + ∆B)! Thus we have to use a mixedforward/backward stability definition in which we perturb C by at most as wellas A and B.3.7.