Heath - Scientific Computing (523150), страница 7
Текст из файла (страница 7)
They have been almost universallyadopted for personal computers and workstations, and also for many mainframes and supercomputers as well. The IEEE standard was carefully crafted to eliminate the manyanomalies and ambiguities in earlier vendor-specific floating-point implementations and has10CHAPTER 1. SCIENTIFIC COMPUTINGgreatly facilitated the development of portable and reliable numerical software. It alsoallows for sensible and consistent handling of exceptional situations, such as division byzero.1.3.2NormalizationA floating-point system is said to be normalized if the leading digit d0 is always nonzerounless the number represented is zero.
Thus, in a normalized floating-point system, themantissa m of a given nonzero floating-point number always satisfies1 ≤ m < β.(An alternative convention is that d0 is always zero, in which case a floating-point numberis said to be normalized if d1 6= 0, and β −1 ≤ m < 1 instead.) Floating-point systems areusually normalized because• The representation of each number is then unique.• No digits are wasted on leading zeros, thereby maximizing precision.• In a binary (β = 2) system, the leading bit is always 1 and thus need not be stored,thereby gaining one extra bit of precision for a given field width.1.3.3Properties of Floating-Point SystemsA floating-point number system is finite and discrete. The number of normalized floatingpoint numbers is2(β − 1)β t−1 (U − L + 1) + 1because there are two choices of sign, β − 1 choices for the leading digit of the mantissa, βchoices for each of the remaining t − 1 digits of the mantissa, and U − L + 1 possible valuesfor the exponent.
The 1 is added because the number could be zero.There is a smallest positive normalized floating-point number,Underflow level = UFL = β L ,which has a 1 as the leading digit and 0 for the remaining digits of the mantissa, and thesmallest possible value for the exponent. There is a largest floating-point number,Overflow level = OFL = β U +1 (1 − β −t ),which has β − 1 as the value for each digit of the mantissa and the largest possible value forthe exponent. Any number larger than OFL cannot be represented in the given floatingpoint system, nor can any positive number smaller than UFL.Floating-point numbers are not uniformly distributed throughout their range, but areequally spaced only between successive powers of β. Not all real numbers are exactlyrepresentable in a floating-point system.
Real numbers that are exactly representable in agiven floating-point system are sometimes called machine numbers.Example 1.5 Floating-Point System. An example floating-point system is illustrated1.3. COMPUTER ARITHMETIC11in Fig. 1.2, where the tick marks indicate all of the 25 floating-point numbers in a systemhaving β = 2, t = 3, L = −1, and U = 1.
For this system, the largest number isOFL = (1.11)2 × 21 = (3.5)10 , and the smallest positive normalized number is UFL =(1.00)2 × 2−1 = (0.5)10 . This is a very tiny, toy system for illustrative purposes only, butit is in fact characteristic of floating-point systems in general: at a sufficiently high level ofmagnification, every normalized floating-point system looks essentially like this one—grainyand unequally spaced...................................................................................................................................................................................................................................................................................................................................................................................................................................................−4−3−2−101234Figure 1.2: Example of a floating-point number system.1.3.4RoundingIf a given real number x is not exactly representable as a floating-point number, then itmust be approximated by some “nearby” floating-point number.
We denote the floatingpoint approximation of a given real number x by fl(x). The process of choosing a nearbyfloating-point number fl(x) to approximate a given real number x is called rounding, andthe error introduced by such an approximation is called rounding error , or roundoff error.Two of the most commonly used rounding rules are• Chop: The base-β expansion of x is truncated after the (t − 1)st digit. Since fl(x) is thenext floating-point number towards zero from x, this rule is also sometimes called roundtoward zero.• Round to nearest: fl(x) is the nearest floating-point number to x; in case of a tie, we usethe floating-point number whose last stored digit is even.
Because of the latter property,this rule is also sometimes called round to even.Rounding to nearest is the most accurate, but it is somewhat more expensive to implementcorrectly. Some systems in the past have used rounding rules that are cheaper to implement,such as chopping, but rounding to nearest is the default rounding rule in IEEE standardsystems.Example 1.6 Rounding Rules. Rounding the following decimal numbers to two digitsusing each of the rounding rules gives the following resultsNumber1.6491.6501.6511.699Chop1.61.61.61.6Round to nearest1.61.61.71.7Number1.7491.7501.7511.799Chop1.71.71.71.7Round to nearest1.71.81.81.812CHAPTER 1.
SCIENTIFIC COMPUTINGA potential source of additional error that is often overlooked is in the decimal-to-binaryand binary-to-decimal conversions that usually take place upon input and output of floatingpoint numbers. Such conversions are not covered by the IEEE standard, which governsonly internal arithmetic operations. Correctly rounded input and output can be obtainedat reasonable cost, but not all computer systems do so. Efficient, portable routines forcorrectly rounded binary-to-decimal and decimal-to-binary conversions—dtoa and strtod,respectively—are available from netlib (see Section 1.4.1).1.3.5Machine PrecisionThe accuracy of a floating-point system can be characterized by a quantity variously knownas the unit roundoff , machine precision, or machine epsilon.
Its value, which we denote bymach , depends on the particular rounding rule used. With rounding by chopping,mach = β 1−t ,whereas with rounding to nearest,mach = 21 β 1−t .The unit roundoff is important because it determines the maximum possible relative errorin representing a nonzero real number x in a floating-point system: fl(x) − x ≤ mach .xAn alternative characterization of the unit roundoff that you may sometimes see is thatit is the smallest number such thatfl(1 + ) > 1,but this is not quite equivalent to the previous definition if the round-to-even rule is used.Another definition sometimes used is that mach is the distance from 1 to the next largerfloating-point number, but this may differ from either of the other definitions.
Althoughthey can differ in detail, all three definitions of mach have the same basic intent as measuresof the granularity of a floating-point system.For the toy illustrative system in Example 1.5, mach = 0.25 with rounding by chopping,and mach = 0.125 with rounding to nearest. For IEEE binary floating-point systems,mach = 2−24 ≈ 10−7 in single precision and mach = 2−53 ≈ 10−16 in double precision. Wethus say that the IEEE single- and double-precision floating-point systems have about 7and 16 decimal digits of precision, respectively.Though both are “small,” the unit roundoff should not be confused with the underflowlevel.
The unit roundoff mach is determined by the number of digits in the mantissa fieldof a floating-point system, whereas the underflow level UFL is determined by the numberof digits in the exponent field. In all practical floating-point systems,0 < UFL < mach < OFL.1.3. COMPUTER ARITHMETIC1.3.613Subnormals and Gradual UnderflowIn the toy floating-point system illustrated in Fig. 1.2, there is a noticeable gap aroundzero. This gap, which is present to some degree in any floating-point system, is due tonormalization: the smallest possible mantissa is 1.00. .
. , and the smallest possible exponentis L, so there are no floating-point numbers between zero and β L . If we relax our insistenceon normalization and allow leading digits to be zero (but only when the exponent is at itsminimum value), then the gap around zero can be “filled in” by additional floating-pointnumbers. For our toy illustrative system, this relaxation gains six additional floating-pointnumbers, the smallest positive one of which is (0.01)2 ×2−1 = (0.125)10 , as shown in Fig. 1.3...................................................................................................................................................................................................................................................................................................................................................................................................................................................−4−3−2−101234Figure 1.3: Example of a floating-point system with subnormals.The extra numbers added to the system in this way are referred to as subnormal ordenormalized floating-point numbers.
Although they usefully extend the range of magnitudes representable, subnormal numbers have inherently lower precision than normalizednumbers because they have fewer significant digits in their fractional parts. In particular,extending the range in this manner does not make the unit roundoff mach any smaller.Such an augmented floating-point system is sometimes said to exhibit gradual underflow ,since it extends the lower range of magnitudes representable rather than underflowing tozero as soon as the minimum exponent value would otherwise be exceeded. The IEEEstandard provides for such subnormal numbers and gradual underflow. Gradual underflowis implemented through a special reserved value of the exponent field because the leadingbinary digit is not stored and hence cannot be used to indicate a denormalized number.1.3.7Exceptional ValuesThe IEEE floating-point standard provides two additional special values that indicate exceptional situations:• Inf, which stands for “infinity,” results from dividing a finite number by zero, such as1/0.• NaN, which stands for “not a number,” results from undefined or indeterminate operationssuch as 0/0, 0 ∗ Inf, or Inf/Inf.Inf and NaN are implemented in IEEE arithmetic through special reserved values of theexponent field.Whether Inf and NaN are supported at the user level in a given computing environmentdepends on the language, compiler, and run-time system.