Morgan - Numerical Methods (523161), страница 21
Текст из файла (страница 21)
(Note that this differs from the normalization described earlierin the fixed-point routines, which involved coercing the number to a value greaterthan or equal to one-half and less than one. This results in a representation thatconsists of: a sign, a number in fixed-point notation between 1.0 and 2.0, and anexponent representing the number of shifts required. Mathematically, this can beexpressed as3for -125for -1,021128exponentexponent128 in single precision and1,024 in double precision.FLOATING-POINTARITHMETICSince the exponent is really the signed (int) log2 of the number you’re representing, only numbers greater than 2.0 or less than 1.0 have an exponent other than zeroand require any shifts at all.
Very small numbers (less than one) must be normalizedusing left shifts, while large numbers with or without fractional extensions requireright shifts. As the number is shifted to the right, the exponent is incremented; as thenumber is shifted to the left, the exponent is decremented.The IEEE 754 standard dictates that the exponent take a certain form for someerrors and for zero. For a Not a Number (NAN) and infinity, the exponent is all ones;for zero, it is all zeros. For this to be the case, the exponent is biased—127 bits forsingle precision, 1023 for double precision.Figure 4-l shows the components of a floating-point number: a single bitrepresenting the sign of the number (signed magnitude), an exponent (8 bits forsingle precision and 11 bits for double precision, and a mantissa (23 bits for singleprecision, 52 bits for double).Figure 4-1. Single and double-precision floating-point numbers.129NUMERICAL METHODSLet’s look at an example of a single-precision float.
The decimal number 14.92has the following binary fixed-point representation (the decimal point is shown forclarity):1110.11101011100001010010We need three right shifts to normalize it:1.11011101011100001010010 x 23We add 127D to the exponent to make it 130D (127 + 3):10000010BBecause this is a positive number, the sign bit is 0 (the bit in parentheses is the hiddenbit):0+10000010+(1)1101110l0lll0000l0l00l0Bor416eb852HThe expression of the fractional part of the number depends on the precisionused.
This example used 24 bits to conform to the number of bits in the singleprecision format. If the number had been converted from a 16-bit fixed-point word,the single-precision version would be 416eb000H. Note the loss of significance.Retrieving the fixed-point number from the float is simply a matter of extractingthe exponent, subtracting the bias, restoring the implied leading bit, and performingthe required number of shifts. The bandwidth of the single-precision float if fairlyhigh-approximately 3.8 db—so having a data type to handle this range wouldrequire more than 256 bits.
Therefore we need some restrictions on the size of thefixed-point value to which we can legally convert. (For more information, refer toChapter 5 .)130FLOATING-POINT ARITHMETICExtended PrecisionIf all floating-point computations were carried out in the precision dictated bythe format, calculations such as those required by a square-root routine, a polynomialevaluation, or an infinite series could quickly lose accuracy.
In some cases, theresults would be rendered meaningless. Therefore, IEEE 754 also specifies anextended format for use in intermediate calculations.3 This format increases both thenumber of significant bits and the exponent size. Single-precision extended isincreased from 24 significant bits to at least 32, with the exponent increased to at least11 bits. The number of significant bits for double precision can be greater than 79 andthe exponent equal to or greater than 15 bits.This extended precision is invisible to users, who benefit from added accuracyin their results. Those results are still in the standard single or double-precisionformat, necessitating a set of core routines (using extended precision) that aregenerally unavailable to the normal user.
Another set of routines is needed to convertstandard format into extended format and back into standard format, with the resultsrounded at the end of a calculation.The routines described later in this chapter take two forms. Some were written,for debugging purposes, to be called by a higher-level language (C); they expectsingle-precision input and return single-precision output.
They simply convert to andfrom single precision to extended format, passing and receiving arguments from thecore routines. These external routines have names that begin with fp_, such asfp_mul. The core routines operate only with extended precision and have namesbeginning with fl, such as flmul; these routines cannot be called from C.4The extended-precision level in these routines uses a quadword for simpleparameter passing and offers at least a 32-bit significand. This simplifies thetranslation from extended to standard format, but it affords less immunity to loss ofsignificance at the extremes of the single precision floating range than would agreater number of exponent bits. If your application requires a greater range, makethe exponent larger—15-bits is recommended-in the higher level routines beforepassing the values to the core routines.
This can actually simplify exponent handlingduring intermediate calculations.Use the core routines for as much of your work as you can; use the external131NUMERICAL METHODSroutines when the standard format is needed by a higher-level language or forcommunications with another device. An example of a routine, cylinder, that usesthese core routines to compute the volume of a cylinder appears in the moduleFPMATH.ASM, and many more appear in TRANS.ASM.The External RoutinesThis group includes the basic arithmetic procedures—fp_mul, fp_div, fp_add,and fp_sub. Written as an interface to C, they pass arguments and pointers on thestack and write the return values to static variables.In fp_add, two single-precision floating-point numbers and a pointer to the resultarrive on the stack.
Local variables of the correct precision are created for each of thefloats, and memory is reserved for the extended result of the core routines. Aquadword is reserved for each of the extended variables, including the return; thesingle-precision float is written starting at the second word, leaving the leastsignificant word for any extended bits that result from intermediate calculations.After the variables are cleared, the doubleword floats are written to them and thecore routine, fladd, is called. Upon return from fladd, the routine extracts the singleprecision float (part of the extended internal float) from the result variable, roundsit, and writes it out with the pointer passed from the calling routine.fp_add: Algorithm1.Allocate and clear storage for three quadwords, one for each operand andone for the extended-precision result.2.Align the 32-bit operands within the extendedvariables so that the leastsignificant byte is at the boundary of the second word.3.Invoke the core addition routine with both operands and a pointer to thequadword result.4.Invoke the rounding routine with the result of the previous operationand a pointer to that storage.5.Pull the 32-bit float out of the extended variable, write it to the staticvariable, and return.132FLOATING-POINT ARITHMETICfp-add: Listing; *****fp_addproclocalpushfcldxorleamovrep stoswleamovrep stoswleamovrep stoswlealeamovrep movswlealearepmovmovswinvokeinvokeleamovmovrep movswpopfretfp_add endpuses bx cx dx si di,fp0:dword, fpl:dword, rptr:wordflp0:qword, flpl:qword, result:qwordax,axdi,word ptr resultcx,4di,word ptr flp0cx,4di,word ptr flpl;clear variables for;the core routinecx,4si,word ptr fp0di,word ptr flp0[2]cx,2si,word ptr fpldi,word ptr flp1[2];align the floats;within the extended;variablescx,2fladd, flp0, flpl, addr resultround, result, addr resultsi, word ptr result[2];do the add;round the result;make it a standard floatdi,rptrcx,2133NUMERICAL METHODSThis interface is consistent throughout the external routines.
The prototypes forthese basic routines and fp_comp are:fp_add proto c fp0:dword,fp_sub proto c fp0:dword,fp_mu1 proto c fp0:dword,fp_div proto c fp0:dword,fp_camp proto c fp:dword,fpl:dword,fpl:dword,fpl:dword,fpl:dword,fpl:dwordrptr:wordrptr:wordrptr:wordrptr:wordFp_comp compares two floating-point values and returns a flag in AX specifying whether fp0 is greater than fpl (1), equal to fpl (0), or less than fpl (-1). Thecomparison assumes the number is rounded and does not include extended-precisionbits. (FPMATH.ASM contains the other support routines.)The Core RoutinesBecause these routines must prepare the operands and pass their arguments to theappropriate fixed-point functions, they’re a bit more complex and require moreexplanation.
They disassemble the floating-point number, extract the exponent, andalign the radix points to allow a fixed-point operation to take place. They then takethe results of these calculations and reconstruct the float.The basic arithmetic routines in this group include:flp0:qword, flpl:qword, rptr:wordfladdproto-flp0 is addend0; flpl is addend1flsubprotoflp0:qword, flpl:qword, rptr:word-flp0 is the minuend; flpl is the subtrahendflmulflp0:qword, flpl:qword, rptr:wordproto-flp0 is the multiplicand; flpl is the multiplierfldiv134protoflp0:qword, flpl:qword, rptr:word-flp0 is the dividend; flpl is the divisorFLOATING-POINT ARITHMETICFor pedagogical and portability reasons, these routines are consistent in terms ofhow they prepare the data passed to them.Briefly, each floating-point routine must do the following:1.Set up any variables required for the arguments that are passed and for theresults of the current computations.2.
Check for initial errors and unusual conditions.Division:divisor == zero: return divide by zero errordivisor == infinite: return zerodividend == zero: return infinity errordividend == infinite: return infinitedividend == divisor: return oneMultiplication:either operand == zero: return zeroeither operand == infinite: return infiniteSubtraction:minuend == zero: do two’s complement of subtrahendsubtrahend == zero: return minuend unchangedoperands cannot align: return largest with appropriate signAddition:either addend == zero: return the other addend unchangedoperands cannot align: return largest3. Get the signs of the operands.