Arndt - Algorithms for Programmers (523138), страница 48
Текст из файла (страница 48)
The (linear)convolution of the two sequences ak , bk , k = 0 . . . N − 1 is defined as the sequence c whereckN−1X:=ai bjk = 0 . . . 2N − 2. a−1a−2(13.6)i,j=0; i+j=kA number written in radix r asaPaP −1...a2a1a0...a−p+1a−p(13.7)denotes a quantity ofPXai · rii=−p1 Fordecimal numbers the radix is 10.= aP · rP + aP −1 · rP −1 + · · · + a−p · r−p .(13.8)CHAPTER 13.
ARITHMETICAL ALGORITHMS281That means, the digits can be considered as coefficients of a polynomial in r. For example, with decimalnumbers one has r = 10 and 123.4 = 1 · 102 + 2 · 101 + 3 · 100 + 4 · 10−1 . The product of two numbers isalmost the polynomial product2N−2Xck rkN−1X:=ai ri ·N−1Xi=0k=0bj rj(13.9)j=0The ck are found by comparing coefficients. One easily checks that the ck must satisfy the convolutionequation 13.6.As the ck can be greater than ‘nine’ (that is, r − 1), the result has to be ‘fixed’ using carry operations:Go from right to left, replace ck by ck %r and add (ck − ck %r)/r to its left neighbor.An example: usually one would multiply the numbers 82 and 34 as follows:8232472=2×26833488We just said that the carries can be delayed to the end of the computation:82=2×326383824242734888. .
. which is really polynomial multiplication (which in turn is a convolution of the coefficients):(8 x + 2)=24 x224 x2×32 x6x+38 x(3 x + 4)8+8Convolution can be done efficiently using the Fast Fourier Transform (FFT): Convolution is a simple(element wise array) multiplication in Fourier space. The FFT itself takes N · log N operations. Insteadof the direct convolution (∼ N 2 ) one proceeds like this:• compute the FFTs of multiplicand and multiplicator• multiply the transformed sequences elementwise• compute inverse transform of the productTo understand why this actually works note that (1) the multiplication of two polynomials can be achievedby the (more complicated) scheme:• evaluate both polynomials at sufficiently many2 points• pointwise multiply the values found• find the polynomial corresponding to those (product-)values2 Atleast one more point than the degree of the product polynomial c: deg c = deg a + deg bCHAPTER 13.
ARITHMETICAL ALGORITHMS282and (2) that the FFT is an algorithm for the parallel evaluation of a given polynomial at many points,namely the roots of unity. (3) the inverse FFT is an algorithm to find (the coefficients of) a polynomialwhose values are given at the roots of unity.You might be surprised if you always thought of the FFT as an algorithm for the ‘decomposition intofrequencies’. There is no problem with either of these notions.Relaunching our example we use the fourth roots of unity ±1 and ±i:+1+i−1−ia = (8 x + 2)+10+8i + 2−6−8i + 2×b = (3 x + 4)+7+3i + 4+1−3i + 4c = ab+70+38i − 16−6−38i − 16c = (24 x2 + 38 x + 8)This table has to be read like this: first the given polynomials a and b are evaluated at the points givenin the left column, thereby the columns below a and b are filled. Then the values are multiplied to fill thecolumn below c, giving the values of c at the points.
Finally, the actual polynomial c is found from thosevalues, resulting in the lower right entry. You may find it instructive to verify that a 4-point FFT reallyevaluates a, b by transforming the sequences 0, 0, 8, 2 and 0, 0, 3, 4 by hand. The backward transformof 70, 38i − 16, −6, −38i − 16 should produce the final result given for c.The operation count is dominated by that of the FFTs (the elementwise multiplication is of course ∼ N ),so the whole fast convolution algorithm takes ∼ N · log N operations. The following carry operation isalso ∼ N and can therefore be neglected when counting operations.Multiplying our million-digit numbers will now take only 106 log2 (106 ) ≈ 106 · 20 operations, takingapproximately 2 seconds on a 10 Mips machine.Strictly speaking N · log N is not really the truth: it has to be N · log N · log log N .
This is becausethe sums in the convolutions have to be represented as exact integers. The biggest term C that canpossibly occur is approximately N R2 for a number with N digits (see next section). Therefore, workingwith some fixed radix R one has to do FFTs with log N bits precision, leading to an operation count ofN · log N · log N .
The slightly better N · log N · log log N is obtained by recursive use of FFT multiplies.For realistic applications (where the sums in the convolution all fit into the machine type floating pointnumbers) it is safe to think of FFT multiplication being proportional N · log N .For a survey of multiplication methods, some mathematical background and further references see [37].How far the idea ‘polynomials for numbers’ can be carried and where it fails see [38].13.2.3Radix/precision considerations with FFT multiplicationThis section describes the dependencies between the radix of the number and the achievable precisionwhen using FFT multiplication.
In what follows it is assumed that the ‘super-digits’, called LIMBs occupya 16 bit word in memory. Thereby the radix of the numbers can be in the range 2 . . . 65536(= 216 ).Further restrictions are due to the fact that the components of the convolution must be representable asinteger numbers with the data type used for the FFTs (here: doubles): The cumulative sums ck have tobe represented precisely enough to distinguish every (integer) quantity from the next bigger (or smaller)value.
The highest possible value for a ck will appear in the middle of the product and when multiplicandand multiplier consist of ‘nines’ (that is R − 1) only. It must not jump to cm ± 1 due to numerical errors.For radix R and a precision of N LIMBs Let the maximal possible value be C, thenC=N (R − 1)2(13.10)The number of bits to represent C exactly is the integer greater or equal tolog2 (N (R − 1)2 ) = log2 N + 2 log2 (R − 1)(13.11)CHAPTER 13. ARITHMETICAL ALGORITHMS283Due to numerical errors there must be a few more bits for safety. If computations are made using doublesone typically has a mantissa of 53 bits3 then we need to haveM≥log2 N + 2 log2 (R − 1) + S(13.12)where M :=mantissa-bits and S :=safety-bits. Using log2 (R − 1) < log2 (R):Nmax (R) =2M −S−2 log2 (R)(13.13)Suppose we have M = 53 mantissa-bits and require S = 3 safety-bits.
With base 2 numbers one coulduse radix R = 216 for precisions up to a length of Nmax = 253−3−2·16 = 256k LIMBs. Corresponding are4096 kilo bits and = 1024 kilo hex digits. For greater lengths smaller radices have to be used accordingto the following table (extra horizontal line at the 16 bit limit for LIMBs):Radix R210 = 1024211 = 2048212 = 4096213 = 8192214 = 16384215 = 32768216 = 65536217 = 128 k218 = 256 k219 = 512 k220 = 1 M221 = 2 Mmax # LIMBs1048, 576 k262, 144 k65, 536 k16384 k4096 k1024 k256 k64 k16 k4k1k256max # hex digits2621, 440 k720, 896 k196, 608 k53, 248 k14, 336 k3840 k1024 k272 k72 k19 k5k1300max # LIMBs110 G1100 M11 M110 k1k11max # digits220 G3300 M44 M550 k6, 59777max # bits10240 M2816 M768 M208 M56 M15 M4M1062 k281 k74 k19 k5120For decimal numbers:Radix R102103104105106107max # bits730 G11 G146 M1826 k22 k255Summarizing:• For decimal digits and precisions up to 11 million LIMBs use radix 10,000.
(corresponding to moreabout 44 million decimal digits), for even greater precisions choose radix 1,000.• For hexadecimal digits and precisions up to 256,000 LIMBs use radix 65,536 (corresponding to morethan 1 million hexadecimal digits), for even greater precisions choose radix 4,096.13.3Division, square root and cube root13.3.1DivisionThe ordinary division algorithm is useless for numbers of extreme precision. Instead one replaces thedivision ab by the multiplication of a with the inverse of b.
The inverse of b = 1b is computed by findinga starting approximation x0 ≈ 1b and then iteratingxk+13 Of=xk + xk (1 − b xk )(13.14)which only the 52 least significant bits are physically present, the most significant bit is implied to be always set.CHAPTER 13. ARITHMETICAL ALGORITHMS284until the desired precision is reached. The convergence is quadratic (2nd order), which means that thenumber of correct digits is doubled with each step: if xk = 1b (1 + e) thenµ¶111xk+1 =(1 + e) + (1 + e) 1 − b (1 + e)(13.15)bbb¢1 ¡1 − e2(13.16)=bMoreover, each step needs only computations with twice the number of digits that were correct at itsbeginning.
Still better: the multiplication xk (. . . ) needs only to be done with half precision as it computesthe ‘correcting’ digits (which alter only the less significant half of the digits). Thus, at each step we have1.5 multiplications of the ‘current’ precision. The total work4 amounts to1.5 ·NX12nn=0which is less than 3 full precision multiplications. Together with the final multiplication a division costsas much as 4 multiplications. Another nice feature of the algorithm is that it is self-correcting. Thefollowing numerical example shows the first two steps of the computation5 of an inverse starting from atwo-digit initial approximation:13.3.2b:=3.1415926(13.17)x0=0.31(13.18)b · x0=3.141 · 0.3100 = 0.9737(13.19)initial 2 digit approximation for 1/by0:=1.000 − b · x0 = 0.02629(13.20)x 0 · y0=0.3100 · 0.02629 = 0.0081(49)(13.21)x1:=x0 + x0 · y0 = 0.3100 + 0.0081 = 0.3181(13.22)b · x1=3.1415926 · 0.31810000 = 0.9993406(13.23)y1:=1.0000000 − b · x0 = 0.0006594(13.24)x 1 · y1=0.31810000 · 0.0006594 = 0.0002097(5500)(13.25)x2:=x1 + x1 · y1 = 0.31810000 + 0.0002097 = 0.31830975(13.26)Square root extractionComputing square roots is quite similar to division: first compute√d.