Arndt - Algorithms for Programmers (523138), страница 15
Текст из файла (страница 15)
The operation count of the CRT implementation as given above is significantlybetter than that of a straight forward implementation.8 cf.extended Euclidean algorithmCHAPTER 4. NUMBER THEORETIC TRANSFORMS (NTTS)4.678A modular multiplication techniqueWhen implementing a mod class on a 32 bit machine the following trick can be useful: It allows easymultiplication of two integers a, b modulo m even if the product a · b does not fit into a machine integer(that is assumed to have some maximal value z − 1, z = 2k ).Let hxiy denote x modulo y, bxc denote the integer part of x. For 0 ≤ a, b < m:º¹a·ba·b =· m + ha · bimmrearranging and taking both sides modulo z > m:¿¹ºÀa·ba·b−·mmz=(4.24)hha · bim iz(4.25)ºÀ Àa·b·mmz z(4.26)where the rhs. equals ha · bim because m < z.¿ha · bim=¿¹ha · biz −the expression on the rhs. can be translated into a few lines of C-code. The code given here assumes thatone has 64 bit integer types int64 (signed) and uint64 (unsigned) and a floating point type with 64 bitmantissa, float64 (typically long double).uint64 mul_mod(uint64 a, uint64 b, uint64 m){uint64 y = (uint64)((float64)a*(float64)b/m+(float64)1/2); // floor(a*b/m)y = y * m;// m*floor(a*b/m) mod zuint64 x = a * b;// a*b mod zuint64 r = x - y;// a*b mod z - m*floor(a*b/m) mod zif ( (int64)r < 0 ) // normalization needed ?{r = r + m;y = y - 1;// (a*b)/m quotient, omit line if not needed}return r;// (a*b)%m remnant}It uses the fact that integer multiplication computes the least significant bits of the result ha · biz whereasfloat multiplication computes the most significant bits of the result.
The above routine works if 0 <=a, b < m < 263 = z2 . The normalization is not necessary if m < 262 = z4 .When working with a fixed modulus the division by p may be replaced by a multiplication with theinverse modulus, that only needs to be computed once:Precompute:float64 i = (float64)1/m;and replace the linebyuint64 y = (uint64)((float64)a*(float64)b/m+(float64)1/2);uint64 y = (uint64)((float64)a*(float64)b*i+(float64)1/2);so any division inside the routine avoided. But beware, the routine then cannot be used for m >= 262 :it very rarely fails for moduli of more than 62 bits. This is due to the additional error when invertingand multiplying as compared to dividing alone.This trick is ascribed to Peter Montgomery.TBD: montgomery mult.CHAPTER 4.
NUMBER THEORETIC TRANSFORMS (NTTS)4.779Number theoretic Hartley transform *Let r be an element of order n, i.e. rn = 1 (but there is no k < n so that rk = 1) we like to identify rwith exp(2 i π/n).Then one can set2πn2πi sinncos≡≡r2 + 12rr2 − 12r(4.27)(4.28)For This choice of sin and cos the relations exp() = cos() + i sin() and sin()2 + cos()2 = 1 should hold.2x2 −1The first check is trivial: x2+1x + 2 x = x.
The second is also easy if we allow to write i for some element222122(x +1) −(x −1)x −1 22that is the square root of −1: ( x2+1= 1. Ok, but what is i in the modularx ) +( 2xi ) =4 x2ring? Simply rn/4 , then we have i2 = −1 and i4 = 1 as we are used to. This is only true in cyclic rings.Chapter 5The Walsh transform and itsrelativesHow to make a Walsh transform out of your FFT:‘Replace exp(something) by 1, done.’TBD: complex Walsh transform5.1The Walsh transform: Walsh-Kronecker basisRemoving all exp(something) from the radix-2, decimation in time Fourier transform we getvoid slow_walsh_wak_dit2(double *f, ulong ldn)// (this routine has a problem){ulong n = (1<<(ulong)ldn);for (ulong ldm=1; ldm<=ldn; ++ldm){const ulong m = (1<<ldm);const ulong mh = (m>>1);for (ulong j=0; j<mh; ++j){for (ulong r=0; r<n; r+=m){const ulong t1 = r+j;const ulong t2 = t1+mh;double v = f[t2];double u = f[t1];f[t1] = u+v;f[t2] = u-v;}}}}The transform involves proportional n log2 (n) additions (and subtractions) and no multiplication at all.The transform is its own inverse, so there is nothing like the is in the FFT procedures here.As the slow in the name shall suggest, the implementation as a problem as given.
The memory accessis highly non-local. Let’s make a slight improvement: Here we just took the code 1.4 and threw away alltrigonometric computations (and multiplications). But the swapping of the inner loops, that we did forthe FFT in order to save trigonometric computations is now of no advantage anymore. So we try thispiece of code:template <typename Type>void walsh_wak_dit2(Type *f, ulong ldn)80CHAPTER 5. THE WALSH TRANSFORM AND ITS RELATIVES0:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[*[** * * * ***** ** ** * ******* * * * ***** ** ** * ******* * * * ***** ** ** * ******* * * * ***** ** ** * ******* * ***** *** ** * **** ***** *** * ***** *** ** * **** ***** *** * * * ***** ** ** * ********* ** **** ** ** * ***** * ******* ** *************** * ***** *** ** * ***** ** * **** * ***** ** *** * * * ***** ** ** * ******* * * * ***** ** ** * ******** **** ** ******* ** ** * * *** *** * * ** ***** ** **** ** ** * * *** *** * * ** ****** * ***** *** ** * **** **** ** ***** ** * *** *** * * ** *** * * * ***** ** ** * ********* ** ****** *** ** * * ** **** ** **** ** ** * * ** *** * **** * * * **** ** *****************81* *]* ]]*]]*]* *]* ]]*]* *]* ]* *]* ]]*]]*]* *]* ]* *]* ]]*]* *]* ]]*]]*]* *]* ]Figure 5.1: Basis functions for the Walsh transform (Walsh-Kronecker basis).
Asterisks denote thevalue +1, blank entries denote −1.// transform wrt. to walsh-kronecker basis (wak-functions)// decimation in time (DIT) algorithm{ulong n = (1UL<<ldn);for (ulong ldm=1; ldm<=ldn; ++ldm){const ulong m = (1UL<<ldm);const ulong mh = (m>>1);for (ulong r=0; r<n; r+=m){ulong t1 = r;ulong t2 = r+mh;for (ulong j=0; j<mh; ++j, ++t1, ++t2){Type u = f[t1];Type v = f[t2];f[t1] = u + v;f[t2] = u - v;}}}}[FXT: walsh wak dit2 in walsh/walshwak.h]Which performance impact can this innocent change in the code have? For large n it gave a speedup bya factor of more than three when run on a computer with a main memory clock of 66 Megahertz and a5.5 times higher CPU clock of 366 Megahertz.
[FXT: walsh wak dit2 in walsh/walshwak.h]The equivalent code for the decimation in frequency algorithm isCHAPTER 5. THE WALSH TRANSFORM AND ITS RELATIVES82template <typename Type>void walsh_wak_dif2(Type *f, ulong ldn){const ulong n = (1UL<<ldn);for (ulong ldm=ldn; ldm>=1; --ldm){const ulong m = (1UL<<ldm);const ulong mh = (m>>1);for (ulong r=0; r<n; r+=m){ulong t1 = r;ulong t2 = r+mh;for (ulong j=0; j<mh; ++j, ++t1, ++t2){Type u = f[t1];Type v = f[t2];f[t1] = u + v;f[t2] = u - v;}}}}[FXT: walsh wak dif2 in walsh/walshwak.h]The Walsh transform of integer input is integral, cf.
section 6.6.A function that computes the k-th base function of the transform istemplate <typename Type>void walsh_wak_basefunc(Type *f, ulong n, ulong k){for (ulong i=0; i<n; ++i){ulong x = i & k;x = parity(x);f[i] = ( 0==x ? +1 : -1 );}}[FXT: walsh wak basefunc in walsh/walshbasefunc.h]5.2The Kronecker productThe length-2 Walsh transform can be seen to be equivalent to the multiplication of a 2-component vectorby the matrix·¸+1 +1W2 =(5.1)+1 −1The length-4 Walsh transform corresponds toW4=+1 +1 +1+1+1 +1 +1−1 +1 −1 +1 −1 −1 −1 −1 +1(5.2)One might be tempted to write·W4=+W2+W2+W2−W2¸(5.3)This idea can indeed be turned into a well-defined notation which turns out to be quite powerful whendealing with orthogonal transforms and their fast algorithms.CHAPTER 5.
THE WALSH TRANSFORM AND ITS RELATIVESLet A be an m × n-matrixA=a0,0a1,0...a0,1a1,1...······a0,n−1a1,n−1...am−1,0am−1,1···am−1,n−183then the Kronecker product (or Tensor product) with a matrix B isa0,0 Ba0,1 B···a0,n−1 B a1,0 BaB···a1,n−1 B1,1A ⊗ B := .........am−1,0 B am−1,1 B · · · am−1,n−1 B(5.4)(5.5)There is no restriction on the dimensions of B.
If B is a r × s-matrix then the dimensions of the givenKronecker product is mr × ns and ck+ir,l+js = ai,j bk,l .For a scalar factor α the relations(αA) ⊗ B = α(A ⊗ B)A ⊗(αB) = α(A ⊗ B)(5.6)(5.7)should be immediate.The Kronecker product is not commutative, that is, A ⊗ B 6= B ⊗ A in general.
The following relationsare the same as for the ordinary matrix product. Bilinearity (the matrices left and right from a plus signmust be of the same dimensions):(A + B) ⊗ C = A ⊗ C + B ⊗ CA ⊗(B + C) = A ⊗ B + A ⊗ C(5.8)(5.9)Associativity:A ⊗(B ⊗ C)=(A ⊗ B) ⊗ C(5.10)The matrix product (indicated by a dot) of Kronecker products can be rewritten as(A ⊗ B) · (C ⊗ D) = (A · C) ⊗(B · D)(L1 ⊗ R1 ) · (L2 ⊗ R2 ) · .
. . · (Ln ⊗ Rn ) = (L1 · L2 · . . . · Ln ) ⊗(R1 · R2 · . . . · Rn )(A · B) ⊗(C · D) = (A ⊗ C) · (B ⊗ D)(L1 · R1 ) ⊗(L2 · R2 ) ⊗ . . . ⊗(Ln · Rn ) = (L1 ⊗ L2 ⊗ . . . ⊗ Ln ) · (R1 ⊗ R2 ⊗ . . . ⊗ Rn )(5.11)(5.12)(5.13)(5.14)Here the matrices left and right from a dot must be compatible for ordinary matrix multiplication.One has(A ⊗ B)T−1(A ⊗ B)==AT ⊗ B TA−1⊗B−1Back to the Walsh transform, we have W1 = [1] and for n = 2k , n > 1:¸·+Wn/2 +Wn/2= W2 ⊗ Wn/2Wn =+Wn/2 −Wn/2(5.15)(5.16)(5.17)In order to see that this relation is the statement of a fast algorithm split the (to be transformed) vector xinto halves·¸x0x =(5.18)x1CHAPTER 5. THE WALSH TRANSFORM AND ITS RELATIVES84and write·Wn x =Wn/2 x0 + Wn/2 x1Wn/2 x0 − Wn/2 x1¸·=Wn/2 (x0 + x1 )Wn/2 (x0 − x1 )¸(5.19)That is, a length-n transform can be computed by two length-n/2 transforms of the sum and differenceof the first and second half of x.Now when A = B in relation 5.16 we have (A ⊗ A)−1 = A−1 ⊗ A−1 , (A ⊗ A ⊗ A)−1A−1 ⊗ A−1 ⊗ A−1 and so on.
Using a notation equivalent to the sum (or product) sign this isà nO!−1AnO=k=1A−1=(5.20)k=1Using the notation for the Walsh transformlog2 (n)WnO=W2(5.21)W2−1(5.22)k=1(where the empty product equals [1] = W1 ) andlog2 (n)Wn−1O=k=1The latter relation isn’t that exciting as W2−1 = W2 , however, it obviously also holds when the inversetransform is different from the forward transform. Thereby, given a fast algorithm for some transform inform of a Kronecker product, the fast algorithm for the backward transform is immediate.The direct sum of two matrices is defined as·A⊕B:=A00B¸(5.23)We havenMA:=In ⊗ A(5.24)k=1where In is the n × n-identity matrix.5.3Computing the Walsh transform fasterAll operations necessary for the Walsh transform are cheap: loads, stores, additions and subtractions.The memory access pattern is a major concern with direct mapped cache, as we have verified comparingthe first two implementations in this chapter.