Arndt - Algorithms for Programmers (523138), страница 7
Текст из файла (страница 7)
THE FOURIER TRANSFORM// f[i2] = ti - f1i; // im low// f[i4] = ti + f1i; // im hi// =^=sumdiff(ti, f1i, f[i4], f[i2]);}}sumdiff(f[0], f[1]);if ( nh>=2 ) { f[nh] *= 2.0; f[nh+1] *= 2.0; }fht_fft((Complex *)f, ldn-1, -1);if ( is<0 ) reverse_nh(f, n);[FXT: wrap real complex fft in realfft/realfftwrap.cc][FXT: wrap complex real fft in realfft/realfftwrap.cc]1.9.2Real valued split radix Fourier transformsReal to complex SRFTCode 1.11 (split radix R2CFT) Pseudo code for the split radix R2CFT algorithmprocedure r2cft_splitradix_dit(x[],ldn){n := 2**ldnix := 1;id := 4;do{i0 := ix-1while i0<n{i1 := i0 + 1{x[i0], x[i1]} := {x[i0]+x[i1], x[i0]-x[i1]}i0 := i0 + id}ix := 2*id-1id := 4 * id}while ix<nn2 := 2nn := n/4while nn!=0{ix := 0n2 := 2*n2id := 2*n2n4 := n2/4n8 := n2/8do // ix loop{i0 := ixwhile i0<n{i1 := i0i2 := i1 + n4i3 := i2 + n4i4 := i3 + n4{t1, x[i4]} := {x[i4]+x[i3], x[i4]-x[i3]}{x[i1], x[i3]} := {x[i1]+t1, x[i1]-t1}if n4!=1{i1 := i1 + n8i2 := i2 + n8i3 := i3 + n8i4 := i4 + n8t1 := (x[i3]+x[i4]) * sqrt(1/2)t2 := (x[i3]-x[i4]) * sqrt(1/2){x[i4], x[i3]} := {x[i2]-t1, -x[i2]-t1}{x[i1], x[i2]} := {x[i1]+t2, x[i1]-t2}}31CHAPTER 1.
THE FOURIER TRANSFORMi0 := i0 + id}ix := 2*id - n2id := 2*id}}}while ix<ne := 2.0*PI/n2a := efor j:=2 to n8{cc1 := cos(a)ss1 := sin(a)cc3 := cos(3*a) // == 4*cc1*(cc1*cc1-0.75)ss3 := sin(3*a) // == 4*ss1*(0.75-ss1*ss1)a := j*eix := 0id := 2*n2do // ix-loop{i0 := ixwhile i0<n{i1 := i0 + j - 1i2 := i1 + n4i3 := i2 + n4i4 := i3 + n4i5 := i0 + n4 - j + 1i6 := i5 + n4i7 := i6 + n4i8 := i7 + n4// complex mult: (t2,t1) := (x[i7],x[i3]) * (cc1,ss1)t1 := x[i3]*cc1 + x[i7]*ss1t2 := x[i7]*cc1 - x[i3]*ss1// complex mult: (t4,t3) := (x[i8],x[i4]) * (cc3,ss3)t3 := x[i4]*cc3 + x[i8]*ss3t4 := x[i8]*cc3 - x[i4]*ss3t5 := t1 + t3t6 := t2 + t4t3 := t1 - t3t4 := t2 - t4{t2, x[i3]} := {t6+x[i6], t6-x[i6]}x[i8] := t2{t2,x[i7]} := {x[i2]-t3, -x[i2]-t3}x[i4] := t2{t1, x[i6]} := {x[i1]+t5, x[i1]-t5}x[i1] := t1{t1, x[i5]} := {x[i5]+t4, x[i5]-t4}x[i2] := t1i0 := i0 + id}ix := 2*id - n2id := 2*id}while ix<n}nn := nn/2[FXT: split radix real complex fft in realfft/realfftsplitradix.cc]Complex to real SRFTCode 1.12 (split radix C2RFT) Pseudo code for the split radix C2RFT algorithmprocedure c2rft_splitradix_dif(x[],ldn)32CHAPTER 1.
THE FOURIER TRANSFORM{n := 2**ldnn2 := n/2nn := n/4while nn!=0{ix := 0id := n2n2 := n2/2n4 := n2/4n8 := n2/8do // ix loop{i0 := ixwhile i0<n{i1 := i0i2 := i1 + n4i3 := i2 + n4i4 := i3 + n4{x[i1], t1} := {x[i1]+x[i3], x[i1]-x[i3]}x[i2] := 2*x[i2]x[i4] := 2*x[i4]{x[i3], x[i4]} := {t1+x[i4], t1-x[i4]}if n4!=1{i1 := i1 + n8i2 := i2 + n8i3 := i3 + n8i4 := i4 + n8{x[i1], t1} := {x[i2]+x[i1], x[i2]-x[i1]}{t2, x[i2]} := {x[i4]+x[i3], x[i4]-x[i3]}x[i3] := -sqrt(2)*(t2+t1)x[i4] := sqrt(2)*(t1-t2)}i0 := i0 + id}ix := 2*id - n2id := 2*id}while ix<ne := 2.0*PI/n2a := efor j:=2 to n8{cc1 := cos(a)ss1 := sin(a)cc3 := cos(3*a) // == 4*cc1*(cc1*cc1-0.75)ss3 := sin(3*a) // == 4*ss1*(0.75-ss1*ss1)a := j*eix := 0id := 2*n2do // ix-loop{i0 := ixwhile i0<n{i1 := i0 + j - 1i2 := i1 + n4i3 := i2 + n4i4 := i3 + n4i5 := i0 + n4 - j + 1i6 := i5 + n4i7 := i6 + n4i8 := i7 + n4{x[i1], t1} := {x[i1]+x[i6], x[i1]-x[i6]}{x[i5], t2} := {x[i5]+x[i2], x[i5]-x[i2]}{t3, x[i6]} := {x[i8]+x[i3], x[i8]-x[i3]}{t4, x[i2]} := {x[i4]+x[i7], x[i4]-x[i7]}{t1, t5} := {t1+t4, t1-t4}33CHAPTER 1.
THE FOURIER TRANSFORM34{t2, t4} := {t2+t3, t2-t3}// complex mult: (x[i7],x[i3]) := (t5,t4)x[i3] := t5*cc1 + t4*ss1x[i7] := -t4*cc1 + t5*ss1// complex mult: (x[i4],x[i8]) := (t1,t2)x[i4] := t1*cc3 - t2*ss3x[i8] := t2*cc3 + t1*ss3i0 := i0 + id* (ss1,cc1)* (cc3,ss3)}ix := 2*id - n2id := 2*id}while ix<n}nn := nn/2}}ix := 1;id := 4;do{i0 := ix-1while i0<n{i1 := i0 + 1{x[i0], x[i1]} := {x[i0]+x[i1], x[i0]-x[i1]}i0 := i0 + id}ix := 2*id-1id := 4 * id}while ix<n[FXT: split radix complex real fft in realfft/realfftsplitradix.cc]See [29].1.10Multidimensional FTs1.10.1DefinitionLet ax,y (x = 0, 1, 2, .
. . , C − 1 and y = 0, 1, 2, . . . , R − 1) be a 2-dimensional array of data7 . Its 2dimensional Fourier transform ck,h is defined by:cck,h=:=F [a]1√n(1.94)C−1X R−1Xax,y z x k+y hwherez = e± 2 π i/n ,n = RC(1.95)x=0 x=0Its inverse isaax= F −1 [c]=1√nC−1X R−1X(1.96)ck,h z −(x k+y h)(1.97)k=0 h=0For a m-dimensional array a~x (~x = (x1 , x2 , x3 , . . . , xm ), xi ∈ 0, 1, 2, . . . , Si ) the m-dimensional Fourier7 Imaginea R × C matrix of R rows (of length C) and C columns (of length R).CHAPTER 1.
THE FOURIER TRANSFORM35transform c~k (~k = (k1 , k2 , k3 , . . . , km ), ki ∈ 0, 1, 2, . . . , Si ) is defined asc~k:=S1 −1 SXSX2 −1m −11 X~√...a~x z ~x.kn x =0 x =0x =0=~S1 X~√a~x z ~x.kn12wherez = e± 2 π i/n ,n = S1 S2 . . . Sm(1.98)m~ = (S1 − 1, S2 − 1, . . . , Sm − 1)Twhere S(1.99)~x=~0The inverse transform is again the one with the minus in the exponent of z.1.10.2The row column algorithmThe equation of the definition of the two dimensional FT (1.94) can be recast asck,h:=C−1R−11 X xk X√zax,y z y hn x=0x=0(1.100)which shows that the 2-dimensional FT can be accomplished by using 1-dimensional FTs to transformfirst the rows and then the columns8 .
This leads us directly to the row column algorithm:Code 1.13 (row column FFT) Compute the two dimensional FT of a[][] using the row columnmethodprocedure rowcol_ft(a[][], R, C){complex a[R][C] // R (length-C) rows, C (length-R) columnsfor r:=0 to R-1 // FFT rows{fft(a[r][], C, is)}complex t[R]// scratch array for columnsfor c:=0 to C-1 // FFT columns{copy a[0,1,...,R-1][c] to t[] // get columnfft(t[], R, is)copy t[] to a[0,1,...,R-1][c] // write back column}}Here it is assumed that the rows lie in contiguous memory (as in the C language).
[FXT: twodim fft inndimfft/twodimfft.cc]Transposing the array before the column pass in order to avoid the copying of the columns to extrascratch space will do good for the performance in most cases. The transposing back at the end of theroutine can be avoided if a back-transform will follow9 , the back-transform must then be called with Rand C swapped.The generalization to higher dimensions is straight forward.
[FXT: ndim fft in ndimfft/ndimfft.cc]1.11The matrix Fourier algorithm (MFA)The matrix Fourier algorithm10 (MFA) works for (composite) data lengths n = R C. Consider the inputarray as a R × C-matrix (R rows, C columns).8 orthe rows first, then the columns, the result is the sametypical for convolution etc.10 A variant of the MFA is called ‘four step FFT’ in [50].9 asCHAPTER 1. THE FOURIER TRANSFORM36Idea 1.9 (matrix Fourier algorithm) The matrix Fourier algorithm (MFA) for the FFT:1. Apply a (length R) FFT on each column.2. Multiply each matrix element (index r, c) by exp(±2 π i r c/n) (sign is that of the transform).3. Apply a (length C) FFT on each row.4.
Transpose the matrix.Note the elegance!It is trivial to rewrite the MFA as theIdea 1.10 (transposed matrix Fourier algorithm) The(TMFA) for the FFT:transposedmatrixFourieralgorithm1. Transpose the matrix.2. Apply a (length C) FFT on each column (transposed row).3. Multiply each matrix element (index r, c) by exp(±2 π i r c/n).4.
Apply a (length R) FFT on each row (transposed column).TBD: MFA = radix-sqrt(n) DIF/DIT FFTFFT algorithms are usually very memory nonlocal, i.e. the data is accessed in strides with large skips (asopposed to e.g. in unit strides). In radix 2 (or 2n ) algorithms one even has skips of powers of 2, which isparticularly bad on computer systems that use direct mapped cache memory: One piece of cache memoryis responsible for caching addresses that lie apart by some power of 2. TBD: move cache discussion toappendix With an ‘usual’ FFT algorithm one gets 100% cache misses and therefore a memory performancethat corresponds to the access time of the main memory, which is very long compared to the clock ofmodern CPUs.
The matrix Fourier algorithm has a much better memory locality (cf. [50]), because thework is done in the short FFTs over the rows and columns.For the reason given above the computation of the column FFTs should not be done in-place. One caninsert additional transpositions in the algorithm to have the columns lie in contiguous memory when theyare worked upon. The easy way is to use an additional scratch space for the column FFTs, then only thecopying from and to the scratch space will be slow. If one interleaves the copying back with the exp()multiplications (to let the CPU do some work during the wait for the memory access) the performanceshould be ok.
Moreover, one can insert small offsets (a few unused memory words) at the end of each rowin order to avoid the cache miss problem almost completely. Then one should also program a procedurethat does a ‘mass production’ variant of the column FFTs, i.e. for doing computation for all rows at once.√It is usually a good idea to use factors of the data length n that are close to n.
Of course one canapply the same algorithm for the row (or column) FFTs again: It can be a good idea to split n into 3factors (as close to n1/3 as possible) if a length-n1/3 FFT fits completely into the second level cache (oreven the first level cache) of the computer used. Especially for systems where CPU clock is much higherthan memory clock the performance may increase drastically, a performance factor of two (even whencompared to else very good optimized FFTs) can be observed.1.12Automatic generation of FFT codesFFT generators are programs that output FFT routines, usually for fixed (short) lengths.
In fact thethoughts here a not at all restricted to FFT codes, but FFTs and several unroll-able routines like matrixmultiplications and convolutions are prime candidates for automated generation. Writing such a programCHAPTER 1. THE FOURIER TRANSFORM37is easy: Take an existing FFT and change all computations into print statements that emit the necessarycode. The process, however, is less than delightful and error-prone.It would be much better to have another program that takes the existing FFT code as input and emitthe code for the generator. Let us call this a meta-generator. Implementing such a meta-generator ofcourse is highly nontrivial.