pcxx_ug (1158314), страница 9
Текст из файла (страница 9)
We will build our rst version out of matrices which are dened as distributed arraysof a simple class to represent a single matrix element.class Double{public:double value;Double(){ value = 0; }Double(double x){ value = x;}Double &operator *(Double &x){ return Double(value*x.value); }Double &operator +=(Double &x){ value += x.value; return *this;}};The matrix collection, shown below, builds a matrix sized according to the dimension inthe alignment object.
The idea behind the matrix-matrix algorithm in computingP = M * Nis to have each element of the matrix P grab the right row of M and the right column of N andform the dot product to obtain its value. Consequently, the operation is performed using\method of element" parallelismCollection Matrix: DistributedArray{// a constructor is needed.Matrix(Distribution *T, Align *A): DistributedArray(T, A){}MethodOfElement:void matMul(Matrix<ElementType> &M, Matrix<ElementType> &N){int myrow = this->index1; // first index from DistributedArrayint mycol = this->index2; // second index from DistributedArrayint maxcol = M.dim2size;// should be your "n" parameterif(maxcol != N.dim1size) err("col dim of M must = row dim of N");value = 0;for (int j; j < maxCol; j++)*this += M(myrow, j) * N(j, mycol); // Get_Element implicit}39The complexity of this code should be compared with the message passing matMul() in theSPMD programming (section 3.2.4).
Note that these two programs implement the samebasic algorithm, the version above is much simpler. In addition, the message passing versionwas tied to the way the data was distributed to memory. In this version the distributionand alignment of the data are completely independent of the the code. In the main programbelow we have used a distribution and alignment scheme that places the rows of the secondmatrix along the columns of the rst. Consequently, no communication takes place if thematrices are square.
However, we emphasize that the code will run no matter how the datais aligned.Processor_Main(){Processor P;Distribution T(m, p, &P, BLOCK, BLOCK);Align A(m, n, "[ALIGN( domain[i][j], T[i][j])]" );Align B(n, p, "[ALIGN( domain[i][j], T[j][i])]" );Matrix<Double> M(&T, &A, m, n);Matrix<Double> N(&T, &B, n, p);Matrix<Double> P(&T, &A, m, p);P.matMul( M, N);}One problem with this program is that it would run like a dog. There are too manyfunction calls for each multiply-add. Here is how to make it y. Let's replace Double as theelement type with#define N 100class BlockDouble{double data[N][N];BlockDouble & operator *(BlockDouble &x){call a linpack BLAS-3 dense matrix-matrix multiplytuned for the architecture}BlockDouble & operator +=(BlockDouble &x){call the BLAS-3 dense matrix accumulate.}};Now by replacing \Double" by \BlockDouble" in the Processor_Main() and you have aserious blocked matrix-matrix multiply that will do ( 3) ops for ( 2 ) data moves.This can be very fast.O N40O N5.2 Example: Batcher's Bitonic Merge Sort.
Ver 1.0+Another classic example is the Batcher Bitonic Merge Sorting algorithm. A bitonic sequenceis a string of values that can be seen as the concatenation of an increasing sequence followedby a decreasing sequence. The Botonic merge operates by dividing a bitonic sequence oflength k into two sequences of length k/2. The item at position i is compared to the item atposition i+k/2 and the smaller is kept at position i and the large is kept at position i+k/2.The results is two bitonic sequences, each of length k/2, with the property that every elementof the left hand sequence is less than every element of the right hand sequence. (See Knuthfor a proof.) We then recursively merge the two smaller sequences and we will eventuallysee the entire sequence sorted.
But to make this work you need a bitonic sequence to startwith. The way you make this happen is to start with sorted sequences of lenght r withone increasing and the other decreasing. When concatenated, they form a bitonic sequenceof length 2r. Doing this with r=1, r=2, r=4 and so on you can build up larger bitonicsequences.The code below works with collections of element where each element is a sub-sequenceof smaller objects.
The subsequences will be called Pvector objects, which as shown below,contain a vector of simple objects with an integer key.class E{public:int key;};class Pvector{public:E part[BLKSIZE];int sortdir;void Qsort(int dir);};The Pvector class contains a member function Qsort which is used to sort the local vectorof object in the direction indicated by an integer ag. There is also a eld sordir which isused to record the last direction the list was sorted. Qsort is implemented with the standardC library quicksort.
(See the TestSuite directory for the complete source code.)The collection of Pvector objects is based on the DistributedArray collection library assnown below.#include "kernel.h"#include "distarray.h"Collection SortedList: DistributedArray{public:SortedList(Distribution *D, Align *A);41void merge(int distance);void sort();MethodOfElement:E tmp[BLKSIZE];virtual E part[BLKSIZE];int savedir;virtual void Qsort(int dir);int cond(int a, int b, int v1, int v2);void localSort(int flg);void grabFrom(int dist);void localMerge(int dist, int flg);void localCheck();// needed for dist. array. not used here.double Dot(ElementType &arg1){ return 0.0; }ElementType &operator +=(ElementType &rhs){ return *this; }};Within the TEClass thread functions we have the two main components of the sortingalgorithm sort() and merge() which accomplish the bitonic merge and bitonic sort.
Sortis the toplevel function. Its main job is to set up a sequence of bitonic merges until theentire list is merged into one sorted list. This function begins by calling an element functionlocalSort() which calls Qsort on each element. Local sort is designed so that if it is calledwith its parameter equal to 1, then it does the initial sort so that even indexed element aresorted in increasing order and odd numbers elements are sorted in decreasing order.
Thismeans that we have established bitonic sequences consisting of pairs of elements consistingof one even element and the following odd numbered element.void SortedList::localSort(int flg){if(flg)Qsort(Ident % 2, 0, BLKSIZE-1);elseQsort(!savedir, 0, BLKSIZE-1);}The sort function is then very simple.void SortedList::sort(){int i, p;this->localSort(1);for(i = 1; i < dim1size; i = 2*i)Barrier();};merge(i);The merge(i) function is as described above. It is designed to merge bitonic sequences oflength 2*i performing the exchange operation of distance i and then recursively merging theresulting subsequences.42void SortedList::merge(int dist){int flg;flg = 1;while(dist > 0){this->grabFrom(dist);this->localMerge(dist, flg);dist = dist/2;flg = 0;};this->localSort(0);}There are two parallel element functions invocations called in merge().
The rst of thisis a function grabFrom(dist) which, for each element, extracts a copy of the element adistance dist away into a local buer tmp. Note that this copy of the data to a second arraydisqualies this implementation from being a classical \in place" sort. We will return to thisissue later. This is the function that does all the inter-element communication. Note thatdepending on which stage one is in the algorithm the sign of the oset alternates. That is,in the rst stage even elements access the element to the right and odd elements access theelement to the left.
In the second stage, elements 0 and 1 access elements 2 and 3 respectivelyand visa versa.void SortedList::grabFrom(int dist){int v,i, offset;E *T;void * addr1, *addr2;v = (Ident/dist) % 2;if(v ) offset = -dist;elseoffset = dist;T = &((*ThisCollection)(Ident+offset)->part[0]);for(i = 0; i < BLKSIZE; i++) tmp[i] = T[i];}The localMerge() algorithm selects the values from the tmp array to keep in the local array.The most complex part of the Bitonic merge algorithm is keeping track of where one shouldgenerate an ascending sequence and where one must generate a descending sequence. Duringthe initial ascending merge operations, the order must reverse at each step.
While, for eachrecursive \sub-merge" the sordirection must be the same as on the previous step. The localeld savedir records the previous direction of the merge and the parameter flg in the localmerge routine below is used to distinguish the initial merge for a give value of dist.void SortedList::localMerge(int dist, int flg){int v, v2, i;if(flg) savedir = v2 = (Ident/(2*dist)) % 2;43else v2 = savedir;v = (Ident/dist) % 2;for(i = 0; i < BLKSIZE; i++){if(v == v2) && (part[i].key > tmp[i].key)) part[i] = tmp[i];if(v != v2) && (part[i].key <= tmp[i].key)) part[i] = tmp[i];}};The main program for this example is given below.void Processor_Main(){Processors P;Distribution T(SIZE,&P,BLOCK);Align A(SIZE,"[ALIGN(V[i],T[i])]");SortedList<Pvector> X(&T,&A);int i, n, error;double temps;n = SIZE*BLKSIZE;X.sort();}This program will sort 106 items in about 2.5 seconds on a 256 node CM-5 or Paragon.
Thestandard unix C library qsort on a sparc 10 will complete the task in 25 seconds. Runningthe direct qsort on a single node of the CM-5 requires 104 seconds. Conseqently we see aspeed-up of only about 40. This is cleary not the best that can be done. Part of the problemis the complexity of the Batcher Bitonic Sort ( ( 2( )) vs. ( ( ).O nlognO nlog n5.3 C++ and HPF Style Fortran. Ver 2.0One of the principal roles played by C++ in programming massively parallel computers is inthe design of distributed data structures that are either too complex or simply inconvenientto program in Fortran. However, the mainstay of scientic and engineering programmingis Fortran. The reasons for this are more than cultural. The restrictions in the Fortrantype-system prohibit the use of pointers beyond the simple alias mechanisms allowed byFortran90.