pC++SciPro (1158313), страница 4
Текст из файла (страница 4)
Similarly, one can useM [i : m : k; s : t : r] in any expression where one can use M .The core distributed array functions include: ElemenType *operator()(int i,...) returns a pointer to either the (ith:::)element if it is local to the executing processor, or a pointer to a buer containinga copy of the (ith :::) element if it is non-local. DistributedArray<ElementType> &saxpy(a, x, y) does pointwise scalar element a times array x plus array y assigned to this array.
Other standard BLAS-1Vector operators are supported. void Shift(int dim, int distance) parallel shifts an array by the given distance in the given direction (with end around wrap). void Broadcast() broadcasts the value in location (0; 0; ::) to the entire array. void BroadcastDim(int I) broadcasts the values along dimension I where I =1; 2; ::. void Reduce() does a parallel reduction of the array leaving the result in location (0; 0; ::).
void ReduceDim(int I) does a reduction along dimension I where I = 1; 2; ::and leaves the result in the orthogonal hyperplane containing (0; 0; :::).All operations that require element "arithmetic" assume that the base elementclass has operators for basic arithmetic and some cast operators that will allow a realnumber to be case in to the element type.As mentioned previously, if the operators like +, =, -, *, /, etc. are dened forthe array element type then the collection level operations are also well dened.5.1.1The Matrix Collection ClassesMany applications use arrays are used to represent matrices, tensors and vectors.
Tosimplify programming, we have dened two types of distributed matrix and vectorcollections.The DistributedMatrix and DistributedVector classes are derived from the DistributedArray collection class. Their functionality provides access to well tested14and tuned parallel linear algebra operations. More specically, these collections provide BLAS-3 level operators for distributed arrays that represent matrices and vectors.For example: for DistributedMatrix we have DistributedMatrix &operator *(Distributed Matrix &) gives the matrixproduct rather than the pointwise array product. DistributedVector &operator *(Distributed Vector &) is the matrix vector product.
DistributedVector &transProd(Distributed Vector &) is the matrix vector product using the transpose of this matrix.A larger project to build a distributed LAPACK project is underway. This workwill be based on adapting the SCALAPACK library being developed as part of aconsortium involving Berkeley, Illinois, Oak Ridge and Tennessee. In addition we willwork with the Illinois Splib project.5.1.2Blocked Matrices and VectorsA Blocked Distributed Matrix DistBlkMatrix is much like a DistBlkVector.
The keyidea is the observation that many matrix computations on parallel machines partitiona large matrix into an array of submatricies, where each submatrix is located on oneprocessor. Many matrix algebra operations for large matrices can be decomposedinto operations on smaller matrices, where the elements of the smaller matrix aresubmatrices of the original. For example, given a class Matrix with all the standardproperties of algebraic matrices overloaded, we can create a q by q distributed matrixof matrices, each of size p by p. The operationProcessor P(q*q);Template T(q, q, &P, Block, Block);Align A(q, q, "[ALIGN( domain[i][j], T[i][j])]" );DistributedMatrix< Matrix > M(&T, &A, p,p);DistributedMatrix< Matrix > N(&T, &A, p,p);M = M*N;is mathematically equivalent to standard matrix multiplication on objects declared as:Processor P(q*q);Template T(p*q, p*q, &P, Block, Block);Align A(p*q, p*q, "[ALIGN( domain[i][j], T[i][j])]" );DistributedMatrix< float > m(&T, &A);DistributedMatrix< float > n(&T, &A);m = m*n;In the rst case there is one element per processor and it is a p by p matrix.
In thesecond case there is a p by p submatrix assigned to each processor. The advantage ofthe rst scheme is that the basic matrix element operation can come from a library welltuned for the individual processor while in the second case it is up to the compiler toautomatically block the resulting code in an ecient manner. (A research project is inprogress to make the transformation from the second form to the rst automatically.)15The only disadvantage of the Matrix-of-Matrices solution is that access to an individual element can be more dicult.
To fetch the (i; j ) element of the matrix m oneuses m(i; j ). To fetch the same value from M requiresM(i/p, j/p)(i%p, j%p);To simplify this task of translating the index and sub-array notation from the mathematical level to the matrix-of-matrix references, we have introduced the collectionDistBlkMatrix. The matrix corresponding to M would be written asProcessor P(q*q);Template T(q, q, &P, Block, Block);Align A(q, q, "[ALIGN( domain[i][j], T[i][j])]" );n = p*q; // the size of the mathematical matrixDistBlkMatrix< Matrix > M(&T, &A, n,n);In this case a reference to the value of an individual element is given by M.get(i,j) orM.put(i,j, value).
Columns can be accessed with the function M.col(int i) whichreturns a DistBlkVector which is dened in terms ofDistributedVector<Vector>for some uniprocessor vector class. Similarly for rows and diagonals. Multiplicationof DistBlkMatrix<Matrix> by DistBlkVector<Vector> works as expected so long asthe Matrix*Vector element operations are well dened.Another reason for using a matrix-of-matrices structure is for distributing sparsematrices or block structured matrices. For example, if one has a class SparseMatrixthat works with the class Vector, then the collectionDistBlkMatrix<SparseMatrix> S(....);DistBlkVector<Vector> X(..), Y(...);Y = S*X;can be a handy way to construct a collection that has the global structure of a sparsematrix but still allows simple algebraic operators to be applied at the source level. Forexample, this is used in the pC++ implementation of the NAS sparse CG benchmark[10].Along these lines, other useful collections for sparse structured matrices would beDistBandedMatrix< TridiagonalMatrix >.These will eventually be added to our library.6 Parallel Tridiagonal Systems and A Fast Poisson SolverIn this section we describe the construction of a simple parallel algorithm for solvingtridiagonal systems of equations and then show how it can be applied to solving thePoisson equation.166.1 Tridiagonal SolversA standard parallel algorithm for solving diagonally dominant, tridiagonal systems oflinear equations is cyclic reduction.
This method uses Gaussian elimination withoutpivoting. The algorithm orders the elimination steps with the odd elements rst,followed by the elements that are not multiples of four, followed by the elements thatare not multiples of eight, etc. At each stage all the eliminations can be done in parallelprovided that the updates are done correctly. (See [4] for details.)One way to program this in pC++ is to build a special subcollection of the DistributedVector to represent the tridiagonal system of equations. For simplicity weshall assume the matrix is symmetric and all diagonal elements are identical and allodiagonal elements are equal.
(These assumptions are sucient for the PDE example that follows). Though the tridiagonal matrix can be described by two numbers,the diagonal, and the odiagonal, the elimination process will destroy this property.Consequently, we will need two vectors of coecients, one for the diagonal, a, and onefor the odiagonals, b. Also to simplify the boundary conditions in our program, wewill assume n = 2k for some k and the vector length is n + 1 and the problem size isn 1. Our solution vector will be in the range of indices from 1 to n 1 and it will beinitialized with the right hand side values.In the collection structure below, we have placed the coecient values directlyin the element, and we have dened one parallel function cyclicReduction, which takespointers to the two matrix coecients.
The method of element functions we need area way to set the local coecient values at the start of the algorithm, a function to dothe forward elimination phase of the elimination and a method to do the \backsolve"phase.Collection DistributedTridiagonal: public DistributedVector{public:....void cyclicReduction(ElementType *diagonal,ElementType *offidagonal);MethodOfElement:ElementType *a;ElementType *b;virtual ElementType(ElementType *);void setCoeff(ElementType *diagonal,ElementType *offdiagonal){a = new ElementType(diagonal);b = new ElementType(offdiagonal);}void backSolve(int s);void forwardElimination(int s);}Note that the setCoeff() function requires a special constructor that takes a pointerto an ElementType object and allocates a new copy of the objected referenced.