cnrs (1158306), страница 3
Текст из файла (страница 3)
This is because the matrix-vectorproduct operator was written as a parallel algorithm to begin with.Note, however, that both versions of this program contain a potential ineciency. Themultiply algorithm species that the vector x is to be expanded to a n by n distributedarray by duplication from the rst row of the template. Clearly, only one copy is neededin each processor. An alternative formulation that makes this optimization explicit wouldmake the template array the same size as the processor array and specify that the elementsof the matrix are from a class oat DenseMatrix and are of size n=q by n=q. In otherwords the matrix is now a matrix of matrices. Similarly, we make the vectors each vectorsof vectors of length n=q. Because the size of the matrix is now identical to the processorarray, each processor will only get one copy of the vector segment it needs.
The code forthis version of the matrix vector computation now takes the form shown below.TemplateClass T( [q,q], [q,q], Ident);DistributedBlkMatrix<float_DenseMatrix>M( T, n, [q,q], [i,j]->[i,j])(n/q, n/q);DistributedBlkVector<float_Vector> x( T, n, [q], [i]->[i,0])(n/q);DistributedBlkVector<float_Vector> y( T, n, [q], [i]->[0,i])(n/q);y = M*x;The only dierence between a DistributedBlkMatrix and a DistributedMatrix is in theway the global indexing function works. (The blocked matrix is only of size q by q but itshould be indexed as an matrix of size n by n.) The matrix times vector code is identical(inherited) from the the DistributedMatrix class.There are two advantages of the blocked scheme shown above.
First, the elementwiseoperations in the matrix multiply step,tmp = (*this)*tmp;/* elementwise parallelism*/are now products of an object of class oat DenseMatrix time an object of class oat Vector.These are each scalar operation and can be coded as level-3 BLAS nely tuned for peakperformance on each processor of the system. The performance improvement can bedramatic as shown in [10].The second advantage of this scheme is that we now have a way to work with distributedsparse matrices at very little cost. Assume you have a sequential class oat SparseMatrixfor which sparse matrix operations, such as matrix-vector products, have been dened.By changing one line of the programDistributedBlkMatrix<float_SparseMatrix> M( T,n,[q,q],[i,j]->[i,j])(n/q, n/q);our dense matrix program now can function with sparse matrices.
The resulting distributed data structure is illustrated in Figure 2. Of course, many algorithms appropriatefor dense systems may not be optimal for sparse matrices, but there are many cases wherethe best algorithms are the same. For example, the NAS SparseCG benchmark uses aclassical conjugate gradient algorithm and results from an implementation based on theidea above are illustrated in [11].Figure 2. A Distributed Matrix of SparseMatricies.5. The Real Challenge: Dynamic StructuresThe collection hierarchy described in section 3 includes a number of data types otherthan matrices.
For large scale simulations, meshes or grids are the critical data structure.The most challenging problems involve grids that are locally rened and dynamic in nature. For example, Multigrid algorithms and various adaptive mesh renement techniquesare becoming increasingly important in problems where the simulation must considermultiple length scales. Illustrated in Figure 3, these grids are linked together in a treestructure.
Algorithms that operation on these structures typically work on only one levelat a time and the structure of the tree is dynamic. In the following paragraphs we sketcha possible implementation in terms of the collection hierarchy. Unlike the discussion inthe preceding sections the ideas here are all currently unimplemented. At a later date wewill have nal implementation details and experimental results.If we begin by assuming that each mesh cell is described by a C++ class cell the problemis reduced to dening the global topology. The structure is composed of a tree of meshesof cells, so one possibility is to dene an AMR Grid asG.level(2)-G.level(1)-G.level(0) -AMR_Grid<Cell> G;Figure 3.
Adaptive Mesh Renement Collection.typedefDistributedTree<DistributedGrid<Cell>> AMR_Grid;however this may not be perfectly suited to algorithms that sequentially go through thelevels of the tree and treat each level as set of grids that may be processed concurrently.An alternative formulation istypedefOrderedSet<DistributedGrid<Cell>>Level;Class AMR_Grid{Level L[MaxLevel];...};Both of these schemes use a type of nested concurrent structure to represent the tree ofmeshes.
Other than the fact that this is not yet implemented in our compiler, there isnothing inherently wrong with the idea. The primary diculty with these schemes thatof dening what processor alignment and distribution means in this case. This is an openresearch problem.A slightly easier problem to work on is the adaptive Barnes-Hut or multipole algorithmsused in N-body computations. While closely related to the AMR problem above (seeEdelsohn [1] for an interesting analysis), it also represents a dierent use of an adaptivetree. In the Barnes-Hut algorithm, space is initially partitioned in a manner similar tothat in Figure 4 and a tree that represents a quadtree recursive partitioning of the domainis constructed. The majority of work in the algorithm involves traversals of this tree.A parallel implementation requires two data structures.
One is a vector of particlesDistributedVector<particle>particles;and the other is a dynamic tree of nodes which represent either region summary information or references to particles.DistributedTree<Node>BH_Tree;The most important part of this computation is the design of the distributed treecollection. It is essential that the tree be constructed in such a way that particles thatare mapped to a given processor are associated with leaves in the distributed tree thatare resident on that processor.
This will allow us to exploit spatial locality and optimizethe tree traversal routines.wbacevdgfkjitmlsnoqrhstljfrong kpuvweupqmcidhbaFigure 4. A Typical Barnes-Hut Data Structure for N-Body Problems.6. Visualization IssuesOne of the principle tools required by programmers when tuning applications for highperformance on massively parallel systems is a mechanism for understanding how per-formance relates to algorithm and data structure design. The current state of the artis to monitor message trac between processors and use graphical techniques to debugalgorithm and its performance.We feel that, as data structures become more complex, it becomes increasingly dicultto use a processor-message based model.
In particular, applications where multiple threadof computation are bound to a single processor, one wants to see inner-thread communications. This situation can occur when a data distribution function maps multiple objectsto a single processor and the compiler cannot merge these into a single thread. Furthermore, in the case of very complex structures like the AMR computations described in thelast section, the user is more interested in how the communications relate to a graphicaldisplay of the data structures.There are a variety of ways of dealing with this problem, but the object oriented featuresof C++ provide a unique set of opportunities.
Indeed, graphical display software is onearea where object oriented techniques are now the standard practice.Our proposal is to instrument the basic collection hierarchy with built-in measurementand visualization software. A user that builds with or subclasses from the hierarchy willinherit the built-in visualization routines. In this way, standard graphics routines can beimplement for graphs, trees and grids that are best suited to these applications. If theuser does not like the standard displays, it is easy to override the default routines. Whilethis idea is as yet untested, we feel the potential for this approach is very great.7. ConclusionIn this paper we have sketched the extension of HPF style data distributions and dataparallel programming to a type of object parallel programming.
By adding a specialconcurrent aggregate class to C++ we have shown how linear algebra computations areeasily described. In addition, we have illustrated how libraries of collection types can becreated and how they can be used in a variety of complex applications.A preliminary version of the pC++ compiler is now running on several shared addressspace machines. These include the BBN GP1000, the BBN TC2000, the Alliant FX/8and FX/2800. Many of the ideas described in this paper represent new directions forpC++ that will be implemented in the next version of the compiler. Our initial targetswill be the TMC CM-5 and the Intel Paragon.
We expect to release a public domainversion of the compiler in the Fall of 1993. This compiler will actually be a preprocessorthat generates C++ or C code that executes in a SPMD model. At the same time wewill release an initial version of our collection hierarchy library.8. REFERENCES1 D. Edelsohn, Hierarchial Tree-Structures as Adaptive Meshes, Syracuse Center forComputational Science, Technical Report, CRPC-TR91186/SCCS-193, Nov. 1991.2 D. Balsara, M.