ipps94 (1158310), страница 3
Текст из файла (страница 3)
Communication occurs only in a function called Ap which appliesthe nite dierence operator and in the dot productfunction, dotprod. In Ap, the communication is allbased on nearest neighbors and in the dotprod function the communication is all based on tree reductions.4.2 Poisson: Fast Poisson Solversuite, four have been translated to pC++ and we report on two of them here. (A more complete report onthis suite will be published when all have been translated and tested.)The easiest of these is the \Embarrassingly Parallel" program. This code generates 224 complex pairsof uniform (0, 1) random numbers and gathers a smallnumber of statistics. In the benchmark, the randomnumbers are created by two special portable functions called vranlc() and randlc(), and a main function compute pairs(int k) is used to compute the numbers based on a dierent value of the seed parameterk.
Our approach is to divide the work into a set ofcomputational engines with one engine per processor.This \set" of engines is a collection of elements of typeSet<GaussianEngine>. Each GaussianEngine object1computes NODESof the total where NODES is thenumber of processors.4.4 Sparse: NAS Sparse CG BenchmarkAnother common method for solving PDEs is touse a fast Poisson solver. This method involves applying a Fast Fourier Transform based sine transform toeach column of the array of data representing the righthand side of the equation. This is followed by solvingtridiagonal systems of equations associated with eachrow.
When this is complete, another sine transform ofeach column will yield the solution. In this case it isbest to view the grid as a distributed vector of vectors,where the vector class has a special member functionfor computing the sine transform, sineT ransform().The distributed vector collection must have a special function for the parallel solution of tridiagonalsystems of equations. In our case we use a standard cyclic reduction scheme. This is accomplishedby building a collection class DistributedTridiagonalwhich is a subclass of DistributedVector. This classhas a public function cyclicReduction() which takestwo parameters which correspond to the diagonal ando-diagonal elements of the matrix which are storedin each element.The cyclic reduction function is organized as twophases of log(n) parallel operations.
The rst phase isthe forward elimination, the second is the back solvestep. Communication only happens within the element function forwardElimination and backSolve.The nal poisson solver uses two collections F andU which represent the initial data and the nal solution. Both are of type distributed tridiagonal of vectorwhich is easily mapped to a one dimensional template.A far more interesting benchmark in the NAS suiteis the random sparse conjugate gradient computation. This benchmark requires the repeated solutionto A X = F, where A is a random sparse matrix.There are two ways to approach the parallelization ofthis task. The best, and most dicult, is to view thematrix as the connectivity graph for the elements inthe vector.
By a preprocessing step one can partition the vector into segments of nearly equal size thatminimizes the communication between subsets. Theresulting algorithm can have the same low communication to computation ratio as our simple grid CGexample above.The second approach is not as ecient in termsof locality, but it a more direct implementation of thebenchmark. In this version we represent the matrix asa distributed random sparse matrix. More specically,we represent the matrix as a Distributed Matrix ofSparse Matrices. The DistributedMatrix collectionclass in the pC++ library allows any class that has thealgebraic structure of a ring to be the element of thematrix. Furthermore, a p by p matrix whose elementsare k by k matrices has all the same mathematicalproperties of a kp by kp matrix.
Indeed, this is thekey property used in the blocking algorithms used inmost BLAS level 3 libraries. Consequently, we canselect the element class to be a sparse matrix. Thisdistributed matrix of sparse matrices can then be usedlike an ordinary matrix.4.3 Embar: NAS Embarrassingly ParallelBenchmark5 Performance ScalabilityThe NAS benchmark suite was designed to test thesuitability of massively parallel systems for the needsof NASA Ames Research. Of the nine codes in theA key feature of the pC++ programming systemis its ability to transparently and eciently supportscaling of problem size and machine size.
Clearly, thedominant performance concern is pC++'s ability toachieve scalable performance. As the problem size increases, pC++ should support increased processing efciency. In response to increasing numbers of processors, pC++ should demonstrate good speedup. Scalability results for shared and distributed versions ofthe pC++ system are given below.3 We also providecomparisons to Fortran results where appropriate.5.1 Shared MemoryThe shared memory ports of the pC++ runtimesystem isolate performance issues concerned with theallocation of collections in the shared memory hierarchy and architectural support for barrier synchronization. Clearly, the choice of collection distributionis important, but as is typically the case in programming shared memory machines, the ability to achievegood memory locality is critical.The scaled performance of shared memory ports ofthe pC++ system reects the eectiveness of collection distribution schemes as well as the interactionsof the underlying architecture with respect to collection memory allocation, \remote" collection elementaccess, and barrier synchronization.
Figures 3, 4, and5 show the speedup of the benchmark programs on theSequent, BBN, and KSR machines, respectively.Naturally, the speedup of Embar was excellent forall machines. For the Sequent using 23 processors, thespeedup of 15.94 reects the mismatch between thenumber of processors and the problem size; the staircase behavior is even more pronounced in the BBNresults. The slightly lower speedup for Embar on 64nodes of the KSR-1 is due to the activation of an additional level of the cache memory hierarchy; a portionof the memory references must cross between clusterrings, encountering latencies 3.5 times slower than references made within a 32 processor cluster.
This eectis clearly evident for the Grid benchmark on the KSR,whereas the other machines show steady speedup improvement; the 40 to 60 processor cases on the BBNTC2000 are again due to load imbalances caused bythe distribution not being well matched to that number of processors.The Poisson benchmark performs well on all sharedmemory machines for all processor numbers. Thisdemonstrates pC++'s ability to eciently assign elements of a collection, such as the distributed vectorcollection, to processors and use subclassing to implement high-performance functions on the data, such ascyclic reduction.The speedup performance on the Sparse benchmarkreects the importance of locality; most evident in theKSR results.
The uniform memory system of the Symmetry hides many of the poor locality eects, resulting3 The ports to the IBM SP-1 and workstation clusters usingPVM were done recently. Performance results for this portswere not yet available for inclusion in this article.in a reasonable speedup prole. The NUMA memory system of the BBN TC2000 is more susceptibleto locality of reference because of the cost of remotereferences. We knew that the Sparse implementationwas not the most ecient in terms of locality, but thisresulted in particularly poor speedup performance onthe KSR-1; when crossing cluster boundaries, the dropis speedup is quite severe.
Although the pC++ portto the KSR-1 was done most recently and should stillbe regarded as a prototype, the analysis of the performance interactions between collection design, distribution choice and the hierarchical, cache-based KSR-1memory system is clearly important for optimization.5.2 Distributed MemoryIn contrast to shared memory ports of pC++, theprincipal performance factors for distributed memoryversions of the runtime system are message communication latencies and barrier synchronization.
Collection design and distribution choice are the majorinuences on the performance of an application, butruntime system implementation of message communication and barrier synchronization can play an important role. In fact, these factors aect performancequite dierently on the TMC CM-5 and Intel Paragon.Communication latency is very low for the CM-5 andthe machine has a fast global synchronization mechanism.
In the Paragon, communication performance ispoor, requiring parallelization schemes that minimizecommunication for good performance to be achieved.Figures 1 and 2 show the execution time speedupresults for the benchmark suite on the CM-5 andParagon machines, respectively. The Embar benchmark shows excellent speedup behavior on both machines. For the CM-5, the execution time was within10 percent of the published hand optimized Fortranresults for this machine. In the case of the Paragon,a 32 node Paragon achieved a fraction of 0.71 of theCray uniprocessor Fortran version; speedup was 19.6with respect to this code.The Grid benchmark demonstrated near linearspeedup on the CM-5 even with 64 by 64 sub-blocks,reecting the low communication overhead relativeto sub-block computation time.
However, the highcommunication overheads on the Paragon machine required a dierent block size choice, 128 instead of64, before acceptable speedup performance could beachieved on this benchmark.Execution time for Poisson is the sum of the timefor FFT transforms and cyclic reduction. Becausethe transforms require no communication, their performance scales very well for both the CM-5 and theParagon. In contrast, the cyclic reduction requires acommunication complexity that is nearly equal to thecomputational complexity.
Although the communication latency is very low for the CM-5, no speedup wasobserved in this section even for Poisson grid sizes of2,048; because of the larger number of processors used,the communication to computation ration was high.With a smaller number of processors, the Paragon wasable to nd some speedup in the cyclic reduction part.Finally, running the Sparse benchmark, the CM-5achieved a reasonable speedup, and for 256 processors matched the performance of the Cray Y/MP untuned Fortran code. In the case of the Paragon, theintense communication required in the sparse matrixvector multiply, coupled with high communication latency, actually resulted in a slowdown in performanceas processors were added. We cannot expect improvements in these numbers until Intel nishes their \performance release" of the system.Currently, Indiana University is experimenting withSandia's high performance SUNMOS as the computenode operating system of its Intel Paragon.
SUNMOSincreases message passing bandwidth and reduces latency. For compute-bound benchmarks such as Embar, preliminary results show no signicant improvement. However for communication intensive benchmarks such as Sparse, initial results show a factor ofnearly 7 improvement over nodes running OSF/1.6 Detailed Performance AnalysisThe pC++ performance environment allows usto investigate interesting performance behavior inmore detail. In particular, a pC++ program execution involves several execution components: initialization, processor object instantiation, thread creation, collection creation, runtime system operation,and the application's main algorithm.