supercomp93 (1158320), страница 4
Текст из файла (страница 4)
The memory is physically distributed on the nodes and organized as a hardwarecoherent distributed cache [4]. The machine can scaleto 1088 nodes, in clusters of 32. Nodes in a cluster areinterconnected with a pipelined slotted ring. Clustersare connected by a higher-level ring. Each node has asuperscalar 64-bit custom processor, a 0.5 Mbyte localsub-cache, and 32 Mbyte local cache memory.For the pC++ runtime system implementation, weused the POSIX thread package with a KSR-suppliedextension for barrier synchronization.
The collectionallocation strategy is exactly the same as for the Sequent except that no special shared memory allocation is required; data is automatically shared betweenthreads. However, the hierarchical memory system inthe KSR is more complex than in the Sequent machine. Latencies for accessing data in the local subcache and the local cache memory are 2 and 18 cycles,respectively. Latencies between node caches are signicantly larger: 150 cycles in the same ring and 500cycles across rings. Although our current implementation simply calls the standard memory allocationroutine, we suspect that more sophisticated memoryallocation and management strategies will be important in optimizing the KSR performance.4 Performance MeasurementsTo exercise dierent parallel collection data structures and to evaluate the pC++ runtime system implementation, four benchmark programs covering arange of problem areas were used.
These benchmarksare briey described below. The results for the benchmarks on each port of pC++ follow.BM1: Block Grid CG. This computation consists ofsolving the Poisson equation on a 2-dimensional gridusing nite dierence operators and a simple conjugate gradient method without preconditioning. It represents one type of PDE algorithm.BM2: A Fast Poisson Solver. This benchmarkuses FFTs and cyclic reductions along the rows andcolumns of a two dimensional array to solve PDE problems. It is typical of a class of CFD applications.BM3: The NAS Embarrassingly Parallel Benchmark.
Four NAS benchmark codes have been trans-lated to pC++; we report on two. The BM3 programgenerates 224 complex pairs of uniform (0, 1) randomnumbers and gathers a small number of statistics.BM4: The NAS Sparse CG Benchmark. A far moreinteresting benchmark in the NAS suite is the randomsparse conjugate gradient computation. This benchmark requires the repeated solution to A X = F ,where A is a random sparse matrix.4.1 Distributed Memory SystemsThe principal runtime system factors for performance on non-shared, distributed memory ports ofpC++ are message communication latencies and barrier synchronization. These factors inuence performance quite dierently on the TMC CM-5 and IntelParagon. For the CM-5, experiments for 64, 128, and256 processors were performed.
Because of the largesize of this machine relative to the others in the paper, we ran several of the benchmarks on larger sizeproblems. For the BM1 code running on a 16 by 16grid with 64 by 64 sub-blocks, near linear speedup wasobserved, indicating good data distribution and lowcommunication overhead relative to sub-block computation time. Execution time for BM2 is the sum ofthe time for FFT transforms and cyclic reduction. Because the transforms require no communication, performance scales perfectly here. In contrast, the cyclicreduction requires a communication complexity thatis nearly equal to the computational complexity.
Although the communication latency is very low for theCM-5, no speedup was observed in this section evenfor Poisson grid sizes of 2,048. For the benchmark as awhole, a 25 percent speedup was observed from 64 to256 processors. As expected, the BM3 performanceshowed near linear speedup. More importantly, theexecution time was within 10 percent of the publishedmanually optimized Fortran results for this machine.For the BM4 benchmark, we used the full problemsize for the CM-5. While the megaop rate is low, itmatches the performance of the Cray Y/MP un-tunedFortran code.Results for the Paragon show a disturbing lack ofperformance in the messaging system, attributed primarily to the pre-release nature of this software.
Experiments were performed for 4, 16, and 32 processors.The BM1 benchmark required a dierent block sizechoice, 128 instead of 64, before acceptable speedupperformance could be achieved, indicative of the effects of increased communication overhead. At rstglance, the speedup improvement from BM2 contradicts what was observed for the CM-5. However, using a smaller number of processors, as in the Paragoncase, has the eect of altering the communications /computation ratio. Collection elements mapped to thesame processor can share data without communication, while if the collection is spread out over a largenumber of processors almost all references from oneelement to another involves network trac. Speedupbehavior similar to the Paragon was observed on theCM-5 for equivalent numbers of processors.
For theBM3 benchmark, a 32 node Paragon achieved a fraction of 0.71 of the Cray uniprocessor Fortran version;speedup was 19.6. However, the most signicant results are for the BM4 benchmark. Here, the timeincreased as processors were added. This is becauseof the intense communication required in the sparsematrix vector multiply. We cannot expect improvements in these numbers until Intel nishes their \performance release" of the system.4.2 Shared Memory SystemsThe shared memory ports of the pC++ uncover different performance issues from the distributed memory ports regarding the language and runtime system implementation performance.
Here, the abilityto achieve good memory locality is the key to goodperformance. Clearly, the choice of collection distribution is important, but the memory allocation schemesin the runtime system will play a big part. To better isolate the performance of runtime system components and to determine the relative inuence of dierent phases of the entire benchmark execution wherethe runtime system was involved, we used a prototypetracing facility for pC++ for shared memory performance measurement. In addition to producing thesame performance results reported above for the distributed memory systems, a more detailed executiontime and speedup prole was obtained from the tracemeasurements.
Although space limitations preventdetailed discussion of these results, they will be forthcoming in a technical report.In general, we were pleased with the speedup results on the Sequent Symmetry, given that it is a busbased multiprocessor. For all benchmarks, speedup results for 16 processors were good: BM1 (14.84), BM2(14.15), BM3 (15.94), and BM4 (12.33). Beyond 16processors, contention on the bus and in the memory system stalls speedup improvement. Although theSequent implementation serves as an excellent pC++testbed, the machine architecture and processor speedlimits large scalability studies.
The Symmetry pC++runtime system implementation is, however, representative of ports to shared memory parallel machineswith equivalent numbers of processors; e.g., the sharedmemory Cray YM/P or C90 machines. Using the fourprocessor Sequent speedup results (3.7 to 3.99) as anindication, one might expect similar speedup performance on these systems. (Note, we are currently porting pC++ to a Cray YM/P and C90.)The performance results for the BBN TC2000 reect interesting architectural properties of the machine. Like the Sequent, benchmark speedups for16 processor were encouraging: BM1 (14.72), BM2(14.99), BM3 (15.92), and BM4 (11.59).
BM1 speedupfalls o to 23.89 and 32.36 at 32 and 64 processors, respectively, but these results are for a small 8 by 8 gridof subgrids, reecting the small problem size performance encountered in the CM-5. BM2 speedup continues at a fairly even clip, indicating a better amortization of remote collection access costs that resulted inhigh communication overhead in the distributed memory versions.
BM3 speedup was almost linear, achieving 31.48 for 32 processors and 58.14 for 64 processors.Unlike the Sequent, the BM4 speedup beyond 16 processors did not show any signicant architectural limitations on performance.The pC++ port to the KSR-1 was done most recently and should still be regarded as a prototype.Nevertheless, the performance results demonstrate theimportant architectural parameters of the machine.Up to 32 processors (1 cluster), speedup numberssteadily increase. BM1 to BM3 speedup results arevery close to the TC2000 numbers; BM3 speedupfor 64 processors was slightly less (52.71). However,BM4's speedup at 32 processors (9.13) is signicantlyless than the TC2000's result (17.29), highlighting theperformance interactions of the choice of collectiondistribution and the hierarchical, cache-based KSR1 memory system.
Beyond 32 processors, two or moreprocessor clusters are involved in the benchmark computations; we performed experiments up to 64 processors (2 clusters). As a result, a portion of the remotecollection references must cross cluster rings; these references encounter latencies 3.5 times as slow as references made within a cluster.
All benchmark speedupresults reect this overhead, falling to less than their32 processor values.5 ConclusionOur experience implementing a runtime system forpC++ on ve dierent parallel machines indicatesthat it is possible to achieve language portability andperformance scalability goals simultaneously using awell-dened language/runtime system interface.