conpar (1158307), страница 2
Текст из файла (страница 2)
One reason for this is that the problem size is still relatively small.Consequently, the large number of inner products in the CG algorithm will causeit to have a sublinear speed-up if there is not enough computation to mask thelog(p) cost of the reductions. But we can still put these results in perspectiveIn the case of the Alliant FX/2800, we have access to an automatic parallelizingand vectorizing compiler for Fortran. In the table below, we have included theresults of running the optimized Fortran program on the benchmark test datasize of 1400 by 1400 with 78148 non-zeros. The Fortran program was compiledwith full vector concurrent optimization.MachineP= 2 P= 5 P= 10 P= 14FX/2800 pC++ time 112.8 47.9 27.0 22.33FX-Fortran time78.38 73.45 73.63 72.98ratio Fort/pC++0.69 1.5 2.7 3.2As can be seen, the Alliant Fortran compiler is not able to extract any significant parallelism from this computation.
On the other hand,the pC++ program,which has no vectorization within code that runs on each I860, runs slower forsmall numbers of processors but over three times faster than the Fortran codeon 14 processors. We must also add one more caveat. While the pC++ performance may look good, the absolute performance was less than 10the originalbenchmark when run on a Cray YMP.More recently we have completed a partial port of our compiler to the Thinking Machines CM-5 and we have tested this computation on a machine with 128processors.
In this case we have executed the benchmark at its full size of 14000by 14000 with 1853104 non-zeros. The speed of the computation is 18.9 Mopswhich is equal to 0.27 times the speed of a 1 processor Cray Y/MP on the Fortranversion of this benchmark. We emphasize that the machine is an early versionwithout the oating point vector hardware and a very early version of the communications library. By a "partial port" of the compiler we mean that the mainvector and matrix routines had to be modied by hand to use message passingprimitives provided by TMC.
On the other hand, the CGM application code didnot need any modication. (On the other machines discussed here none of thebasic library code for vector and matrix operations had to be modied in goingfrom one machine to another.) We also note that 80% of the time consumed byCGM on the CM-5 is spent in the data communication routines in the Matrixtimes vector computation. We hope to report better performance numbers asthe Thinking Machines Software communications software improves.References1. Andrew A. Chien and William J. Dally. Concurrent Aggregates (CA), Proceedings of the Second ACM Sigplan Symposium on Principles & Practiceof Parallel Programming, Seattle, Washington, March, 1990.2. Charles Koelbel, Piyush Mehrotra, John Van Rosendale. Supporting SharedData Structures on Distributed Memory Architectures, Technical ReportNo.
90-7, Institute for Computer Applications in Science and Engineering,January 1990.3. J. K. Lee and D. Gannon, Object Oriented Parallel Programming: PC++,Proceedings of Supercomputing 91 (Albuquerque, Nov.), IEEE ComputerSciety and ACM SIGARCH, 1991, pp. 273{282.4. D. Gannon and J. K. Lee, Object Oriented Parallelism: pC++ Ideas andExperiments, Proceedings of 1991 Japan Society for Parallel Processing,pp. 13{23, 1991. Also Bit V.24, no.
1 (1992) (in Japanese).5. G. Fox and S. Hiranandani and K. Kennedy and C. Koelbel and U. Kremerand C. Tseng and M. Wu, Fortran D language specication, Rice University, Department of Computer Science TR 90079, 19916. Bjarne Stroustrup. The C++ programming Language, Addison Wesley,Reading, MA, 1986This article was processed using the LaTEX macro package with LLNCS style.