mardigras (1158311), страница 2
Текст из файла (страница 2)
Fortran subroutines inthe original Fortran code now become member functions of the Segment class, althoughsubroutines involving inter-element communication and I/O need to be modied. TheSegment class is declared as follows with many less important variables and memberfunctions omitted:class Segment {public:FArrayDouble x,y,z,vx,vy,vz,ax,ay,az,mass,plm,clm,dlm,elm,flm,dplm;double sinsum[lmax+1][lmax+1][nmax+1],cossum[lmax+1][lmax+1][nmax+1];Segment();void compute_polynomial();void compute_acceleration();void update_position();void update_velocity();};The new data type FArrayDouble is a predened class to contain Fortran double precision arrays. On the CM-5, FArrayDouble represents parallel arrays | arrays allocatedin the vector processor accelerators' memories.
In the above declaration, x, y, z, vx,vy, vz, ax, ay, az, mass are one-dimensional arrays that contain the positions, thevelocities, the accelerations, and the masses of particles belonging to a Segment object,and plm, clm, dlm, elm, flm, dplm contain the expansion coecients and the valuesof the polynomial. The sinsum and cossum contain the local sums and eventuallythe global sums of expansion coecients. They are serial arrays on the CM-5 and6are allocated in the Sparc chip's memory. Among the class member functions (Fortran subroutines), compute polynomial computes the polynomials and local sums,compute acceleration computes the acceleration for each particle, update positionand update velocity update the positions and velocities of particles.The collection that distributes the elements, allocates memory, and manages interelement communication is declared as below.
Again, many less important memberfunctions are omitted for brevity:Collection SelfConsistField:public Fortran {public:SelfConsistField(Distribution *D,Align *A);void InParticles();void InParameters();void OutParticles(int nsnap);MethodOfElement:virtual void compute_polynomial();virtual void compute_acceleration();virtual void update_position();virtual void update_velocity();void read_segment();void write_segment();};Functions declared here are pC++ functions. Their main functionality is to handle theI/O.
InParticles, InParameters, and OutParticles read the input les and write tothe output les, and read segment and write segment are called by InParticles andOutParticles to perform parallel I/O on both the CM-5 and Paragon. Functions thatare already dened in element class Segment are declared as virtual functions in thiscollection declaration.
The inherited Fortran collection is a parent collection whichhandles inter-element communication. Fortran itself is derived from the SuperKernelcollection.The main program is listed below:void Processor_Main() {Processors P;Distribution D(elem_count,&P,BLOCK);Align A(elem_count,"[ALIGN(X[i],D[i])]");SelfConsistField<Segment> X(&D,&A);// read initial modelX.InParameters();X.InParticles();X.compute_polynomial();X.ReduceDoubleAdd(offset,variable_count);X.compute_acceleration();// main loopfor(n = 1; n <= nsteps; n++) {X.update_position();7CPU time in seconds spent in main loop without I/O.No. of pC++/CMFCMFpC++/F77Particleson CM5on CM5on Paragon5125.4 (76.6) 6.6 (40.9)5.851208.6 (75.1) 9.9 (40.9)25.05120045.8 (89.1) 50.3 (57.4)332.5Total CPU time in seconds with parallel I/O5120049.1 (89.1) 58.2 (57.4)396.3Table 1: Execution time in seconds for 100 time steps on 32-node partitions with theexpansions truncated at nmax = 6 and lmax = 4.
Numbers in parentheses are amountof memory used in MBytes. The total time includes reading the input model (2.87MBytes), initializing the system, evolving the system for 100 evolution steps, and writing two output les (2.87 MBytes and 2.46 MBytes).X.compute_polynomial();X.ReduceDoubleAdd(offset,variable_count);X.compute_acceleration();X.update_velocity();X.OutParticles(n);}}where ReduceDoubleAdd is a reduction function inherited from SuperKernel. offset ismeasured from the beginning of the class Segment to the beginning of the eld sinsum,and variable count is the total number of array elements is sinsum and cossum. Aleapfrog integration scheme is used to advance particles.3.2 Timing ResultsOur experiments with the pC++/CMF SCF code was conducted on the 32-node CM-5at the University of Maryland and the 88-node Paragon at Indiana University.
For comparison, we also ran the CMF SCF code on the CM-5. The results of these experimentsare listed in Table 1.As it can be seen in Table 1, the pC++/CMF code is, in fact, slightly faster thanthe CMF SCF code in all cases. This can be explained by that the pC++/CMF code isable to use a faster global vector reduction function. The timing results show that ourFortran interface implementation is very ecient. Given the results shown in Table 1and experiments conducted on NCSA's CM-5 using larger partition sizes, we concludethat the pC++/CMF code can achieve at least the 14.4 Gop speed of the CMF code.4 THE PARTICLE MESH CODEAnother N-body code in the dossier of the GC3 group is the Particle Mesh (PM) code(see Ferrel and Bertschinger 1993 [9]).
The particle-mesh method used in the PM code8computes long-range gravitational forces in a galaxy or galaxy cluster system by solvingthe gravitational potential on a mesh. The three-dimensional space is discretized by athree-dimensional grid. An average density for each grid point is then computed usingthe Nearest Grid Point scheme in which the density value at a grid point is the sum ofall masses of the particles nearest to that grid point.
Once the density values at the gridpoints are known, Fourier transforms are performed to compute the potential valuesat the grid points. And nally, the potential values at the grid points are interpolatedback to the particles and the particle positions and velocities are updated. Two datastructures that are natural to the particle-mesh method are: a one-dimensional particlelist and a three dimensional mesh.4.1 The Particle List CollectionThe particles in the simulation are rst sorted according to their anity to mesh points;particles closest to the same mesh point are neighbors in the sorted list. The sortedlist is then divided into segments and each segment forms an element of a particle listcollection.There are two approaches that we can follow when dividing the sorted list.
Pertaining to the two approaches are issues of data locality and load balancing. In the rstapproach, the sorted list is evenly divided so that the segments have the same length.In the second, particles belonging to the same mesh points are grouped into the samesegment and segments will have dierent lengths.In the rst approach, load balancing is ensured because each processor has thesame number of particles. However, this approach may cause particles belonging tothe same mesh point to be distributed among dierent elements, thus requiring moreinter-element communication and remote update. The second approach allows a greaterexploitation of data locality, but there is a potential load balancing problem. As thesystem evolves, particles (stars or galaxies) tend to group together into clumps. Consequently, some mesh points may have 1000 times more particles than other mesh points,and segments that have these mesh points will have much longer lengths. Since weusually distribute the collection elements, in this case segments, evenly across the processors in a parallel machine, the processors that have those long segments will do morework.
We follow the rst approach as an initial test.The Segment class is dened asclass Segment {public:int particle_count; // Number of particles in the segmentdouble x[particle_count],y[particle_count],z[particle_count],vx[particle_count],vy[particle_count],vz[particle_count],mass[particle_count];...};where x, y, z, vx, vy, vz, and mass represent the position, the velocity, and the massof a particle, respectively.The ParticleList collection is dened as9Collection ParticleList : SuperKernel {public:ParticleList(Distribution *D, Align *A);void sortParticles();MethodOfElement:void pushParticle(Mesh<MeshElement> &G);void updateGridMass(Mesh<MeshElement> &G);...};The function sortParticles() is a routine that sorts the particles in lexicographicorder according to their positions.