supercomp93 (Раздаточные материалы)
Описание файла
Файл "supercomp93" внутри архива находится в следующих папках: Раздаточные материалы, SAGE. PDF-файл из архива "Раздаточные материалы", который расположен в категории "". Всё это находится в предмете "модели параллельных вычислений и dvm технология разработки параллельных программ" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст из PDF
To appear in: Proceedings of the Supercomputing '93 Conference, Portland, Oregon, November 15{19, 1993.Implementing a Parallel C++ Runtime System forScalable Parallel Systems1F. BodinP. Beckman, D. Gannon, S. Yang S. Kesavan, A. Malony, B. MohrIrisaDept. of Comp. Sci.Dept. of Comp. and Info. Sci.University of RennesIndiana UniversityUniversity of OregonRennes, FranceBloomington, Indiana 47405Eugene, Oregon 97403Francois.Bodin@cs.irisa.frfbeckman,gannon,yangg@cs.indiana.eduAbstractpC++ is a language extension to C++ designed toallow programmers to compose \concurrent aggregate"collection classes which can be aligned and distributedover the memory hierarchy of a parallel machine ina manner modeled on the High Performance FortranForum (HPFF) directives for Fortran 90.
pC++ allows the user to write portable and ecient code whichwill run on a wide range of scalable parallel computer systems. The rst version of the compiler isa preprocessor which generates Single Program Multiple Data (SPMD) C++ code. Currently, it runs onthe Thinking Machines CM-5, the Intel Paragon, theBBN TC2000, the Kendall Square Research KSR-1,and the Sequent Symmetry. In this paper we describethe implementation of the runtime system, which provides the concurrency and communication primitivesbetween objects in a distributed collection. To illustrate the behavior of the runtime system we includea description and performance results on four benchmark programs.1 IntroductionpC++ permits programmers to build distributeddata structures with parallel execution semantics.For \distributed memory" machines, with a nonshared address space, the runtime system implementsa shared name space for the objects in a distributedcollection.
This shared name space is supported by theunderlying message passing system of the target machine. In the case of \shared memory" architectures,the runtime system uses the global addressing mechanism to support the name space. A thread system1 This research is supported by ARPA under Rome Labscontract AF 30602-92-C-0135, the National Science Foundation Oce of Advanced Scientic Computing under grant ASC9111616, and Esprit under the BRA APPARC grant.fkesavans,malony,mohrg@cs.uoregon.eduon the target machine is used to support the paralleltasks.After a short introduction to pC++ we give a detailed description of each runtime system.
To illustrate the behavior of the runtime system we includeperformance results for four benchmark programs.2 A Brief Introduction to pC++The basic concept behind pC++ is the notion ofa distributed collection, which is a type of concurrentaggregate \container class" [7, 9]. More specically,a collection is a structured set of objects distributedacross the processing elements of the computer.
Ina manner designed to be completely consistent withHPF Fortran, the programmer must dene a distribution of the objects in a collection over the processorsand memory hierarchy of the target machine. As HPFbecomes more available, future versions of the pC++compiler will allow object level linking between distributed collections and HPF distributed arrays.A collection can be an Array, a Grid, a Tree, or anyother partitionable data structure.
Collections havethe following components: A collection class describing the basic topology ofthe set of elements. A size or shape for each instance of the collectionclass; e.g., array dimension or tree height. A base type for collection elements. This can beany C++ type or class. For example, one can dene an Array of Floats, a Grid of FiniteElements,a Matrix of Complex, or a Tree of Xs, where X isthe class of each node in the tree.
A Distribution object. The distribution describesan abstract coordinate system that will be distributed over the available \processor objects" ofthe target by the runtime system. (In HPF [8],the term template is used to refer to the coordinate system. We will avoid this so that there willbe no confusion with the C++ keyword template.) A function object called the Alignment. Thisfunction maps collection elements to the abstractcoordinate system of the Distribution object.The pC++ language has a library of standard collection classes that may be used (or subclassed) bythe programmer [10, 11, 12, 13]. This includes collection classes such as DistributedArray, DistributedMatrix, DistributedVector, and DistributedGrid.
Toillustrate the points above, consider the problem ofcreating a distributed 5 by 5 matrix of oating pointnumbers. We begin by building a Distribution. Adistribution is dened by its number of dimensions,the size in each dimension and how the elements aremapped to the processors. Current distribution mappings include BLOCK, CYCLIC and WHOLE , butmore general forms will be added later. For our example, let us assume that the distribution coordinatesystem is distributed over the processor's memoriesby mapping WHOLE rows of the distribution indexspace to individual processors using a CYCLIC pattern where the ith row is mapped to processor memoryi mod P, on a P processor machine.pC++ uses a special implementation dependent library class called Processors. In the current implementation, it represents the set of all processors available to the program at run time.
To build a distribution of some size, say 7 by 7, with this mapping, onewould writeProcessors P;Distribution myDist(7,7,&P,CYCLIC,WHOLE);Next, we create an alignment object called myAlignthat denes a domain and function for mapping thematrix to the distribution. The matrix A can be dened using the library collection class DistributedMatrix with a base type of Float.Align myAlign(5,5,"[ALIGN(domain[i][j],myMap[i][j])]");DistributedMatrix<Float> A(myDist,myAlign);The collection constructor uses the alignment object,myAlign, to dene the size and dimension of the collection. The mapping function is described by a textstring corresponding to the HPF alignment directives.It denes a mapping from a domain structure to adistribution structure using dummy index variables.The intent of this two stage mapping, as it was originally conceived for HPF, is to allow the distributioncoordinates to be a frame of reference so that dierentarrays could be aligned with each other in a mannerthat promotes memory locality.2.1 Processors, Threads, and ParallelismThe processor objects used to build distributionsfor collections represent a set of threads.
Given thedeclarationProcessors P;one thread of execution is created on each processor ofthe system that the user controls. These new processorobject (PO) threads exist independent of the main program control thread. (In the future, pC++ will allowprocessor sets of dierent sizes and dimensions.) Eachnew PO thread may read but not modify the \global"variables; i.e., program static data or data allocatedon the heap by the main control thread. Each POthread has a private heap and stack.Collections are built on top of a more primitive extension of C++ called a Thread Environment Class,or TEClass, which is the mechanism used by pC++to ask the processor object threads to do something inparallel. A TEClass is declared the same as any otherclass with the following exceptions: There must be a special constructor with a Processors object argument.
Upon invocation of thisconstructor, one copy of the member eld objectis allocated to each PO thread described by theargument. The lifetime of these objects is determined by their lifetime in the control thread. A TEClass object may not be allocated by a POthread. The () operator is used to refer to a single threadenvironment object by the control thread. A call to a TEClass member function by the mainprogram control thread represents a transfer ofcontrol to a parallel action on each of the threadsassociated with the object. (Consequently, member functions of the TEClass can read but cannot modify global variables.) The main controlthread is suspended until all the processor threadscomplete the execution of the function.
If themember function returns a value to the main control thread, it must return the same value fromeach PO thread or the result is undened. If a TEClass member function is invoked by oneof the processor object threads, it is a sequentialaction by that thread. (Hence, there is no way togenerate nested parallelism with this mechanism.)These issues are best illustrated by an example.Global Dataint x;float y[100];int x;// c++ globalfloat y[1000]; // c++ globalmain()MyThread T(P);T(0)T(1)int id;T(2)int id;T(3)int id;int id;T(3).id=2timeTEClass MyThreads{int id;// private thread datapublic:float d[200];// public thread datavoid f(){id++;} // parallel functionsint getX(int j){return x;}};T.f();Barrier();main() {Processors P;// the set of processorsMyThreads T(P); // implicit constructor// one thread object/proc.// a serial loopfor(int i=0; i<P.numProcs(); i++)T(i).id=i; // main control thread can// modify i-th thread env.T.f(); // parallel execution on each thread// an implicit barrier after parallel call}In this example, the processor set P is used asthe parameter to the thread environment constructor.One copy of the object with member eld id is allocated to each PO thread dened by P.
The lifetimeof T is dened by the main control thread in which itwas created. (However, in the current implementationthe storage is not automatically reclaimed.) Figure 1illustrates the thread and memory model that the language provides.The main control thread can access and modify thepublic member elds of the TEClass object. To accomplish this, one uses the () operator, which is implicitly overloaded. The reference T(i).id refers tothe id eld in the ith TEClass object.