supercomp93 (1158320), страница 2
Текст из файла (страница 2)
Note that thevalue of the expression T.id within the main controlthread may not be well dened because each threadmay have a dierent value for id. However the assignment T.id = 1 is valid and denotes an update to allmembers named id.An individual PO thread cannot modify the localelds of another PO thread, but it can access themby means of the () operator. The only other wayfor PO threads to communicate is by means of nativesystem message passing, but this is not encourageduntil a C++ binding for the standard message passinginterface is dened.The call T.f() indicates a branch to a parallel operation on each PO thread. After the parallel executionDistributedArray<E> A(&D,&A)A(0:3)A(4:7)A(8:11)A(12:15)Figure 1: TEClass Objects, Collections and Processor Threadsof the method, there is a barrier synchronization before returning to the main control thread.
In the caseof invoking an object such as T.getX(), which has anon-void return value, it is essential that the functionreturns an identical value for each main PO thread fora given main thread invocation.Note that the TEClass mechanisms provide a simple and direct way to \wrap" message passing SPMDstyle C++ routines inside a standard sequential mainprogram. In this way, many of the special librariesof parallel C++ code already designed can be easilyintegrated into this model [16, 17].2.2 Collections and TemplatesThe C++ language provides the templates mechanism to build parameterized classes.
Collections arevery similar to templates dened over TEClasses. Infact, it is almost sucient to describe a collection classastemplate <class ElementType>TEClass MyCollection: Kernel<ElementType> {MyCollection(Distribution &D,Align &A)::Kernel(D,A) { ...}...};where Kernel is a super-template class that denesthe concurrent aggregate structure of the collection.Indeed, as will be shown in the following sections ofthis paper, it is the structure of Kernel and its basicconstructors and member functions that is at the coreof the runtime system for pC++.While the construction above is nearly sucient todene collections, it does not give us everything wewant.
In particular, collections dene a generalizationof data parallelism as follows. Let C be a collectiontype and E an element type. Assume E has the formclass E {int a;void f();E & operator +(E&);};and let x and y be declared to be of type C<E>. Because + is dened on the elements, one can write x+yand this means the \pointwise" addition of the elements and the result is a new collection of type C<E>.In a similar manner the expression x.a + y.a is a newcollection of type C<int>. In addition, the expressionx.f() means the parallel application of the elementmember function f to each element of the collection.These operations, together with collection specic reductions form the main parallel control in pC++.To accomplish this we provide a special form of theC++ template construct with a distinguished class parameter, called ElementType, which denes the typeof the element in the collection.
The exact syntax isshown below:Collection<class ElementType>TEClass CollName: ParentCollection {public:CollName(Distribution &D, Align &A);private:protected:// TEClass member functions and fields.// Methods declared here are executed in// parallel by the associated PO thread.MethodOfElement:// Fields declared here are added to each// element, Methods to the element class.}Data elds dened in the public, private and protectedareas are components of the underlying TEClass object. The size and shape of the collection and theway in which the elements of the collection are distributed to the processor object threads is dened bythe Distribution and Alignment objects supplied tothe constructor for the collection.
The set of elementsthat are mapped to a single PO thread object is calledthe \local collection" for that thread. The data structure that contains and manages these elements is partof the Kernel class which is a required ancestor of anycollection.2.3 Communication Between Elements ina CollectionOne of the key ideas behind collection classes isto enable the programmer to build a distributed datastructure where data movement operators can beclosely linked to the semantics of the data type. Forexample, elements of a DistributedGrid need to accesstheir neighbors.
A Tree node will often reference itschildren or parent. Each application has a topologyof communication and global operators that must besupported eciently by the collection structure.If c is a collection of type C<E>, then any processorthread may access the ith element of c by the Kernelfunctionsc->Get_Element(i);c->Get_ElementPart(i, offset, size);The rst function accesses the ith element and placesa copy in a local system buer and returns a pointerof type (ElementType *). The second function accesses part of an element at a given oset and sizeand make a copy in the buer.
In other collections,such as distributed arrays, matrices and grids, the operator (...) has been overloaded. For example, if cis a two dimension collection, then expressions likex = c(i,j) + c(i+1,j)work by calling the Get_Element function if a reference is not to a local element.3 The pC++ Runtime SystemGet_Element and Get_ElementPart are the onlycommunication primitives in pC++ that allow processor threads to access the shared name space andremote elements. Notice that this diers signicantlyfrom the \message passing" style that is common inSPMD programming. In the latter model, all synchronization is accomplished by matching a send operationwith a corresponding receive.
In pC++, any processorobject thread may read any element of any collection,but only the owner object thread may modify the element; this is equivalent to the \owner computes" semantics found in HPF. All synchronization is in termsof barrier operations that terminate each collectionoperation.The runtime system for pC++ must manage threedistinct tasks.1. The allocation of collection classes. This involvesthe interpretation of the alignment and distribution directives to build the local collection foreach processor object.More specically, each processor object must have amechanism whereby any element of the collection canbe identied. In a shared address space environment,this may be a table of pointers or a function that computes the address of an element.
In the non-sharedaddress space model, this may be the name of a processor that either has the object or knows where tond it. Depending upon the execution model of thetarget, this task may also involve the initialization ofthreads associated with each processor object.2. The management of element accesses. In particular, access management requires an eective, ecient implementation of Get_Elementand Get_ElementPart functions.This activity can be seen as a compiler-assisted \localcaching" of a remote element to the local processorthread's address space.
In a shared address space environment, alternative memory management schemesmay be used to improve memory locality. If there is noshared address space, these functions require a \onesided" communication protocol { if processor X needsan element from processor Y, processor X must wakeup an agent that has access to the address space of Ywhich can fetch the data and return it to X.3. Termination of parallel collection operations. Allparallel collection operations are barrier synchronized before returning to the main thread.Note that only the processor objects involved in thecomputation of the collection operation must be synchronized and not every processor in the system needbe involved.
However, as we shall see, this may berequired for some implementations.Some restrictions are imposed by the current pC++compiler that are important to note for the runtimesystem implementation. The current version of thepC++ compiler generates SPMD code in which theset of processor objects for each collection is the sameas the set of processors in the user's execution partition. There is one execution thread per processor and,on one processor, all local collection objects share thisthread. In true SPMD style, the main thread is duplicated and is run with the single local thread on eachprocessor.This model of execution is consistent with all current HPF implementations, but imposes some limitations on the programmer.
For example, it is notpossible to have two dierent collection operationsgoing on in dierent subsets of processors concurrently. Furthermore, this limits the system from having nested concurrency by building collections of collections. Even with these limitations this is still apowerful programming model.
These limitations willbe removed in future versions of the compiler.In the paragraphs that follow, we will look at theshared address and distributed address space implementations of the current pC++ execution model.3.1 Distributed Memory SystemsDistributed memory parallel systems consist of aset of processing nodes interconnected by a high-speednetwork. Each node consists of a processor and local memory. In the case of a non-shared, distributedmemory system, each processor only has access to itslocal memory and a message system is used to movedata across the network between processors.One common approach to building a shared memory system on top of a non-shared, distributed memory computer is called shared virtual memory (SVM).In SVM, the message passing system is used to movepages from one processor to another in the same waya standard VM system moves pages from memoryto disk. Though these systems are still experimental [14, 15], they show great promise for providinga support environment for shared memory parallelprogramming models.
Because pC++ is based on ashared name space of collection elements, we use SVMtechniques to build the runtime system for the IntelParagon and the Thinking Machines CM-5.More specically, our model is based on ideas fromKoan[15]. The basic idea is that each collection element has a manager and a owner. The owner of theelement is the processor object that contains the element in its local collection. As with pages in an SVM,we assume that an element may be moved from onelocal collection to another at run time, for load balancing reasons, or, in the case of dynamic collections, maybe created and destroyed at run time.
In other words,we assume that the distribution map may be dynamic.Although this dynamic feature of the system is notbeing used by the current compiler,2 it is a design2It will be supported in future releases.requirement for the runtime system implementation.The purpose of the element manager is to keep trackof which processor object owns the element.