shpcc94 (1158318), страница 2
Текст из файла (страница 2)
It may be possibleto view a returned C++ structure as a Fortran90instance of a user dened type, but returning a Cpointer would be a big problem. pC++ calling HPF:a. A call to a HPF parallel subroutine from apC++ sequential main thread, i.e., a pC++main program.b. A call to a Fortran \node program", i.e., asequential or vector subroutine that is calledfrom a parallel section of a pC++ program.Of these we will focus on designing an interface for thecase of pC++ calling Fortran. The data structure manipulation and encapsulation properties of C++ andthe fast computing speed of Fortran make it naturalfor Fortran subroutines to be called as library routinesand for C++ to act as the main controller.Let us rst consider case a. A pC++ element classwhich contains three arrays with two of the arraysbeing a special type FArrayDouble and one being aconventional C++ array is given below.class MyElement {public:FArrayDouble x, y;double z[100];MyElement();};The special type FArrayDouble is used to containdouble precision Fortran90 arrays and their descriptor structures.
To dene the constructor for the classwe assume that x is a one dimensional array of a sizeof 100 and y is a two dimensional array of a size of200 50.MyElement::MyElement():x(100),y(200,50){...}To use this class in a pC++ program that calls an HPFsubroutine, we need to dene the Processors objectand to specify the size and shape of the collection.In the following example, a one dimensional collectionof a size of 64 with elements of type MyElement isconstructed and an HPF subroutine, FFUN, is called.extern "HPF" void FFUN(FArrayDouble&,FArrayDouble&,double*);main(){Processors P(64);Distribution D(64,&P,BLOCK);Align A(64,"[ALIGN(T[i],D[i])]");HPFCollection<MyElement> C(&D,&A);FFUN(C.x,C.y,C.z);...}The important point to consider is what the passedarguments look like in terms of HPF distributed arrays. In this case, they can be dened by the HPFdirectives.!HPF$ PROCESSORS P(64)double precision x(100*64),& y(200*64,50),z(100*64)!HPF$ DISTRIBUTE x(BLOCK),y(BLOCK,*) ONTO P!HPF$ DISTRIBUTE z(BLOCK) ONTO PIt is relatively easy to see how the blocked decomposition corresponds to our simple collection of arrayblocks.
When we have a two dimensional array of virtual processors, we would have a pC++ program asbelow:main(){Processors P(64,32);Distribution D(64,32,&P,BLOCK,BLOCK);Align A(64,32,"[ALIGN(T[i][j],D[i][j])]");HPFCollection<MyElement> C(&D,&A);FFUN(C.x,C.y,C.z);...}In this case, the HPF function would see these structures as dened by the following sequence of directives.!HPF$ PROCESSORS P(64,32)double precision x(100*64,32),& y(200*64, 50*32),z(100*64,32)!HPF$ DISTRIBUTE x(BLOCK,BLOCK) ONTO P!HPF$ DISTRIBUTE y(BLOCK,BLOCK) ONTO P!HPF$ DISTRIBUTE z(BLOCK,BLOCK) ONTO PThe reader will note that we have not described howto create arrays that are distributed with anythingother than a BLOCK distribution and array sizes arerestricted to be multiples of the processor array size.There are ways around these restrictions, but they willnot be discussed here.4 Working with Connection Machinesnode Fortran on the CM-5In this section we consider case b in which a Fortran node program is called from a parallel section ofpC++.
One of the problems in experimenting withHPF compilers is that only a few are now beginningto emerge. At the time of this writing, we do nothave access to any of these systems, so we have conducted our experiments with Thinking Machines Co.CM Fortran (CMF) [3] on the CM-5. CMF is a reasonable approximation to the spirit of HPF.There are three major issues that we want to address: Fortran array memory allocation, Fortran array element access, and Fortran array interprocessor communication.As it can be seen later on in this section, the rst twoare addressed through new C++ classes, and the thirdthrough a new pC++ collection.A CM-5 processing node consists of a Sparc microprocessor and four vector processor accelerators(vector units).
To make use of these vector units,arrays have to be allocated in the memories of thevector units. The Connection Machine Run TimeSystem (CMRTS) [4] provides functionalities for allocating space on the vector unit memories. However, the allocation is not so trivial since one alsoneeds to be concerned about how array elements willbe partitioned among the four vector units and being compatible with CMF. We use CMRTS function \CMRT intern detailed geometry" to create ageometry and \CMRT allocate heap array" to allocate memory.
Once the arrays are allocated, arraydescriptors returned by \CMRT allocate heap array"are passed to Fortran subroutines as arguments. Computations are done in Fortran subroutines. These Fortran arrays can also be directly manipulated by pC++control programs through overloaded C++ operators.We design a special C++ array class for each Fortran data type. The following class denes an arrayclass for Fortran double precision arrays:class FArrayDouble {// private part of classpublic:double *d;FArrayDouble() {d = 0.0;};FArrayDouble(int i);FArrayDouble(int i,int j);FArrayDouble(int i,int j,int k);FArrayDouble(int i,char layout1);FArrayDouble(int i,int j,char layout1,char layout2);FArrayDouble(int i,int j,int k,char layout1,char layout2,char layout3);~FArrayDouble();double& operator()(int l);double& operator()(int l,int n);double& operator()(int l,int n,int m);};The private part of the class is used to store memoryaddresses of an array on the four vector units.
Theinformation will be used for accessing each individual array elements in pC++, and, in general, shouldnot be of concern to the users. The class constructors that require arguments allocate memory on vector units when invoked. The arguments, i, j, and kspecify the dimension sizes of the array and layout1,layout2, and layout3 correspond to the CMF compiler directive for array layouts.
A layout can be either \SERIAL" or \NEWS." A \NEWS" dimensionmeans that this dimension will be distributed acrossthe four vector units. A \SERIAL" dimension meansthis dimension will be packed into one vector unit'smemory. An array with only \SERIAL" dimensionssignals that it should be allocated in the Sparc chip'smemory and thus is not allowed in the current implementation. Such an array can be allocated just likea normal C++ array which is always allocated in theSparc chip's memory. The constructors that do not require the layout information use the default \NEWS"layout.
The overloaded operators are used for accessing array elements.The method we use to access array elements onthe vector units from the Sparc chip diers fromthat of CMF. In the CMF assembly code, array element access involves computing \send addresses."In our approach, we use \subgrid dimension," \offchip position," \subgrid axis increment," and \subgrid size" provided by CMRTS to access each individual array element. It turns out that our method is asfast as CMF when all axes are declared \NEWS" andcan be a few times faster than CMF when some axesare declared \SERIAL." For example, the speed foraccessing a 64 64 double precision array is 5:68 105bytes/second in pC++ and 3:81 105 bytes/second inCMF when the rst dimension is declared \NEWS"and the second \SERIAL."Besides memory allocation, we also provide methods for inter-element communication for the Fortranarrays.
We design a special pC++ collection for thispurpose:Collection Fortran : public SuperKernel {public:Fortran(Distribution *D,Align *A);void GetFArray(int index,int offset,FArrayDouble &buffer);void GetFArray(int index1,int index2,int offset,FArrayDouble &buffer);};The Fortran collection is derived from a special rootcollection called SuperKernel. The SuperKernel collection uses the distribution and alignment objects toinitialize a collection and allocate the appropriate element objects on each processor's memory.
The kernelalso provides a global name space for the collectionelements. The communication function GetFArrayfetches an FArray array from a collection element denoted by the given index(es).A Fortran subroutines is declared as a method ofan element class. An invocation of a Fortran subroutine declared this way results in separate invocations of the Fortran subroutine on each processor. The Fortran subroutine is thus similar to amessage-passing CMF \node program" though explicit message-passing is strongly discouraged in theFortran subroutine. Communication between collection elements is accomplished in the pC++ controlprogram. Since Fortran compilers usually add underscores to subroutine names, the pC++ compiler willneed to generate a \wrapper function" for each Fortran subroutine.By way of illustration, the following shows how aFortran subroutine is called from pC++ utilizing theFortran interface.extern "C" {//Fortran subroutine is declared externalvoid integrate_seg_(double*,double*,int&,double&);}class Segment {public:double seg_sum;int length;FArrayDouble x, y;Segment();void integrate_seg();};Segment::Segment():x(length),y(length) {//FArrays are allocated in constructor}void Segment::integrate_seg() {// wrapper invokes Fortran subroutineintegrate_seg_(x.d,y.d,length,seg_sum);}void main() {Processors P;Distribution D(64,&P,CYCLIC);Align A(64,"[ALIGN(T[i],D[i])]");Fortran<Segment> F(&D,&A);F.integrate_seg();}The Fortran subroutine is given below:subroutine integrate_seg(x,y,length,seg_sum)double precision seg_suminteger lengthdouble precision,array(length)::x,ylayout x(:news), y(:news)seg_sum = SUM(x*y)returnend&cmfIn this example, the Fortran subroutine computes thedot-product between two one-dimensional arrays xand y.
The x and y arrays are passed to the Fortran subroutine as \explicit-shape" arrays. However,the interface does not prevent us from declaring thearrays as \assumed-shape" arrays.One shortcoming of the current Fortran interfaceimplementation is that the pC++ control programcannot access Fortran COMMON blocks. This requires that all Fortran global variables be declaredin the pC++ control program and passed to Fortransubroutines as arguments. Nevertheless, it does allowcommunication between Fortran subroutines throughCOMMON blocks. It is important to note, however,that pC++ programming model allows more than onepC++ collection element to be allocated on one processor, meaning Fortran subroutines called by dierentcollection elements will access the same COMMONblock.
Data in the COMMON block are shared byelements on that processor. Potential race-conditionmay arise if a programmer is not careful. The best wayto solve this problem is to allocate only one collectionelement on one processor.5 Working with Intel node Fortran onthe ParagonWe have also ported the Fortran interface to theIntel Paragon. Unfortunately, Fortran90 type compiler is not available on the Paragon, so we can onlyexperiment with the Intel Fortran77 compiler. Sincethe Fortran interface is designed to be portable, mostof what we have discussed for the CM-5 still appliesfor the Paragon. The pC++ program in the previoussection can be directly ported to the Paragon without any change except that the layout specicationsare ignored on the Paragon.