fdvmLDe (1158336), страница 7
Текст из файла (страница 7)
Shadow edges for multidimensional distributed arrays can be specified for each dimension. A special case is when it is required to renew "a corner" of shadow edges. In such a case additional CORNER parameter is needed.
Example 6.2. Specification of SHADOW-references with corner elements
REAL A(100,100), B(100,100)
CDVM$ ALIGN B(I,J) WITH A(I,J)
CDVM$ DISTRIBUTE A (BLOCK,BLOCK)
. . .
CDVM$ PARALLEL (I,J) ON A (I,J), SHADOW_RENEW (B(CORNER))
DO 10 I = 2, 99
DO 10 J = 2, 99
A(I,J) = (B(I,J+1) + B(I+1,J) + B(I+1,J+1) ) / 3
10 CONTINUE
The width of shadow edges of the array B is equal to 1 for both dimensions by default. As "corner" reference B(I+1,J+1) exists, the CORNER parameter is specified.
| shadow edges | |||||||||
| sent values | |||||||||
| internal area | |||||||||
| corner elements |
Fig. 6.2. Scheme of array local section with shadow edges.
6.2.3Computing values in shadow edges. SHADOW_COMPUTE clause
In sections 6.2.1 and 6.2.2 the ways to renew values in shadow edges by data exchange between processors were described. At that new values of an array are calculated in one loop, but renewing is performed either before execution or during execution of other loop.
Under some conditions the values in shadow edges can be renewed without data exchange between processors. New values of shadow edges can be calculated in the same loop, where new values of the arrays are calculated. This way of shadow edge renewing is described by SHADOW_COMPUTE specification, which is clause of PARALLEL directive.
Let consider a parallel loop
CDVM$ PARALLEL (I1,I2,…,In) ON A(f1,f2,…,fk)
where A is identifier of the array, the loop iterations are distributed according to.
Let
{LH} - set of references to distributed arrays in left parts of the assignment statements of
the loop body.
{RH} - set of references to distributed arrays in right parts (expressions) of the assignment
statements of the loop body.
The following conditions must be satisfied to apply SHADOW_COMPUTE specification:
-
the width of shadow edges of distributed dimensions of arrays from {LH} and {RH} must be less than the width of shadow edges of corresponding dimensions of array A;
-
shadow edges of arrays from {RH} must be filled with the values, corresponding to values of the arrays.
During the loop execution the values of shadow edges of arrays from {LH} are renewed. The width of renewed part of every shadow edge is equal to the width of corresponding shadow edge of array A.
Example 6.3. SHADOW_COMPUTE specification.
CDVM$ DISTRIBUTE (BLOCK) :: A, B, C, D
CDVM$ SHADOW A(1:2)
CDVM$ SHADOW B(2:2)
CDVM$ PARALLEL ( I ) ON C( I ), SHADOW_COMPUTE,
CDVM$* SHADOW_RENEW( A, B )
DO 10 I = 1, N
C(I) = A(I) + B(I)
D(I) = A(I) - B(I)
10 CONTINUE
As by default the width of shadow edges for C and D array is 1, the condition 1) is satisfied. Performing of SHADOW_RENEW specification satisfies the condition 2).
6.2.4ACROSS specification of dependent references of SHADOW type for single loop
Consider the following loop
DO 10 I = 2, N-1
DO 10 J = 2, N-1
A(I,J) = (A(I,J-1) + A(I,J+1) + A(I-1,J) + A(I+1,J)) / 4
10 CONTINUE
Data dependence exists between loop index i1 and i2 ( i1<i2 ), if both these iterations refer to the same array element by write-read or read-write scheme.
If iteration i1 writes a value and iteration i2 reads this value, then flow dependence, or simply dependence exists between the iterations.
If iteration i1 reads the "old" value and iteration i2 writes the "new" value, then anti-dependence i1 i2 exists between the iterations.
In both cases iteration i2 can be executed only after iteration i1.
The value i2 - i1 is called a range or length of dependence. If for any iteration i dependent iteration i + d (d is constant) exists, then the dependence is called regular one, or dependence with constant length.
The loop with regular computations, with regular dependencies on distributed arrays, can be distributed with PARALLEL directive, using ACROSS clause.
across-clause | is ACROSS ( dependent-array-list ) |
dependent-array | is dist-array-name ( dependence-list ) [(section-spec-list)] |
dependence | is flow-dep-length : anti-dep-length |
flow-dep-length | is int-constant |
anti-dep-length | is int-constant |
section-spec | is SECTION (section-subscript-list) |
All the distributed arrays with regular data dependence are specified in ACROSS clause. The length of flow dependence (flow-dep-length) and the length of anti-dependence (anti-dep-length) are specified for each dimension of the array. There is no data dependence, if length is equal to zero.
Constraint:
-
In each array reference, data dependence may appear only in one distributed dimension. For example, A(I-1,J), A(I,J+1) references are allowed, but A(I‑1,J+1), A(I+1,J-1) references are forbidden.
If only the values of the sections of array are renewed in the loop but not whole array then these sections should be specified by SECTION clause.
Example 6.4. Specification of the loop with regular data dependence.
CDVM$ PARALLEL (I,J) ON A(I,J), ACROSS (A(1:1,1:1))
DO 10 I = 2, N-1
DO 10 J = 2, N-1
A(I,J) = (A(I,J-1) + A(I,J+1) + A(I-1,J) + A(I+1,J))/4
10 CONTINUE
Flow and anti- dependencies of length 1 exist for all dimensions of the array A.
ACROSS specification is implemented via shadow edges. The length of anti-dependence defines width of high edge renewing, and length of flow-dependence defines width of low edge renewing. High edges are renewed prior the loop execution (as for SHADOW_RENEW directive). Low edges are renewed during loop execution as remote data calculation proceeds. Actually, ACROSS-references are subset of SHADOW-references, that have data dependence.
Efficiency of parallel execution of ACROSS loop
An array dimension with data dependence will be called recurrent dimension.
Efficiency degree of ACROSS loop parallel execution depends on the number of distributed recurrent dimensions.
One-dimensional array. For one-dimensional distributed array with recurrent dimension only sequential execution is possible (see fig. 6.3).
Multi-dimensional array. For multi-dimensional array the following combinations of recurrent distributed dimensions can be selected according to degree of efficiency decreasing.
-
At least one non-recurrent dimension exists. The array and the loop are distributed along this dimension only. The loop is executed as usual parallel loop without ACROSS specification.
Example 6.5. Non-recurrent dimension parallelization.
CDVM$ DISTRIBUTE A( BLOCK, * )
CDVM$ PARALLEL ( I ) ON A( I, * )
DO 30 I = 1,N1
DO 30 J = 2,N2-1
30 A(I,J) = A(I,J-1) + A(I,J+1)
Note, that this way may be not most efficient, if N1 is much less then N2 and the number of processors (insufficient parallelism).
-
Only one recurrent dimension is distributed. Other dimensions are localized on every processor. Run-time system performs pipeline parallelization (see fig. 6.4). The size of pipeline step is defined on every computer automatically, depending on the loop execution time and the time of data passing when renewing shadow edges.
Example 6.6. Pipeline parallelization.
CDVM$ DISTRIBUTE A( BLOCK, * )
CDVM$ PARALLEL ( I, J ) ON A( I, J ), ACROSS( A( 1:1, 1:1 ))
DO 40 I = 2,N1-1
DO 40 J = 2,N2-1
40 A(I,J) = A(I,J-1) + A(I,J+1) + A(I-1,J) + A(I+1,J)
Constraint of pipeline parallelization. For implementation of pipeline parallelization the following additional conditions must be satisfied:
-
PARALLEL directive specifies two loop headers as minimum. One loop specifies distributed recurrent dimension, other loop specifies a local dimension of the array.
-
If several arrays are specifying in ACROSS loop, these arrays should be aligned along recurrent distributed dimension and along the local dimension, indexed by the parallel loop.
-
m>1 recurrent dimensions exist. The virtual processor arrangement contains m dimensions too. Run-time system automatically organizes parallel execution on hyper-planes of processor arrangement. Every hyper-plane has m-1 dimensions.
Example 6.7. Hyper-plane parallelization.
CDVM$ DISTRIBUTE A( BLOCK, BLOCK )
CDVM$ PARALLEL ( I, J ) ON A( I, J ) , ACROSS ( A( 1:1, 1:1 ))
DO 50 I = 2,N1-1
DO 50 J = 2,N2-1
50 A(I,J) = A(I,J-1) + A(I,J+1) + A(I-1,J) + A(I+1,J)
Two-dimensional arrangement of virtual processors is represented on fig. 6.5. The computations on processors, belonging to one hyper-plane (diagonal line), will be executed in parallel.
i
p0 | t1 |
| |
p1 | t2 |
| |
p2 | t3 |
.
.
.
Fig. 6.3. Sequential execution
j
i
p0 | t1 | t2 | t3 | |
| ||||
| ||||
p1 | t2 | t3 | ||
| ||||
p2 | t3 | |||
.
.