PPPA_G~1 (1158360), страница 2
Текст из файла (страница 2)
Second, parallel DVM-program execution time may essentially differ from the time of a sequential program execution. It can be of following origins:
-
Access to distributed data in parallel program differs from that in sequential ones. The parallel program execution time may be increased by additional overhead losses by 10-30 percents. However in parallel program data access optimization can be done which results in speed up of parallel program execution as compared to sequential one.
-
Translation of DVM-programs into standard Fortran77 or C – programs may lead to differences in program optimizations by standard translators. As a result the parallel program may execute slower or faster. Characteristics of modern optimizing compiler influence execution performance very much (100 – 200 percents).
-
Some overhead losses due to parallel program execution support can considerably slow down the program execution ( for example, allocation and free memory operations in sequential program may be transferred in very complicated construction of creating and deleting distributed arrays in parallel program).
Therefore it is desirable to execute program as sequential one on one processor (if it is impossible to do on parallel computer it may be possible on workstation ).
If parallel execution time and sequential execution time are considerably different programmer can use the following DVM-system possibilities.
DVM-program can be compiled in a special mode such that it will not much differ from sequential program (however it is necessary to control the influence of these differences on execution time) but will contain tools for collecting time characteristics in different intervals. User can get sequential execution characteristics and compare them with corresponding characteristics of parallel execution on one processor.
User should take into account the above mentioned facts when analyzing the lost time and its components.
At first three lost time components for zero interval (the whole program ) should be estimated. Probably main part of the lost time is one of two first components ( insufficient parallelism or communications ).
If the main losses are due to insufficient parallelism user should find out whether it appears in parallel or sequential parts. In case of parallel parts wrong definition of processor matrix or wrong data or calculation distributions may have an effect on the lost time. If insufficient parallelism was found in sequential parts a sequential loop executing a great volume of calculations may be the cause. But removing such causes may take a lot of efforts.
If the main losses are due to communications user should pay attention to dissynchronization losses. If synchronization loss value is close to the communication value it is necessary to consider imbalance characteristic, as just imbalance of parallel loop calculations is the main cause of dissynchronization and great communication losses. If imbalance value is much less than synchronization value user should pay attention to time variation for collective operations. If dissynchronization is not a consequence of time variation of completion of collective operations it may be caused by imbalance of some parallel loops which in the considered program execution interval may be mutually compensated. So it make sense to consider imbalance characteristics in intervals of lower levels.
The second probable cause of great dissynchronization losses may be processor dissynchronization, that can occur even if input/output operations start simultaneously. This happens because the main job ( operation system input/output function calls ) is executed on input/output processor while the rest of processors are waiting for data from i/o processor or information about collective operation completion. This cause can be easily revealed if user considers the corresponding communication component – losses because of input/output communications.
A large number of reduction operations or operations loading data from other processors (renewing shadow edges or remote access ) may be a main cause of communication losses. In this case user can reorganize the program to unite reduction operations or renewing shadow edges operations into group operations.
There is another approach to characteristics analysis when first, efficiency coefficients and lost time in first level intervals are analyzed and then they are analyzed in second level intervals etc.. As a result a critical interval will be found and user will be able to concentrate his efforts on its characteristic analysis. It is necessary to take into considerations that interval dissynchronization losses and interval idle losses may be caused by not only imbalance and time variation on this very interval but by imbalance and time variation on other previous intervals too.
While debugging program performance the user does not need to perform total volume of calculations as it will be when the program is used for real tasks. For example, the user can limit the number of regularly repeated external iterations to one or two. The efficiency coefficient depending on losses in intervals that are executed before the first iteration or after the last iteration may be considerably reduced. However the user can consider the external iteration execution as a separate interval and then debug its performance as a performance of the whole program according to above methods.
4Start of execution with statistics.
To collect statistics on DVM-program performance parameter Is_DVM_STAT should be equal 1 when DVM-program starts on multiprocessor computer or on workstation network.
After completion of the program execution a file with name ‘sts’ is created. The length of the file is product of statistics buffer size and the number of processors used for the program execution.
Changing parameters it is possible to change the length of statistics buffer (StatBufLength), where each interval execution characteristics and maximal interval nesting level (MaxIntervalLevel) are saved. Reducing interval nesting level the user can reduce the number of intervals for which statistics are collected, and so the volume of statistics will be reduced.
If parameter IsTimeVariation is equal to 1, then statistics buffer is also used for saving information about times of start and completion of all collective operations. These times are used by performance visualizer to calculate potential dissynchronization and time variation losses and also to find out potential reducing of communications due to overlapping interprocessor exchanges and computations. If there is no enough buffer space to save information of all executed collective operations a warning message is output. User should take into account that performance visualizer cannot use full information while calculating above characteristics.
If errors arises while collecting information the file can be created any case, and error message will be output into file or on screen. These messages begin with the word Statistics.
List of messages:
Statistics: not enough memory for interval, data were not wrote to the file,
Statistics: number of ends of interval > number of begins of interval, data were not wrote to the file,
Statistics: end of interval nline = <N>, name = <name>, no end nline = <N> name =<name>, data were not wrote to the file,
Statistics: StatBufLength=<length>, increase buffer's size by <N> bytes, data were not wrote to the file,
Statistics: StatBufLength=<length>, not enough memory for times of collective operations, increase buffer's size by <N> bytes, only part of times of collective operations and all intervals were wrote to the file.
5Start of performance analyzer.
To get time characteristics for intervals user should execute the following command :
dvm pa <file name1> <file name2> [[[<ch1> <ch2> <ch3>] <level>] <numbers>]
<file name1> – name of file with statistics( sts by default),
<file name2> – output file name,
<ch1> – y/n output of general characteristics,
<ch2> – y/n output of comparative characteristics,
<ch3> – y/n output characteristics for processors,
<level> – nesting level number,
<numbers> – list of processor numbers, for which characteristics should be output.
To get more information of the command parameters user can execute
dvm pa –h .
6 Representation of characteristics.
All characteristics are written into text file which name is defined by the user in the command string of performance analyzer. For each interval the following information is saved:
-
name of source file with DVM-program and first interval operator number (SOURCE, LINE)
-
interval type – whole program, parallel loop (PAR), sequential loop (SEQ) or marked by user sequence of operators (USER);
-
interval level number (LEVEL);
-
the number of entrances ( and exits ) in the interval (EXE_COUNT);
-
value of expression defined when describing interval (EXPR);
-
main execution characteristics and their components (Main characteristics);
-
minimal, maximal and average program execution characteristics on every processor (Comparative characteristics);
-
program execution characteristics on every processor (Execution characteristics on processors);
When characteristics are output their components are in the same line ( to the right in brackets), or in the next line (to the right of symbols “*” or “-“).
Components of some characteristics connected with collective operation execution output as columns of table where lines correspond to the type of collective operation and columns are characteristics. One column (Nop) contains the number of operations of every type, that is characteristics not depending on the number of processor used for the program execution.
Information about minimal, maximal and average characteristics is saved in the table in the same way.
User can reduce the volume of output information prescribing needed types of characteristics. Besides it is possible to restrict the number of intervals prescribing the maximal interval level. User can also define the list of processor numbers for which execution characteristics will be saved. Some characteristics are not saved if their value is equal to zero.
Below there is an output example of Jacobi Fortran-DVM-program characteristics of execution on 4 workstations SGI O2. Size (L) of arrays A and B is equal to 1200, the number of iterations – 4. Results ( array B ) are not written into a file.
Characteristics (Main characteristics and Comparative characteristics) are represented only for zero interval.
PROGRAM JACOB
PARAMETER (L=1200, ITMAX=4)
REAL A(L,L), EPS, MAXEPS, B(L,L)
CHPF$ PROCESSORS P(4,4)
CHPF$ DISTRIBUTE ( BLOCK, BLOCK) ONTO P :: A, B
C arrays A and B with block distribution
PRINT *, '********** TEST_JACOBI **********'
MAXEPS = 0.5E - 7
CDVM$ PARALLEL (J,I) ON A(I, J)
C nest of two parallel loops, iteration (i,j) will be executed on
C processor, which is owner of element A(i,j)
DO 1 J = 1, L
DO 1 I = 1, L
A(I, J) = 0.
IF(I.EQ.1 .OR. J.EQ.1 .OR. I.EQ.L .OR. J.EQ.L) THEN
B(I, J) = 0.
ELSE
B(I, J) = ( 1. + I + J )
ENDIF
1 CONTINUE
DO 2 IT = 1, ITMAX
EPS = 0.
CDVM$ PARALLEL (J, I) ON A(I, J), REDUCTION ( MAX( EPS ))
C variable EPS is used for calculation of maximum value
DO 21 J = 2, L-1
DO 21 I = 2, L-1
EPS = MAX ( EPS, ABS( B( I, J) - A( I, J)))
A(I, J) = B(I, J)
21 CONTINUE
CDVM$ PARALLEL (J, I) ON B(I, J), SHADOW_RENEW (A)
C Copying shadow elements of array A from
C neighboring processors before loop execution
DO 22 J = 2, L-1
DO 22 I = 2, L-1
B(I, J) = (A( I-1, J ) + A( I, J-1 ) + A( I+1, J)+
* A( I, J+1 )) / 4
22 CONTINUE
PRINT *, 'IT = ', IT, ' EPS = ', EPS
IF ( EPS . LT . MAXEPS ) GO TO 3
2 CONTINUE
3 CONTINUE
C OPEN (3, FILE='JACOBI.DAT', FORM='FORMATTED')















