pppa_guid-e (1158384), страница 2
Текст из файла (страница 2)
In FORTRAN-DVM:
CDVM$ INTERVAL[integer expression]
. . .
CDVM$ ENDINTERVAL
For example, marking loop body as an interval and prescribing integer expression as a loop counter each loop iteration will be represented as separate interval. In the same manner characteristics of even and odd loop iteration or characteristics of procedure execution with given parameters can be obtained.
3.2Recommendations on characteristics analysis
While developing parallel program user as a rule has one of two possible target – solve the problem in acceptable time or create an efficient program for solving a class of problems on different parallel computers.
In the first case if the execution time is acceptable then other characteristics can be not interesting for the user. In the second case the main characteristic for user is coefficient of parallelism efficiency. If execution time or coefficient of parallelism efficiency does not satisfy the user then the lost time and its components should be analyzed.
Before proceeding to recommendation on analysis let us make some notes.
First, the calculation of lost time (as well as coefficient of parallelism efficiency) is not based on real time of execution on one processor but on predicted time. This predicted time may differ from the real one.
Real time may be greater than predicted one because the same calculations can be executed slower on one processor than on several processors. The explanation of that is: when the volume of data used in calculations changes then the speed of access to data through cache-memory changes too. Since modern processor performance depends on effectiveness of cache–memory usage, the real time can noticeably exceed the predicted time.
Real time may be less than predicted one because not all overhead losses of parallel program execution are taken into account in predicted time. Such losses (for example, losses for search in system tables) may occur when some frequently used functions are executed and it is impossible to calculate the time of their execution without introduction unacceptable perversions in program execution. These extra losses may be reduced in case of program execution on one processor.
As a result of influence of cache-memory usage efficiency and overhead system losses the user will get different values of productive time on different configurations of parallel computer. So it is desirable to execute program on one processor (when it is possible, as it may take much more memory than one processor has) to understand differences between real and predicted times.
Second, parallel DVM-program execution time may essentially differ from the time of a sequential program execution. It can be of following origins:
-
Access to distributed data in parallel program differs from that in sequential ones. The parallel program execution time may be increased by additional overhead losses by 10-30 percents. However in parallel program data access optimization can be done which results in speed up of parallel program execution as compared to sequential one.
-
Translation of DVM-programs into standard Fortran77 or C – programs may lead to differences in program optimizations by standard translators. As a result the parallel program may execute slower or faster. Characteristics of modern optimizing compiler influence execution performance very much (100 – 200 percents).
-
Some overhead losses due to parallel program execution support can considerably slow down the program execution ( for example, allocation and free memory operations in sequential program may be transferred in very complicated construction of creating and deleting distributed arrays in parallel program).
Therefore it is desirable to execute program as sequential one on one processor (if it is impossible to do on parallel computer it may be possible on workstation ).
If parallel execution time and sequential execution time are considerably different programmer can use the following DVM-system possibilities.
DVM-program can be compiled in a special mode such that it will not much differ from sequential program (however it is necessary to control the influence of these differences on execution time) but will contain tools for collecting time characteristics in different intervals. User can get sequential execution characteristics and compare them with corresponding characteristics of parallel execution on one processor.
User should take into account the above mentioned facts when analyzing the lost time and its components.
At first three lost time components for zero interval (the whole program ) should be estimated. Probably main part of the lost time is one of two first components ( insufficient parallelism or communications ).
If the main losses are due to insufficient parallelism user should find out whether it appears in parallel or sequential parts. In case of parallel parts wrong definition of processor matrix or wrong data or calculation distributions may have an effect on the lost time. If insufficient parallelism was found in sequential parts a sequential loop executing a great volume of calculations may be the cause. But removing such causes may take a lot of efforts.
If the main losses are due to communications user should pay attention to dissynchronization losses. If these losses are substantial it is necessary to consider imbalance characteristic, as just imbalance of parallel loop calculations is the main cause of dissynchronization and great communication losses. If imbalance value is much less than synchronization value user should pay attention to time variation for collective operations. If dissynchronization is not a consequence of time variation of completion of collective operations it may be caused by imbalance of some parallel loops which in the considered program execution interval may be mutually compensated. So it make sense to consider imbalance characteristics in intervals of lower levels.
The second probable cause of great dissynchronization losses may be processor dissynchronization that can occur even if input/output operations start simultaneously. This happens because the main job (operation system input/output function calls) is executed on input/output processor while the rest of processors are waiting for data from I/O processor or information about collective operation completion. This cause can be easily revealed if user considers the corresponding communication component – losses because of input/output communications.
Delay in asynchronous collective operation start may cause great communication losses. In this case user should refer to person responsible for maintenance of communication library, used by DVM-system.
A large number of reduction operations or operations loading data from other processors (renewing shadow edges or remote access) may be a main cause of communication losses. In this case user can reorganize the program to unite reduction operations or renewing shadow edges operations into group operations.
There is another approach for characteristic analysis when first, efficiency coefficients and lost time in first level intervals are analyzed and then they are analyzed in second level intervals etc. As a result a critical interval will be found and user will be able to concentrate his efforts on its characteristic analysis. It is necessary to take into considerations that interval dissynchronization losses and interval idle losses may be caused by not only imbalance and time variation on this very interval but by imbalance and time variation on other previous intervals too.
While debugging program performance the user does not need to perform total volume of calculations, as it will be when the program is used for real tasks. For example, the user can limit the number of regularly repeated external iterations to one or two. The efficiency coefficient depending on losses in intervals that are executed before the first iteration or after the last iteration may be considerably reduced. However the user can define the external iteration execution as a separate interval and then debug its performance as a performance of the whole program according to above methods.
4Start of execution with statistics
To collect statistics on DVM-program performance parameter Is_DVM_STAT should be equal 1 when DVM-program starts on multiprocessor computer or on workstation network.
After completion of the program execution a file with name ‘sts’ is created. The length of the file is product of statistics buffer size and the number of processors used for the program execution.
Changing parameters it is possible to change the length of statistics buffer (StatBufLength), where each interval execution characteristics and maximal interval nesting level (MaxIntervalLevel) are saved. Reducing interval nesting level the user can reduce the number of intervals for which statistics are collected, and so the volume of statistics will be reduced.
If parameter IsTimeVariation is equal to 1, then statistics buffer is also used for saving information about times of start and completion of all collective operations. These times are used by performance visualizer to calculate potential dissynchronization and time variation losses and also to find out potential reducing of communications due to overlapping interprocessor exchanges and computations. If there is no enough buffer space to save information of all executed collective operations a warning message is output. User should take into account that performance visualizer cannot use full information while calculating above characteristics.
If errors arises while collecting information the file can be created any case, and error message will be output into file or on screen. These messages begin with the word Statistics.
List of messages:
-
Statistics: not enough memory for interval, data were not wrote to the file,
-
Statistics: number of ends of interval > number of begins of interval, data were not wrote to the file,
-
Statistics: end of interval nline = <N>, name = <name>, no end nline = <N> name =<name>, data were not wrote to the file,
-
Statistics: StatBufLength=<length>, increase buffer's size by <N> bytes, data were not wrote to the file,
-
Statistics: StatBufLength=<length>, not enough memory for times of collective operations, increase buffer's size by <N> bytes, only part of times of collective operations and all intervals were wrote to the file.
5Start of performance analyzer
To get time characteristics for intervals user should execute the following command:
dvm pa <file name1> <file name2> [[[<ch1> <ch2> <ch3>] <level>] <numbers>]
| <file name1> | – name of file with statistics( sts by default), |
To get more information of the command parameters user can execute
dvm pa –h.
6Representation of characteristics
All characteristics are written into text file which name is defined by the user in the command string of performance analyzer. For each interval the following information is saved:
-
name of source file with DVM-program and first interval operator number (SOURCE, LINE)
-
interval type – whole program, parallel loop (PAR), sequential loop (SEQ) or marked by user sequence of operators (USER);
-
interval level number (LEVEL);
-
the number of entrances ( and exits ) in the interval (EXE_COUNT);
-
value of expression defined when describing interval (EXPR);
-
main execution characteristics and their components (Main characteristics);
-
minimal, maximal and average program execution characteristics on every processor (Comparative characteristics);
-
program execution characteristics on every processor (Execution characteristics on processors);
When characteristics are output their components are in the same line ( to the right in brackets), or in the next line (to the right of symbols “*” or “-“).
Components of some characteristics connected with collective operation execution output as columns of table where lines correspond to the type of collective operation and columns are characteristics. One column (Nop) contains the number of operations of every type, that is characteristics not depending on the number of processor used for the program execution.
Information about minimal, maximal and average characteristics is saved in the table in the same way.
User can reduce the volume of output information prescribing needed types of characteristics. Besides it is possible to restrict the number of intervals prescribing the maximal interval level. User can also define the list of processor numbers for which execution characteristics will be saved. Some characteristics are not saved if their value is equal to zero.
Below there is an output example of Jacobi Fortran-DVM-program characteristics of execution on 4 workstations SGI O2. Size (L) of arrays A and B is equal to 1200, the number of iterations – 4. Results ( array B ) are not written into a file.
Characteristics (Main characteristics and Comparative characteristics) are represented only for zero interval.
PROGRAM JACOB















