PPPA_G~1 (1158360)
Текст из файла
Debugging DVM-program performance
User guide
April 25, 1999
Contents
1 Introduction . 1
2 Characteristics of program execution. 2
2.1 Main characteristics of program execution. 2
2.2 Components of the main characteristics 3
2.3 Program execution characteristics on each processor 4
3 Methodology of performance debugging. 4
3.1 Representation of program as a hierarchy of intervals 4
3.2 Recommendations on characteristics analysis. 5
4 Start of execution with statistics. 7
5 Start of performance analyzer. 8
6 Representation of characteristics. 8
1Introduction .
The performance of parallel program execution on multiprocessor computers with the distributed memory is determined by the following major factors:
-
program parallelism - a part of parallel calculations in total volume of calculations;
-
balance of processor load during parallel calculations;
-
time of interprocessor communications;
-
degree of overlapping of interprocessor communications and calculations.
Methods and tools of parallel program performance debugging essentially depend on the system which are used for parallel program.
An essential advantage of DVM-system is that at any moment it is known whether sequential or parallel part of the program is executed on any processor. Besides, all synchronization operations of the program are known. Therefore there is an opportunity to quantify the influence of four above factors on the program execution performance.
Special tools were developed for analysis and debugging of performance of DVM-program execution. They work as following. During a program execution on multiprocessor computer (or uniform computer network) the support system stores time characteristic information in processor memory and writes the data into a file upon the program completion. Then the file is processed on workstation using a special performance visualizer.
The performance visualizer allows the user to get time characteristics of the program execution in more or less detail..
2 Characteristics of program execution.
-
Main characteristics of program execution.
The opportunity to distinguish sequential and parallel parts of the program during its execution on the multiprocessor computer allows to predict a productive time required for the program execution on serial computer. So the main characteristic of parallel execution (efficiency coefficient) can be calculated: it is a ratio of the productive time to the total processor time. The total processor time is calculated as a product of execution time on the multiprocessor computer (execution time – maximum of program execution times on all processors used) by number of processors. The lost time is the total processor time of parallel execution subtracted by the productive time. If the programmer is not satisfied with the efficiency coefficient value he should analyze components of the lost time and their origin.
There are following components of the lost time:
-
Losses because of insufficient parallelism that cause replication of execution on several processors (insufficient parallelism); the losses are possible in two cases: first, sequential parts of the program execute on all processors; second, some parallel loop iterations can be replicated on all or several processors according to user prescription.
-
Losses because of execution of interprocessor communication (communication);
-
Losses because of idle time of the processors on which the program execution has been completed earlier than on others (idle).
Time of interprocessor communications includes the time of data transfer from one processor to another and also includes a time lost because of message receive operation on one processor starting earlier than the corresponding send message operation on another one ( dissynchronization losses ).
Since the DVM user does not deal with low level operations as message passing the information should be represented in form convenient to him.
During DVM-program execution interprocessor message exchanges are generated by the following collective operations:
-
reduction start and waiting;
-
start and waiting of shadow edges renewing;
-
loading remote access buffers;
-
dynamic data redistribution;
-
input/output operations (they are executed on one dedicated processor which receives data from and sends to others);
If one of the operations listed above starts not simultaneously on different processors then dissynchronization losses may occur. To estimate such losses the dissynchronization losses for each collective operation are calculated like a time spent on synchronization by all processors as if any collective operation starts with processors synchronization. Overhead losses on synchronization message exchange are not taking into account.
A special characteristic –synchronization – is used to estimate total potential losses because non simultaneous start of collective operations on different processors.
The user should pay attention to the main origin of dissynchronization losses – processor loading imbalance. The loading imbalance is caused by non uniform distribution of parallel loop calculations between processors.
If processor synchronization (interprocessor communication) would be performed upon each entering and exiting of parallel loop then processor loading imbalance would inevitably lead to dissynchronization losses. However as such synchronization is performed not for all loops then imbalances on different program segments can be compensated and real losses can be insignificant or even absent. To estimate possible imbalance losses the user is given a generalized characteristic- imbalance. To minimize overhead losses for calculation of this characteristic it is assumed that processor synchronization is performed only once – upon exiting the program. So a total load of each processor is calculated first and then imbalance losses due to dissynchronization are predicted. However in real program processor synchronization is performed not only upon the exiting but more often, so the real losses will be higher. The real dissynchronization losses will be still higher than imbalance value in case when processor load is strongly varies from one execution of parallel loop to another execution of the same loop.
Dissynchronization can occur not only due to imbalance but also because of different moments of collective operation completion caused by characteristics of its realization on a parallel computer. To evaluate the potential dissynchronization the user is provided with a special characteristic – time variation of collective operation completion. As the imbalance time this characteristic is an integral one. It allows user to quite accurately estimate possible losses due to dissynchronization in case when different execution time of collective operations are not random but are determined by network topology or processor specialization ( input/ output processor, processor for reduction operations etc.).
An important characteristic showing potential reducing of communications by overlapping interprocessor exchanges and computations is time of overlapping.
The main characteristics of effectiveness are integral characteristics allowing user to estimate parallelization degree and potential of its increase. However to estimate the effectiveness of complex programs the integral characteristics can be not sufficient. In this case more detail information on execution of the whole program and it parts can be provided for user.
-
Components of the main characteristics
Some of the above main characteristics consist of several components and its values can be given to user.
Productive time consists of productive processor time (system overheads are included) and input/output operation tine (not taking into account message exchange).
Insufficient parallelism loses consist of two components giving a possibility to distinguish losses in user program and corresponding system overheads.
To refine the communication time it is decomposed into following components:
-
time of reduction operation start;
-
time of reduction operation waiting;
-
time of start of renewing shadow edges;
-
time of completion of renewing shadow edges;
-
time of loading remote access buffers;
-
time of dynamic data redistribution;
-
time of message exchange during input/output operations.
Real and potential dissynchronization losses and the losses due to variation in time of collective operation completion are decomposed in the same manner.
The time of exchanges and computations overlap consist of two components - the time of reduction overlap and the time of shadow edges renewing overlap.
-
Program execution characteristics on each processor
The calculation of the main integral characteristics and its components are based on program execution characteristics on each processor. These characteristics can be useful for more detail analysis of parallel program execution effectiveness. Besides the values of these characteristics its average, maximal and minimal values and the corresponding processor are given.
3Methodology of performance debugging.
For effectiveness analysis of complex parallel program execution it is not sufficient to have characteristics of the whole program execution but detail characteristics of chosen program parts are needed. The execution of DVM-program can be represented as a hierarchy of intervals and the tools to do that and recommendations on characteristic analysis are described below.
-
Representation of program as a hierarchy of intervals
Program execution is considered as an interval of the highest level (zero level). This interval can include several intervals on the next (first) level. Such intervals can be parallel loops, sequential loops as well as any sequence of operations marked by user for which the execution starts from the first operator and completes with the last operator. The intervals of the first level can in turn include intervals of the second level etc..
All above characteristics are computed not only for the whole program but for each its interval. Multiple interval execution can be considered as unrolled sequence of interval operators on the same processors as during real execution of the parallel program. In fact the characteristics of the interval executed several times are added up after each execution. The intervals included into the interval of higher level are identified by the source file name and a line number in it corresponding to the beginning of the interval and may be user defined integer number.
User controls program splitting into intervals during compilation. There are the following options:
-
-e1 – the intervals are all parallel loops and sequential loops embedding them;
-
-e2 – the intervals are all parallel loops and marked sequences of operators;
-
-e3 – concatenation of the first two options ( e1 and e2 );
-
-e4 - the intervals are all parallel and sequential loops and marked sequences of operators.
To mark sequence of operators as an interval two special C-DVM or FORTRAN-DVM instructions are used:
In C-DVM the interval is defined as follows:
DVM(INTERVAL[integer expression ])<operator>,
In FORTRAN-DVM:
CDVM$ INTERVAL[integer expression]
. . .
CDVM$ ENDINTERVAL
For example, marking loop body as an interval and prescribing integer expression as a loop counter each loop iteration will be represented as separate interval. In the same manner characteristics of even and odd loop iteration or characteristics of procedure execution with given parameters can be obtained.
-
Recommendations on characteristics analysis.
While developing parallel program user as a rule has one of two possible target – solve the problem in acceptable time or create an efficient program for solving a class of problems on different parallel computers.
In the first case if the execution time is acceptable then other characteristics can be not interesting for the user. In the second case the main characteristic for user is coefficient of parallelism efficiency. If execution time or coefficient of parallelism efficiency does not satisfy the user then the lost time and its components should be analyzed.
Before proceeding to recommendation on analysis let us make some notes.
First, the calculation of lost time (as well as coefficient of parallelism efficiency) is not based on real time of execution on one processor but on predicted time. This predicted time may differ from the real one.
Real time may be greater than predicted one because the same calculations can be executed slower on one processor than on several processors. The explanation of that is: when the volume of data used in calculations changes then the speed of access to data through cache-memory changes too. Since modern processor performance depends on effectiveness of cache–memory usage, the real time can noticeably exceed the predicted time.
Real time may be less than predicted one because not all overhead losses of parallel program execution are taken into account in predicted time. Such losses (for example, losses for search in system tables) may occur when some frequently used functions are executed and it is impossible to calculate the time of their execution without introduction unacceptable perversions in program execution. These extra losses may be reduced in case of program execution on one processor.
As a result of influence of cache-memory usage efficiency and overhead system losses the user will get different values of productive time on different configurations of parallel computer. So it is desirable to execute program on one processor (when it is possible, as it may take much more memory than one processor has) to understand differences between real and predicted times.
Характеристики
Тип файла документ
Документы такого типа открываются такими программами, как Microsoft Office Word на компьютерах Windows, Apple Pages на компьютерах Mac, Open Office - бесплатная альтернатива на различных платформах, в том числе Linux. Наиболее простым и современным решением будут Google документы, так как открываются онлайн без скачивания прямо в браузере на любой платформе. Существуют российские качественные аналоги, например от Яндекса.
Будьте внимательны на мобильных устройствах, так как там используются упрощённый функционал даже в официальном приложении от Microsoft, поэтому для просмотра скачивайте PDF-версию. А если нужно редактировать файл, то используйте оригинальный файл.
Файлы такого типа обычно разбиты на страницы, а текст может быть форматированным (жирный, курсив, выбор шрифта, таблицы и т.п.), а также в него можно добавлять изображения. Формат идеально подходит для рефератов, докладов и РПЗ курсовых проектов, которые необходимо распечатать. Кстати перед печатью также сохраняйте файл в PDF, так как принтер может начудить со шрифтами.














