ipps94 (1158310), страница 2
Текст из файла (страница 2)
This code captures data that can be used tocalculate the number of times this function is calledand the percentage of the total execution time spentin that routine. The necessary computations can becarried out in two ways: 1) the prole analysis can bedirectly performed at runtime (direct proling), or 2)all entry and exit points of a function can be tracedand calculations done o-line (trace-based proling).For pC++, we are interested in capturing performance proling data associated with three generalclasses of functions: 1) thread-level functions, 2) collection class methods, and 3) runtime system routines. The data we want to capture includes activationcounts, execution time, and, in the case of collections,referencing information.3.1.1 General ApproachWe perform all program transformations necessary forinstrumentation at the language level, thus ensuringproling portability. However, since proling meansinserting special code at all entry and exit points of afunction, language-level proling introduces the trickyproblem of correctly instrumenting these points.
Inparticular, we have to ensure that the exit prolingcode is executed as late as possible before the function is exited. In general, a function can return anexpression that can be arbitrarily complex, possiblytaking a long time to execute. Correct proling instrumentation would extract the expression from the return statement, compute its value, execute the proling exit code, and nally return the expression result.Luckily, we can let the C++ compiler do the dirtywork. The trick is very simple: we declare a specialProler class which only has a constructor and a destructor and no other methods.
A variable of thatclass is then declared and instantiated in the rst lineof each function which has to be proled as shownbelow for function bar.class Profiler {char* name;public:Profiler(char *n) {name=n; code_enter(n);}~Profiler() {code_exit(name);}};void bar(){Profiler tr("bar"); // Profiler variable// body of bar}The variable is created and initialized each time thecontrol ow reaches its denition (via the constructor)and destroyed on exit from its block (via the destructor).
The C++ compiler is clever enough to rearrangethe code and to insert calls to the destructor no matterhow the scope is exited. Note also, that we use a private member to store a function identication whichwe can use in the destructor.3.1.2 Proling ImplementationThe approach described above has two basic advantages. First, instrumenting at the source code levelmakes it very portable. Second, dierent implementations of the proler can be easily created by providingdierent code for the constructor and destructor. Thismakes it very exible. Currently, we have implementedtwo versions of the proler:Direct Proling: The function prole is directly computed during program execution. We maintain aset of performance data values (#numcalls, usec,cumusec) for each proled function. In addition,we store the current and parent function identications and the timestamp of function entryin the Profiler object.
These three values areset by the constructor, which also increments thecounter #numcalls. The destructor uses the entry timestamp to compute the duration of thefunction call and adds this value to the corresponding usec and cumusec elds, but also subtracts it from the usec eld of its parent function.In this way, we can compute the time spent in afunction itself not counting its children. At theexit of the main function, all prole data gathered for all functions is written to a le by thedestructor.Trace-based Proling: Here the constructor and destructor functions simply call an event loggingfunction from the pC++ software event tracinglibrary (see next subsection).
All events insertedare assigned to the event class EC_PROFILER. Byusing event classes, the event recording can beactivated/deactivated at runtime. The computation of the prole statistics is then done o-line.Other proling alternatives could be implementedin the same way. For example, proling code couldbe activated/deactivated for each function separately,allowing dynamic proling control. Another possibility is to let users supply function-specic prolecode (specied by source code annotations or specialclass members with predened names) that allows customized runtime performance analysis.3.1.3 The pC++ InstrumentorWe use the Sage++ class library and restructuringtoolkit to manipulate pC++ programs and insert thenecessary proler instrumentation code at the beginning of each function. The Instrumentor consists ofthree phases: 1) read the parsed internal representation of the program, 2) manipulate the program representation by adding proling code according to aninstrumentation command le, and 3) write the newprogram back to disk.
Sage++ provides all the necessary support for this type of program restructuring.By default, every function in the pC++ input lesis proled. However, the user can specify the set offunctions to instrument with the help of an instrumentation command le. The le contains a sequenceof instrumentation commands for including/excludingfunctions from the instrumentation process based onthe le or class in which they are declared, or simplyby their name. Filenames, classes, and functions canbe specied as regular expressions.3.1.4 The pC++ Runtime SystemThere are also instrumented versions of the pC++class libraries and runtime system, both for direct andtrace-based proling.
In addition to the instrumentation of user-level functions, they provide proling ofruntime system functions and collection access.3.1.5 pprof and vpprofPprof is the parallel prole tool. It prints pC++ pro-le datales generated by programs compiled for direct proling. The output of pprof is similar to theUNIX prof tool.
In addition, it prints a function prolefor each program thread and some data access statistics, showing the local and remote accesses to eachcollection per thread. Also, it prints a function prolesummary (mean, minimum, maximum) and collectiondata access summary for the whole parallel execution.The function prole table has the following elds:%time The percentage of the total runningtime of the main program used bythis function.msec The number of milliseconds used bythis function alone.total msec A running sum of the time usedby this function and all its children(functions which are called withinthe current function).#calls The number of times this functionwas invoked.usec/call The average number of microseconds spent in this function per call.name The name of the function.Vpprof is a graphical viewer for pC++ proledatales.
After compiling an application for prolingand running it, vpprof lets you browse through thefunction and collection prole data. It is a graphicalfrontend to pprof implemented using Tcl/Tk [15, 16].The main window shows a summary of the functionand the collection access prole data in the form of bargraphs. A mouse click on a bar graph object providesmore detailed information.3.2 Event Tracing of pC++ ProgramsIn addition to proling, we have implemented anextensive system for tracing pC++ program events.Currently, tracing pC++ programs is restricted toshared-memory computers (e.g., Sequent Symmetry,BBN Buttery, and Kendall Square KSR-1) and theuniprocessor UNIX version. The implementation ofthe event tracing package to distributed memory machines is under way2.
Trace instrumentation supportis similar to proling. On top of the pC++ tracing system, we are implementing an integrated performanceanalysis and visualization environment. The performance results reported in this paper use utility toolsto analyze the event traces that are based on externally available event trace analysis tools:Note, the dierence between the shared- and distributedmemory implementations is only in the low-level trace data collection library and timestamp generation; all trace instrumentation is the same.2Merging: Traced pC++ programs produce an eventlog for each node. The trace les will have namesof the form <MachineId>.<NodeId>.trc.
Thesingle node traces must be merged into one globalevent trace, with all event records sorted by increasing timestamps. This is done with the toolse merge. If the target machine does not have ahardware global clock, se merge will establish aglobal time reference for the event traces by correcting timestamps.Trace Conversion: The utility tool se convert converts traces to the SDDF format used with thePablo performance analysis environment [18, 21]or to ALOG format used in the Upshot event display tool [4]. It also can produce a simple userreadable ASCII dump of the binary trace.Trace Analysis and Visualization: The trace les canbe processed with the SIMPLE event trace analysis environment or other tools based on theTDL/POET event trace interface [13, 14].
Thesetools use the Trace Description Language (TDL)output of the Instrumentor to access the traceles. In addition, we have implemented a Upshotlike event and state display tool (oShoot) basedon Tcl/Tk [15, 16]. Like Upshot, it is based onthe ALOG event trace format.3.3 Programming Environment ToolsIn addition to the performance tools, we startedto implement some programming environment utilities. Currently, function, class, and static callgraphbrowsers are implemented. Future versions of pC++will include data visualization and debugging tools.4 Benchmark Test SuiteTo evaluate the pC++ language and runtime system implementations, we have established a suiteof benchmark programs that illustrate a wide rangeof execution behaviors, requiring dierent degrees ofcomputation and communication. In this section,we describe the benchmark programs and point outfeatures that make them particularly interesting forpC++ performance evaluation.
The benchmarks werechosen to evaluate dierent features of the pC++ language and runtime system; two are related to CFDapplications and two come from the NAS suite.4.1 Grid: Block Grid CGThe rst benchmark illustrates a \typical toy problem", grid computation. The computation consistsof solving the Poisson equation on a two dimensionalgrid using nite dierence operators and a simpleconjugate gradient method without preconditioning.Though this problem is very simple, it does illustrateimportant properties of the runtime system associatedwith an important class of computations.In the program, we have used an early prototype ofour DistributedGrid collection class. In addition, wehave also used a common blocking technique that isoften used to increase the computational granularity.The method used here is to make the grid size P byP and set theN grid elements to be subgrids of size Mby M; M = P .The heart of the algorithm is a Conjugate Gradientiteration without any preconditioning.