1994. Compiler Transformations for High-Perforamce Computing, страница 17
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 17 страницы из PDF
The compilerfocuses on computations that updatearrayelementsbasedon their neighbors,It finds the patternofneighborsthat are needed to computeagivenelement,calledthe stencil.Thecross, a commonstencil,representsacomputationthat updates each array element using the value of its neighborstothe north,east, west, and south. Stencilsareaggregatedintolargerpatterns,usingoptimizationsormultistencils,analogousto loop unrollingand stripmining.The multistencilsare mappedto thehardwareso as to minimizecommunication.SIMD compilersdevelopedat Compass[Knobe and Natarajan1993; Knobe et al.1988; 1990] constructa preferencegraphbasedon the computationsbeingperformed.The graph representsalignmentrequeststhatwouldyieldtheleast communicationoverhead.The compilersatisfies every request when possible,using agreedy algorithmto find a solutionwhenpreferencesare in conflict.ACMComputmgSurveys,Vol26,No4, December1994402●DavidF.
Baconet al.Wholey[1992] begins by computingadetailedcost functionfor the computation. The functionis the inputto a hillclimbingalgorithmthatsearchesthespace of possibleallocationsof data toprocessors.It begins with two processors,then four,and so forth;each timethenumberof processorsis doubled,the algorithmuses the most efficientallocationfromthe previousstage as a startingpoint.In additionto general-purposemappingalgorithms,the compilercan usespecific transformationslike loop flattening [von Hanxledenand Kennedy1992]to avoididle timedue to nonuniformcomputations.7.6 VLIW TransformationsThe Very Long InstructionWord (VLIW)architectureis anotherstrategyfor executinginstructionsin parallel.A VLIWprocessor[Colwellet al.
1988; Fisheretal. 1984; FloatingPointSystems1979;Rau et al. 1989]exposesits multiplefunctionalunits explicitly;each machineinstructionis very large (on the order of512 bits) and controlsmany independentfunctionalunits.Typicallythe instruction is dividedinto 32-bit pieces, each ofwhich is routedto one of the functionalunits.Withtraditionalprocessors,thereisa cleardistinctionbetweenlow-levelschedulingdone via instructionselectionand high-levelschedulingbetweenprocessors.Thisdistinctionis blurredinVLIWarchitectures.Theyrelyon thecompilerto expose the paralleloperationsexplicitlyand schedulethematcompiletime. Thus the task of generating objectcode incorporateshigh-levelresourceallocationand schedulingdecisions that occur only betweenprocessorsona moreconventionalparallelarchitecture.Tracescheduling[Ellis1986; Fisher1981; Fisheret al.
1984] is an analysisstrategythatwas developedfor VLIWarchitectures.BecauseVLIWmachinesrequiremore parallelismthanis typically availablewithina basic block, com-ACMComputingSurveys,Vol26,No.4, December1994pilersforthesemachinesmustlookthroughbranchesto find additionalinstructionsto execute. The multiplepathsof controlrequirethe introductionofspeculativeexecution—computingsomeresultsthat may turn out to be unused(ideallyat no additionalcost, by functionalunitsthat wouldotherwisehavegone unused).The speculationcan be handleddynamicallyby the hardware[ Sohi andVajapayem1990], but thatcomplicatesthe design significantly.The trace-scheduling alternativeis to have the compilerdo it by identif~ngpaths(or traces)throughthecontrol-flowgraphthatmightbe taken.The compilerguesseswhich way the branchesare likelyto goand hoists code above them accordingly.Superblockscheduling[Hwu et al.
19931is an extensionof trace schedulingthatrelies on programprofileinformationtochoose the traces.8. TRANSFORMATIONFRAMEWORKSGiventhe manytransformationsthatcompilerwritershave available,they facea dauntingtask in determiningwhichones to apply and in what order. There isno single best order of application;onetransformationcan permitor preventasecondfrombeingapplied,or it canchangethe effectivenessof subsequentchanges.Currentcompilersgenerallyuse a combinationof heuristicsand apartiallyfixedorderofapplyingtransformations.There are two basic approachesto thisproblem:uniffingthe transformationsina single mechanism,and applyingsearchtechniquesto the transformationspace.In fact there is often a degree of overlapbetweenthese two approaches.8.1 UnifiedTransformationA promisingstrategyis to encode boththecharacteristicsof thecode beingtransformedand the effect of each transformation;then the compilercan quicklysearch the space of possible sets of transformationsto find an efficientsolution.CompilerOne frameworkthat is being activelyinvestigatedis based on unimodularmatrix theory [Banerjee1991; Wolf and Lam1991].
It is applicableto any loop nestwhose dependencecan be described witha distancevector;a subset of the loopsthat requirea directionvector can alsobe handled.The transformationsthat aunimodularmatrixcan describeare interchange,reversal,and skew.The basic principleis to encode eachtransformationof the loop in a matrixand apply it to the dependencevectors ofthe loop. The effect on the dependencepatternof applyingthe transformationcan be determinedby multiplyingthematrixand the vector.
The form of theproduct vector reveals whetherthe transformationis valid. To model the effect ofapplyinga sequenceof transformations,the correspondingmatricesare simplymultiplied.Figure60 shows a loop nest that canbe transformedwith unimodularmatrices. The distancevectorthatdescribesthe loop is D = (1, O), representingthedependenceof iterationi on i – 1 in theouter loop.Becauseof the dependence,it is notlegal to reverse the outer loop of the nest.The reversaltransformationisR=–::.The product[1isHRD=P1=‘:.This demonstratesthat the transformation is not legal, because PI is not lexicographicallypositive.We can also test whetherthe two loopscan be interchanged;theinterchangetransformationthatto Disyields1 =1~ ~ .
Applying[.P2 = ~ . In this case,[1the resultingvector is lexicographicallypositiveshowingthat the transformationis legal.Transformationsdoi403●=2,10doj= 1,a[i, j]end doend doFigure 60.10= a[i-l,Unimodularjl+ a[i, jltransformationexample.Any loop nest whose dependenceareall representableby a distance vector canbe transformedvia skewing into a canonical form called a fully permutableloopnest. In this form, any two loops in thenest can be interchangedwithoutchanging the loop semantics.Once in thiscanonicalform, the compilercan decompose the loop into the granularitythatmatches the target architecture[Wolf andLam 1991].Sarkar and Thekkath[1992] describe aframeworkfor transformingperfect loopnests that includesunimodulartransformations,tiling,coalescing,and parallelloop execution.The transformationsareencodedin an orderedsequence.Rulesare providedfor mappingthe dependencevectors and loop bounds,and the transformationto the loop body is describedbya template.Pugh[1991]describesa moreambitious(andtime-consuming)techniquethatcan transformimperfectlynestedloops and can do most of the transformations possiblethrougha combinationofstatementreordering,interchange,fusion, skewing,reversal,distribution,andparallelization.It views the transformation problemas that of findingthe bestschedule for a set of operationsin a loopnest.
A methodis given for generatingand testingcandidateschedules.8.2 Searchingthe TransformationSpaceWang[1991]and Wangand Gannon[1989]describea parallelizationsystemthatuses heuristicsearchtechniquesintelligenceto finda profrom artificialgram transformationsequence.The target machineis representedby a set offeaturesthat describe the type, size, andspeed of the processors,memory,and in-ACMComputingSurveys,Vol26,No4, December1994404“DavidF. Baconet al.terconnect.The heuristicsare organizedhierarchically.The mainfunctionsare:descriptionof parallelismin the programand in the machine;matchingof programparallelismto machineparallelism;andcontrol of restructuring.9.
COMPILEREVALUATIONResearchersare still tryingto find a goodway to evaluatethe effectivenessof compilers. There is no generallyagreed uponway to determinethe best possibleperformanceof a particularprogramon aparticularmachine,so it is difficulttodeterminehow well a compileris doing.Since some applicationsare better structuredthan othersfor a given architecture or a given compiler,measurementsfor a particularapplicationor groupofapplicationswill not necessarilypredicthow well anotherapplicationwill fare.Nevertheless,a wide varietyof measurementstudiesdo exist thatseek toevaluatehow applicationsbehave,howwell they are beingcompiled,and howwell they could be compiled.We havedivided these studies into several groups.9.1 BenchmarksBenchmarkshave receivedby far themost attentionsince they measuredelivered performance,yieldingresultsthatare used to marketmachines.They wereoriginallydevelopedto measuremachinespeed,not compilereffectiveness.TheLivermoreLoops [McMahon1986] is oneof the early benchmarksuites; it soughtto compare the performanceof supercomputers.
The suite consists of a set of smallloops based on the most time-consuminginner loops of scientificcodes.TheSPECbenchmarksuite[Dixit1992] includesboth scientificand general-purposeapplicationsintendedto berepresentativeof an engineering/scientific workload.The SPEC benchmarksarewidely used as indicatorsof machineperformance,but are essentiallyuniprocessor benchmarks.As architectureshavebecomemorecomplex,it has become obvious that theACMComputmgSurveys,Vol26, No 4, December1994benchmarksmeasure the combinedeffectivenessof the compilerand the targetmachine.Thus two compilersthat targetthe same machinecan be comparedbyusing the SPEC ratingsof the generatedcode.For parallelmachines,the contributionof the compileris even more important,since the differencebetweennaiveandoptimizedcode can be manyordersofmagnitude.ParallelbenchmarksincludeSPLASH(StanfordParallelApplicationsfor SharedMemory)[Singhet al.
1992]and the NASANumericalAerodynamicSimulation(NAS)benchmarks[Baileyet al. 1991].The Perfect Club [Berryet al. 1989] isa benchmarksuite of computationallyintensive programsthat is intendedto helpevaluateserial and parallelmachines.9.2 Code CharacteristicsA numberof studies have focused on theapplicationsthemselves,examiningtheirsourcecode for programmeridiomsorprofilingthe behaviorof the compiledexecutable.Knuth[1971] carriedout an early andinfluentialstudy of Fortranprograms.Hestudied 440 programscomprising250,000lines (punchedcards). The most important effect of this study was to dramatizethe fact that the majorityof the execution time of a programis usuallyspent ina very small proportionof the code. Otherinterestingstatisticsare that 9570 of allthe do loops incrementedtheirindexvariableby 1, and 4070 of all do loopscontainedonly one statement.Shen et al.
[1990]examinedthe subscript expressionsthat appearedin a setof Fortranmathematicallibrariesandscientificapplications.They applied various dependencetests to theseexpressions. The resultsdemonstratehow thetests compareto one another,showingthat one of the biggest problemswas unknownvariables.These variableswerecaused by procedurecalls and by the useof an arrayelementas an index valueintoanotherarray.Coupledsubscriptsalso caused problemsfor tests that exam-Compilerine a single arraydimensionat a time(coupledsubscriptsarediscussedinSection 5.4).A study by Petersenand Padua [1993]evaluatesapproximatedependencetestsagainst the omega test [Pugh 1992], finding that the latterdoes not expose substantiallymore parallelismin the PerfectClub benchmarks.9.3 CompilerEffectivenessAs we mentionedabove, researchersfindit difficultto evaluatehow well a compiler is doing.