1994. Compiler Transformations for High-Perforamce Computing, страница 3
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 3 страницы из PDF
Baconthatredundantet al.loadsare0 Qc is an analogousquantitythat measures reuse of a cache line in memory.It is the ratio of the numberof wordsthatare read out of the cache fromthat particularline to the numberoftimes the cache fetches that line frommemory.3.2 Model ArchitecturesIn the Appendixwe presenta series ofmodelarchitecturesthatwe willusethroughoutthis surveyto demonstratethe effect of’ variouscompilertransformations.
The architecturesincludea superscalarCPU(S-DLX),a vectorCPU(V-DLX),a shared-memorymultiprocessor (sMX),anda distributed-memorymultiprocessor(dMX).We assumethatthe readeris familiarwith basic principles of modern computerarchitecture,includingRISC design, pipelining,caching,andinstruction-levelparallelism.Ourgenericarchitecturesare based on DLX,an idealizedRISCarchitectureintroducedbyHennesseyandPatterson[1990].4.COMPILER ORGANIZATIONFigure 2 shows the design of a hypothetical compilerfor a superscalaror vectormachine.It includesmost of the generalpurposetransformationscovered in thissurvey.Becausecompilersfor parallelmachinesare still a very active area ofresearch,we have excludedthese machines from the design.The purposeof this compilerdesign isto give the reader (1) an idea of how thevarioustypes of transformationsfit togetherand (2) some of the orderingissues that arise.
This organizationis byno meansdefinitive:differentarchitectures dictatedifferentdesigns, and opinions differon how best to orderthetransformations.Optimizationtakes place in three distinct phases, correspondingto three differentrepresentationsof the program:high-levelintermediatelanguage,low-ACMComputmgSurveys,Vol26,No4, December1994level intermediatelanguage,and objectcode. First, optimizationsare applied to ahigh-levelintermediatelanguage(HIL)thatis semanticallyvery close to theoriginalsource program.Because all thesemanticinformationof the source m-ogram is available,higher-leveltran~formationsare easier to apply. For instance,array referencesare clearlydistinguishable, insteadof being a sequence of lowlevel address calculations.Next,the programis translatedtolow-levelintermediateform (LIL),essentiallyan abstractmachinelanguage,andoptimizedat the LIL level.
Addresscomputationsprovidea majorsourceofpotentialoptimizationat this level. Forinstance,two referencesto a[5, 3] anda[7, 3] both requirethe computationofthe address of column3 of array a.Finally,the programis translatedtoobjectcode, and machine-specificoptimizationare performed.Some optimization,like code collocation,rely on profileinformationobtainedby runningtheprogram.Theseoptimizationsareoftenimplementedas binary-to-binarytranslations.The high-leveloptimizationphase begins by doing all of the large-scalerestructuringthat will significantlychangethe organizationof the program.This includesvariousprocedure-restructuringoptimlizations,as wellas scalarization,whichconvertsdata-parallelconstructsintoloops.
Doingthis restructuringatthe beginningallows the rest of the compiler to work on individualproceduresinrelativeisolation,Next,high-leveldata-flowoptimization,partialevaluation,and redundancyeliminationare performed.These optimizationsimplifythe programas muchas possible,and remove extraneouscodethatcouldreducethe effectivenessofsubsequentanalysis.The main focus in compilersfor highperformancearchitecturesis loop optimization.isperformedAsequencetoconvertof transformationstheloopsintoathat is more amenableto optimization:wherepossible,perfectloop nestsare created;subscriptexpressionsareformCompilerTransformationsF●353%Source ProgramPerfect Loop Ne.oslnductton VartablesILoop ReorderingLoop IntercharrgaLoop SkewingLoop ReversalLoop TthngLoop CoalescingStrip MiningLoop UnrollingCycle Shnnkma“FTail-recursion EhmmatlonProcedureProcedureCull GraphAnwkmons1tLow-levelHigh-Level Dataflow OptimizationConstant Propagationand FoldingCopy Props ationCommon.%%sxpreswonEliminationLoop-invariant Code MotIonLoop UnswtfchingStranoth ReductionIL GenerationLow-lsvel InferresdlateLanguage(UL)Low-level Dataflow OptimizationConstant Propagation and FoldingCopy PropagationCommon Subaxpression EliminationLoq-mvanantCode MotionStrength Reduction of lnd.Var.
Exprs.ReassociationInductIon Variable EliminationPartial EvaluationAlgebralc SlmplficatronShort-circuitingLow-level&%%%%%X:?duct’OnRedundancyEliminatiorr=Redundancy EliminationUnreachableCode ElimmationUsaleasCede ElimmationDead Variable Eltminat!on! LooIz Preparation IIDepeedsnceVectorsCross-Call Re Ister Allocation=Loop DistributionLoop NormalizationLoop PeelingScalar ExpanaionAssemblv L@muaee1Micro-optimizationPeephok OptimizationSuperoptlmlzatlont1Loop Preparation IIForwardSubstationArray PaddingArray AlignmentIdiom and ReductionRecognitionLOODCollaosmoErectableProgrom1Instruction Ceche OptimizationCode Co-locatiomDisplacement MinimizationCache Conflict AvoidancetFinal OpttmtzedPigure2.Organizationof a hypotheticalACMoptimizingComputmgCo&compiler.Surveys,Vol.26, No.4, December1994354David“F.
Baconet al.rewrittenin terms of the inductionvariables; and so on. Then the loop iterationsare reorderedto maximizeparallelismand locality.A postprocessingphase organizesthe resultingloops to minimizeloop overhead.Afterthe loop optimization,the program is convertedto low-levelintermediate language(LIL). Many of the data-flowand redundancyeliminationoptimizationthat were appliedto the HILarereappliedto the LIL in order to eliminateinefficienciesin the componentexpressions generatedfrom the high-levelconstructs.Additionally,inductionvariableoptimizationare performed,whichareoften importantfor high performanceonuniprocessormachines.A final LIL optimizationpassappliesprocedurecalloptimizations.CodegenerationconvertsLILintoassemblylanguage.Thisphaseof thecompileris responsiblefor instructionselection,instructionscheduling,andregisterallocation.The compilerapplieslow-leveloptimizationsduringthis phaseto improvefurtherthe performanceofthe code.Finally,object code is generatedby theassembler,and the object files are linkedinto an executable.Afterprofiling,profile-basedcache optimizationcan be applied to the executableprogramitself.5, DEPENDENCEANALYSISthe various forms of analysis usedby optimizingcompilers,the one we relyon most heavilyin this survey is dependence analysis[Banerjee1988b;Wolfe1989b].Thissectionintroducesdependence analysis,its terminology,and theunderlyingtheory.A dependenceis a relationshipbetweentwocomputationsthatplacesconstraintson their executionorder.
Dependenceanalysisidentifiesthese constraints,whichare then used to determine whethera particulartransformation can be applied withoutchangingthesemanticsof the computation.Among5.1Typesof DependenceThere are two kinds of dependence:controldependenceanddatadependence.ACMComputingSurveys,Vol.26,No4, December1994Thereis a controldependencebetweenstatement1 and statement2, writtenS1S1 determines4s2, when statementwhetherS’z will be executed.For example:12if (a = 3) thenb=10end ifTwo statementshave a data dependenceif theycannotbe executedsimultaneously due to conflictinguses of the samevariable.Thereare threetypes of data(alsodependence:flowdependencecalledtrue dependence),antidependence,and outputdependence.Sb has a flowdependenceon S~ (denotedby S~ - S’d)when S~ must be executedfirst becauseit writesa value that is read by S’d.
Forexample:34a=c=lod=2*a+cSG has an antidependenceon S~ (denotedby S~ * Sc ) whenSc writesa variablethat is read by S~:56e=f*4+gg=z.hAn antidependencedoes not constrainexecutionas tightlyas a flowdependence. As before,the code willexecutecorrectlyif So is delayeduntilafter S~completes.An alternativesolutionis touse two memorylocationsg~ and g~ tohold the values read in S~ and writteninSG, respectively.If the write by SG completesfirst,the old valuewillstillbeavailablein g~.An output dependenceholds when bothstatementswrite the same variable:78a=b*ca=di-eWe denote this conditionby writingST~ Sg. Again, as with an antidependence,storagereplicationcan allow the statementsto executeconcurrently.In thisshortexamplethereis no interveninguse of a and no controltransferbetweenthe two assignments,so the computationin ST is redundantand can actuallybeeliminated.CompilerThe fourthpossible relationship,calledan input dependence,holds when two accesses to the same memorylocationarebothreads.Althoughan inputdependence imposesno orderingconstraints,the compilercan make note of it for thepurposesof optimizingdata placementon multiprocessors.We denote an unspecifiedtype of dependenceby SI = S2.