1994. Compiler Transformations for High-Perforamce Computing, страница 18
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 18 страницы из PDF
They have come up withfour approachesto the problem:eexaminethe compileroutputby handto evaluateits ability to transformcode;emeasureperformanceagainstanother;ecomparethe performanceof executablecompiledwithfulloptimizationagainst littleor no optimization;andecomparean applicationcompiledforparallelexecutionagainstthe sequential version runningon one processor.of onecompilerNobayashiand Eoyang [1989] comparethe performanceof supercomputercompilersfromCray,Fujitsu,Alliant,Ardent,and NEC.The compilerswereappliedto variousloops from the Livermore and Argonnetest suitesthatrequired restructuringbefore they could becomputedwith vector operations.Carr [1993] applies a set of loop transformations(unroll-and-jam,loop interchange, tiling,and scalar replacement)tovariousscientificapplicationsto find astrategythat optimizescache utilization.Wolf [1992] uses inner loops from thePerfect Club and from one of the SPECbenchmarksto compare the performanceof hand-optimizedto compiler-optimizedcode.
The focus is on improvinglocalityto reducetrafficbetweenmemoryandthe cache.Relativelyfew studieshave been performedto test the effectivenessof realcompilersin parallelizingreal programs.However,the resultsof those that haveare not encouraging.Transformationse405One study of four Perfectbenchmarkprogramscompiledon the AlliantFX/8producedspeedupsbetween0.9 (that is,a slowdown)and 2.36 out of a potential32; when the applicationswere tuned byhand,the speedupsrangedfrom 5.1 to13.2 [Eigenmannet al.
1991]. In anotherstudy of 12 Perfect benchmarkscompiledwiththe KAP compilerfor a simulatedmachinewithunlimitedparallelism,7applicationshad speedupsof 1.4 or less;two applicationshad speedupsof 2-4;and the rest were sped up 10.3, 66, and77. All but threeof these applicationscould have been sped up by a factor of 10or more [Padua and Petersen1992].Lee et al. [1985] study the abilityof theParafrasecompilerto parallelizea mixture of smallprogramswrittenin Fortran.Beforecompilation,whileloopswere convertedto do loops, and code forhandlingerror conditionswas removed.With 32 processorsavailable,4 out of 15applicationsachieved 30920 efficiency,and2 achieved10%efficiency;theother9 out of 15 achievedless than10%efficiency.Outof 89 loops,59 wereparallelized,mostof themloopsthatinitializedarraydata structures.Somecoding idiomsthat are amenableto improved analysiswere identified.Some preliminaryresultshavealsobeen reportedfor the FortranD compiler[Hiranandaniet al.
1993] on the InteliPSC/860and ThinkingMachinesCM-5architectures.9.4 Instruction-LevelParallelismIn order to evaluatethe potentialgainfrominstruction-levelparallelism,researchershave engagedin a numberofstudiesto measurean upperboundonhow much parallelismis availablewhenan unlimitednumberof functionalunitsis assumed.Some of these studiesarediscussed and evaluatedin detail by Rauand Fisher [1993].Early studies [Tjadenand Flynn1970]were pessimisticin theirfindings,measuringa maximumlevel of parallelismon the order of two or three—aresultthat was called the Flynn bottleneck.TheACMComputingSurveys,Vol26,No4, December1994406●DavidF. Baconet al.mainreasonfor the low numberswasthatthese studiesdid not look beyondbasic blocks.Parallelismcan be exploitedacrossbasic block boundaries,however,on machines that use speculativeexecution.hstead of waitinguntilthe outcomeof aconditionalbranch is known, these architecturesbegin executingthe instructionsat eitheror both potentialbranchtargets; when the conditionalis evaluated,any computationsthat are renderedinvalid or useless must be discarded.When the basic block restrictionis relaxed,thereis muchmore parallelismavailable.Risemanand Foster [1972] assumedreplicatedhardwareto supportspeculativeexecution.They foundthattherewas significantadditionalparallelismavailable,but thatexploitingitwould requirea great deal of hardware.For the programsstudied,on the order of216 arithmeticunits would be necessaryto expose 10-way parallelism.A subsequentstudyby NicolauandFisher [1984] sought to measurethe parallelismthatcouldbe exploitedby aVLIWarchitectureusing aggressivecompiler analysis.The strategywas to compile a set of scientificprogramsinto anintermediateform,simulateexecution,andapplya schedulingalgorithmretroactivelythatused branchand address informationrevealedby the simulation.Thestudyrevealedsignificantamountsof parallelism—afactor of tensor hundreds.Wall[1991]took tracedata fromanumberof applicationsand measuredtheamountof parallelismunder a varietyofmodels.
His study consideredthe effect ofarchitecturalfeaturessuch as varyingnumbersof registers,branchprediction,and jumpprediction.It also consideredthe effects of alias analysisin the compiler. The resultsvaried widely,yieldingas much as 60-way parallelismunder themost optimisticassumptions.The average parallelismon an ambitiouscombinationof architectureandcompilersupportwas 7.Butleret al.
[1991] used an optimizingcompileron a group of programsfrom theACMComputingSurveys,Vol. 26, No. 4, December1994SPEC89suite.The resultingcode wasexecutedon a varietyof simulatedmachines.The range includesunrealistically omniscientarchitecturesthat combine data-flowstrategiesfor schedulingwith perfect branch prediciton.They alsorestrictedmodels with limitedhardware.Underthe most optimisticassumptions,they were able to execute 17 instructionsper cycle; more practicalmodels yieldedbetween2.0 and 5.8 instructionsper cycle.
Withoutlookingpast branches,fewapplicationscould executemore than 2instructionsper cycle.Lam and Wilson[1992] argue that onemajor reason why studies of instructionlevel parallelismdiffer markedlyin theirresultsis branchprediction.The studiesthat performspeculativeexecutionbasedon predictionyield much less parallelismthan the ones that have an oracle withperfectknowledgeaboutbranches.Theauthorsspeculatethatexecutingbrancheslocallywill not yield large degrees of parallelism;they conduct experimentson a numberof applicationsandfind similarresultsto other studies (between 4.2 and 9.2). They argue that compileroptimizationsthatconsidertheglobal flow of control are essential.Finally,Larus [1993] takes a differentapproachthantheothers,focusingstrictlyon the parallelismavailableinloops.
He examinessix codes (three integer programsfromSPEC89and threescientificapplications),collectsa traceannotatedwithsemanticinformation,and runs it througha simulatorthat assumes a uniformshared-memoryarchitecturewithinfiniteprocessors.Two ofthe scientificcodes, the ones that wereprimarilyarray manipulators,showed apotentialspeedup in the range of 65–250.The potentialspeedup of the other codeswas1.7–3,suggestingthatthe techniquesdevelopedfor parallelizingscientificcodeswillnotworkwellinparallelizingother types of programs.CONCLUSIONWe have describeda largenumberoftransformationsthat can be used to im-Compilerprove programperformanceon sequentialand parallelmachines.Numerousstudieshavedemonstratedthatthesetransformations,when properlyapplied,are sufficientto yield high performance.However,currentoptimizingcompilerslack an organizingprinciplethat allowsthem to choose how and when the transformationsshould be applied.Findingaframeworkthatunifiesmanytransformationsis an activearea of research.Sucha frameworkcouldsimplifytheproblemof searchingthe space of possible transformations,makingit easier andthereforecheaperto buildhigh-qualityoptimizingcompilers.A key part of thisproblemis the developmentof a costfunctionthat allows the compilerto evaluate the effectivenessof a particularsetof transformations.Despitethe absence of a strategyforuniffingtransformations,compilershaveproven to be quite successful in targetingsequentialand superscalararchitectures.On ~arallelarchitectures.however.mosthig~-performanceapplicationscurrentlyrely on the programmerratherthan thecompilerto manageparallelism.Compilersface a largespaceof possibletransformationsto apply,and parallelmachinesexact a very high cost of failure when optimizationalgorithmsdo notdiscovera good reorganizationstrategyfor the application.Becauseeffortsto ~arallelizetraditionallanguagesautom~ticallyhave metwith littlesuccess, the focus of researchhas shiftedto compilinglanguagesinwhichthe Dromammersharesthe responsibility}or”exposingparallelismandchoosing a data layout.Anotherdevelopingarea is the optimizationof nonscientificapplications.Like most researchersworkingon highperformancearchitectures,compilerwritershave focused on code that relieson loop-basedmanipulationof arrays.Otherkindsof programs,particularlythose writtenin an object-orientedprogrammingstyle, rely heavilyon pointers.Optimizationof pointer manipulationandof object-orientedlanguagesis emergingas anotherfocus of compilerresearch.APPENDIXTransformationsMACHINEA.1 Superscalar●407MODELSDLXA superscalarprocessorhas multiplefunctionalunits and can issue more thanone instructionper clock cycle.
Currentexamplesof superscalarmachinesare theDEC Alpha[Sites1992], HP PA-RISC[Hewlett-Packard1992],IBMRS/6000[Oehlerand Blasgen1991],and IntelPentium[Alpertand Avnon 1993].S-DLX is a simplifiedsuperscalarRISCarchitecture.It has fourindependentfunctionalunitsfor integer,load/store,branch,and floating-pointoperations.Inevery cycle, the next two instructionsareconsideredfor execution.If they are fordifferentfunctionalunits,andthereare no dependencebetweenthe instructions, they are both initiated.Otherwise,the second instructionis deferreduntilthe next cycle. S-DLXdoes not reorderthe instructions.Most operationscompletein a singlecycle.
When an operationtakes more thanone cycle, subsequentinstructionsthatuse resultsfrom multicycleinstructionsare stalleduntilthe resultis available.Because there is no instructionreordering, whenan instructionis stallednoinstructionsare issuedto any of thefunctionalunits.S-DLX has a 32-bit word, 32 generalpurpose registers(GPRs, denoted by Rn),and 32 floating-pointregisters(FPRs, denoted by Fn). The value of RO is always O.The FPRs can be used as double-precision(64-bit)registerpairs.Forthesake of simplicitywe have not includedthedouble-precisionfloating-pointinstructions.Memoryis byteaddressablewitha32-bit virtualaddress. All memoryreferences are made by load and store instructions betweenmemoryand the registers.Datais cached in a 64-kilobyte4-wayset-associativewrite-backcachecomposed of 1024 64–byte cache lines.