1994. Compiler Transformations for High-Perforamce Computing, страница 9
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 9 страницы из PDF
Baconet al.ns=s+a[i]enddo(a)TS[1:64]TIdo=1,n,64= TS[1:64]+ a[TI:TI+63]doTI= 1,s = sendloop= 0.0TS[1:64]endreductionTS [64]realdoa sum64+ TSITI]do(b) looptransformedFigure 29.forReductionis describedby Chen and Kuck [19751, byKuck[1977;1978],and Wolfe[1989b].Blelloch[1989]describescompilationstrategiesfor recognizingand exploitingparallelprefix operations.The compilercan recognizeand convertotheridiomsthatarespeciallysupportedby hardwareor software.Forinstance,the CMAXFortranpreprocessor convertsloops implementingvectorand matrixoperationsinto calls to assembly-codedBLAS (Basic LinearAlgebraSubroutines)[SabotandWholey1993]. The VAXFortrancompilerconverts string copy loops into block transferinstructions[Harrisand Hobbs 1994].vectorizationrecognition.6.4.3 Array Statement ScalarizationWhen a loop is expressedin array notation, the compilercan eitherconvertitinto vector operationsor scalarizeit intomanticsand the programmer’sintent,asdescribedin Section2.1.
Providedthatbitwise identicalresults are not required,the partialsums can be computedinparallel.Maximumparallelismis achievedbycomputingthe reductionwitha tree:pairs of elementsare summed,then pairsof these resultsare summed,and so on.The numberof serialsteps is reducedfrom O(n) to O(log n).Operationssuch as and, or, rein, andmax are trulyassociative,and their reductioncan be parallelizedunderallcircumstances.6.4,2 Loop Idiom RecognitionParallelarchitecturesoften providespecializedhardwarethat the compilercantake advantageof. Frequently,for example, SIMDmachinessupportreductiondirectlyin the processorinterconnectionnetwork.Some parallelmachines,suchas the ConnectionMachineCM-2 [Thinking Machines1989],includehardwarenot only for reductionbut for parallelprefix operations,allowinga loop with abody of the form a[il = a[i – 1] + a[i] tobe parallelized.The parallelizationof amore generalclass of linearrecurrencesACM Computmg Surveys, Vol 26, No.
4, December 1994oneormoreHowever,pletelyrequiresperformedhandsideand“obvious”(butrial(b),formreasonis thatelementoperationmentsare30showsnotationTheto se-eachas ifThetheyinelementgeneralsolutionisT andwritesthewerethetotobacktwobementtherefusedtoassignmentis noloop,thetemporary,to thefused,dependencewhereoriginalS1aa sepa-valueslegallyele-introducehaveFigure30(c).Thethenbe eliminatedcanallincor-previousa, as showninraryarraycanloopstheincre-is incrementedof thearraybyelement.happenthat(c).Figure30(b)is not cortheoriginalcode,everyvalueloopitsconversionupdatedtemporarya(a),conversiona correctsimultaneously;version,by thement.whenonbeforeperformed.previoustoperformedtheright-of a is to be incrementedof thebeon theFigureincorrect)andcomarraycomputedarraythatinvaluerateininnotsubexpressionwereareexamplecomputationrectthevalueeveryside198!2b].isthatassignmentsTheTherect[Wolfebecauseas if everyleft-handanyloopsconversionstraightforwardnotationtheserialtheintotempoif thenamely,S2 ‘~) SIistheandarray.S2inassignistheCompilera[2:n-1]= a[2:n-1](a) initialdoi= 2,a[i]endarrayenddo2+ a[i-1]incorrect= 2,T[i]1●Parallelism.Vectormachinesoftendividememoryintobanks,allowingvector registersto be loaded in a parallel or pipelinedfashion.Superscalarmachinesoftensupportdoubleorquadwordload and store instructions;0WorkingSet Size.If all the memorvelemen~saccessed insideof a loop d;not fit in the data cache, then itemsthat will be accessed in later iterationsmay be flushed,decreasingQc.
If morevariablesare simultaneouslylive thanthereare availableregisters(thatis,the registerpressureis high),thenloads and stores will have to spill values into memory,decreasingQ. If morepages are accessed in a loop than thereare entriesin the TLB, then the TLBmay thrash.scalarization+ a[i-1]= 2,n-1= T[i]do(c) correcti= n-1,2,= a[i]a[i]endReuse, denoted by Q and Qc, the ratioof uses of an item to the numberoftimes it is loaded (describedin Section3);n-1= a[i.]iend(d)0doa[i]doexpressiondoiscalarization-1+ a[i-1]doreversingeliminatesbothneeda[2:n-11= a[2:n-il(e) array expressionFigure 30.Arrayloopsallowsfusionfor temporaryarray+ a[i:n-2]requiringstatementandT+ a[3:n]a temporaryscalarization.In thiscase thereis an antidependence, butitcanbe removedbyreversingthe loops, enablingfusion,andeliminating the temporary,as shown in Figure30(d).However,thearraylanguagestatementinFigure 30(e) requiresatemporary since an antidependenceexists regardless of the directionof the loop.6.5 Memory375n-1= a[i](b)do“As a result,optimizationof the use ofthe memorysystem has become steadilymore important.Factorsaffectingmemory performanceinclude:+ a[l:n-2]languageTransformationsAccessTransformationsHigh-performanceapplicationsareasfrequentlymemorylimitedas they arecomputelimited.In fact, for the last fifteen years CPU speeds have doubledevery threeto five years,whileDRAMspeeds have doubledaboutonce everydecade (DRAMs,or dynamicrandomaccess memorychips,are the low-cost,low-powermemorychips used for mainmemoryin most computers).Since the registersare the top of thememoryhierarchy,efficientregisterusage is absolutelycrucialto high performance.
Untilthe late seventies,registerallocationwas consideredthe single mostimportantproblemin compileroptimization for whichtherewas no adequatesolution.The introductionof techniquesbased on graphcoloring[Chaitin1982;Chaitinet al. 1981; Chow and Hennessy1990] yielded very efficientglobal (withina procedure)registerallocation.Mostcompilersnow make use of some variantof graphcoloringintheirregisterallocator.In general,memoryoptimizationsareinhibitedby low-levelcoding that relieson particularstoragelayouts.FortranEQUIVALENCEstatementsare the mostobvious example.C and Fortran77 bothspecify the storage layout for each type ofdeclaration.Programmersrelyingon aparticularstoragelayoutmay defeatawide range of importantoptimization.ACMComputingSurveys,Vol.
26, No 4, December1994376●DavidF. Baconet al.Optimizationscoveredin othersections that also can improvememorysystem performanceare loop interchange(6.2.1), loop tiling(6.2.6), loop unrolling(6.2.8),and various(6.3. 1), 100P fusionoptimizationthateliminateregistersaves at procedurecalls (6.8).realdoa[8,512]i= 1,512a[i,i]= a[l,endi]+ cdo(a) originalrealcodea[9,512]6.5.1 Array PaddingPaddingis a transformationwherebyunused data locationsare insertedbetweenthe columnsof an arrayor betweenarrays.
Paddingis used to ameliorateanumberof memorysystemconflicts,inparticular:- bank conflictson vectorbanked memory[BurnettJr. 1970];. cacheset or TLB[Bacon* false sharingof cachememorymultiprocessors.lineset al. 1994];on shared-Bank conflictson vector machineslikethe Crayand our hypotheticalV-DLXcan be caused when the programindexesthroughan arraydimensionthat is notlaid out contiguouslyin memory,leadingto a nonunitstride whose value we willdefine to be s. If s is an exact multipleofthe numberof banks (B), the bandwidthof the memorysystem will be reduced bya factor of B (8, on V-DLX)because allmemoryaccesses are to the same bank.The numberof memorybanks is usuallya power of two.In general,if an array will be accessedwith stride s, the array should be paddedby the smallestp such thatgcd(s +p, B) = 1.
This will ensurethatB successive accesses with stride s + p will alladdressdifferentbanks.An exampleisshown in Figure3 l(a): the loop accessesmemorywith stride 8, so all memoryreferences will be to the first bank. Afterpadding,successiveiterationsaccessmemorywith stride 9, so they go to successive banks, as shown in Figure3 l(b).Cachedmemorysystems,especiallythose thatare set-associative,are lessACMComputmgSurveysjVoli=a[l,end1,i]512= a[l,i]+ cdo(b)paddinga eliminatesconfbctsFigure 31.memorybankon V-DLXArraypaddingmachineswithand Coffman,set conflicts;o cache miss jamminganddo26, No 4, December1994sensitiveto lowpower-of-twostrides.However,large power-of-twostrides willcause extremelypoor performancedue tocache set and TLBset conflicts.Theproblemis that addressesare mappedtosets simply by using a range of bits fromthe middleof the address.For instance,on S-DLX,the low 6 bits of an addressare the byte offset withinthe line, andthe next 8 bits are the cache set.
Figure12 shows the effect of stride on the performanceof a cache-basedsuperscalarmachine.A numberof researchershave notedthatotherstridesyieldreducedcacheperformance.Bailey[1992]has quantified this by notingthatbecausemanywords fit into a cache line, if the numberof sets times the numberof words per setis almostexactlydivisibleby the stride(say by n f e), then for a limitednumberof referencesr, every nth memoryreference will map to the same set, If [ r/n]isgreaterthantheassociativityof thecache, then the later referenceswill evictthe lines loaded by the earlier references,precludingreuse.Set and bank conflictscan be causedby a bad stride over a single array, or bya loop that accesses multiplearrays thatall align to the same set or bank.