1994. Compiler Transformations for High-Perforamce Computing, страница 8
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 8 страницы из PDF
This combinationof transformationsis sometimesreferredto asunroll-and-jam[Callahanet al. 1988].Loopquantization[Nicolau1988]isanotherapproach to unrollingthat avoidsreplicatedinner loops. Rather than creatingmultiplecopiesandthensubsequentlyeliminatingthem,quantizationadds additionalstatementsto the innermost loop directly.The iterationrangesare changed, but the structureof the loopnest remainsthe same.
However,unlikestraightforwardunrolling,quantizationchanges the order of executionof the loopand is not always a legal transformation.6.3.2Software PipeliningAnothertechniqueto improveinstructionparallelismis softwarepipelining[Lam1988]. In hardwarepipelining,instruction executionis broken into stages, suchas Fetch, Execute,and Write-back.Thefirstinstructionis fetchedin the firstclock cycle. In the next cycle, the secondinstructionis fetchedwhilethe firstisexecuted,and so on. Once the pipelinehas been filled,the machinewillcomplete 1 instructionper cycle.In softwarepipelining,the operationsof a single loop iterationare broken intos stages, and a single iterationperformsstage1 fromiterationi, stage 2 fromiterationi – 1, etc.
Startupcode must begeneratedbefore the loop to initializethepipelinefor the first s – 1 iterations,andcleanup code must be generatedafter theloop to drainthe pipelinefor the lasts — 1 iterations.Figure23(d)showshowsoftwarepipeliningimprovesthe performanceof asimple loop. The depth of the pipelineiss = 3. The softwarepipelinedloop produces one new result every iteration,or 4cycles per result.ACM Computmg Surveys, Vol.
26, No 4, December 1994370David“doF. Baconi=l,na[i]= a[i]endet al.+ cdo(a) the1LWR8,n(R30)2LFF4,c(R30)3MJLTIR8,R8,5ADDIR8,RIO,136L:LFRIO,R30, #a(RIO)FE.,F5,SF-4(R1O),F4F5the compiled;loadn mtoR8;loadc mtoF4;RIO=address(a[l]); R8=address(a[n+l]R8F5,loop; R8=n*4#4ADDF(b)ADDIinitialloopADDIR1O,RiO, #4;loadSLTRil,RI0,R8;a[il=a[il+cRll,LBNEZbodya[l];storefor S-DLX.This)into;RIO=address(a[l+l]F5;Rll=;lfa[llnewis the loopfromFigF5,(R1O);loada[l]2LFF6,4(RIO);loada[l+i]intoF63LFF7,8(R1O)ADDFF5,F5,F4;loada[i+2]IntoF84ADDIRio,Rio,ADDFF6,F6,F4;R1O=address5SLTRii,ADDFF7,F7,F4;Rll=6SF-I2(R1O)-rSF-8(R1O),F6aSF-4(RIO)>F7(c) afterunrolhng1L:LF#12RI0,R8BNEZRli,1LWR8,n(R30)ADDIRIO,LFF4,c(R30)SUBIR8,3MJLTIR8,R8,5ADDIR8,RIO,ADDFF6,F5,(R1O),F6ADDFF6,F5,3LFF5,4(R1O)4[stall]iafterL:SFF4software;a[l]=a[l]+c;a[l+l]=a[i+l]+c;storenewa[l+ll;storenew a[i+21;IfprologueRll,and epdoguegoto;loadn intoR8;RiO=address(a[l]);loadc intoF4; R8=n-2LFF5,(R1O);R8=address(a[n-1]);F5=a[l]LFF5,4(R1O);F6=a[l]+c;F5=a[2]ADDIRIO,R1O, #4;storeSLTRll,R1O,R8;a[l+l]=a[i+l]+cBNEZRli,L;loadpipeliningand rescheduling,mtowithoutF5unrolling;storenew;storenew a[i+l]a[i];ifRii,(loop; a[i+21gotoepilogue=a[i+21F6,ADDFADDIF8, F7, F4R1O, R1O, #8;loada[l+4]IntoF5;R1O=address4LFF7,SLTRli,;loada[i+5]intoF7;Rli=5BNEZRll,;ifLunrolling2 timesFigure 23.ACMComputmgRIO, R8Surveys.Vol26,N0and thensoftwareIncreasing4, DecemberRll,instruction1994gotopipehnmg)R1O<R8?ADDF(e) afterF4;Rll=a[i+2]but;RIO=address(a[i+i]a[i]4( R1O),F8F5,16(RIO)12(RIO)F5,newSFLFF6Lomitted)23(RIO),Lscheduling;a[i+21=a[l+2]+ca[i](loopgotoinstruction;R8=(n-2)*4R8F42(d)#2#47L:SFR30,#aR8,(a[l+3])new3 tmnes and rescheduhng21LRll,afterF5R1O<R87;store>F5into8(d))RIO<R8?Lomitted)+C;a[i+3]=a[i+3]+c(a[l+21)RIO<R87L(loopparallelismprologuein loops.and epilogueomitted)CompilerTransformationsdoOverlappxlalli=l,alldoor=.~.-rme(a)endLcJOpunrollingendjldodo371nj=l,a[i,*m= a[i,jl+ callall(a)originalloopTmle(b) SoftwarePipelimngFigure24.Loop unrollingvs.
softwaredo allT=l,n*mi. =((T-1)/pipelining.ja[i,endFinally,Figure23(e) shows the resultof combiningsoftwarepipelining(s = 3)with unrolling(u = 2). The loop takes 5cycles per iteration,or 21/z cycl:sperresult.Unrollingalone achieves2 /~ cycles per result.If softwarepipeliningiscombinedwithunrollingby u = 3, theresultingloop wouldtake 6 cycles periterationor two cycles per result,whichis optimalbecause only one memoryoperationcan be initiatedper cycle.Figure24 illustratesthe differencebetween unrollingand softwarepipelining:unrollingreducesoverhead,whilepipeliningreducesthe startupcost ofeach iteration.Perfect pipeliningcombinesunrollingbyloopquantizationwithsoftwarepipelining[Aikenand Nicolau1988a;1988b].
A special case of softwarepipelining is predictivecommoning,whichisappliedto memoryoperations.If an array elementwrittenin iterationi – 1 isread in iterationi, then the first elementis loadedoutsideof the loop, and eachiterationcontainsone load and one store.The IW/6000XL C/Fortrancompiler[0’Brienet al.1990]performsthisoptimization.If thereare no loop-carrieddependence,the lengthof the pipelineis thelengthof the dependencechain. If thereare multiple,independentdependencechains.thev can be scheduledtogethersubjectto ~esourceavailability,o; loopdistributioncan be appliedto put eachchain into its own loop.
The schedulingconstraintsinthepresenceof loopcarried dependenceand conditionalsaremore complex;the detailsare discussedm)*m= MOD(T-1,jldo= a[i,+ 1jl+ call(b)real+ 1m)coalescedloopTA [n*m]equivalencedo all(TA, a)T = 1,TA [T]n*m= TA [T]+ cend do all(c) collapsedFigure 25.Aiken[1988].inloopLoop coalescing vs. collapsing.andNicolau[1988a]andLam6.3.3 Loop CoalescingCoalescingcombinesa loop nest into asingle loop, with the originalindices computed from the resultingsingle inductionvariable[Polychronopoulos1987b; 1988].Coalescingcan improvethe schedulingofthe loop on a parallelmachineand mayalso reduce loop overhead.In Figure 25(a), for example,if n and mare slightlylargerthan the numberofprocessorsP, then neitherof the loopsschedulesas well as the outerparallelloop, since executingthe last n – P iterationswilltake the same timeas thefirst P.
Coalescingthe two loops ensuresthat P iterationscan be executedeverytime except duringthe last (nm mod P)iterations,as shown in Figure25(b).Coalescingitself is always legal since itdoes not change the iterationorder of theloop. The iterationsof the coalesced loopmay be parallelizedif all the originalloops were parallelizable.A criterionforACMComputingSurveys,Vol.
26, No. 4, December1994372●DavidF. Baconet al.parallelizabilityin termsof dependencevectors is discussed in Section 6.2.Thecomplexsubscriptcalculationsintroducedby coalescingcan oftenbesimplifiedto reduce the overheadof thecoalesced loop [Polychronopoulos1987 b].doi=2,nb[i]= b[i]end+ b[21dodoalli=3,a[i]endn= a[i]do+ call(a)originalloops6.3.4 Loop CollapsingCollapsingis a simpler,more efficient,but less generalversionof coalescinginwhichthe numberof dimensionsof thearrayis actuallyreduced.Collapsingeliminatestheoverheadof multiplenested loops and multidimensionalarrayindexing.Collapsingis used not only to increasethe numberof parallelizableloop iterations, but also to increasevector lengthsand to eliminatethe overheadof a nestedloop (also called the Carryoptimization[Allenand Cocke 1971; IBM 1991]).The collapsedversionof the loop discussed in the previoussection is shownin Figure25(c).Collapsingis best suited to loop neststhat iterateover memorywith a constantstride.Whenmore complexindexingisinvolved,coalescingmaybe a betterapproach.6.3.5 Loop Peelinga small numberofWhen a loop is peeled,iterationsare removedfrom the beginningor end of the loop and executedseparately.If only one iterationis peeled,a commoncase, the code for that iteration can be enclosed withina conditional.For a larger numberof iterations,a separate loop can be introduced.Peelinghastwo uses: for removingdependencecreated by the fh-st or last few loop iterations,therebyenablingparallelization;and for matchingthe iterationcontrolofadjacentloops to enable fusion.The loop in Figure26(a) is not parallelizablebecauseof a flow dependencebetween iterationi = 2 and iterationsi =3 .
. . n. Peelingoff the first iterationallows the rest of the loop to be parallelizedand fusedwiththe followingloop, asshown in Figure26(b).ACMComputmgSurveys,Vol26, No 4, December1994if(2b[2]enddo<= n)then= b[2]+ b[2]ifalli=3,nb[i]= b[il+ b[2]a[i]= a[i]+ cend do all(b) after peelingone iterationand fusingfrom first loopthe resultingFigure 26.LooploopspeelingSince peeling simply breaks a loop intosectionswithoutchangingthe iterationorder, it can be appliedto any loop.6.3.6 Loop NormalizationNormalizationconvertsall loops so thatthe inductionvariableis initially1 (or O)and is incrementedby 1 on each iteration[Allen and Kennedy1987]. This transformationcan expose opportunitiesfor fusion and simplifyinter-loopdependenceanalysis,as shown in Figure27.
It canalso help to reveal which loops are candidates for peeling followedby fusion.The most importantuse of normalization is to permitthe compilerto applysubscriptanalysistests, many of whichrequirenormalizediterationranges.6.3.7 Loop SpreadingSpreadingtakestwo serialloopsandmoves some of the computationfrom thesecond to the first so that the bodies ofboth loops can be executedin parallel[Girkarand Polychronopoulos1988].An exampleis shown in Figure 28: thetwo loops in the originalprogram(a) cannot be fused because they have differentCompilerdoi=l,na[i]endTransformationsdo i= a[i]+ cdo= 1,i= 2,b[i]endn+l2= a[i-i]* b[i]= 1,= b[i+l](a)dooriginal= a[i]i= 1,+ cn/2a[i+l]if= a[i+i](iJ 3)b[i-2]nb[i+l]loopsCOBEGINdodoi=l,+ a[i+3]loopsna[i]+ b[i]dododoi=l,end+ a[i]n-3b[i+l]end(a) originalend= a[i+l]end dodo ido373n/2a[i+l]1~+ a[i]then= b[i-2]+b[i-3]+a[i]end if= a[i]* b[i+l]COENDdoend do(b) afternormalization,thetwo loopsdo i = n/2-3, n-3b[i+l]= b[i+l]can befused+ b[i]+ a[i+3]end doFigure 27.Loopnormalization.(b) after spreadingFigure 28.bounds,andtherewouldbea depen-dence S’z ‘~) SI in the fused loop due tothe write to a in S1 and the read of a inS’z.
By executingthe statementSz threeiterationslater withinthe new loop, it ispossible to execute the two statementsinparallel,which we have indicatedtextucomallywiththe COBEGIN / COENDpound statementin Figure28(b).The numberof iterationsby which thebody of the second loop must be delayedis the maximumdependencedistance between any statementin the second loopand any statementin the first loop, plus1. Addingone ensures that there are nodependencewithinan iteration.For thisreason, there must not be any scalar dependencebetween the two loop bodies.Unlesstheloopbodiesarelarge,spreadingis primarilybeneficialfor exposing instruction-levelparallelism.13ependingon the amountof instructionparallelismachieved,the introductionofduetoa conditionalmay negate the gainspreading.The conditionalcan be removed by peelingthe first few (in thiscase 3) iterationsfrom the first loop, atthe cost of additionalloop overhead.Loop spreading.6.4 Loop ReplacementTransformationsThis sectiondescribesloop transformationsthatoperateon wholeloops andcompletelyalter their structure.6.4.1ReductionRecognitionA reductionis an operationthatcomputes a scalar value from an array, Common reductionsincludecomputingeitherthe sum or the maximumvalueof theelementsin an array.
In Figure 29(a), thesum of the elementsof a are accumulatedin the scalar S. The dependencevector forthe loop is (l), or ( <). While a loop withdirectionvector(<)mustnormallybeexecutedserially,reductionscan be parallelizedif the operationperformedisassociative.Commutativityprovidesadditionalopportunitiesfor reordering.In Figure 29(b), the reductionhas beenvectorizedby using vector adds (the inner do all loop) to computeTS; the finalresult is computedfrom TS using a scalarloop. For semicommutativeand semiassociativeoperatorslikefloating-pointmultiplication,the validityof the transformationdependson the languagese-ACMComputingSurveys,Vol. 26, No, 4, December1994374“Daviddoi=l,F.