1994. Compiler Transformations for High-Perforamce Computing, страница 6
Описание файла
PDF-файл из архива "1994. Compiler Transformations for High-Perforamce Computing", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 6 страницы из PDF
Note that,as withloop-invariantcode motion,if there is any chance thatthe conditionevaluationwillcause anexception,it must be guardedby a testthat the loop will be executed.In a loop nest where the inner loop hasunknownbounds,if code is generatedstraightforwardlythere will be a test before the body of the inner loop to determine whetherit shouldbe executedatall. The test for the innerloop will berepeatedevery time the outer loop is executed.
When the compileruses an intermediaterepresentationof the programthat exposes the test explicitly,unswitching can be appliedto move the test outside of the outer loop. The RS/6000XLC/Fortrancompileruses unswitchingforthis purpose [0’Brienet al. 1990].6.2moregeneraltermreferringto anytransfo~mationthatmoves a- comput~tion to an earlierpoint in the program[Aho et al. 1986].
Whileloop-invariantcode motionis one particularlycommonformof hoisting,thereare others:anexpressioncan be evaluatedearliertoreduce registerpressureor avoid arithmetic unit latency,or a load instructionmightbe movedupwardto reducetheeffect of memoryaccess latency.TransformationsLoopRerxderingIn this section we describetransformations thatchangethe relativeorder ofexecutionof the iterationsof a loop nestor nests. These transformationsare primarilyused to expose parallelismandimprovememorylocality.Some compilersonly apply reorderingtransformationsto perfect loop nests (seeSection 6.2.7). To increase the opportunities for optimization,such a compilercansometimesapply loop distributionto extract perfect loop nests from an imperfectnesting.The compilerdetermineswhetheraloop can be executed in parallelby examiningthe loop-carrieddependence.Theobvious case is when all the dependencedistancesfor the loop are O (direction=), meaningthatthereare no dependencecarriedacross iterationsby theloop.
Figuren(a) is an example;the distance vector for the loop is (O, 1), so theouter loop is parallelizable.ACM ComputingSurveys,Vol. 26, No. 4, December1994362Daviddoi=l,j]= a[i,+ cdo ia[*]1024* stride,= a[i]stride+ cend dodoi=i,loop is parallelizable(a) loop withndoj=l,I etride\ CacheMissesna[i,j]= a[i-l,j]+ a[i-l,j+l]dovaryingITLBstrideIMissesRelative\Speed (%)16421002128483do(b)innerloopDependenceis parallelizable.conditionsforparallelizingloops.More generally,the pth loop in a loopnest is parallelizableif for every distancevector V=(vl,. .
..vP. v~), v~),up =ov3q<p:vq>o.In Figure1 l(b),thedistancevectorsare {(1, 0), (1, – l)}, so the innerloop isparallelizable.Bothreferenceson theright-handside of the expressionreadelementsof a from row i — 1, which waswrittenduringthe previousiterationofthe outer loop. Thereforethe elementsofrow i may be calculatedand writteninany order.Loop interchangeexchangesthe positionof two loops in a perfect loop nest, generally movingone of the outer loops to theinnermostposition[Allenand Kennedy1984; 1987; Wolfe 1989b]. Interchangeisone of the most powerfultransformationsand can improveperformancein manyways.Loop interchangemay be performedto:eenablevectorizationan inner, dependentindependentloop;*improvevectorizationby movingtheindependentloop with the largest rangeinto the innermostposition;ebyimproveparallelperformanceing an in-depende-ntloop outwa~dComputingSurveys,by interchangingloop with an outer,Volmov-in a26, No 4, December1994tI116 I64 I1024 I1024 I32 I128 I2319256 I102451212512 I102410248(b) effect of strideFigure 12.
Predictedeffect of stride on performance of an IBM RS/6000for the above loop.Array elements are double precision (8 bytes): missrates are per 1024 iterations.Beyond stride 16,TLB misses dominate [IBM 1992],nest to increase the granularityofeach iterationand reduce the numberof barriersynchronizations;loop. reduce●6.2.1 Loop InterchangeACM= 1,a[i](a) outerFigure Il.j-1]dodoendprecisiondoublena[i,endendet al.ndoj=2,endF. Baconstride,ideallyto stride1; andincreasethe numberof loop-invariantexpressionsin the inner loop.Care must be taken that these benefitsdo not canceleach otherout, For instance,an interchangethatimprovesregisterreuse may change a stride-1access patternto a stride-naccess patternwith much lower overall performancedueto increasedcache misses.Figure12demonstratesthe dramaticeffect of different strides on an IBM RS /6000.In Figure13(a), the inner l’oop accessesarray a with stride n (recall that we areassumingthe Fortranconventionof column-major storageorder).Byinterchangingthe loops, we convert the innerloopto stride-1access,as showninFigure13(b).For a large array in whichless thanone columnfits in the cache, this opti-Compilerdo i= l,ndo jdoi=2,= l,ntotalTransformationsdo j[i]= total[il+ a[i,jlnj]= a[i-l,j+i]end doend doend do(a) original(a)loop nestJJdo j= l,ndo i = i,ntotal363= 1, n-1a[i,end do●[i]123123= total[il+ a[i,jlend doenddo(b) interchangedFigure 13.loop nestLoop interchange,(b)(c)Figure 14.
Original loop (a); original traversalder (b); traversal order after interchange (c).mizationreducesthe numberof cachemisses on a from n 2 to n 2 X elementsizeilinesize,or n 2/ 16 with4-byte elementsand the 64-byte lines of S-DLX. However,the originalloopallowstotal [i] to beeliminating theplacedin a register,load\ store operationsin the innerloop(see scalar replacementin Section 6.5.4).So the optimizedversionincreasesthenumberof load/storeoperationsfor totalfrom 2n to 2n 2. If a fits in the cache, theoriginalloop is better.On a vectorarchitecture,the transformedloopenablesvectorizationbyeliminatingthe dependenceon total [i] inthe inner loop.Interchangingloops is legal when thealtereddependenceare legal and whenthe loop boundscan be switched.If twoloops p and q in a perfect loop nest of dloops are interchanged,each dependencevectorV=(ul,.
. ..up. vq, ,,u~). .,u~)intheoriginalloopnestbecomesV’ =(vi,...,u~,UP,,,v~)..,v~)in the transformedloop nest. If V’ is lexicographicallypositive,thenthedependencerelationshipsof the originalloop are satisfied.A loop nest of just two loops can beinterchangedunless it has a dependencevector of the form ( < , >). Figure14(a)shows the loop nest with the dependence(1, – 1), givingrise to the loop-carrieddependenceshown in Figure14(b). Theorderin whichthe iterationsare exe-or-cutedis shownby thedottedline.The traversalorder after interchangeisshownin Figure14(c): some iterationsare executedbefore iterationsthat theydepend on, so the interchangeis illegal.Switchingthe loop bounds is straightforwardwhen the iterationspace is rectangular,as in the loop nest in Figure13.In this case the bounds of the inner loopare independentof the indices of the containingloop, and the two can simplybeexchanged.When the iterationspace isnot rectangular,computingthe bounds ismore complex.Triangularspaces are often used by programmers,and trapezoidalspacesare introducedby loopskewing(discussedin the next section).Furthertechniquesarenecessarytomanageimperfectlynestedloops.
Someof the variationsare discussedin detailby Wolfe[ 1989b]and Wolfand Lam[1991].6.2.2 Loop SkewingLoop skewingis an enablingtransformation that is primarilyuseful in combinationwithloopinterchange[Lamport1974;Muraoka1971;Wolfe1989b].Skewingwas inventedto handlewavefront computations,so called because theupdatesto the arraypropagatelike awave across the iterationspace.ACMComputingSurveys,Vol. 26, No.
4, December1994364“doI=David2,F. Baconet al.dex variablesto compensate,skewingdoes not change the meaningof the program and is always legal.The originalloop nest in Figure15(a)can be interchanged,but neitherloop canbe parallelized,because there is a dependence on both the inner loop (O, 1) and onthe outer loop (1, O).The result when f = 1 is shown in Figure15(c–d).The transformedcode isequivalentto the original,but the effecton the iterationspace is to alignthediagonalwave frontsof the originalloopnest so that for a given valueof j, alliterationsin i can be executed in parallel.To expose this parallelism,the skewedloop nest must also be interchanged.After skew and interchange,the loop nesthas distancevectors{(1, O), (1, 1)}. Thefirst dependenceallows the inner loop tobe parallelizedbecause the corresponding dependencedistanceis O.
The seconddependenceallowsthe innerloop to beparallelizedbecause it is a dependenceon previousiterationsof the outer loop.Skewingcan expose parallelismfor anest of two loops witha set of distancevectors {Vk } only ifn-1do J = 2, m-1a[l, Jl = (a[~-t,Jl+ a[i,j-11+a[l+l,jl+ a[l,J+ll)/4end doend do(a) ongmalcode(b) originaldo1 = 2,do Idependencespace{(1, O), (O, 1)}(c) skewed spacen-1= i+2,a[l,~-11i+m-1= (a[l-i,a[l+l,j-d+ a[i,j-1-ii+j-11+ a[l,J+l-ll)/4end doend do(d) skewed codedoI= 4,{(1, 1), (O, 1)}n+n-2do 1 = max(2,a[i,dependencej-dJ-m+l),min(n-l,= (a[i-l,a[i+l,~-2)j-d+ a[l,j-1-1]+j-ii+ a[l,]+l-d)/4end doenddo(e) skewedand interchangedcodedependence{(1,0),(1,1)}Figure 15.Loop skewing.In Figure15(a) we showa typicalwavefrontcomputation,Each elementiscomputedby averagingits four nearestneighbors,Whileneitherof the loops isparallelizablein their originalform, eacharraydiagonal(“ wavefront”)couldbecomputedin parallel.The iterationspaceand dependenceare shownm Figure15(b), with the dotted lines indicatingthewave fronts.Skewingis performedby addingtheouterloop indexmultipliedby a skewfactor,f, to the bounds of the inner iteration variable,and then subtractingthesame quantityfrom every use of the inner iterationvariableinside the loop.
Because it alters the loop bounds but thenalters the uses of the correspondingin-ACMComputmgSurveys,Vol26, No4, December1994distanceWhen skewingby f, the originalvector (vi, vz) becomes (rJl, fvl + Pz). Forany dependencewhere Uz < 0, the goal isf suchthatful + V2 > 1. Theto findcorrect skew factor f is computedby taking the maximumof f, = [(1 – u12)/u,llover all the dependence[Kennedyet al.1993].Interchangingof skewed loops is complicatedby the fact thattheirboundsdepend on the iterationvariablesof theenclosingloops.Fortwoloopswithil = 11, U1 and iz = lZ, Uz whereboundslZ and Uz are expressionsindependentofil, the skewed inner loop has boundsizinterchange,= fi ~ + 12, fi ~ + U2.