1994. Compiler Transformations for High-Perforamce Computing (798436), страница 7

Файл №798436 1994. Compiler Transformations for High-Perforamce Computing (1994. Compiler Transformations for High-Perforamce Computing) 7 страница1994. Compiler Transformations for High-Perforamce Computing (798436) страница 72019-09-182019-09-18СтудИзба

1994. Compiler Transformations for High-Perforamce Computing

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 7)

Afterthe bounds aredoi2=fZ1+12,do ilfu1+u2= max(ll,min(ul,[(i, – u2)/fD,[(iz – 12)/f 1)CompilerAn alternativemethodfor handlingwavefrontcomputationsissupernodepartitioning[Irigoinand Triolet1988].Transformationsdoi=l,365●ndo j=1,a[i,nj]= a[i-1,j+i]+ Iend do6.2.3 Loop Reversalend do(a) originalReversalchanges the directionin whichthe loop traversesits iterationrange[Wedel 19751. It is often used in conjunction with other iterationspace reorderingtransformationsbecause it changesthedependencevectors [Wolfe 1989b].As an optimizationin its own right,reversalcan reduceloop overheadbyeliminatingthe need for a compareinstructionon architectureswithouta compound compare-and-branch(such as theAlpha [Sites 1992]), The loop is reversedso that the iterationvariableruns downto zero, allowingthe loop to end with abranch-if-not-equal-to-zeroinstruction(BNEZ, on S-DLX).Reversalcan also eliminatethe needfor temporaryarraysin implementingFortran90 array statements(see Section6.4.3).If loop p in a nest of d loops is reversed, then for each dependencevectorV, the entryUP is negated.The reversalis legalif each resultingvectorV’ islexicographicallypositive,that is, when=Oor3q<p:v~>0.‘PFor instance,the innerloop of a loopnestwithdirectionvectors{(< , =),( <, >)} can be reversed,because the resultingdependenceareallstilllexico~aphicallypositive.Figure16 shows how reversalcan enable loop interchange:the originalloopnest (a) has the distancevector (1, – 1)thatpreventsinterchangebecausetheresultingdistancevector( – 1, 1) is notlexicographicallypositive;the reversedloop nest (b) can legally be interchanged.6.2.4 Strip MiningStrip miningis a method of adjustingthegranularityof an operation,especiallyaparallelizableoperation[Abu-Sufahet al.1981; Allen 1983; Loveman1977].An example is shown in Figure17.

Thestrip-minedcomputationis expressedinloop nest:distanceInterchangedoi=l,vector (1, - 1).is not possible.ndoj=n,l,-1a[i,j]= a[i-1,j+i]+ Iend doend do(b) innerloop reversed:LoopsFigure 16.do i=l,directionvector (1, 1).may be interchanged.Loop reversal.na[i]= a[i]+ cend do(a) originafloopTN = (n/64)*64do TI=I,a[TITN,64:TI+63]end dodo i=TN+l,= a[TI:TI+63]+ cna[i]= a[i]end do+ c(b) after strip miningR9 = addressofLVVI,R9ADDSVVI,Vl,F8,R9Sv(c) vectora[TI]VI; VI<-a[TI; vi<-VI; a[TIassemblya[TIFigure 17.:TI+63]:TI+63]+ c (F8=c)<-code for the updateVIof:TI+63]Strip mining.array notation,and is equivalentto a doall loop. Cleanupcode is needed if theiterationlength is not evealy divisiblebythe strip length.One of the most commonuses of stripminingis to choose the numberof independentcomputationsin the innermostACM Computing Surveys, Vol.

26, No. 4, December 1994366eDavidF. Baconet al.loop of a nest. On a vector machine,forexample,the serial loop can then be converted into a series of vector operations,each vectorcomprisinga single“strip”[CrayResearch1988]).Stripminingisalso used for SIMDcompilation[Weiss1991], for combiningsend operationsina loop on distributed-memorymultiprocessors [Hiranandaniet al. 1992], and forlimitingthe size of compiler-generatedtemporaryarrays [Abu-Sufah1979; Wolfe1989b].Stripminingoftenrequiresothertransformationsto be performedfirst.Loop distribution(see Section6.2.7) canexpose simple loops from withinan original loop nest that is too complex to stripmine. Loop interchangecan be used tomove a parallelizableloop into the innermost positionof a loop nest or to maximize the length of the strip.The examplesgiven above demonstratehow strip miningcan create a larger unitof work out of smallerones. The transformationcan also be used in the reversedirection,as shown in Section 7.4.6.2.5 Cycle ShrinkingCycle shrinkingis essentiallya specializationof strip mining.When a loop hasdependencethat preventit from beingexecuted in parallel(that is, convertedtoa do all), the compilermay still be ableto expose some parallelismif the dependence distanceis greaterthanone.

Inthis case cycle shrinkingwill convertaserial loop into an outer serial loop andan inner parallelloop [Polychronopoulos1987a]. Cycle shrinkingis primarilyuseful for exposingfine-grainedparallelism.For instance,in Figure18(a), a[i + k] iswrittenin iterationi and read in iteration i + k; the dependencedistanceis k.Consequentlythe first k iterationscan beperformedin parallelprovidedthat noneof the subsequentiterationsis allowed tobegin untilthe first k are complete.Thesame is then done for the next k iterations, as shown in Figure18(b). The iterationspace dependenceare showninFigure18(c): each group of k iterationsisdependentonly on the previousgroup.ACMComputmgSurveys,Vol26, No. 4, December1994doi=i,i2na[i+k]= b[i]b[i+k]= a[i]enda) becauseof theof thedoTI=l,do2iendto a, S1 SS2;becauseS2~S1.to b,= TI , TI+k-1= b[i]b[i+k]endwritewriten,kalla[i+k](b)+ c[i]do= a[i]do+ c[i]alldok iterationsbecausecan be performedthatis theminimumin paralleldependencedistance.~~o12(c) iteration436sspacewhenn = 6 and18.Cycleshrinking.Figurek = 2.The resultis potentiallya speedup bya factor of k, but k is likelyto be small(usually2 or 3); so this optimizationisnormallylimitedto exposingparallelismthat can be exploitedat the instructionlevel (for instanceby loop unrolling),Notethat k must be constantwithinthe loopand must at least be known to be positiveat compile time.6.2.6 Loop TilingTilingisizationof stripthemultidimensionalmining.Tilinggeneral(alsocalledblocking)is primarilyused to improvecache reuse ( QC) by dividingan iterationspace into tiles and transformingthe loopnest to iterateover them [Abu-Sufahetal.

1981; Gannonet al, 1988; Lam et al,1991; Wolfe 1989a]. However,it can alsobe used to improveprocessor,register,TLB, or page locality.The need for tilingis illustratedby theloop in Figure19(a) that assignsa thetransposeof b. With the j loop innermost,access to b is stride-1,while access to a isCompilerdo i=l,ndo j=l,na[i, j] = b[j,end doend doTransformationsdo i=l,a[i]i]*367n= a[i]x[i+ll+C= x[i]*7+ x[i+i]+ a[i]end do(a) originalloop(a) original loopdo alldo TI=I,n,do TJ=l,64n,do i=TI,64j]n)do i=i,min(TJ+63ufi)= b[j,x[i+l]end doi]end doendi=l,n= a[i]+cend do allmin(TI+63,do j=TJ,a[i,a[i]n= x[i]*7+ x[i+l]+ a[i](b) after loop distributiondoFigure 20.end doLoopdistribution.end do(b) tiledFigure 19,loopLooptiling.stride-n.Interchangingdoes not help,since it makesaccess to b stride-n.Byiteratingover subrectanglesof the iteration space, as shown in Figure19(b), theloop uses every cache line fully.The inner two loops of a matrixmultiply have this structure;tilingis criticalfor achievinghigh performancein densematrixmultiplication.A pair of adjacentloops can be tiled ifthey can legallybe interchanged.Aftertiling,the outer pair of loops can be interchangedto improvelocalityacrosstiles,and the innerloops can be exchanged to exploitinner-loopparallelismor registerlocality.6.2.7 Loop DistributionDistribution(also calledloop fissionorloop splitting)breaksa single loop intomany.Each of the new loops has thesame iterationspace as the original,butcontainsa subset of the statementsof theoriginalloop [Kuck1977; Kucket al.1981: Muraoka19711.Distributionis used to●create●createdence;perfectloop nests;subloopswithfewerdepen-0improveinstructiontion TLB localitybodies;0reduce memorying over fewereincreaseregistercache and instrucdue to shorterlooprequirementsarrays; andregisterpressure.reusebyby iteratdecreasingFigure20 is an examplein which distributionremovesde~endencesand allows part of a loop to ‘be run in parallel,Distributionmay be appliedto anyloop, but all statementsbelongingto adependencecycle (called a n-block [Kuck1977]) must be placed in the same loop,and if SI = Sz in the originalloop, thenthe loop containingSI must precede theloop that containsS’z.

If the loop containscontrolflow, applyingif-conversion(seeSection 5.2) can expose greateropportunitiesfor distribution.An alternativeisto usea controldependencegraph[Kennedyand McKinley1990],A specializedversionof this transformationis distributionby name,originallycalledhorizontaldistributionofname partition[Abu-Sufahet al. 1981].Ratherthan performingfull dependenceanalysison the loop, the statementsare~artitionedinto sets that referencemu~uallyexclusivevariables.These statements are guaranteedto be independent.When the arrays in questionare large,distributionby name can increasecacheACM Computing Surveys, Vol.

26, No. 4, December 19943680DavidF. Baconet al.locality.Note that the above loop cannotbe distributedusingfissionby name,since both statementsreferencea.I6.2.8 Loop Fusions20The inversetransformationof distributionis fusion(alsocalledjamming)[Ershov1966].ItcanimproveperformancebyreducingincreasingFigure 21.be fusedinstructionparallelism;theloadbalanceof paralleldence Sz = S1 in the fused loop.

The reason this would be incorrectis that beforefusing,all instancesof S1 execute beforeany Sz. Afterfusing,correspondinginstances are executedtogether.If any instance of S1 has a dependenceon (i.e.,must be executedafter) any subsequentinstanceof Sz, the fusionaltersexecution order illegally,as shown in Figure21.ComputingS1 andS2 = S’l in the fusedS2 cannotloop.6.3 Loop RestructuringIn Figure20, distributionenables theparallelizationof part of the loop.

However, fusing the two loops improvesregister and cache localitysince a[i] need onlybe loaded once. Fusionalso increasesinstructionparallelismby increasingtheratio of floating-pointoperationsto integer operationsin the loop and reducesloop overheadby a factorof two. Withlarge n, the distributedloop should runfaster on a vector machine while the fusedloop shouldbe betteron a superscalarmachine.For two loops to be fused, they musthave the same loop bounds;whenthebounds are not identical,it is sometimespossible to make them identicalby peeling (describedin Section 6.3.5) or by introducingconditionalexpressionsinto thebody of the loop.

Two loops with the samebounds may be fused if there do not existstatementsS1 in the first loop and Sz inthe seco~~ ~uch that they have a depen-ACMTwo lo~ps containingwhen0loop overhead;improvingregister,vector[Wolfe1989b], data cache, TLB, or page [AbuSufah 1979] locality;andimprovingloops.o’0Surveys,Vol26, No 4, December1994This sectiondescribesloop transformationsthatchangethe structureof theloop, but leavethe computationsperformedby an iterationof the loop bodyand their relativeorder unchanged.6.3.1 Loop UnrollingUnrollingreplicatesthe body of a loopsome numberof timescalledthe unrollingfactor (u) and iteratesby step uinsteadof step 1.

The benefitsof unrollinghave been studiedon several different architectures[Dongarraand Hind1979]; it is a fundamentaltechniqueforgeneratingthelonginstructionsequencesrequiredby VLIWmachines[Ellis19861.Unrollingmance by= reducingcanloopimprovetheperfor-overhead;“ increasinginstruction* improvingregister,parallelism;datacache,andor TLBlocality,In Figure 22, we show all three of theseimprovementsin an example.Loop overhead is cut in half because two iterationsare performedbefore the test and branchat the end of the loop. Instructionparallelismis increasedbecausethe secondassignmentcan be performedwhiletheresultsof the first are being stored andthe loop variablesare being updated.If array elementsare assigned to regisscalarters (eitherdirectlyor by usingas describedinSectionreplacement,6.5.4), registerlocalitywill improvebecause a[i] and a[i + 1] are used twice inCompilerdo i=2,n-1a[ij= a[i]+ a[i-1]* a[i+llend do(a) originaldo i=2,n-2,a[i]= a[i]a[i+l]loop2+ a[i-1]= a[i+l]* a[i+i]+ a[i]* a[i+2]end doif(mod(n-2,2)a[n-1]= i)= a[n-1]then+ a[n-2]* a[nlend if(b) loop unrolledFigure 22.twiceLoop unrollingthe loop body, reducingthe numberofloads per iterationfrom 3 to 2.If the targetmachinehas doubleormultiwordloads, unrollingoften allowsseveral loads to be combinedinto one.The if statementat the end of Figure22(b) is the loop epiloguethat must begeneratedwhen it is not knownat compile time whetherthe numberof iterations of the loop will be an exact multipleof the unrollingfactoru.

If u > 2, theloop epilogueis itself a loop.Unrollinghas the advantagethatitcan be appliedto any loop, and can bedone profitablyat both the high and thelow levels. Some compilersalso performloop rerollingprior to unrollingbecauseprogramsoften containloops that wereunrolledby hand for a differenttargetarchitecture.Figure 23(a–c) shows the effects of unrollingin more detail. The source code inFigure23(a) is translatedto the assembly code in Figure23(b), which takes 6cycles per resulton S-DLX.Afterunrolling3 times, the code requires8 cyclesper iterationor 22/s cycles per result,asshown in Figure23(c), The originalloopstalls for one cycle waitingfor the load,and for two cycles waitingfor the ADDFto complete.In the unrolledloop some ofthese cycles are filled.Transformations9369Mostcompilersfor high-performancemachineswill unrollat least the innermost loop of a nesting.Outerloop unrollingis not as universalbecauseityieldsreplicatedinstancesof the innerloops.To avoidthe additionalcontroloverhead,the compilercan often fuse thecopies back together,yieldingthe sameloop structurethat appearedin the original code.

Характеристики

Тип файла

PDF-файл

Размер

6,02 Mb

Материал

1994. Compiler Transformations for High-Perforamce Computing

Тип материала

Статья

Предмет

Конструирование компиляторов

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов статьи

1994.-compiler-transformations-for-high-perforamce-computing.rar

1994. Compiler Transformations for High-Perforamce Computing.pdf

Прочти меня!!!.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.