1994. Compiler Transformations for High-Perforamce Computing (798436), страница 14

Файл №798436 1994. Compiler Transformations for High-Perforamce Computing (1994. Compiler Transformations for High-Perforamce Computing) 14 страница1994. Compiler Transformations for High-Perforamce Computing (798436) страница 142019-09-182019-09-18СтудИзба

1994. Compiler Transformations for High-Perforamce Computing

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 14)

1986], requirethat all tail-recursiveproceduresbe identifiedand that they beexecutedwithoutconsumingstack spaceproportionalto the recursiondepth.3918else= . FALSE.if(a[i]= x)=TRUE,inarraythenelseinanayend if= inarray(a,x,1+1,n)return(a) A tail-recursivelogicalrealfunctionx,i,(ix,1,n)n> n)inarrayelseinarray(a,a[nlIntegerifprocedurethen= . FALSE.if(a[i]inarray= x)then= . TRUE.else1 = i+lgotoend ifireturn(b) Afterrecursiveintegerrealx,(irecursionfunctioneliminationsurmrray(a,x,i,n)i+l,n)a[n]integeriftaili,n= n)sumsrraythen= a [i]elsesunrarrayend= a[l]+sunmrray(a,x,ifreturn(c) A procedureFigure 50.Tailthatis not tail-recursiverecursionelimination.possibleto cache the resultsof recentinvocations.When the procedureis calledagainwiththesamearguments,thecached resultis used insteadof recomputingit [Abelsonand Sussman1985;Michie1968].Figure51 shows a simpleexampleofmemorization.If f is often called with thesame argumentsand f also takes a nontrivialamountof time to run, then memoizationwillsubstantiallyincreaseACM ComputingSurveys, Vol 26, No.

4, December 1994392David●y =F. Baconf(i)(a) originalfunctionlogicalf. UNCACHED[n]realf .CACHE[n]doi=l,callnf_ UNCACHED[i]endet al.=true.do...if(f_ UNCACHED[i])f. CACHE [i]f-UNCACHED[i]endythen= f(i)=false.if= f-CACHE[i](b)codeaugmentedFigure 51.for memorizationFunctionmemoizationperformance.Ifnot,it will degradeperformanceand consume memory.This exampleassumes that f’s parameter is confinedto the range1 . . . n, andthatn is not extremelylarge.A moresophisticatedmemorizationscheme wouldhash the argumentsand use a cache sizethat makes a sensibletrade-offbetweenreuseandmemoryconsumption.Forfunctionsthatreturndynamicallyallocated objects, storage managementmustbe consideredas well.For c calls and a hit rate of r-, the timeto execute the c calls isT=whereC(~th+tk is the timeoizedresultfrom(1–r)tm)to retrievethecachethe(amem-hit),andtn is the time to call f plus the overheadof discoveringthat it is not in the cache(a miss).

With a high hit rate and a largeth and tn, memorizdifferencebetweenationcansignificantlyimproveperfor-mance.describetransformationsthatare specific to parallelarchitectures.Highlyefficientparallelapplicationshave been developedby manuallyapplying the transformationsdiscussed here tosequentialcode. Duplicatingthese successes by automaticallytransformingthecode withina compileris an active areaof research.Whilea great deal of workhas been done, automaticparallelizationof full-scaleapplicationswithexistingcompilersdoes not consistentlyyield significantspeedups.Experimentsthatmeasurethe effectivenessof parallelizingcompilersare presentedin Section 9.3.Becausetheparallelizationof sequentialcode is provingto be difficult,languageslike HPF Fortran[HighPerformanceFortranForum1993] provideconstructsthat allow the programmertodeclare data decompositionstrategiesandto expose parallelism.The compileristhen responsiblefor generatingcommunicationand synchronizationcode, usingvariousoptimizations(discussedbelow)to reduce the resultingoverhead.As ofyet, however,there are few real applicationsthathave been writtenin theselanguages,and there is littleexperimental data.In this section we make extensiveuseof our modelshared-memoryand distributed-memorymultiprocessorssMXand dMX,whichare describedfullyinthe Appendix.Bothmachinesare constructedusingS-DLXprocessors.Theprogrammingenvironmentinitializestheglobal variablePnum to be the numberofprocessorsin the machineand Pid to bethe7.1FOR PARALLELMACHINESAlltheareapplicableputerACMtransformationstoa wideorganizations.ComputmgdiscussedSurveys,InVolvarietythis26, Noso farof com-section4, Decemberwe1994ofwithDataOne of thekeyinthethelocalprocessorthata compilerO).Layoutmustis7.

TRANSFORMATIONSnumber(startingmakedecisionstargetingdecompositiona multiprocessorofdataacrosstheprocessors.Theneedto distributeandmanageitisobviousindataa dis-tributed-memorymodel,programmingwherecommunicationperformanceof codememorymodelisisexplicit.Theundera sharedalsodependentonCompileravoidingunnecessarycommunication:the fact that communicationis performedimplicitlydoes not eliminateits cost.Transformationsdoallj=l,393●ndoalli=l,na[i, j] = a[i,end do allj]+ cend do all7.1.1(a)Regular Array DecompositionThe most common strategyfor decomposing an array on a parallelmachineis tomap each dimensionto a set of processors in a regularpattern.The four commonly used patternsare discussed in thefollowingsections.originalloop1234587823401235s78SerialDecompositionSerialdecompositionis the degeneratecase, where an entiredimensionis allocated to a single processor.Figure52(a)shows a simple loop nest that adds c toeach elementof a two-dimensionalarray,the decompositionof the array onto processors (b), and the loop transformedintocode for sMX (c).

Each column is mappedserially,meaningthatall the data inthat columnwill be placed on a singleprocessor. As a result,the loop that iterates over the columns(the inner loop) isunchanged.“!IIII(b)decornpos,t;on( block, sertd)of a on fourprocessorscallFORK(Pnum)do j= (n/Pnuro)*Pid+i,doi=l,rnin( (n/Pnum)*(Pid+i), n)na[i,j]= a[i,j]+ cend doend docall(c)JOIN()correspondingthe( block,loop,usingserial)blockschedulingofsize n/Pnurn*234567812A block decompositiondividesthe elementsintoone groupof adj scentelements per processor.Using the previousexample,the rows of array a are dividedbetween the four processors,as shown inFigure 52(b). Because the rows have beendividedacross the processors,the outerloop in Figure52(c) has been modifiedsothat each processor iteratesonly over thelocal portionof each row.Figure52(d) shows the decompositionif block schedulingis appliedacross boththe horizontaland verticaldimensions.A majordifferencebetweencompilation for sharedand distributed-memorymachinesis thatshared-memorymachines providea globalname space, sothat any particulararray elementcan benamed by any processorusing the samesubscriptsas in the sequentialcode.

Onthe other hand, on a distributed-memorymachine,arrays must usuallybe dividedo1233Block Decomposition148618E13(d) ( block, block)decompositionof a on fourprocessorsFigurearray52.on theSerialand block decompositionsehared-memorymultiprocessorof ansMX.acrossthe processors,each of which hasits own privateaddress space.As a result,naminga nonlocalarrayelementrequiresa mappingfunctionfrom the original,global name space to aprocessor numberand local array indiceswithinthatprocessor.Theexamplesshown here are all for sharedmemory,and thereforesidestepthesecomplexities, whichare discussedin Section7.3.However,the same decompositionprinciples are used for distributedmemory.ACMComputingSurveys,Vol26, No 4, December1994394David“F. Baconet al.doj=l,Indoi=l,f2345678jtotal1end23end= total+ a[i,jldodola(a) loop5678(Cyclic, serial)(a)decompositionprocessorsof a on fouricallFORK (Pnunddo j= Pid+l,n,doi=l,a[i,endendjl= a[i,jl+ cdodocall(b)Pnumn(b)JOIN()corresponding(CYCIZC,ser!a~)schedulingOfelementsFigure 54.of a readTriangularby theiterationloopspace.loopFigure53.Cyclicdecompositionon sMXThe advantageof block decompositionis that adj scent elementsare usuallyonthe same processor.Because a loop thatcomputesa valuefor a givenelementoften uses the values of neighboringelements, the blocks have good localityandreduce the amountof communication.CyclicDecompositionA cyclicdecompositionassignssuccessive elementsto successive processors.Acyclic allocationof columnsto processorsfor the sequentialcode in Figure52(a) isshown in Figure 53(a).

Figure 53(b) showsthe transformedcode for sMX.Cyclic decompositionhas the oppositeeffectof blocking:it hafi poor localityfor neighbor-basedcommunication,butspreads load more evenly. Cyclic decompositionworks well for loops like the onein Figure 54, as long as the array is largeenoughso thateach processorhandlesmany columns. A common algorithmthatresults in this type of load balance is LUfactorizationof matrices.A block decompositionwouldwork poorlybecause theiterationsmost costly to executewouldACMComputmgSurveys,Vol.

26, No. 4, December1994be clusteredtogetherand computedby asingle processor. With a cyclic decomposition, the expensiveiterationsare spreadacross the processors.Biock-CyclicDecompositionA block-cyclicdecompositioncombinesthe two strategies;the elementsare divided into many adjacentgroups, usuallyan integermultipleof the numberof processors available.Each group of elementsis assignedto a processorcyclically.Figure 55(a) shows an array whose columnshave been allocatedto two processorsina block-cyclicfashion,and Figure55(b)gives the code to implementthe mappingon sMX. Block-cyclicis a compromisebetween the localityof block decompositionofcyclicandtheloadbalancingdecomposition.IrregularComputationsSome loops, like the one shown in Figure56, have highlyirregularexecutionbehavior,When the varianceof executiontimes is high, none of the static regulardecompositionstrategiesmaybe sufficientto yieldgoodperformance.Avarietyof dynamicalgorithmsmakeCompiler1236S678<23i’,0101678EIll(a) (block-cyclic,serial)decompositionof a ontwo processorscallFORK(Pnum)do k = 1,do jn,doi=l,endk + 2*Pid+ ina[i,endend2*Pnum= k + 2*Pid,j]= a[i,j]+ cdododocallJOIN()(b) correspondingschedulingFigure 55.Block-cyclicdoi=l,ifof loop,(block-cyclic,usingserial)blocksize 2decompositionon sMX.n(maska[i][i]= 1)then= expensive.functionoend ifenddo(a) exampleloopaaJ&1234567(b)iterationversusFigure 56.Irregular8Scostforthe aboveexecutionloopDecomposition395behavior.schedulingdecisionsat run time basedon the historicalbehaviorof the computation [Lucco 1992; PolychronopoulosandKuck 1987; Tang and Yew 1990].Automatic“across a set of processors;in a block decomposition,for example, the array mightbe dividedinto contiguous10 x 10 subarrays.The alignmentof the arrayestablishesthe specificelementsthat arein each of those subarrays.Languageslike HPF Fortranrely onthe proWammerto declarehow arraysare to be alignedand what style of decompositionto use; the languageincludesannotationslikeBLOCKandCYCLIC.A varietyof researchprojectshave investigatedwhetherthe strategycan bedeterminedautomaticallywithoutprogrammerassistance.Programmerscanfind it difficultto choose a good set ofdeclarations,particularlysince the optimal decompositioncan change during thecourse of programexecution.The generalstrategyfor automatingdecompositionand alignmentis to express the behaviorof the programin arepresentationthat either capturescommunicationexplicitlyor allowsit to becomputed.The compilerappliesoperations that reflect differentdecompositionand alignmentchoices;the goal is tomaximizeparallelismandminimizecommunication.The approachesincludetheAlignment-DistributionGraph[Chatterjeeet al.

1993a;1993b],theComponentAffinityGraph [Li and Chen1990;1991],constraints[Gupta1992],andaffinetransformations[Andersonand Lam 1993].In additionto presentingan algorithmto find the optimalmappingof arrays toprocessorsfor one- and two-dimensionalarrays,Mace [1985;1987]shows thatfindingoptimaldecompositionsis anNP-completeproblem.7.1.37.1.2Transformationsand AlignmentThe decompositionof an arraydetermines how the elementsare distributedScalar PrivatizationThe goal behindprivatizationis to increase the amountof parallelismand toavoid unnecessarycommunication.Whena scalar is used withina loop solely as ascratchvariable,each processorcan begivena privatecopy so the use of thescalar need not involveany communication.

Характеристики

Тип файла

PDF-файл

Размер

6,02 Mb

Материал

1994. Compiler Transformations for High-Perforamce Computing

Тип материала

Статья

Предмет

Конструирование компиляторов

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов статьи

1994.-compiler-transformations-for-high-perforamce-computing.rar

1994. Compiler Transformations for High-Perforamce Computing.pdf

Прочти меня!!!.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.