1994. Compiler Transformations for High-Perforamce Computing (798436), страница 18

Файл №798436 1994. Compiler Transformations for High-Perforamce Computing (1994. Compiler Transformations for High-Perforamce Computing) 18 страница1994. Compiler Transformations for High-Perforamce Computing (798436) страница 182019-09-182019-09-18СтудИзба

1994. Compiler Transformations for High-Perforamce Computing

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 18)

They have come up withfour approachesto the problem:eexaminethe compileroutputby handto evaluateits ability to transformcode;emeasureperformanceagainstanother;ecomparethe performanceof executablecompiledwithfulloptimizationagainst littleor no optimization;andecomparean applicationcompiledforparallelexecutionagainstthe sequential version runningon one processor.of onecompilerNobayashiand Eoyang [1989] comparethe performanceof supercomputercompilersfromCray,Fujitsu,Alliant,Ardent,and NEC.The compilerswereappliedto variousloops from the Livermore and Argonnetest suitesthatrequired restructuringbefore they could becomputedwith vector operations.Carr [1993] applies a set of loop transformations(unroll-and-jam,loop interchange, tiling,and scalar replacement)tovariousscientificapplicationsto find astrategythat optimizescache utilization.Wolf [1992] uses inner loops from thePerfect Club and from one of the SPECbenchmarksto compare the performanceof hand-optimizedto compiler-optimizedcode.

The focus is on improvinglocalityto reducetrafficbetweenmemoryandthe cache.Relativelyfew studieshave been performedto test the effectivenessof realcompilersin parallelizingreal programs.However,the resultsof those that haveare not encouraging.Transformationse405One study of four Perfectbenchmarkprogramscompiledon the AlliantFX/8producedspeedupsbetween0.9 (that is,a slowdown)and 2.36 out of a potential32; when the applicationswere tuned byhand,the speedupsrangedfrom 5.1 to13.2 [Eigenmannet al.

1991]. In anotherstudy of 12 Perfect benchmarkscompiledwiththe KAP compilerfor a simulatedmachinewithunlimitedparallelism,7applicationshad speedupsof 1.4 or less;two applicationshad speedupsof 2-4;and the rest were sped up 10.3, 66, and77. All but threeof these applicationscould have been sped up by a factor of 10or more [Padua and Petersen1992].Lee et al. [1985] study the abilityof theParafrasecompilerto parallelizea mixture of smallprogramswrittenin Fortran.Beforecompilation,whileloopswere convertedto do loops, and code forhandlingerror conditionswas removed.With 32 processorsavailable,4 out of 15applicationsachieved 30920 efficiency,and2 achieved10%efficiency;theother9 out of 15 achievedless than10%efficiency.Outof 89 loops,59 wereparallelized,mostof themloopsthatinitializedarraydata structures.Somecoding idiomsthat are amenableto improved analysiswere identified.Some preliminaryresultshavealsobeen reportedfor the FortranD compiler[Hiranandaniet al.

1993] on the InteliPSC/860and ThinkingMachinesCM-5architectures.9.4 Instruction-LevelParallelismIn order to evaluatethe potentialgainfrominstruction-levelparallelism,researchershave engagedin a numberofstudiesto measurean upperboundonhow much parallelismis availablewhenan unlimitednumberof functionalunitsis assumed.Some of these studiesarediscussed and evaluatedin detail by Rauand Fisher [1993].Early studies [Tjadenand Flynn1970]were pessimisticin theirfindings,measuringa maximumlevel of parallelismon the order of two or three—aresultthat was called the Flynn bottleneck.TheACMComputingSurveys,Vol26,No4, December1994406●DavidF. Baconet al.mainreasonfor the low numberswasthatthese studiesdid not look beyondbasic blocks.Parallelismcan be exploitedacrossbasic block boundaries,however,on machines that use speculativeexecution.hstead of waitinguntilthe outcomeof aconditionalbranch is known, these architecturesbegin executingthe instructionsat eitheror both potentialbranchtargets; when the conditionalis evaluated,any computationsthat are renderedinvalid or useless must be discarded.When the basic block restrictionis relaxed,thereis muchmore parallelismavailable.Risemanand Foster [1972] assumedreplicatedhardwareto supportspeculativeexecution.They foundthattherewas significantadditionalparallelismavailable,but thatexploitingitwould requirea great deal of hardware.For the programsstudied,on the order of216 arithmeticunits would be necessaryto expose 10-way parallelism.A subsequentstudyby NicolauandFisher [1984] sought to measurethe parallelismthatcouldbe exploitedby aVLIWarchitectureusing aggressivecompiler analysis.The strategywas to compile a set of scientificprogramsinto anintermediateform,simulateexecution,andapplya schedulingalgorithmretroactivelythatused branchand address informationrevealedby the simulation.Thestudyrevealedsignificantamountsof parallelism—afactor of tensor hundreds.Wall[1991]took tracedata fromanumberof applicationsand measuredtheamountof parallelismunder a varietyofmodels.

His study consideredthe effect ofarchitecturalfeaturessuch as varyingnumbersof registers,branchprediction,and jumpprediction.It also consideredthe effects of alias analysisin the compiler. The resultsvaried widely,yieldingas much as 60-way parallelismunder themost optimisticassumptions.The average parallelismon an ambitiouscombinationof architectureandcompilersupportwas 7.Butleret al.

[1991] used an optimizingcompileron a group of programsfrom theACMComputingSurveys,Vol. 26, No. 4, December1994SPEC89suite.The resultingcode wasexecutedon a varietyof simulatedmachines.The range includesunrealistically omniscientarchitecturesthat combine data-flowstrategiesfor schedulingwith perfect branch prediciton.They alsorestrictedmodels with limitedhardware.Underthe most optimisticassumptions,they were able to execute 17 instructionsper cycle; more practicalmodels yieldedbetween2.0 and 5.8 instructionsper cycle.

Withoutlookingpast branches,fewapplicationscould executemore than 2instructionsper cycle.Lam and Wilson[1992] argue that onemajor reason why studies of instructionlevel parallelismdiffer markedlyin theirresultsis branchprediction.The studiesthat performspeculativeexecutionbasedon predictionyield much less parallelismthan the ones that have an oracle withperfectknowledgeaboutbranches.Theauthorsspeculatethatexecutingbrancheslocallywill not yield large degrees of parallelism;they conduct experimentson a numberof applicationsandfind similarresultsto other studies (between 4.2 and 9.2). They argue that compileroptimizationsthatconsidertheglobal flow of control are essential.Finally,Larus [1993] takes a differentapproachthantheothers,focusingstrictlyon the parallelismavailableinloops.

He examinessix codes (three integer programsfromSPEC89and threescientificapplications),collectsa traceannotatedwithsemanticinformation,and runs it througha simulatorthat assumes a uniformshared-memoryarchitecturewithinfiniteprocessors.Two ofthe scientificcodes, the ones that wereprimarilyarray manipulators,showed apotentialspeedup in the range of 65–250.The potentialspeedup of the other codeswas1.7–3,suggestingthatthe techniquesdevelopedfor parallelizingscientificcodeswillnotworkwellinparallelizingother types of programs.CONCLUSIONWe have describeda largenumberoftransformationsthat can be used to im-Compilerprove programperformanceon sequentialand parallelmachines.Numerousstudieshavedemonstratedthatthesetransformations,when properlyapplied,are sufficientto yield high performance.However,currentoptimizingcompilerslack an organizingprinciplethat allowsthem to choose how and when the transformationsshould be applied.Findingaframeworkthatunifiesmanytransformationsis an activearea of research.Sucha frameworkcouldsimplifytheproblemof searchingthe space of possible transformations,makingit easier andthereforecheaperto buildhigh-qualityoptimizingcompilers.A key part of thisproblemis the developmentof a costfunctionthat allows the compilerto evaluate the effectivenessof a particularsetof transformations.Despitethe absence of a strategyforuniffingtransformations,compilershaveproven to be quite successful in targetingsequentialand superscalararchitectures.On ~arallelarchitectures.however.mosthig~-performanceapplicationscurrentlyrely on the programmerratherthan thecompilerto manageparallelism.Compilersface a largespaceof possibletransformationsto apply,and parallelmachinesexact a very high cost of failure when optimizationalgorithmsdo notdiscovera good reorganizationstrategyfor the application.Becauseeffortsto ~arallelizetraditionallanguagesautom~ticallyhave metwith littlesuccess, the focus of researchhas shiftedto compilinglanguagesinwhichthe Dromammersharesthe responsibility}or”exposingparallelismandchoosing a data layout.Anotherdevelopingarea is the optimizationof nonscientificapplications.Like most researchersworkingon highperformancearchitectures,compilerwritershave focused on code that relieson loop-basedmanipulationof arrays.Otherkindsof programs,particularlythose writtenin an object-orientedprogrammingstyle, rely heavilyon pointers.Optimizationof pointer manipulationandof object-orientedlanguagesis emergingas anotherfocus of compilerresearch.APPENDIXTransformationsMACHINEA.1 Superscalar●407MODELSDLXA superscalarprocessorhas multiplefunctionalunits and can issue more thanone instructionper clock cycle.

Currentexamplesof superscalarmachinesare theDEC Alpha[Sites1992], HP PA-RISC[Hewlett-Packard1992],IBMRS/6000[Oehlerand Blasgen1991],and IntelPentium[Alpertand Avnon 1993].S-DLX is a simplifiedsuperscalarRISCarchitecture.It has fourindependentfunctionalunitsfor integer,load/store,branch,and floating-pointoperations.Inevery cycle, the next two instructionsareconsideredfor execution.If they are fordifferentfunctionalunits,andthereare no dependencebetweenthe instructions, they are both initiated.Otherwise,the second instructionis deferreduntilthe next cycle. S-DLXdoes not reorderthe instructions.Most operationscompletein a singlecycle.

When an operationtakes more thanone cycle, subsequentinstructionsthatuse resultsfrom multicycleinstructionsare stalleduntilthe resultis available.Because there is no instructionreordering, whenan instructionis stallednoinstructionsare issuedto any of thefunctionalunits.S-DLX has a 32-bit word, 32 generalpurpose registers(GPRs, denoted by Rn),and 32 floating-pointregisters(FPRs, denoted by Fn). The value of RO is always O.The FPRs can be used as double-precision(64-bit)registerpairs.Forthesake of simplicitywe have not includedthedouble-precisionfloating-pointinstructions.Memoryis byteaddressablewitha32-bit virtualaddress. All memoryreferences are made by load and store instructions betweenmemoryand the registers.Datais cached in a 64-kilobyte4-wayset-associativewrite-backcachecomposed of 1024 64–byte cache lines.

Характеристики

Тип файла

PDF-файл

Размер

6,02 Mb

Материал

1994. Compiler Transformations for High-Perforamce Computing

Тип материала

Статья

Предмет

Конструирование компиляторов

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов статьи

1994.-compiler-transformations-for-high-perforamce-computing.rar

1994. Compiler Transformations for High-Perforamce Computing.pdf

Прочти меня!!!.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.