Cooper_Engineering_a_Compiler(Second Edition) (1157546), страница 100

Файл №1157546 Cooper_Engineering_a_Compiler(Second Edition) (Rice) 100 страницаCooper_Engineering_a_Compiler(Second Edition) (1157546) страница 1002019-09-182019-09-18СтудИзба

Rice

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 100)

The other three loop nests handle two, four and eight columns of m.This guarantees that the final loop nest, shown in Figure 8.1, can process thecolumns 16 at a time.Ideally, the compiler would automatically transform the original loop nestinto this more efficient version, or into whatever form is most appropriatefor a given target machine. However, few compilers include all of the optimizations needed to accomplish that goal. In the case of dmxpy, the authorperformed the optimizations by hand to produce good performance across awide range of target machines and compilers.From the compiler’s perspective, mapping the loop nest shown in Figure 8.1onto the target machine presents some hard challenges.

The loop nest contains 33 distinct array-address expressions, 16 for m, 16 for x, and onefor y that it uses twice. Unless the compiler can simplify those addresscalculations, the loop will be awash in integer arithmetic.Consider the references to x. They do not change during execution of theinner loop, which varies i. The optimizer can move the address calculationsand the loads for x out of the inner loop.

If it can keep the x values in registers, it can eliminate a large part of the overhead from the inner loop. For areference such as x(j-12), the address calculation is just @x + (j − 12) × w.To further simplify matters, the compiler can refactor all 16 references tox into the form @x + jw − ck , where jw is j · w and ck is k · w for each0 ≤ k ≤ 15. In this form, each load uses the same base address, @x + jw,with a different constant offset, ck .To map this efficiently onto the target machine requires knowledge of theavailable addressing modes.

If the target has the equivalent of iloc’s loadAIoperation (a register base address plus a small constant offset), then all theaccesses to x can be written to use a single induction variable. Its initial valueis @x + jmin · w. Each iteration of the j loop increments it by w.The 16 values of m used in the inner loop change on each iteration.

Thus,the inner loop must compute addresses and load 16 elements of m oneach iteration. Careful refactoring of the address expressions, combinedwith strength reduction, can reduce the overhead of accessing m. The value412 CHAPTER 8 Introduction to Optimization@m + j · high1 (m) · w can be computed in the j loop. (Notice that high1 (m) isthe only concrete dimension declared in dmxpy’s header.) The inner loop canproduce a base address by adding it to (i − 1) · w. Then, the 16 loads can usedistinct constants, ck · high1 (m), where ck is k · w for each 0 ≤ k ≤ 15.To achieve this code shape, the compiler must refactor the address expressions, perform strength reduction, recognize loop-invariant calculations andmove them out of inner loops, and choose the appropriate addressing modefor the loads.

Even with these improvements, the inner loop must perform 16loads, 16 floating-point multiplies, and 16 floating-point adds, plus one store.The resulting block will present a challenge to the instruction scheduler.If the compiler fails in some part of this transformation sequence, the resulting code might be substantially worse than the original. For example, if itcannot refactor the address expressions around a common base address for xand one for m, the code might maintain 33 distinct induction variables—onefor each distinct address expression for x, m, and y. If the resulting demandfor registers forces the register allocator to spill, it will insert additional loadsand stores into the loop (which is already likely to be memory bound).

Incases such as this one, the quality of code produced by the compiler dependson an orchestrated series of transformations that all must work; when onefails to achieve its purpose, the overall sequence may produce lower qualitycode than the user expects.8.2.2 Considerations for OptimizationIn the previous example, the programmer applied the transformations in thebelief that they would make the program run faster. The programmer hadto believe that they would preserve the meaning of the program. (After all,if transformations need not preserve meaning, why not replace the entireprocedure with a single nop?)Two issues, safety and profitability, lie at the heart of every optimization.The compiler must have a mechanism to prove that each application ofthe transformation is safe—that is, it preserves the program’s meaning.

Thecompiler must have a reason to believe that applying the transformation isprofitable—that is, it improves the program’s performance. If either of theseis not true—that is, applying the transformation will change the program’smeaning or will make its performance worse—the compiler should not applythe transformation.SafetyHow did the programmer know that this transformation was safe? That is,why did the programmer believe that the transformed code would producethe same results as the original code? Close examination of the loop nest8.2 Background 413DEFINING SAFETYCorrectness is the single most important criterion that a compiler mustmeet—the code that the compiler produces must have the same meaningas the input program. Each time the optimizer applies a transformation,that action must preserve the correctness of the translation.Typically, meaning is defined as the observable behavior of the program.For a batch program, this is the memory state after it halts, along withany output it generates.

If the program terminates, the values of all visiblevariables immediately before it halts should be the same under any translation scheme. For an interactive program, behavior is more complex anddifficult to capture.Plotkin formalized this notion as observational equivalence.For two expressions, M and N, we say that M and N are observationallyequivalent if and only if, in any context C where both M and N are closed (thatis, have no free variables), evaluating C[M] and C[N] either produces identicalresults or neither terminates [286].Thus, two expressions are observationally equivalent if their impacts onthe visible, external environment are identical.In practice, compilers use a simpler and looser notion of equivalence thanPlotkin’s, namely, that if, in their actual program context, two differentexpressions e and e0 produce identical results, then the compiler can substitute e0 for e.

This standard deals only with contexts that actually arisein the program; tailoring code to context is the essence of optimization.It does not mention what happens when a computation goes awry, ordiverges.In practice, compilers take care not to introduce divergence—the originalcode would work correctly, but the optimized code tries to divide by zero,or loops indefinitely.

The opposite case, where the original code woulddiverge, but the optimized code does not, is rarely mentioned.shows that the only interaction between successive iterations occurs throughthe elements of y.nnA value computed as y(i) is not reused until the next iteration of theouter loop. The iterations of the inner loop are independent of eachother, because each iteration defines precisely one value and no otheriteration references that value. Thus, the iterations can execute in anyorder. (For example, if we run the inner loop from n1 to 1 it producesthe same results.)The interaction through y is limited in its effect.

The ith element of yaccumulates the sum of all the ith iterations of the inner loop. Thispattern of accumulation is safely reproduced in the unrolled loop.414 CHAPTER 8 Introduction to OptimizationA large part of the analysis done in optimization goes toward proving thesafety of transformations.ProfitabilityWhy did the programmer think that loop unrolling would improveperformance? That is, why is the transformation profitable? Severaldifferent effects of unrolling might speed up the code.nnnMemory boundA loop where loads and stores take more cyclesthan does computation is considered memorybound.To determine if a loop is memory bound requiresdetailed knowledge about both the loop and thetarget machine.The total number of loop iterations is reduced by a factor of 16.

Thisreduces the overhead operations due to loop control: adds, compares,jumps, and branches. If the loop executes frequently, these savingsbecome significant.This effect might suggest unrolling by an even larger factor. Finiteresource limits probably dictated the choice of 16. For example, theinner loop uses the same 16 values of x for all the iterations of theinner loop. Many processors have only 32 registers that can hold afloating-point number. Unrolling by 32, the next power of two, wouldcreate enough of these “loop-invariant” values that they could not fit inthe register set. Spilling them to memory would add loads and storesto the inner loop and undo the benefits of unrolling.The array-address computations contain duplicated work.

Consider theuse of y(i). The original code computed y(i)’s address once permultiplication of x and m; the transformed code computes it once per116 multiplications. The unrolled code does 16as much work to addressy(i). The 16 references to m, and to a lesser extent x, should includecommon portions that the loop can compute once and reuse, as well.The transformed loop performs more work per memory operation,where “work” excludes the overhead of implementing the array andloop abstractions.

The original loop performed two arithmeticoperations for three memory operations, while the unrolled loopperforms 32 arithmetic operations for 18 memory operations, assumingthat all the x values stay in registers. Thus, the unrolled loop is lesslikely to be memory bound. It has enough independent arithmetic tooverlap the loads and hide some of their latencies.Unrolling can help with other machine-dependent effects. It increases theamount of code in the inner loop, which may provide the instruction scheduler with more opportunities to hide latencies. If the end-of-loop branchhas a long latency, the longer loop body may let the compiler fill more ofthat branch’s delay slots.

On some processors, unused delay slots must befilled with nops, in which case loop unrolling can decrease the number ofnops fetched, reduce memory traffic and, perhaps, reduce the energy used toexecute the program.8.2 Background 415RiskIf transformations intended to improve performance make it harder for thecompiler to generate good code for the program, those potential problemsshould be considered as profitability issues. The hand transformationsperformed on dmxpy create new challenges for a compiler, including thefollowing:nnDemand for registers The original loop needs only a handful ofregisters to hold its active values. Only x(j), some part of the addresscalculations for x, y, and m, and the loop index variables need registersacross loop iterations, while y(i) and m(i,j) need registers briefly. Incontrast, the transformed loop has 16 elements of x to keep in registersacross the loop, along with the 16 values of m and y(i) that needregisters briefly.Form of address calculation The original loop deals with threeaddresses, one each for y, x, and m.

Because the transformed loopreferences many more distinct locations in each iteration, the compilermust shape the address calculations carefully to avoid repeatedcalculations and excessive demand for registers. In the worst case, thecode might use independent calculations for all 16 elements of x, all 16elements of m, and the one element of y.If the compiler shapes the address calculations appropriately, it can usea single pointer for m and another for x, each with 16 constant-valuedoffsets. It can rewrite the loop to use that pointer in the end-of-loop test,obviating the need for another register and eliminating another update.Planning and optimization make the difference.Other problems of a machine-specific nature arise as well.

Характеристики

Тип файла

PDF-файл

Размер

8,27 Mb

Материал

Rice

Тип материала

Другое

Предмет

Конструирование компиляторов

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов учебной работы

rice.rar

Rice

COMP 512 Advanced Compiler Construction_files

2007Shield.jpg

Другое

08Taxonomy.pptx

978-0-262-31709-2-ch014.pdf

peephole.pptx

scheduling.pptx

Купер и Торчсон - перевод

Рисунки

Рис1-1.jpg

12 Instruction Scheduling(Engl).doc

12 Instruction Scheduling(Rusl).doc

12a Instruction Scheduling (Terminology)E.doc

12a Instruction Scheduling (Terminology)R.doc

12b Instruction Scheduling (Side-bars)E.doc

12b Instruction Scheduling (Side-bars)R.doc

Cooper_Engineering_a_Compiler(Second Edition).pdf

COMP 512 Advanced Compiler Construction.htm

Rice University Houston COMP 412- Topics in Compiler Construction.doc

Rice University Houston COMP 512 Advanced Compiler Construction.doc

Прочти меня!!!.txt

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.