Effective Partial Redundancy Elimination (798424), страница 3
Текст из файла (страница 3)
In this case, none of the exposed redundancies are particularly surprising, since wecreated them during forward propagation. However, itis important to note that the code now conforms withthe naming requirements stated in Section 2.2. Expressions are named uniquely by r0 , r1 , r2 , r3 , r6 , r7, r8 , andr9 . The remaining names, r4 , r5 , and r10 , are definedexclusively by copies and serve as variable names.Finishing the Example Applying partial redundancy elimination to the code in Figure 8 produces thecode in Figure 9.
The invariant expressions r6 and r7have been hoisted from the loop and the redundant computations of r3 , r6 , and r7 have been removed. Finally,the coalescing phase of a Chaitin-style global register allocator will remove unnecessary copy instructions [6]. Inthis example, coalescing is able to remove all the copies(as shown in Figure 10), though this will not always bepossible.Taken together, the sequence of transformations reduced the length of the loop by 1 operation withoutincreasing the length of any path through the routine.However, it is worth noting that the final code is notoptimal.
If the expressions r6 and r7 had been arrangeddifferently, we would have been able to take advantageof the fact that r0 + r1 had already been computed. Asnoted in Section 2.2, finding the optimal solution wouldrequire examination of a combinatorial number of cases.We use a fast heuristic that produces good, though notoptimal, results.enter(r0 , r1 )r4 ← r0 + r1if r4 > 100 branch?r5 ← 0r6 ← r0 + 1r7 ← r6 + r1r5 ← r7 + r5 ?- r4 ← r4 + 1if r4 ≤ 100 branch?r10 ← r7 + r5?r10 ← 0?- return r10Figure 10: After Coalescing4Experimental StudyTo test the effectiveness of our techniques, we have implemented versions of global reassociation, global valuenumbering, and partial redundancy elimination in thecontext of an experimental FORTRAN compiler.
Thecompiler is structured as a front end that consumesFORTRAN and produces ILOC (our intermediate language), an optimizer that consumes and produces ILOC,and a back end that consumes ILOC and produces C.The generated C code is instrumented to accumulatedynamic counts of ILOC operations. Thus, we are ableto compile individual FORTRAN routines, perhaps selected from a large program, and test the effectivenessof different optimizations on the routine in the contextof its complete program.The optimizer is structured as a sequence of passes,where each pass is a Unix filter that consumes and produces ILOC. Each pass performs a single optimization, including all the required control-flow and dataflow analyses. While this approach is not suitable forproduction compilers, its flexibility makes it ideal forexperimentation.Our implementation of PRE uses a variation describedby Drechsler and Stadel [14].
Their formulation supports edge placement for enhanced optimization andsimplifies the data-flow equations that must be solved,avoiding the bidirectional equations typical of someother approaches [13]. Our implementation of globalvalue numbering uses the simplest variation describedby Alpern, Wegman, and Zadeck, possibly missing someopportunities discovered by their more powerful approaches [2, Sections 3 and 4].reassociation The left column provides the operationcounts for routines optimized using global reassociation (without distribution of multiplication overaddition) and global value numbering before PREand the other optimizations.
The right columngives the percentage improvements over partial.distribution The left column gives the operation countsfor routines optimized using global reassociation(including distribution of multiplication over addition) and global value numbering before PRE andthe other optimizations. The right column givesthe percentage improvements over reassociation.The total column gives the percentage improvementsachieved over the baseline by the entire set of optimizations, while the new column gives the improvement overpartial contributed by the combination of reassociationand distribution with global value numbering.Empty entries indicate no improvement, whereas entries of 0% and −0% indicate very small improvementsand degradations.Limitations of the Optimizer Our optimizer is notcomplete.
In particular, we are currently missing passesfor strength reduction and hash-based value numbering.Nevertheless, we believe our results are still valid indications of the importance of reassociation. Indeed, itmay be that our results understate the eventual benefits – strength reduction should reduce non-essentialoverhead and hash-based value numbering should alsobenefit from reassociation.4.24.1Code DegradationResultsWe ran several versions of the optimizer on a suite oftest routines. Each version adds new passes to the previous one.
Our test suite consists of 50 routines, drawnfrom the Spec benchmark suite and from Forsythe, Malcolm, and Moler’s book on numerical methods [16]. Theresults are given in Table 1. We report results for fourdifferent levels of optimization:baseline This column provides the dynamic operationcount, including branches, for each routine whenoptimized using a sequence of global constant propagation [26], global peephole optimization, globaldead code elimination [11, Section 7.1], coalescing,and a final pass to eliminate empty basic blocks.1partial The left column gives the operation counts forroutines optimized with PRE, followed by the sequence of optimizations used to establish the baseline. The right column gives the percentage improvement over the baseline.1 The sizes of the test cases for matrix300 and tomcatv have beenreduced to ease testing.The results in Table 1 reveal several cases where our“improvements” slowed down the code.
Since we are using heuristic approaches to difficult problems, we shouldnot be surprised by occasional losses, annoying as theyare. Examination of the code revealed three sources ofdifficulty; each is discussed in the sections below.Reassociation Sometimes reassociation can disguisecommon subexpressions. Recall our example from Figures 2 though 10. The final arrangement of the code,r4 ← r0 + r1andr6 ← r0 + 1r7 ← r6 + r1hid the fact that r0 + r1 was being recomputed. Wefound that this sort of problem occurred quite often inthe routines of our test suite. Fortunately, the effectis usually dominated by the improved motion of loopinvariants.routinefmingamgenfmtsetrkf45sgemvsaxpyinisetsplinetomcatvdebicosevalsgemmcardebhmoyorgparrepviddrepviheatsvdx21y21inidebpastemsidesecofmtgenfppppyehparoitwldrvdebflucolburdecompinithxcoerayrkfsintegrsubbsuppurandzeroinfehlihbtrsaturrsolveddefludcoerabilandriglprophyefillroutinebaseline4,817462,285705621,49686775,2891,659858,364,9886,6451051,3931,716471884,2704092296,8344031,7336,35320633,8732367,7671607,489122,220,7668,0661528765,9181174565,8037049062211,0207855133222231,12718210,18816115,541226baselinepartial3,807180,260538621,34166756,912961250,343,4583,234981,095989281353,0423212014,5552588884,07017614,4302075,8381393,72490,895,1465,1701266353,0861052982,4246328132207395104533181698271603,3551133,904205partial20%61%23%10%23%24%42%70%51%6%21%42%40%28%28%21%12%33%35%48%35%14%57%12%24%13%50%25%35%17%27%47%10%34%58%10%10%0%27%35%11%1%24%26%12%67%29%74%9%reassociation1,90849%143,06520%46014%586%1,2417%66756,7660%8857%251,509,201−0%2,9468%8711%1,096−0%999−1%273%1353,0380%3035%1905%4,5230%258954−7%3,9413%177−0%13,8643%2022%5,5145%1325%3,6771%86,945,3284%5,1560%1213%641−0%3,0670%1040%2970%2,436−0%636−0%814−0%221−0%743−0%5104520%323−1%1680%854−3%165−3%3,447−2%126 −11%4,016−2%230 −12%reassociationTable 1: Experimental Resultsdistribution1,908107,03125%39713%4620%1,00319%52521%47,42616%8029%213,985,24414%2,8024%861%95412%88911%257%12110%2,7629%2942%1843%4,2346%2397%82913%3,8213%1656%13,7071%1953%5,5141323,5712%87,122,050 −0%4,9653%123 −1%6173%3,0181%1042972,447 −0%636814222 −0%743517 −1%458 −1%323172 −2%8451%1653,509 −1%1250%4,351 −8%230distributionnew49%40%26%25%25%21%16%16%14%13%12%12%10%10%10%9%8%8%7%7%6%6%6%5%5%5%5%4%4%3%2%2%2%0%0%−0%−0%−0%−0%−0%−1%−1%−1%−1%−2%−3%−4%−10%−11%−12%newtotal60%76%43%25%32%39%37%51%75%57%18%31%48%46%35%35%28%19%38%40%52%39%19%59%17%29%17%52%28%38%19%29%49%11%34%57%9%10%−0%27%34%10%−0%22%25%9%65%22%72%−1%totalDistribution Similarly, distribution of multiplicationover addition can cause problems in some cases.
Consider the following pair of expressions arising from a pairof array accesses, one to a single-precision array and theother to a double-precision array:4 × (ri − 1)8 × (ri − 1)Distribution of the multiplies would yield:4 × ri − 4 × 18 × ri − 8 × 1and eventually, via constant folding:Unfortunately, this version is slightly worse than theoriginal code since the original allowed commoning ofthe subexpression ri − 1.
Despite disappointments ofthis sort, it is clear from the results in Table 1 that distribution is quite important. We believe that some ofthe problems of distribution can be avoided by employing a slightly more sophisticated approach, though thisis a topic for further study.Forward Propagation Earlier, we mentioned thatforward propagation eliminates partially-dead expressions. However, forward propagation can also result incode degradation if expressions are moved into loopswhere they will be invariant but PRE will be unable tohoist them.