K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 101
Текст из файла (страница 101)
As we shall see, the tools of optimization also play a large role in the compiler’s back end. For these reasons, itis important to introduce optimization and explore some of the issues that itraises before discussing the techniques used in a compiler’s back end.8.2.1 ExamplesTo provide a focus for this discussion, we will begin by examining twoexamples in depth. The first, a simple two-dimensional array-address calculation, shows the role that knowledge and context play in the kind of codethat the compiler can produce. The second, a loop nest from the routinedmxpy in the widely-used linpack numerical library, provides insight intothe transformation process itself and into the challenges that transformedcode can present to the compiler.Improving an Array-Address CalculationConsider the ir that a compiler’s front end might generate for an array reference, such as m(i,j) in fortran.
Without specific knowledge about m,i, and j, or the surrounding context, the compiler must generate the full8.2 Background 409expression for addressing a two-dimensional array stored in column-majororder. In Chapter 7, we saw the calculation for row-major order; fortran’scolumn-major order is similar:@m + (j − low2 (m)) × (high 1 (m) − low1 (m) + 1) × w + (i − low1 (m)) × wwhere @m is the runtime address of the first element of m, lowi (m) and highi (m)are the lower and upper bounds, respectively, of m’s ith dimension, and w isthe size of an element of m. The compiler’s ability to reduce the cost of thatcomputation depends directly on its analysis of the code and the surroundingcontext.If m is a local array with lower bounds of one in each dimension and knownupper bounds, then the compiler can simplify the calculation to@m + (j − 1) × hw + (i − 1) × wwhere hw is high1 (m) × w.
If the reference occurs inside a loop where jruns from 1 to k, the compiler might use operator strength reduction toreplace the term (j − 1) × hw with a sequence j01 , j02 , j03 , . . . j0k , where0 + hw. If i is also the inductionj01 = (1 − 1) × hw = 0 and ji0 = ji−1variable of a loop running from 1 to l, then strength reduction can replace(i − 1) × w with the sequence i01 , i02 , i03 , . . . il0 , where i01 = 0 and i0j =i0j−1 + w. After these changes, the address calculation is just@m+j0 +i0The j loop must increment j0 by hw and the i loop must increment i0 by w.If the j loop is the outer loop, then the computation of @m + j0 can be movedout of the inner loop. At this point, the address computation in the inner loopcontains an add and the increment for i0 , while the outer loop contains an addand the increment for j0 .
Knowing the context around the reference to m(i,j)allows the compiler to significantly reduce the cost of array addressing.If m is an actual parameter to the procedure, then the compiler may not knowthese facts at compile time. In fact, the upper and lower bounds for m mightchange on each call to the procedure. In such cases, the compiler may beunable to simplify the address calculation as shown.Improving a Loop Nest in LINPACKAs a more dramatic example of context, consider the loop nest shownin Figure 8.1.
It is the central loop nest of the fortran version of theroutine dmxpy from the linpack numerical library. The code wraps twoloops around a single long assignment. The loop nest forms the core of aStrength reductiona transformation that rewrites a series ofoperations, for examplei ·c, (i +1)·c, . . . , (i +k)·cwith an equivalent seriesi10 , i20 , . . .
, ik0 ,0 +cwhere i10 = i ·c and ij0 = ij−1See Section 10.7.2.410 CHAPTER 8 Introduction to Optimizationsubroutine dmxpy (n1, y, n2, ldm, x, m)double precision y(*), x(*), m(ldm,*)...jmin = j+16do 60 j = jmin, n2, 16do 50 i = 1, n1y(i) = ((((((((((((((( (y(i))+ x(j-15)*m(i,j-15)) + x(j-14)*m(i,j-14))+ x(j-13)*m(i,j-13)) + x(j-12)*m(i,j-12))+ x(j-11)*m(i,j-11)) + x(j-10)*m(i,j-10))+ x(j- 9)*m(i,j- 9)) + x(j- 8)*m(i,j- 8))$$$$$$$$+ x(j- 7)*m(i,j- 7)) + x(j- 6)*m(i,j- 6))+ x(j- 5)*m(i,j- 5)) + x(j- 4)*m(i,j- 4))+ x(j- 3)*m(i,j- 3)) + x(j- 2)*m(i,j- 2))+ x(j- 1)*m(i,j- 1)) + x(j) *m(i,j)continue5060continue...endn FIGURE 8.1 Excerpt from dmxpy in LINPACK.routine to compute y + x × m, for vectors x and y and matrix m.
We willconsider the code from two different perspectives: first, the transformationsthat the author hand-applied to improve performance, and second, the challenges that the compiler faces in translating this loop nest to run efficientlyon a specific processor.Before the author hand-transformed the code, the loop nest performed thefollowing simpler version of the same computation:do 60 j = 1, n2do 50 i = 1, n1y(i) = y(i) + x(j) * m(i,j)50continue60 continueLoop unrollingThis replicates the loop body for distinctiterations and adjusts the index calculations tomatch.To improve performance, the author unrolled the outer loop, the j loop,16 times. That rewrite created 16 copies of the assignment statement withdistinct values for j, ranging from j through j-15.
It also changed theincrement on the outer loop from 1 to 16. Next, the author merged the16 assignments into a single statement, eliminating 15 occurrences ofy(i) = y(i) + · · · ; that eliminates 15 additions and most of the loads and8.2 Background 411stores of y(i). Unrolling the loop eliminates some scalar operations. It oftenimproves cache locality, as well.To handle the cases where the the array bounds are not integral multiplesof 16, the full procedure has four versions of the loop nest that precede theone shown in Figure 8.1. These “setup loops” process up to 15 columnsof m, leaving j set to a value for which n2 - j is an integral multipleof 16. The first loop handles a single column of m, corresponding to an oddn2.
The other three loop nests handle two, four and eight columns of m.This guarantees that the final loop nest, shown in Figure 8.1, can process thecolumns 16 at a time.Ideally, the compiler would automatically transform the original loop nestinto this more efficient version, or into whatever form is most appropriatefor a given target machine.
However, few compilers include all of the optimizations needed to accomplish that goal. In the case of dmxpy, the authorperformed the optimizations by hand to produce good performance across awide range of target machines and compilers.From the compiler’s perspective, mapping the loop nest shown in Figure 8.1onto the target machine presents some hard challenges.
The loop nest contains 33 distinct array-address expressions, 16 for m, 16 for x, and onefor y that it uses twice. Unless the compiler can simplify those addresscalculations, the loop will be awash in integer arithmetic.Consider the references to x. They do not change during execution of theinner loop, which varies i. The optimizer can move the address calculationsand the loads for x out of the inner loop. If it can keep the x values in registers, it can eliminate a large part of the overhead from the inner loop.
For areference such as x(j-12), the address calculation is just @x + (j − 12) × w.To further simplify matters, the compiler can refactor all 16 references tox into the form @x + jw − ck , where jw is j · w and ck is k · w for each0 ≤ k ≤ 15. In this form, each load uses the same base address, @x + jw,with a different constant offset, ck .To map this efficiently onto the target machine requires knowledge of theavailable addressing modes. If the target has the equivalent of iloc’s loadAIoperation (a register base address plus a small constant offset), then all theaccesses to x can be written to use a single induction variable. Its initial valueis @x + jmin · w. Each iteration of the j loop increments it by w.The 16 values of m used in the inner loop change on each iteration.
Thus,the inner loop must compute addresses and load 16 elements of m oneach iteration. Careful refactoring of the address expressions, combinedwith strength reduction, can reduce the overhead of accessing m. The value412 CHAPTER 8 Introduction to Optimization@m + j · high1 (m) · w can be computed in the j loop. (Notice that high1 (m) isthe only concrete dimension declared in dmxpy’s header.) The inner loop canproduce a base address by adding it to (i − 1) · w. Then, the 16 loads can usedistinct constants, ck · high1 (m), where ck is k · w for each 0 ≤ k ≤ 15.To achieve this code shape, the compiler must refactor the address expressions, perform strength reduction, recognize loop-invariant calculations andmove them out of inner loops, and choose the appropriate addressing modefor the loads.
Even with these improvements, the inner loop must perform 16loads, 16 floating-point multiplies, and 16 floating-point adds, plus one store.The resulting block will present a challenge to the instruction scheduler.If the compiler fails in some part of this transformation sequence, the resulting code might be substantially worse than the original. For example, if itcannot refactor the address expressions around a common base address for xand one for m, the code might maintain 33 distinct induction variables—onefor each distinct address expression for x, m, and y.
If the resulting demandfor registers forces the register allocator to spill, it will insert additional loadsand stores into the loop (which is already likely to be memory bound). Incases such as this one, the quality of code produced by the compiler dependson an orchestrated series of transformations that all must work; when onefails to achieve its purpose, the overall sequence may produce lower qualitycode than the user expects.8.2.2 Considerations for OptimizationIn the previous example, the programmer applied the transformations in thebelief that they would make the program run faster. The programmer hadto believe that they would preserve the meaning of the program.