Cooper_Engineering_a_Compiler(Second Edition) (1157546), страница 99
Текст из файла (страница 99)
Chapter 9 presents an overview of static analysis. It describes some of the analysis problems that an optimizing compilermust solve and presents practical techniques that have been used to solvethem. Chapter 10 examines so-called scalar optimizations—those intendedfor a uniprocessor—in a more systematic way.8.2 BACKGROUNDUntil the early 1980s, many compiler writers considered optimization as afeature that should be added to the compiler only after its other parts wereworking well.
This led to a distinction between debugging compilers andoptimizing compilers. A debugging compiler emphasized quick compilationat the expense of code quality. These compilers did not significantly rearrange the code, so a strong correspondence remained between the sourcecode and the executable code. This simplified the task of mapping a runtimeerror to a specific line of source code; hence the term debugging compiler. Incontrast, an optimizing compiler focuses on improving the running time ofthe executable code at the expense of compile time.
Spending more time incompilation often produces better code. Because the optimizer often movesoperations around, the mapping from source code to executable code is lesstransparent, and debugging is, accordingly, harder.408 CHAPTER 8 Introduction to OptimizationAs risc processors have moved into the marketplace (and as risc implementation techniques were applied to cisc architectures), more of the burdenfor runtime performance has fallen on compilers.
To increase performance,processor architects have turned to features that require more support fromthe compiler. These include delay slots following branches, nonblockingmemory operations, increased use of pipelines, and increased numbers offunctional units. These features make processors more performance sensitive to both high-level issues of program layout and structure and tolow-level details of scheduling and resource allocation.
As the gap betweenprocessor speed and application performance has grown, the demand foroptimization has grown to the point where users expect every compiler toperform optimization.The routine inclusion of an optimizer, in turn, changes the environment inwhich both the front end and the back end operate. Optimization furtherinsulates the front end from performance concerns. To an extent, this simplifies the task of ir generation in the front end. At the same time, optimizationchanges the code that the back end processes. Modern optimizers assumethat the back end will handle resource allocation; thus, they typically targetan idealized machine that has an unlimited supply of registers, memory, andfunctional units.
This, in turn, places more pressure on the techniques usedin the compiler’s back end.If compilers are to shoulder their share of responsibility for runtime performance, they must include optimizers. As we shall see, the tools of optimization also play a large role in the compiler’s back end. For these reasons, itis important to introduce optimization and explore some of the issues that itraises before discussing the techniques used in a compiler’s back end.8.2.1 ExamplesTo provide a focus for this discussion, we will begin by examining twoexamples in depth.
The first, a simple two-dimensional array-address calculation, shows the role that knowledge and context play in the kind of codethat the compiler can produce. The second, a loop nest from the routinedmxpy in the widely-used linpack numerical library, provides insight intothe transformation process itself and into the challenges that transformedcode can present to the compiler.Improving an Array-Address CalculationConsider the ir that a compiler’s front end might generate for an array reference, such as m(i,j) in fortran.
Without specific knowledge about m,i, and j, or the surrounding context, the compiler must generate the full8.2 Background 409expression for addressing a two-dimensional array stored in column-majororder. In Chapter 7, we saw the calculation for row-major order; fortran’scolumn-major order is similar:@m + (j − low2 (m)) × (high 1 (m) − low1 (m) + 1) × w + (i − low1 (m)) × wwhere @m is the runtime address of the first element of m, lowi (m) and highi (m)are the lower and upper bounds, respectively, of m’s ith dimension, and w isthe size of an element of m.
The compiler’s ability to reduce the cost of thatcomputation depends directly on its analysis of the code and the surroundingcontext.If m is a local array with lower bounds of one in each dimension and knownupper bounds, then the compiler can simplify the calculation to@m + (j − 1) × hw + (i − 1) × wwhere hw is high1 (m) × w. If the reference occurs inside a loop where jruns from 1 to k, the compiler might use operator strength reduction toreplace the term (j − 1) × hw with a sequence j01 , j02 , j03 , . . . j0k , where0 + hw.
If i is also the inductionj01 = (1 − 1) × hw = 0 and ji0 = ji−1variable of a loop running from 1 to l, then strength reduction can replace(i − 1) × w with the sequence i01 , i02 , i03 , . . . il0 , where i01 = 0 and i0j =i0j−1 + w. After these changes, the address calculation is just@m+j0 +i0The j loop must increment j0 by hw and the i loop must increment i0 by w.If the j loop is the outer loop, then the computation of @m + j0 can be movedout of the inner loop. At this point, the address computation in the inner loopcontains an add and the increment for i0 , while the outer loop contains an addand the increment for j0 .
Knowing the context around the reference to m(i,j)allows the compiler to significantly reduce the cost of array addressing.If m is an actual parameter to the procedure, then the compiler may not knowthese facts at compile time. In fact, the upper and lower bounds for m mightchange on each call to the procedure. In such cases, the compiler may beunable to simplify the address calculation as shown.Improving a Loop Nest in LINPACKAs a more dramatic example of context, consider the loop nest shownin Figure 8.1. It is the central loop nest of the fortran version of theroutine dmxpy from the linpack numerical library. The code wraps twoloops around a single long assignment.
The loop nest forms the core of aStrength reductiona transformation that rewrites a series ofoperations, for examplei ·c, (i +1)·c, . . . , (i +k)·cwith an equivalent seriesi10 , i20 , . . . , ik0 ,0 +cwhere i10 = i ·c and ij0 = ij−1See Section 10.7.2.410 CHAPTER 8 Introduction to Optimizationsubroutine dmxpy (n1, y, n2, ldm, x, m)double precision y(*), x(*), m(ldm,*)...jmin = j+16do 60 j = jmin, n2, 16do 50 i = 1, n1y(i) = ((((((((((((((( (y(i))+ x(j-15)*m(i,j-15)) + x(j-14)*m(i,j-14))+ x(j-13)*m(i,j-13)) + x(j-12)*m(i,j-12))+ x(j-11)*m(i,j-11)) + x(j-10)*m(i,j-10))+ x(j- 9)*m(i,j- 9)) + x(j- 8)*m(i,j- 8))$$$$$$$$+ x(j- 7)*m(i,j- 7)) + x(j- 6)*m(i,j- 6))+ x(j- 5)*m(i,j- 5)) + x(j- 4)*m(i,j- 4))+ x(j- 3)*m(i,j- 3)) + x(j- 2)*m(i,j- 2))+ x(j- 1)*m(i,j- 1)) + x(j) *m(i,j)continue5060continue...endn FIGURE 8.1 Excerpt from dmxpy in LINPACK.routine to compute y + x × m, for vectors x and y and matrix m. We willconsider the code from two different perspectives: first, the transformationsthat the author hand-applied to improve performance, and second, the challenges that the compiler faces in translating this loop nest to run efficientlyon a specific processor.Before the author hand-transformed the code, the loop nest performed thefollowing simpler version of the same computation:do 60 j = 1, n2do 50 i = 1, n1y(i) = y(i) + x(j) * m(i,j)50continue60 continueLoop unrollingThis replicates the loop body for distinctiterations and adjusts the index calculations tomatch.To improve performance, the author unrolled the outer loop, the j loop,16 times.
That rewrite created 16 copies of the assignment statement withdistinct values for j, ranging from j through j-15. It also changed theincrement on the outer loop from 1 to 16. Next, the author merged the16 assignments into a single statement, eliminating 15 occurrences ofy(i) = y(i) + · · · ; that eliminates 15 additions and most of the loads and8.2 Background 411stores of y(i). Unrolling the loop eliminates some scalar operations. It oftenimproves cache locality, as well.To handle the cases where the the array bounds are not integral multiplesof 16, the full procedure has four versions of the loop nest that precede theone shown in Figure 8.1. These “setup loops” process up to 15 columnsof m, leaving j set to a value for which n2 - j is an integral multipleof 16. The first loop handles a single column of m, corresponding to an oddn2.