A.W. Appel, J. Palsberg - Modern Compiler Implementation in Java (Second Edition) (798439), страница 34
Текст из файла (страница 34)
Thenemit the instruction matched at node n.Emission(n) does not recur on the children of node n, but on the leaves of the tile that matchedat n. For example, after the dynamic-programming algorithm finds the optimum cost of thesimple tree above, the emission phase emitsbut no instruction is emitted for any tile rooted at the + node, because this was not a leaf ofthe tile matched at the root.TREE GRAMMARSFor machines with complex instruction sets and several classes of registers and addressingmodes, there is a useful generalization of the dynamic-programming algorithm.
Suppose wemake a brain-damaged version of Jouette with two classes of registers: a registers foraddressing, and d registers for "data." The instruction set of the Schizo-Jouette machine(loosely based on the Motorola 68000) is shown in Figure 9.4.159Figure 9.4: The Schizo-Jouette architecture.The root and leaves of each tile must be marked with a or d to indicate which kind of registeris implied.
Now, the dynamic-programming algorithm must keep track, for each node, of themin-cost match as an a register, and also the min-cost match as a d register.At this point it is useful to use a context-free grammar to describe the tiles; the grammar willhave nonterminals s (for statements), a (for expressions calculated into an a register), and d(for expressions calculated into a d register). Section 3.1 describes the use of context-freegrammars for source-language syntax; here we use them for quite a different purpose.The grammar rules for the LOAD, MOVEA, and MOVED instructions might look like this:d → MEM(+(a, CONST))d → MEM(+(CONST, a))d → MEM(CONST)160d → MEM(a)d → aa → dSuch a grammar is highly ambiguous: There are many different parses of the same tree (sincethere are many different instruction sequences implementing the same expression).
For thisreason, the parsing techniques described in Chapter 3 are not very useful in this application.However, a generalization of the dynamic-programming algorithm works quite well: Theminimum-cost match at each node for each nonterminal of the grammar is computed.Though the dynamic-programming algorithm is conceptually simple, it becomes messy towrite down directly in a general-purpose programming language such as Java. Thus, severaltools have been developed.
These codegenerator generators process grammars that specifymachine instruction sets; for each rule of the grammar, a cost and an action are specified. Thecosts are used to find the optimum tiling, and then the actions of the matched rules are used inthe emission phase.Like Yacc and Lex, the output of a code-generator generator is usually a program in C or Javathat operates a table-driven matching engine with the action fragments (written in C or Java)inserted at the appropriate points.Such tools are quite convenient.
Grammars can specify addressing modes of treelike CISCinstructions quite well. A typical grammar for the VAX has 112 rules and 20 nonterminalsymbols; and one for the Motorola 68020 has 141 rules and 35 nonterminal symbols.However, instructions that produce more than one result - such as autoincrement instructionson the VAX -are difficult to express using tree patterns.Code-generator generators are probably overkill for RISC machines.
The tiles are quite small,there aren't very many of them, and there is little need for a grammar with many nonterminalsymbols.FAST MATCHINGMaximal munch and the dynamic-programming algorithm must examine, for each node, allthe tiles that match at that node. A tile matches if each nonleaf node of the tile is labeled withthe same operator (MEM, CONST, etc.) as the corresponding node of the tree.The naive algorithm for matching would be to examine each tile in turn, checking each nodeof the tile against the corresponding part of the tree. However, there are better approaches. Tomatch a tile at node n of the tree, the label at n can be used in a case statement:match(n) {switch (label(n)) {case MEM: ...case BINOP: ...case CONST: ...}Once the clause for one label (such as MEM) is selected, only those patterns rooted in thatlabel remain in consideration.
Another case statement can use the label of the child of n tobegin distinguishing among those patterns.161The organization and optimization of decision trees for pattern matching is beyond the scopeof this book. However, for better performance the naive sequence of clauses in functionmunchExp should be rewritten as a sequence of comparisons that never looks twice at thesame tree node.EFFICIENCY OF TILING ALGORITHMSHow expensive are maximal munch and dynamic programming?Let us suppose that there are T different tiles, and that the average matching tile contains Knonleaf (labeled) nodes. Let K′ be the largest number of nodes that ever need to be examinedto see which tiles match at a given subtree; this is approximately the same as the size of thelargest tile.
And suppose that, on the average, T′ different patterns (tiles) match at each treenode. For a typical RISC machine we might expect T = 50, K = 2, K′ = 4, T′ = 5.Suppose there are N nodes in the input tree. Then maximal munch will have to considermatches at only N=K nodes because, once a "munch" is made at the root, no pattern-matchingneeds to take place at the nonleaf nodes of the tile.To find all the tiles that match at one node, at most K′ tree nodes must be examined; but (witha sophisticated decision tree) each of these nodes will be examined only once. Then each ofthe successful matches must be compared to see if its cost is minimal. Thus, the matching ateach node costs K′ + T′, for a total cost proportional to (K′ + T′)N/K.The dynamic-programming algorithm must find all the matches at every node, so its cost isproportional to (K′ + T′)N.
However, the constant of proportionality is higher than that ofmaximal munch, since dynamic programming requires two tree-walks instead of one.Since K, K′, and T′ are constant, the running time of all of these algorithms is linear. Inpractice, measurements show that these instruction selection algorithms run very quicklycompared to the other work performed by a real compiler - even lexical analysis is likely totake more time than instruction selection.9.2 CISC MACHINESA typical modern RISC machine has1. 32 registers,2. only one class of integer/pointer registers,3. arithmetic operations only between registers,4.5.6.7."three-address" instructions of the form r1 ← r2 ⊕ r3,load and store instructions with only the M[reg+const] addressing mode,every instruction exactly 32 bits long,one result or effect per instruction.Many machines designed between 1970 and 1985 are complex instruction set computers(CISC).
Such computers have more complicated addressing modes that encode instructions infewer bits, which was important when computer memories were smaller and more expensive.Typical features found on CISC machines include1621. few registers (16, or 8, or 6);2. registers divided into different classes, with some operations available only on certainregisters;3. arithmetic operations can access registers or memory through "addressing modes";4. "two-address" instructions of the form r1 ← r1 ⊕ r2;5.
several different addressing modes;6. variable-length instructions, formed from variable-length opcode plus variablelengthaddressing modes;7. instructions with side effects such as "autoincrement" addressing modes.Most computer architectures designed since 1990 are RISC machines, but most generalpurpose computers installed since 1990 are CISC machines: the Intel 80386 and itsdescendants (486, Pentium).The Pentium, in 32-bit mode, has six general-purpose registers, a stack pointer, and a framepointer. Most instructions can operate on all six registers, but the multiply and divideinstructions operate only on the eax register. In contrast to the "three-address" instructionsfound on RISC machines, Pentium arithmetic instructions are generally "two-address",meaning that the destination register must be the same as the first source register.
Mostinstructions can have either two register operands (r1 ← r1 ⊕ r2), or one register and onememory operand, for example M[r1 + c] ← M[r1 + c] ⊕ r2 or r1 ← r1 ⊕ M[r2 + c], but notM[r1 + c1] ← M[r1 + c1] ⊕ M[r2 + c2]We will cut through these Gordian knots as follows:1. Few registers: We continue to generate TEMP nodes freely, and assume that theregister allocator will do a good job.2. Classes of registers: The multiply instruction on the Pentium requires that its leftoperand (and therefore destination) must be the eax register. The highorder bits of theresult (useless to a MiniJava program) are put into register edx.
The solution is tomove the operands and result explicitly; to implement t1 ← t2 × t3:mov eax, t2eax t2mul t3eax ← eax × t3;mov t1, eaxt1 ← eaxedx ← garbageThis looks very clumsy; but one job that the register allocator performs is to eliminateas many move instructions as possible. If the allocator can assign t1 or t3 (or both) toregister eax, then it can delete one or both of the move instructions.3. Two-address instructions: We solve this problem in the same way as we solve theprevious one: by adding extra move instructions.
To implement t1 ← t2 + t3 weproducemov t1,t2add t1, t3t1 ← t2t1 ← t1 + t3Then we hope that the register allocator will be able to allocate t1 and t2 to the sameregister, so that the move instruction will be deleted.1634. Arithmetic operations can address memory: The instruction selection phase turnsevery TEMP node into a "register" reference. Many of these "registers" will actuallyturn out to be memory locations. The spill phase of the register allocator must be madeto handle this case efficiently; see Chapter 11.The alternative to using memory-mode operands is simply to fetch all the operandsinto registers before operating and store them back to memory afterwards.
Forexample, these two sequences compute the same thing:mov eax, [ebp - 8]add eax, ecxmov [ebp - 8], eaxadd [ebp - 8], ecxThe sequence on the right is more concise (and takes less machine-code space), but thetwo sequences are equally fast. The load, register-register add, and store take 1 cycleeach, and the memory-register add takes 3 cycles. On a highly pipelined machine suchas the Pentium Pro, simple cycle counts are not the whole story, but the result will bethe same: The processor has to perform the load, add, and store, no matter how theinstructions specify them.The sequence on the left has one significant disadvantage: It trashes the value inregister eax.