K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 24
Текст из файла (страница 24)
A buffer and pointer schemeamortizes the fixed costs of the read over many single-character fetches.Making the buffer larger reduces the number of times that the scanner incursthis cost and reduces the per-character overhead.Using a buffer and pointer also leads to a simple and efficient implementation of the RollBack operation that occurs at the end of both the generatedscanners. To roll the input back, the scanner can simply decrement the inputpointer.
This scheme works as long as the scanner does not decrement thepointer beyond the start of the buffer. At that point, however, the scannerneeds access to the prior contents of the buffer.Double bufferingA scheme that uses two input buffers in a modulofashion to provide bounded roll back is oftencalled double buffering.In practice, the compiler writer can bound the roll-back distance that a scanner will need. With bounded roll back, the scanner can simply use twoadjacent buffers and increment the pointer in a modulo fashion, as shownbelow:Buffer 00Buffer 1n-1 n6InputPointer2n-1To read a character, the scanner increments the pointer, modulo 2n andreturns the character at that location. To roll back a character, the programdecrements the input pointer, modulo 2n.
It must also manage the contentsof the buffer, reading additional characters from the input stream as needed.Both NextChar and RollBack have simple, efficient implementations, asshown in Figure 2.17. Each execution of NextChar loads a character, increments the Input pointer, and tests whether or not to fill the buffer.
Every ncharacters, it fills the buffer. The code is small enough to be included inline,perhaps generated from a macro. This scheme amortizes the cost of fillingthe buffer over n characters. By choosing a reasonable size for n, such as2048, 4096, or more, the compiler writer can keep the i/o overhead low.Rollback is even less expensive.
It performs a test to ensure that thebuffer contents are valid and then decrements the input pointer. Again, theimplementation is sufficiently simple to be expanded inline. (If we usedthis implementation of NextChar and RollBack in the generated scanners,RollBack would need to truncate the final character away from lexeme.)2.5 Implementing Scanners 71Char ← Buffer[Input];Input ← 0;Input ← (Input + 1) mod 2n;Fence ← 0;if (Input mod n = 0)then begin;fill Buffer[0 : n];Initializationfill Buffer[Input : Input + n - 1];Fence ← (Input + n) mod 2n;end;return Char;Implementing NextCharif (Input = Fence)then signal roll back error;Input ← (Input - 1) mod 2n;Implementing RollBackn FIGURE 2.17 Implementing NextChar and RollBack.As a natural consequence of using finite buffers, RollBack has a limited history in the input stream. To keep it from decrementing the pointer beyond thestart of that context, NextChar and RollBack cooperate.
The pointer Fencealways indicates the start of the valid context. NextChar sets Fence eachtime it fills a buffer. RollBack checks Fence each time it tries to decrementthe Input pointer.After a long series of NextChar operations, say, more than n of them, RollBack can always back up at least n characters. However, a sequence of callsto NextChar and RollBack that work forward and backward in the buffercan create a situation where the distance between Input and Fence is lessthan n. Larger values of n decrease the likelihood of this situation arising.Expected backup distances should be a consideration in selecting the buffersize, n.Generating LexemesThe code shown for the table-driven and direct-coded scanners accumulatedthe input characters into a string lexeme.
If the appropriate output for eachsyntactic category is a textual copy of the lexeme, then those schemes areefficient. In some common cases, however, the parser, which consumes thescanner’s output, needs the information in another form.For example, in many circumstances, the natural representation for a register number is an integer, rather than a character string consisting of an ‘r’and a sequence of digits. If the scanner builds a character representation,then somewhere in the interface, that string must be converted to an integer.
A typical way to accomplish that conversion uses a library routine,such as atoi in the standard C library, or a string-based i/o routine, such as72 CHAPTER 2 Scannerssscanf. A more efficient way to solve this problem would be to accumulatethe integer’s value one digit at a time.In the continuing example, the scanner could initialize a variable, RegNum,to zero in its initial state. Each time that it recognized a digit, it couldmultiply RegNum by 10 and add the new digit. When it reached an accepting state, RegNum would contain the needed value.
To modify the scannerin Figure 2.16, we can delete all statements that refer to lexeme, addRegNum ← 0; to sinit , and replace the occurrences of goto s2 in statess1 and s2 with:begin;RegNum ← RegNum × 10 + (char - ‘0’);goto s2 ;end;where both char and ‘0’ are treated as their ordinal values in the asciicollating sequence. Accumulating the value this way likely has loweroverhead than building the string and converting it in the accepting state.For other words, the lexeme is implicit and, therefore, redundant.
Withsingleton words, such as a punctuation mark or an operator, the syntacticcategory is equivalent to the lexeme. Similarly, many scanners recognizecomments and white space and discard them. Again, the set of states thatrecognize the comment need not accumulate the lexeme. While the individual savings are small, the aggregate effect is to create a faster, more compactscanner.This issue arises because many scanner generators let the compiler writerspecify actions to be performed in an accepting state, but do not allowactions on each transition.
The resulting scanners must accumulate acharacter copy of the lexeme for each word, whether or not that copy isneeded. If compile time matters (and it should), then attention to such minoralgorithmic details leads to a faster compiler.2.5.4 Handling KeywordsWe have consistently assumed that keywords in the input language shouldbe recognized by including explicit res for them in the description thatgenerates the dfa and the recognizer.
Many authors have proposed an alternative strategy: having the dfa classify them as identifiers and testing eachidentifier to determine whether or not it is a keyword.This strategy made sense in the context of a hand-implemented scanner.The additional complexity added by checking explicitly for keywords causes2.5 Implementing Scanners 73a significant expansion in the number of dfa states.
This added implementation burden matters in a hand-coded program. With a reasonable hash table(see Appendix B.4), the expected cost of each lookup should be constant.In fact, this scheme has been used as a classic application for perfect hashing. In perfect hashing, the implementor ensures, for a fixed set of keys, thatthe hash function generates a compact set of integers with no collisions. Thislowers the cost of lookup on each keyword. If the table implementation takesinto account the perfect hash function, a single probe serves to distinguishkeywords from identifiers.
If it retries on a miss, however, the behavior canbe much worse for nonkeywords than for keywords.If the compiler writer uses a scanner generator to construct the recognizer,then the added complexity of recognizing keywords in the dfa is handled bythe tools. The extra states that this adds consume memory, but not compiletime. Using the dfa mechanism to recognize keywords avoids a table lookupon each identifier.
It also avoids the overhead of implementing a keywordtable and its support functions. In most cases, folding keyword recognitioninto the dfa makes more sense than using a separate lookup table.SECTION REVIEWAutomatic construction of a working scanner from a minimal DFA isstraightforward. The scanner generator can adopt a table-drivenapproach, wherein it uses a generic skeleton scanner and languagespecific tables, or it can generate a direct-coded scanner that threadstogether a code fragment for each DFA state. In general, the direct-codedapproach produces a faster scanner because it has lower overhead percharacter.Despite the fact that all DFA-based scanners have small constant costsper characters, many compiler writers choose to hand code a scanner.This approach lends itself to careful implementation of the interfacesbetween the scanner and the I/O system and between the scanner andthe parser.aReview Questions1.
Given the DFA shown to the left, complete the following:a. Sketch the character classifier that you would use in a table-drivenimplementation of this DFA.b. Build the transition table, based on the transition diagram andyour character classifier.c. Write an equivalent direct-coded scanner.s1bs4cs7s0bs2cs5as8cs3as6bs974 CHAPTER 2 Scanners2. An alternative implementation might use a recognizer for(a|b|c) (a|b|c) (a|b|c), followed by a lookup in a table that containsthe three words abc, bca, and cab.a.
Sketch the DFA for this language.b. Show the direct-coded scanner, including the call needed toperform keyword lookup.c. Contrast the cost of this approach with those in question 1 above.3. What impact would the addition of transition-by-transition actionshave on the DFA-minimization process? (Assume that we have a linguistic mechanism of attaching code fragments to the edges in thetransition graph.)2.6 ADVANCED TOPICS2.6.1 DFA to Regular ExpressionThe final step in the cycle of constructions, shown in Figure 2.3, is toconstruct an re from a dfa. The combination of Thompson’s constructionand the subset construction provide a constructive proof that dfas are atleast as powerful as res. This section presents Kleene’s construction, whichbuilds an re to describe the set of strings accepted by an arbitrary dfa. Thisalgorithm establishes that res are at least as powerful as dfas.
Together, theyshow that res and dfas are equivalent.Consider the transition diagram of a dfa as a graph with labelled edges.The problem of deriving an re that describes the language accepted by thedfa corresponds to a path problem over the dfa’s transition diagram. Theset of strings in L(dfa) consists of the set of edge labels for every pathfrom d0 to di , ∀ di ∈ D A . For any dfa with a cyclic transition graph, the setof such paths is infinite. Fortunately, res have the Kleene closure operatorto handle this case and summarize the complete set of subpaths created bya cycle.Figure 2.18 shows one algorithm to compute this path expression.
It assumesthat the dfa has states numbered from 0 to |D| − 1, with d0 as the start state.It generates an expression that represents the labels along all paths betweentwo nodes, for each pair of nodes in the transition diagram. As a final step,it combines the expressions for each path that leaves d0 and reaches someaccepting state, di ∈ D A . In this way, it systematically constructs the pathexpressions for all paths.The algorithm computes a set of expressions, denoted Rkij , for all the relevantvalues of i, j, and k.