K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 21
Текст из файла (страница 21)
The resulting dfa is minimal; the proofis beyond our scope.ExamplesConsider a dfa that recognizes the language fee | fie, shown in Figure 2.11a.By inspection, we can see that states s3 and s5 serve the same purpose. Both56 CHAPTER 2 Scannersefs0s2es3s1is4es5(a) DFA for “fee | fie”ExaminesStepCurrentPartitionSetCharAction01234{ {s3 , s5 }, {s0 , s1, s2 , s4 } }{ {s3 , s5 }, {s0 , s1, s2 , s4 } }{ {s3 , s5 }, {s0 , s1, s2 , s4 } }{ {s3 , s5 }, {s0 , s1 }, {s2 , s4 } }{ {s3 , s5 }, {s0 }, {s1 }, {s2 , s4 } }—{s3 , s5 }{s0 , s1, s2 , s4 }{s0 , s1 }all—all—nonesplit {s2 , s4 }split {s1 }noneefall(b) Critical Steps in Minimizing the DFAs0fs1i,es2es3(c) The Minimal DFA (States Renumbered)n FIGURE 2.11 Applying the DFA Minimization Algorithm.are accepting states entered only by a transition on the letter e. Neither hasa transition that leaves the state.
We would expect the dfa minimizationalgorithm to discover this fact and replace them with a single state.Figure 2.11b shows the significant steps that occur in minimizing thisdfa. The initial partition, shown as step 0, separates accepting states fromnonaccepting states. Assuming that the while loop in the algorithm iteratesover the sets of P in order, and over the characters in 6 = {e, f, i} in order,then it first examines the set {s3 , s5 }.
Since neither state has an exiting transition, the state does not split on any character. In the second step, it examines{s0 , s1 , s2 , s4 }; on the character e, it splits {s2 , s4 } out of the set. In the thirdstep, it examines {s0 , s1 } and splits it around the character f. At that point,the partition is { {s3 , s5 }, {s0 }, {s1 }, {s2 , s4 } }. The algorithm makes one finalpass over the sets in the partition, splits none of them, and terminates.To construct the new dfa, we must build a state to represent each set inthe final partition, add the appropriate transitions from the original dfa, anddesignate initial and accepting state(s). Figure 2.11c shows the result for thisexample.2.4 From Regular Expression to Scanner 57d0ad2bcd1cbd0bd3cp1(a) Original DFAad2bcd1cp2bbd3c(b) Initial Partitionn FIGURE 2.12 DFA for a(b|c ∗ ) .As a second example, consider the dfa for a (b | c)∗ produced by Thompson’s construction and the subset construction, shown in Figure 2.12a.The first step of the minimization algorithm constructs an initial partition{ {d0 }, {d1 , d2 , d3 } }, as shown on the right.
Since p1 has only one state, itcannot be split. When the algorithm examines p2 , it finds no transitions on afrom any state in p2 . For both b and c, each state has a transition back into p2 .Thus, no symbol in 6 splits p2 , and the final partition is { {d0 }, {d1 , d2 , d3 } }.The resulting minimal dfa is shown in Figure 2.12b. Recall that this isthe dfa that we suggested a human would derive. After minimization, theautomatic techniques produce the same result.This algorithm is another example of a fixed-point computation.
P is finite;at most, it can contain |D| elements. The while loop splits sets in P, butnever combines them. Thus, |P| grows monotonically. The loop halts whensome iteration splits no sets in P. The worst-case behavior occurs wheneach state in the dfa has different behavior; in that case, the while loop haltswhen P has a distinct set for each di ∈ D. This occurs when the algorithm isapplied to a minimal dfa.2.4.5 Using a DFA as a RecognizerThus far, we have developed the mechanisms to construct a dfa implementation from a single re. To be useful, a compiler’s scanner must recognizeall the syntactic categories that appear in the grammar for the source language. What we need, then, is a recognizer that can handle all the res for thelanguage’s microsyntax.
Given the res for the various syntactic categories,r1 , r2 , r3 , . . . , rk , we can construct a single re for the entire collection byforming (r1 | r2 | r3 | . . . | rk ).If we run this re through the entire process, building an nfa, constructinga dfa to simulate the nfa, minimizing it, and turning that minimal dfa intoexecutable code, the resulting scanner recognizes the next word that matchesone of the ri ’s. That is, when the compiler invokes it on some input, thes0 ab,cs158 CHAPTER 2 Scannersscanner will examine characters one at a time and accept the string if it is inan accepting state when it exhausts the input. The scanner should return boththe text of the string and its syntactic category, or part of speech.
Since mostreal programs contain more than one word, we need to transform either thelanguage or the recognizer.At the language level, we can insist that each word end with some easily recognizable delimiter, like a blank or a tab. This idea is deceptivelyattractive. Taken literally, it requires delimiters surrounding all operators, as+, -, (, ), and the comma.At the recognizer level, we can change the implementation of the dfa and itsnotion of acceptance.
To find the longest word that matches one of the res,the dfa should run until it reaches the point where the current state, s, has nooutgoing transition on the next character. At that point, the implementationmust decide which re it has matched. Two cases arise; the first is simple.
Ifs is an accepting state, then the dfa has found a word in the language andshould report the word and its syntactic category.If s is not an accepting state, matters are more complex. Two cases occur. Ifthe dfa passed through one or more accepting states on its way to s, the recognizer should back up to the most recent such state. This strategy matchesthe longest valid prefix in the input string. If it never reached an acceptingstate, then no prefix of the input string is a valid word and the recognizershould report an error.
The scanners in Section 2.5.1 implement both thesenotions.As a final complication, an accepting state in the dfa may represent severalaccepting states in the original nfa. For example, if the lexical specification includes res for keywords as well as an re for identifiers, then akeyword such as new might match two res. The recognizer must decidewhich syntactic category to return: identifier or the singleton category forthe keyword new.Most scanner-generator tools allow the compiler writer to specify a priorityamong patterns. When the recognizer matches multiple patterns, it returnsthe syntactic category of the highest-priority pattern. This mechanismresolves the problem in a simple way.
The lex scanner generator, distributedwith many Unix systems, assigns priorities based on position in the list ofres. The first re has highest priority, while the last re has lowest priority.As a practical matter, the compiler writer must also specify res for partsof the input stream that do not form words in the program text. In mostprogramming languages, blank space is ignored, but every program containsit.
To handle blank space, the compiler writer typically includes an re thatmatches blanks, tabs, and end-of-line characters; the action on accepting2.5 Implementing Scanners 59blank space is to invoke the scanner, recursively, and return its result. Ifcomments are discarded, they are handled in a similar fashion.SECTION REVIEWGiven a regular expression, we can derive a minimal DFA to recognizethe language specified by the RE using the following steps: (1) applyThompson’s construction to build an NFA for the RE; (2) use the subsetconstruction to derive a DFA that simulates the behavior of the RE; and(3) use Hopcroft’s algorithm to identify equivalent states in the DFA andconstruct a minimal DFA. This trio of constructions produces an efficientrecognizer for any language that can be specified with an RE.Both the subset construction and the DFA minimization algorithm arefixed-point computations.
They are characterized by repeated application of a monotone function to some set; the properties of the domainplay an important role in reasoning about the termination and complexity of these algorithms. We will see more fixed-point computations inlater chapters.Review Questions1. Consider the RE who | what | where. Use Thompson’s construction tobuild an NFA from the RE. Use the subset construction to build a DFAfrom the NFA. Minimize the DFA.2. Minimize the following DFA:ts1hs2es3rs4s6es7rs8es9es5s0h2.5 IMPLEMENTING SCANNERSScanner construction is a problem where the theory of formal languages hasproduced tools that can automate implementation. For most languages, thecompiler writer can produce an acceptably fast scanner directly from a setof regular expressions.
The compiler writer creates an re for each syntacticcategory and gives the res as input to a scanner generator. The generatorconstructs an nfa for each re, joins them with -transitions, creates a corresponding dfa, and minimizes the dfa. At that point, the scanner generatormust convert the dfa into executable code.60 CHAPTER 2 ScannersLexicalScannerPatterns GeneratorTablesFAInterpretern FIGURE 2.13 Generating a Table-Driven Scanner.This section discusses three implementation strategies for converting a dfainto executable code: a table-driven scanner, a direct-coded scanner, and ahand-coded scanner.
All of these scanners operate in the same manner, bysimulating the dfa. They repeatedly read the next character in the input andsimulate the dfa transition caused by that character. This process stops whenthe dfa recognizes a word. As described in the previous section, that occurswhen the current state, s, has no outbound transition on the current inputcharacter.If s is an accepting state, the scanner recognizes the word and returns a lexeme and its syntactic category to the calling procedure.
If s is a nonacceptingstate, the scanner must determine whether or not it passed through an accepting state on the way to s. If the scanner did encounter an accepting state, itshould roll back its internal state and its input stream to that point and reportsuccess. If it did not, it should report the failure.These three implementation strategies, table driven, direct coded, and handcoded, differ in the details of their runtime costs. However, they all havethe same asymptotic complexity—constant cost per character, plus the costof roll back. The differences in the efficiency of well-implemented scannerschange the constant costs per character but not the asymptotic complexity ofscanning.The next three subsections discuss implementation differences betweentable-driven, direct-coded, and hand-coded scanners.