Cooper_Engineering_a_Compiler(Second Edition) (1157546), страница 25
Текст из файла (страница 25)
The final step computes the set of paths that start withd0 and end in some accepting state, dj ∈ d A , as the alternation of the pathexpressions.2.6.2 Another Approach to DFA Minimization:Brzozowski’s AlgorithmIf we apply the subset construction to an nfa that has multiple paths fromthe start state for some prefix, the construction will group the states involvedin those duplicate prefix paths together and will create a single path for thatprefix in the dfa. The subset construction always produces dfas that haveno duplicate prefix paths.
Brzozowski used this observation to devise analternative dfa minimization algorithm that directly constructs the minimaldfa from an nfa.Traditional statements of this algorithm assumethat node names range from 1 to n, rather thanfrom 0 to n−1. Thus, they place the direct pathsin R0ij .76 CHAPTER 2 Scannersabcabcs4s1 s2 s3 s1 s2 s3 s4]J J s bc s bcs0 s7s11s0 s6 s6 s7 55 J J ]a s d sa s d sJ^ s8 J s8 109910 (a) NFA for abc | bc | ad(b) Reverse the NFA in (a)a s b ss1 32kQQcs11a s +s8 7da s b ss123kQ*Qcs0s11 XXXXa s +zXs87d(c) Subset the NFA in (b)(d) Reverse the DFA in (c)s2da @b? b @Rc s- s3 s011(e) Subset the NFA in (d) to Produce the Minimal DFAn FIGURE 2.19 Minimizing a DFA with Brzozowski’s Algorithm.For an nfa n, let reverse(n) be the nfa obtained by reversing the directionof all the transitions, making the initial state into a final state, adding a newinitial state, and connecting it to all of the states that were final states in n.Further, let reachable(n) be a function that returns the set of states and transitions in n that are reachable from its initial state.
Finally, let subset(n) bethe dfa produced by applying the subset construction to n.Now, given an nfa n, the minimal equivalent dfa is justreachable( subset( reverse( reachable( subset( reverse(n))) ))).The inner application of subset and reverse eliminates duplicate suffixes inthe original nfa. Next, reachable discards any states and transitions that areno longer interesting. Finally, the outer application of the triple, reachable,subset, and reverse, eliminates any duplicate prefixes in the nfa. (Applyingreverse to a dfa can produce an nfa.)The example in Figure 2.19 shows the steps of the algorithm on a simplenfa for the re abc | bc | ad.
The nfa in Figure 2.19a is similar to theone that Thompson’s construction would produce; we have removed the-transitions that “glue” together the nfas for individual letters. Figure 2.19b2.6 Advanced Topics 77shows the result of applying reverse to that nfa.
Figure 2.19c depicts the dfathat subset constructs from the reverse of the nfa. At this point, the algorithm applies reachable to remove any unreachable states; our example nfahas none. Next, the algorithm applies reverse to the dfa, which producesthe nfa in Figure 2.19d.
Applying subset to that nfa produces the dfa inFigure 2.19e. Since it has no unreachable states, it is the minimal dfa forabc | bc | cd.This technique looks expensive, because it applies subset twice and we knowthat subset can construct an exponentially large set of states. Studies ofthe running times of various fa minimization techniques suggest, however,that this algorithm performs reasonably well, perhaps because of specificproperties of the nfa produced by the first application of reachable (subset(reverse(n))).
From a software-engineering perspective, it may be that implementing reverse and reachable is easier than debugging the partitioningalgorithm.2.6.3 Closure-Free Regular ExpressionsOne subclass of regular languages that has practical application beyondscanning is the set of languages described by closure-free regular expressions. Such res have the form w1 | w2 | w3 | .
. . | wn where the individual words, wi , are just concatenations of characters in the alphabet, 6.These res have the property that they produce dfas with acyclic transitiongraphs.These simple regular languages are of interest for two reasons. First, manypattern recognition problems can be described with a closure-free re. Examples include words in a dictionary, urls that should be filtered, and keys to ahash table. Second, the dfa for a closure-free re can be built in a particularlyefficient way.To build the dfa for a closure-free re, begin with a start state s0 . To adda word to the existing dfa, the algorithm follows the path for the newword until it either exhausts the pattern or finds a transition to se .
In theformer case, it designates the final state for the new word as an acceptingstate. In the latter, it adds a path for the new word’s remaining suffix. Theresulting dfa can be encoded in tabular form or in direct-coded form (seeSection 2.5.2). Either way, the recognizer uses constant time per character inthe input stream.In this algorithm, the cost of adding a new word to an existing dfa isproportional to the length of the new word. The algorithm also worksincrementally; an application can easily add new words to a dfa that isin use.
This property makes the acyclic dfa an interesting alternative for78 CHAPTER 2 Scannerss0sfQQ?+ sds1s5s9s2s6s10eee???eee? ? ?s3s7s11s4s8s12ddd???implementing a perfect hash function. For a small set of keys, this techniqueproduces an efficient recognizer. As the number of states grows (in a directcoded recognizer) or as key length grows (in a table-driven recognizer),the implementation may slow down due to cache-size constraints.
At somepoint, the impact of cache misses will make an efficient implementation of amore traditional hash function more attractive than incremental constructionof the acyclic dfa.The dfas produced in this way are not guaranteed to be minimal. Considerthe acyclic dfa that it would produce for the res deed, feed, and seed, shownto the left. It has three distinct paths that each recognize the suffix eed.Clearly, those paths can be combined to reduce the number of states andtransitions in the dfa. Minimization will combine states (s2 , s6 , s10 ), states(s3 , s7 , s11 ), and states (s4 , s8 , s12 ) to produce a seven state dfa.The algorithm builds dfas that are minimal with regard to prefixes of wordsin the language.
Any duplication takes the form of multiple paths for thesame suffix.2.7 CHAPTER SUMMARY AND PERSPECTIVEThe widespread use of regular expressions for searching and scanning isone of the success stories of modern computer science. These ideas weredeveloped as an early part of the theory of formal languages and automata.They are routinely applied in tools ranging from text editors to web filteringengines to compilers as a means of concisely specifying groups of stringsthat happen to be regular languages. Whenever a finite collection of wordsmust be recognized, dfa-based recognizers deserve serious consideration.The theory of regular expressions and finite automata has developed techniques that allow the recognition of regular languages in time proportionalto the length of the input stream.
Techniques for automatic derivation ofdfas from res and for dfa minimization have allowed the construction ofrobust tools that generate dfa-based recognizers. Both generated and handcrafted scanners are used in well-respected modern compilers. In either case,a careful implementation should run in time proportional to the length of theinput stream, with a small overhead per character.nCHAPTER NOTESOriginally, the separation of lexical analysis, or scanning, from syntax analysis, or parsing, was justified with an efficiency argument. Since the costChapter Notes 79of scanning grows linearly with the number of characters, and the constantcosts are low, pushing lexical analysis from the parser into a separatescanner lowered the cost of compiling.
The advent of efficient parsing techniques weakened this argument, but the practice of building scanners persistsbecause it provides a clean separation of concerns between lexical structureand syntactic structure.Because scanner construction plays a small role in building an actual compiler, we have tried to keep this chapter brief. Thus, the chapter omits manytheorems on regular languages and finite automata that the ambitious readermight enjoy.
The many good texts on this subject can provide a much deepertreatment of finite automata and regular expressions, and their many usefulproperties [194, 232, 315].Kleene [224] established the equivalence of res and fas. Both the Kleeneclosure and the dfa to re algorithm bear his name. McNaughton and Yamadashowed one construction that relates res to nfas [262]. The constructionshown in this chapter is patterned after Thompson’s work [333], whichwas motivated by the implementation of a textual search command for anearly text editor. Johnson describes the first application of this technology toautomate scanner construction [207].