K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 22
Текст из файла (страница 22)
The strategies differin how they model the dfa’s transition structure and how they simulateits operation. Those differences, in turn, produce different runtime costs.The final subsection examines two different strategies for handling reservedkeywords.2.5.1 Table-Driven ScannersThe table-driven approach uses a skeleton scanner for control and a setof generated tables that encode language-specific knowledge. As shown inFigure 2.13, the compiler writer provides a set of lexical patterns, specified2.5 Implementing Scanners 61r0, 1, 2, . .
., 9EOFOtherRegisterDigitOtherOtherNextWord()state ← s0 ;lexeme ← ‘‘ ’’;clear stack;The Classifier Table, CharCatpush(bad);while (state 6= se ) doNextChar(char);lexeme ← lexeme + char;if state ∈ SAthen clear stack;push(state);cat ← CharCat[char];state ← δ[state,cat];end;while(state ∈/ SA andstate 6= bad) dostate ← pop();truncate lexeme;RegisterDigitOthers1seseseses2s2seseseseses0s1s2seThe Transition Table, δs0s1s2seinvalid invalid register invalidThe Token Type Table, TypeRollBack();end;if state ∈ SAthen return Type[state];else return invalid;s0rs10…90…9s2The Underlying DFAn FIGURE 2.14 A Table-Driven Scanner for Register Names.as regular expressions. The scanner generator then produces tables that drivethe skeleton scanner.Figure 2.14 shows a table-driven scanner for the re r [0. .
. 9]+ , which wasour first attempt at an re for iloc register names. The left side of thefigure shows the skeleton scanner, while the right side shows the tables forr [0. . . 9]+ and the underlying dfa. Notice the similarity between the codehere and the recognizer shown in Figure 2.2 on page 32.The skeleton scanner divides into four sections: initializations, a scanningloop that models the dfa’s behavior, a roll back loop in case the dfa overshoots the end of the token, and a final section that interprets and reports theresults. The scanning loop repeats the two basic actions of a scanner: reada character and simulate the dfa’s action.
It halts when the dfa enters the62 CHAPTER 2 Scannerserror state, se . Two tables, CharCat and δ, encode all knowledge about thedfa. The roll back loop uses a stack of states to revert the scanner to its mostrecent accepting state.The skeleton scanner uses the variable state to hold the current state ofthe simulated dfa. It updates state using a two-step, table-lookup process.First, it classifies char into one of a small set of categories using the CharCat table. The scanner for r [0.
. . 9]+ has three categories: Register, Digit, orOther. Next, it uses the current state and the character category as indicesinto the transition table, δ.For small examples, such as r[0 . . . 9]+ , theclassifier table is larger than the completetransition table. In a realistically sized example,that relationship should be reversed.This two-step translation, character to category, then state and category tonew state, lets the scanner use a compressed transition table.
The tradeoffbetween direct access into a larger table and indirect access into the compressed table is straightforward.A complete table would eliminate the mapping through CharCat, but would increase the memory footprint of the table.The uncompressed transition table grows as the product of the number ofstates in the dfa and the number of characters in 6; it can grow to the pointwhere it will not stay in cache.With a small, compact character set, such as ascii, CharCat can be represented as a simple table lookup. The relevant portions of CharCat shouldstay in the cache.
In that case, table compression adds one cache referenceper input character. As the character set grows (e.g. Unicode), more compleximplementations of CharCat may be needed. The precise tradeoff betweenthe per-character costs of both compressed and uncompressed tables willdepend on properties of both the language and the computer that runs thescanner.To provide a character-by-character interface to the input stream, the skeleton scanner uses a macro, NextChar, which sets its sole parameter to containthe next character in the input stream. A corresponding macro, RollBack,moves the input stream back by one character. (Section 2.5.3 looks atNextChar and RollBack.)If the scanner reads too far, state will not contain an accepting state atthe end of the first while loop.
In that case, the second while loop uses thestate trace from the stack to roll the state, lexeme, and input stream backto the most recent accepting state. In most languages, the scanner’s overshoot will be limited. Pathological behavior, however, can cause the scannerto examine individual characters many times, significantly increasing theoverall cost of scanning. In most programming languages, the amount ofroll back is small relative to the word lengths. In languages where significant amounts of roll back can occur, a more sophisticated approach to thisproblem is warranted.2.5 Implementing Scanners 63Avoiding Excess Roll BackSome regular expressions can produce quadratic calls to roll back in thescanner shown in Figure 2.14.
The problem arises from our desire to havethe scanner return the longest word that is a prefix of the input stream.Consider the re ab | (ab)∗ c. The corresponding dfa, shown in the margin,recognizes either ab or any number of occurrences of ab followed by afinal c. On the input string ababababc, a scanner built from the dfa will readall the characters and return the entire string as a single word. If, however, theinput is abababab, it must scan all of the characters before it can determinethat the longest prefix is ab. On the next invocation, it will scan abababto return ab. The third call will scan abab to return ab, and the final callwill simply return ab without any roll back.
In the worst, case, it can spendquadratic time reading the input stream.Figure 2.15 shows a modification to the scanner in Figure 2.14 that avoidsthis problem. It differs from the earlier scanner in three important ways.First, it has a global counter, InputPos, to record position in the inputstream. Second, it has a bit-array, Failed, to record dead-end transitionsas the scanner finds them. Failed has a row for each state and a column foreach position in the input stream. Third, it has an initialization routine thatNextWord()state ← s0 ;lexeme ← ‘‘ ’’;clear stack;push(hbad, badi);while (state 6= se ) doNextChar(char);InputPos ← InputPos + 1;lexeme ← lexeme + char;hstate,InputPosi ← pop();truncate lexeme;RollBack();end;if state ∈ SAthen return TokenType[state];else return bad;then break;push(hstate,InputPosi);cat ← CharCat[char];state ← δ[state,cat];end;n FIGURE 2.15 The Maximal Munch Scanner.s0a?InitializeScanner()InputPos = 0;for each state s in the DFA dofor i = 0 to |input stream| doFailed[s,i] ← false;end;end;s1b?as3 s2ab c6??cs4s5while(state ∈/ SA and state 6= bad ) doFailed[state,InputPos] ← true;if Failed[state,InputPos]if state ∈ SAthen clear stack;?64 CHAPTER 2 Scannersmust be called before NextWord() is invoked.
That routine sets InputPosto zero and sets Failed uniformly to false.This scanner, called the maximal munch scanner, avoids the pathologicalbehavior by marking dead-end transitions as they are popped from the stack.Thus, over time, it records specific hstate,input positioni pairs that cannotlead to an accepting state. Inside the scanning loop, the first while loop, thecode tests each hstate,input positioni pair and breaks out of the scanning loopwhenever a failed transition is attempted.Optimizations can drastically reduce the space requirements of this scheme.(See, for example, Exercise 16 on page 82.) Most programming languageshave simple enough microsyntax that this kind of quadratic roll back cannotoccur.
If, however, you are building a scanner for a language that can exhibitthis behavior, the scanner can avoid it for a small additional overhead percharacter.Generating the Transition and Classifier TablesGiven a dfa, the scanner generator can generate the tables in a straightforward fashion. The initial table has one column for every character in theinput alphabet and one row for each state in the dfa.
For each state, in order,the generator examines the outbound transitions and fills the row with theappropriate states. The generator can collapse identical columns into a singleinstance; as it does so, it can construct the character classifier. (Two characters belong in the same class if and only if they have identical columnsin δ.) If the dfa has been minimized, no two rows can be identical, so rowcompression is not an issue.Changing LanguagesTo model another dfa, the compiler writer can simply supply new tables.Earlier in the chapter, we worked with a second, more constrained specification for iloc register names, given by the re: r( [0.
. . 2] ([0. . . 9] | ) |[4. . . 9] | (3 (0 | 1 | )) ). That re gave rise to the following dfa:s20…2s0rs14…93s50…90,1s3s6s4Because it has more states and transitions than the re for r [0. . . 9]+ , weshould expect a larger transition table.2.5 Implementing Scanners 65s0s1s2s3s4s5s6ser0,1234 ... 9Others1seseseseseseseses2s3seses6seseses2s3seseseseseses5s3seseseseseses4s3seseseseseseseseseseseseseAs a final example, the minimal dfa for the re a (b|c)∗ has the followingtable:b,cs0as1s0s1Minimal DFAab,cOthers1seses1seseTransition TableThe character classifier has three classes: a, b or c, and all other characters.2.5.2 Direct-Coded ScannersTo improve the performance of a table-driven scanner, we must reduce thecost of one or both of its basic actions: read a character and compute thenext dfa transition.
Direct-coded scanners reduce the cost of computingdfa transitions by replacing the explicit representation of the dfa’s stateand transition graph with an implicit one. The implicit representation simplifies the two-step, table-lookup computation. It eliminates the memoryreferences entailed in that computation and allows other specializations. Theresulting scanner has the same functionality as the table-driven scanner, butwith a lower overhead per character. A direct-coded scanner is no harder togenerate than the equivalent table-driven scanner.The table-driven scanner spends most of its time inside the central whileloop; thus, the heart of a direct-coded scanner is an alternate implementation of that while loop.