A.W. Appel, J. Palsberg - Modern Compiler Implementation in Java (Second Edition) (798439), страница 16
Текст из файла (страница 16)
The best solution to this problem is to have side-effect-free semantic actions that buildabstract syntax trees, as described in Chapter 4.Unfortunately, neither JavaCC nor SableCC support the error-symbol errorrecovery method,nor the kind of global error repair described below.74GLOBAL ERROR REPAIRGlobal error repair finds the smallest set of insertions and deletions that would turn thesource string into a syntactically correct string, even if the insertions and deletions are not at apoint where an LL or LR parser would first report an error.Burke-Fisher error repair We will describe a limited but useful form of global error repair,which tries every possible single-token insertion, deletion, or replacement at every point thatoccurs no earlier than K tokens before the point where the parser reported the error.
Thus,with K = 15, if the parsing engine gets stuck at the 100th token of the input, then it will tryevery possible repair between the 85th and 100th tokens.The correction that allows the parser to parse furthest past the original reported error is takenas the best error repair. Thus, if a single-token substitution of var for type at the 98th tokenallows the parsing engine to proceed past the 104th token without getting stuck, this repair is asuccessful one.
Generally, if a repair carries the parser R = 4 tokens beyond where itoriginally got stuck, this is "good enough."The advantage of this technique is that the LL(k) or LR(k) (or LALR, etc.) grammar is notmodified at all (no error productions), nor are the parsing tables modified. Only the parsingengine, which interprets the parsing tables, is modified.The parsing engine must be able to back up K tokens and reparse.
To do this, it needs toremember what the parse stack looked like K tokens ago. Therefore, the algorithm maintainstwo parse stacks: the current stack and the old stack. A queue of K tokens is kept; as each newtoken is shifted, it is pushed on the current stack and also put onto the tail of the queue;simultaneously, the head of the queue is removed and shifted onto the old stack. With eachshift onto the old or current stack, the appropriate reduce actions are also performed. Figure3.39 illustrates the two stacks and queue.Figure 3.39: Burke-Fisher parsing, with an error-repair queue.
Figure 3.18 shows thecomplete parse of this string according to Table 3.19.Now suppose a syntax error is detected at the current token. For each possible insertion,deletion, or substitution of a token at any position of the queue, the Burke-Fisher errorrepairer makes that change to within (a copy of) the queue, then attempts to reparse from theold stack. The success of a modification is in how many tokens past the current token can beparsed; generally, if three or four new tokens can be parsed, this is considered a completelysuccessful repair.75In a language with N kinds of tokens, there are K + K · N + K · N possible deletions,insertions, and substitutions within the K -token window.
Trying this many repairs is not verycostly, especially considering that it happens only when a syntax error is discovered, notduring ordinary parsing.Semantic actions Shift and reduce actions are tried repeatedly and discarded during thesearch for the best error repair. Parser generators usually perform programmer-specifiedsemantic actions along with each reduce action, but the programmer does not expect that theseactions will be performed repeatedly and discarded - they may have side effects. Therefore, aBurke-Fisher parser does not execute any of the semantic actions as reductions are performedon the current stack, but waits until the same reductions are performed (permanently) on theold stack.This means that the lexical analyzer may be up to K + R tokens ahead of the point to whichsemantic actions have been performed.
If semantic actions affect lexical analysis - as they doin C, compiling the typedef feature - this can be a problem with the Burke-Fisher approach.For languages with a pure context-free grammar approach to syntax, the delay of semanticactions poses no problem.Semantic values for insertions In repairing an error by insertion, the parser needs to providea semantic value for each token it inserts, so that semantic actions can be performed as if thetoken had come from the lexical analyzer. For punctuation tokens no value is necessary, butwhen tokens such as numbers or identifiers must be inserted, where can the value come from?The ML-Yacc parser generator, which uses Burke-Fischer error correction, has a %valuedirective, allowing the programmer to specify what value should be used when inserting eachkind of token:%value ID ("bogus")%value INT (1)%value STRING ("")Programmer-specified substitutions Some common kinds of errors cannot be repaired bythe insertion or deletion of a single token, and sometimes a particular single-token insertion orsubstitution is very commonly required and should be tried first.
Therefore, in an ML-Yaccgrammar specification the programmer can use the %change directive to suggest errorcorrections to be tried first, before the default "delete or insert each possible token" repairs.%change|EQ -> ASSIGN | ASSIGN -> EQSEMICOLON ELSE -> ELSE | -> IN INT ENDHere the programmer is suggesting that users often write "; else"where they mean "else"and so on.
These particular error corrections are often useful in parsing the ML programminglanguage.The insertion of in 0 end is a particularly important kind of correction, known as a scopecloser. Programs commonly have extra left parentheses or right parentheses, or extra left orright brackets, and so on. In ML, another kind of nesting construct is let … in … end.
If theprogrammer forgets to close a scope that was opened by a left parenthesis, then the automaticsingletoken insertion heuristic can close this scope where necessary. But to close a let scoperequires the insertion of three tokens, which will not be done automatically unless thecompiler-writer has suggested "change nothing to in 0 end" as illustrated in the %changecommand above.76PROGRAM PARSINGUse JavaCC or SableCC to implement a parser for the MiniJava language. Do it by extendingthe specification from the corresponding exercise in the previous chapter. Appendix Adescribes the syntax of MiniJava.FURTHER READINGConway [1963] describes a predictive (recursive-descent) parser, with a notion of FIRST setsand left-factoring.
LL(k) parsing theory was formalized by Lewis and Stearns [1968].LR(k) parsing was developed by Knuth [1965]; the SLR and LALR techniques by DeRemer[1971]; LALR(1) parsing was popularized by the development and distribution of Yacc[Johnson 1975] (which was not the first parser generator, or "compiler-compiler", as can beseen from the title of the cited paper).Figure 3.29 summarizes many theorems on subset relations between grammar classes.Heilbrunner [1981] shows proofs of several of these theorems, including LL(k) ⊂ LR(k) andLL(1) 6 ⊊ LALR(1) (see Exercise 3.14).
Backhouse [1979] is a good introduction totheoretical aspects of LL and LR parsing.Aho et al. [1975] showed how deterministic LL or LR parsing engines can handle ambiguousgrammars, with ambiguities resolved by precedence directives (as described in Section 3.4).Burke and Fisher [1987] invented the error-repair tactic that keeps a K token queue and twoparse stacks.EXERCISES•••3.1 Translate each of these regular expressions into a context-free grammar.a. ((xy*x) (yx*y))?b. ((0 1)+"."(0 1)*) ((0 1)*"."(0 1)+)*3.2 Write a grammar for English sentences using the wordstime, arrow, banana, flies, like, a, an, the, fruitand the semicolon.
Be sure to include all the senses (noun, verb, etc.) of each word.Then show that this grammar is ambiguous by exhibiting more than one parse tree for"time flies like an arrow; fruit flies like a banana."•3.3 Write an unambiguous grammar for each of the following languages. Hint: Oneway of verifying that a grammar is unambiguous is to run it through Yacc and get noconflicts.o a. Palindromes over the alphabet {a, b} (strings that are the same backwardand forward).o b. Strings that match the regular expression a*b* and have more a's than b's.o c. Balanced parentheses and square brackets. Example: ([[](()[()][])])o *d. Balanced parentheses and brackets, where a closing bracket also closes anyoutstanding open parentheses (up to the previous open bracket).
Example:[([](()[(][])]. Hint: First, make the language of balanced parentheses andbrackets, where extra open parentheses are allowed; then make sure thisnonterminal must appear within brackets.77o••e. All subsets and permutations (without repetition) of the keywords publicfinal static synchronized transient. (Then comment on how best toohandle this situation in a real compiler.)f. Statement blocks in Pascal or ML where the semicolons separate thestatements:ooog. Statement blocks in C where the semicolons terminatethe statements:( statement ; ( statement ; statement ) ; statement ){ expression; { expression; expression; } expression; }3.4 Write a grammar that accepts the same language as Grammar 3.1, but that issuitable for LL(1) parsing.
That is, eliminate the ambiguity, eliminate the leftrecursion, and (if necessary) left-factor.3.5 Find nullable, FIRST, and FOLLOW sets for this grammar; then construct theLL(1) parsing table.0. S′ → S $1. S →2. S → XS3. B → \ begin { WORD }4. E → \ end { WORD }5. X → BSE6. X → { S }7. X → WORD8. X → begin9. X → end•10. X → \ WORD3.6.Calculate nullable, FIRST, and FOLLOW for this grammar:S→uBDzB→BvB→wD→EFE→y•E→F→x F→a.