A.W. Appel, J. Palsberg - Modern Compiler Implementation in Java (Second Edition), страница 10
Описание файла
PDF-файл из архива "A.W. Appel, J. Palsberg - Modern Compiler Implementation in Java (Second Edition)", который расположен в категории "". Всё это находится в предмете "конструирование компиляторов" из 7 семестр, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 10 страницы из PDF
Thedefinition of DIGIT is preceeded by # to indicate that it can be used only in the definition ofother tokens.The last part of Program 2.9 begins with void Start. Itisa production which, in this case,allows the generated lexer to recognize any of the four defined tokens in any order. The nextchapter will explain productions in detail.SABLECCThe tokens described in Figure 2.2 are specified in SableCC as shown in Program 2.10. ASableCC specification file has six sections (all optional):1.
Package declaration: specifies the root package for all classes generated by SableCC.2. Helper declarations: a list of abbreviations.3. State declarations: support the state feature of, for example, GNU FLEX; when thelexer is in some state, only the tokens associated with that state are recognized.
Statescan be used for many purposes, including the detection of a beginning-of-line state,with the purpose of recognizing tokens only if they appear at the beginning of a line.For the compiler described in this book, states are not needed.4. Token declarations: each one is used to specify that the matched string should betransformed into a token that should be communicated to the parser.5. Ignored tokens: each one is used to specify that the matched string should be thrownaway.6. Productions: are used to define the grammar from which the parser is generated.PROGRAM 2.10: SableCC specification of the tokens from Figure 2.2.Helpersdigit = ['0'..'9'];Tokensif = 'if';id = ['a'..'z'](['a'..'z'] | (digit))*;number = digit+;real = ((digit)+ '.' (digit)*) |((digit)* '.' (digit)+);whitespace = (' ' | '\t' | '\n')+;comments = ('--' ['a'..'z']* '\n');Ignored Tokenswhitespace,comments;PROGRAM LEXICAL ANALYSISWrite the lexical-analysis part of a JavaCC or SableCC specification for MiniJava.
AppendixA describes the syntax of MiniJava. The directory$MINIJAVA/chap2/javacccontains a test-scaffolding file Main.java that calls the lexer generated by javacc. It alsocontains a README file that explains how to invoke javacc. Similar files for sablecc can befound in $MINIJAVA/chap2/sablecc.37FURTHER READINGLex was the first lexical-analyzer generator based on regular expressions [Lesk 1975]; it isstill widely used.Computing ∊-closure can be done more efficiently by keeping a queue or stack of stateswhose edges have not yet been checked for ∊-transitions [Aho et al. 1986]. Regularexpressions can be converted directly to DFAs without going through NFAs [McNaughtonand Yamada 1960; Aho et al.
1986].DFA transition tables can be very large and sparse. If represented as a simple twodimensional matrix (states × symbols), they take far too much memory. In practice, tables arecompressed; this reduces the amount of memory required, but increases the time required tolook up the next state [Aho et al. 1986].Lexical analyzers, whether automatically generated or handwritten, must manage their inputefficiently. Of course, input is buffered, so that a large batch of characters is obtained at once;then the lexer can process one character at a time in the buffer. The lexer must check, for eachcharacter, whether the end of the buffer is reached. By putting a sentinel - a character thatcannot be part of any token - at the end of the buffer, it is possible for the lexer to check forend-of-buffer only once per token, instead of once per character [Aho et al.
1986]. Gray[1988] uses a scheme that requires only one check per line, rather than one per token, butcannot cope with tokens that contain end-of-line characters. Bumbulis and Cowan [1993]check only once around each cycle in the DFA; this reduces the number of checks (from onceper character) when there are long paths in the DFA.Automatically generated lexical analyzers are often criticized for being slow.
In principle, theoperation of a finite automaton is very simple and should be efficient, but interpreting fromtransition tables adds overhead. Gray [1988] shows that DFAs translated directly intoexecutable code (implementing states as case statements) can run as fast as hand-coded lexers.The Flex "fast lexical-analyzer generator" [Paxson 1995] is significantly faster than Lex.EXERCISES•••2.1 Write regular expressions for each of the following.a. Strings over the alphabet {a, b, c} where the first a precedes the first b.b.
Strings over the alphabet {a, b, c} with an even number of a's.c. Binary numbers that are multiples of four.d. Binary numbers that are greater than 101001.e. Strings over the alphabet {a, b, c} that don't contain the contiguous substringbaa.f.
The language of nonnegative integer constants in C, where numbers beginningwith 0 are octal constants and other numbers are decimal constants.g. Binary numbers n such that there exists an integer solution of an+bn = cn.2.2 For each of the following, explain why you're not surprised that there is no regularexpression defining it.a. Strings of a's and b's where there are more a's than b's.b. Strings of a's and b's that are palindromes (the same forward as backward).c.
Syntactically correct Java programs.2.3 Explain in informal English what each of these finite-state automata recognizes.38•••2.4 Convert these regular expressions to nondeterministic finite automata.a. (if|then|else)b. a((b|a*c)x)*jx*a2.5 Convert these NFAs to deterministic finite automata.2.6 Find two equivalent states in the following automaton, and merge them to producea smaller automaton that recognizes the same language. Repeat until there are nolonger equivalent states.Actually, the general algorithm for minimizing finite automata works in reverse. First,find all pairs of inequivalent ststes. States X, Y are inequivalent if X is final and Y isnot or (by iteration) ifand(Y naar Y’) and X′, Y′ are inequivalent.
Afterthis iteration ceases to find new pairs of inequivalent states, then X; Y are equivalent ifthey are not inequivalent. See Hopcroft and Ullman [1979], Theorem 3.10.••*2.7 Any DFA that accepts at least one string can be converted to a regular expression.Convert the DFA of Exercise 2.3c to a regular expression. Hint: First, pretend state 1is the start state. Then write a regular expression for excursions to state 2 and back,and a similar one for excursions to state 0 and back. Or look in Hopcroft and Ullman[1979], Theorem 2.4, for the algorithm.*2.8 Suppose this DFA were used by Lex to find tokens in an input file.39•a. How many characters past the end of a token might Lex have to examinebefore matching the token?b.
Given your answer k to part (a), show an input file containing at least twotokens such that the first call to Lex will examine k characters past the end ofthe first token before returning the first token. If the answer to part (a) is zero,then show an input file containing at least two tokens, and indicate theendpoint of each token.2.9 An interpreted DFA-based lexical analyzer uses two tables,edges indexed by state and input symbol, yielding a state number, and final indexedby state, returning 0 or an action-number.Starting with this lexical specification,(aba)+(a(b*)a)(a|b)(action 1);(action 2);(action 3);generate the edges and final tables for a lexical analyzer.Then show each step of the lexer on the string abaabbaba. Be sure to show the valuesof the important internal variables of the recognizer.
There will be repeated calls to thelexer to get successive tokens.•**2.10 Lex has a lookahead operator / so that the regular expression abc/defmatches abc only when followed by def (but def is not part of the matched string, andwill be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975]uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba withinput aba, matching ab where it should match a). Flex [Paxson 1995] uses a bettermechanism that works correctly for (a|ab)/ba but fails (with a warning message) onzx*/xy*.Design a better lookahead mechanism.40Chapter 3: Parsingsyn-tax: the way in which words are put together to form phrases, clauses, or sentences.Webster's DictionaryOVERVIEWThe abbreviation mechanism discussed in the previous chapter, whereby a symbol stands forsome regular expression, is convenient enough that it is tempting to use it in interesting ways:These regular expressions define sums of the form 28+301+9.But now considerThis is meant to define expressions of the form:(109+23)61(1+(250+3))in which all the parentheses are balanced.
But it is impossible for a finite automaton torecognize balanced parentheses (because a machine with N states cannot remember aparenthesis-nesting depth greater than N), so clearly sum and expr cannot be regularexpressions.So how does a lexical analyzer implement regular-expression abbreviations such as digits?The answer is that the right-hand-side ([0-9]+) is simply substituted for digits wherever itappears in regular expressions, before translation to a finite automaton.This is not possible for the sum-and-expr language; we can first substitute sum into expr,yieldingbut now an attempt to substitute expr into itself leads toand the right-hand side now has just as many occurrences of expr as it did before - in fact, ithas more!Thus, the notion of abbreviation does not add expressive power to the language of regularexpressions - there are no additional languages that can be defined - unless the abbreviationsare recursive (or mutually recursive, as are sum and expr).41The additional expressive power gained by recursion is just what we need for parsing.