A.W. Appel, J. Palsberg - Modern Compiler Implementation in Java (Second Edition) (798439), страница 8
Текст из файла (страница 8)
In all of these cases, the alphabet is the ASCII character set.When we speak of languages in this way, we will not assign any meaning to the strings; wewill just be attempting to classify each string as in the language or not.To specify some of these (possibly infinite) languages with finite descriptions, we will use thenotation of regular expressions.
Each regular expression stands for a set of strings.Symbol: For each symbol a in the alphabet of the language, the regular expression a denotesthe language containing just the string a.Alternation: Given two regular expressions M and N, the alternation operator written as avertical bar makes a new regular expression M N. A string is in the language of M N ifit is in the language of M or in the language of N. Thus, the language of a b contains thetwo strings a and b.Concatenation: Given two regular expressions M and N, the concatenation operator · makes anew regular expression M · N. A string is in the language of M · N if it is the concatenation ofany two strings α and β such that α is in the language of M and β is in the language of N.Thus, the regular expression (a b) · a defines the language containing the two strings aaand ba.Epsilon: The regular expression ∊ represents a language whose only string is the emptystring.
Thus, (a · b)∊ represents the language {"", "ab"}.Repetition: Given a regular expression M, its Kleene closure is M*. A string is in M* if it isthe concatenation of zero or more strings, all of which are in M. Thus, ((a b) · a)*represents the infinite set {"", "aa", "ba", "aaaa", "baaa", "aaba", "baba", "aaaaaa", …}.Using symbols, alternation, concatenation, epsilon, and Kleene closure we can specify the setof ASCII characters corresponding to the lexical tokens of a programming language. First,consider some examples:(0 | 1)* · 0Binary numbers that are multiples of two.b*(abb*)*(a|∊) Strings of a's and b's with no consecutive a's.(a|b)*aa(a|b)* Strings of a's and b's containing consecutive a's.In writing regular expressions, we will sometimes omit the concatenation symbol or theepsilon, and we will assume that Kleene closure "binds tighter" than concatenation, andconcatenation binds tighter than alternation; so that ab | c means (a · b) | c, and (a |) means (a |∊).Let us introduce some more abbreviations: [abcd] means (a | b | c | d), [b-g] means [bcdefg],[b-gM-Qkr] means [bcdefgMNOPQkr], M? means (M | ∊), and M+ means (M·M*).
Theseextensions are convenient, but none extend the descriptive power of regular expressions: Any26set of strings that can be described with these abbreviations could also be described by just thebasic set of operators. All the operators are summarized in Figure 2.1.An ordinary character stands for itself.The empty string.∊Another way to write the empty string.M|NAlternation, choosing from M or N.M·NConcatenation, an M followed by an N.Another way to write concatenation.MNM*Repetition (zero or more times).+MRepetition, one or more times.M?Optional, zero or one occurrence of M.[a − zA − Z] Character set alternation.a."a.+*"A period stands for any single character exceptnewline.Quotation, a string in quotes stands for itself literally.Figure 2.1: Regular expression notation.Using this language, we can specify the lexical tokens of a programming language (Figure2.2).if[a-z][a-z0-9]*[0-9]+([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+)("--"[a-z]*"\n")|(" "|"\n"|"\t")+.IFIDNUMREALno token, just white spaceerrorFigure 2.2: Regular expressions for some tokens.The fifth line of the description recognizes comments or white space but does not report backto the parser.
Instead, the white space is discarded and the lexer resumed. The comments forthis lexer begin with two dashes, contain only alphabetic characters, and end with newline.Finally, a lexical specification should be complete, always matching some initial substring ofthe input; we can always achieve this by having a rule that matches any single character (andin this case, prints an "illegal character" error message and continues).These rules are a bit ambiguous. For example, does if8 match as a single identifier or as thetwo tokens if and 8? Does the string if 89 begin with an identifier or a reserved word?There are two important disambiguation rules used by Lex, JavaCC, SableCC, and othersimilar lexical-analyzer generators:Longest match: The longest initial substring of the input that can match any regularexpression is taken as the next token.27Rule priority: For a particular longest initial substring, the first regular expression that canmatch determines its token-type.
This means that the order of writing down the regularexpression rules has significance.Thus, if8 matches as an identifier by the longest-match rule, and if matches as a reservedword by rule-priority.2.3 FINITE AUTOMATARegular expressions are convenient for specifying lexical tokens, but we need a formalismthat can be implemented as a computer program. For this we can use finite automata (N.B. thesingular of automata is automaton). A finite automaton has a finite set of states; edges leadfrom one state to another, and each edge is labeled with a symbol. One state is the start state,and certain of the states are distinguished as final states.Figure 2.3 shows some finite automata. We number the states just for convenience indiscussion.
The start state is numbered 1 in each case. An edge labeled with several charactersis shorthand for many parallel edges; so in the ID machine there are really 26 edges eachleading from state 1 to 2, each labeled by a different letter.Figure 2.3: Finite automata for lexical tokens. The states are indicated by circles; final statesare indicated by double circles. The start state has an arrow coming in from nowhere. An edgelabeled with several characters is shorthand for many parallel edges.In a deterministic finite automaton (DFA), no two edges leaving from the same state arelabeled with the same symbol.
A DFA accepts or rejects a string as follows. Starting in thestart state, for each character in the input string the automaton follows exactly one edge to getto the next state. The edge must be labeled with the input character. After making n transitionsfor an n-character string, if the automaton is in a final state, then it accepts the string. If it isnot in a final state, or if at some point there was no appropriately labeled edge to follow, itrejects.
The language recognized by an automaton is the set of strings that it accepts.For example, it is clear that any string in the language recognized by automaton ID mustbegin with a letter. Any single letter leads to state 2, which is final; so a single-letter string isaccepted. From state 2, any letter or digit leads back to state 2, so a letter followed by anynumber of letters and digits is also accepted.28In fact, the machines shown in Figure 2.3 accept the same languages as the regularexpressions of Figure 2.2.These are six separate automata; how can they be combined into a single machine that canserve as a lexical analyzer? We will study formal ways of doing this in the next section, buthere we will just do it ad hoc: Figure 2.4 shows such a machine.
Each final state must belabeled with the token-type that it accepts. State 2 in this machine has aspects of state 2 of theIF machine and state 2 of the ID machine; since the latter is final, then the combined statemust be final. State 3 is like state 3 of the IF machine and state 2 of the ID machine; becausethese are both final we use rule priority to disambiguate - we label state 3 with IF because wewant this token to be recognized as a reserved word, not an identifier.Figure 2.4: Combined finite automaton.We can encode this machine as a transition matrix: a two-dimensional array (a vector ofvectors), subscripted by state number and input character.
There will be a "dead" state (state0) that loops to itself on all characters; we use this to encode the absence of an edge.int edges[][] = { /* ...012...-...e f g h i j... *//* state 0 */{0,0,...0,0,0...0...0,0,0,0,0,0...},/* state 1 */{0,0,...7,7,7...9...4,4,4,4,2,4...},/* state 2 */{0,0,...4,4,4...0...4,3,4,4,4,4...},/* state 3 */{0,0,...4,4,4...0...4,4,4,4,4,4...},/* state 4 */{0,0,...4,4,4...0...4,4,4,4,4,4...},/* state 5 */{0,0,...6,6,6...0...0,0,0,0,0,0...},/* state 6 */{0,0,...6,6,6...0...0,0,0,0,0,0...},/* state 7 */{0,0,...7,7,7...0...0,0,0,0,0,0...},/* state 8 */{0,0,...8,8,8...0...0,0,0,0,0,0...},et cetera}There must also be a "finality" array, mapping state numbers to actions - final state 2 maps toaction ID, and so on.RECOGNIZING THE LONGEST MATCHIt is easy to see how to use this table to recognize whether to accept or reject a string, but thejob of a lexical analyzer is to find the longest match, the longest initial substring of the input29that is a valid token.
While interpreting transitions, the lexer must keep track of the longestmatch seen so far, and the position of that match.Keeping track of the longest match just means remembering the last time the automaton wasin a final state with two variables, Last-Final (the state number of the most recent final stateencountered) and Input-Position-at-Last-Final. Every time a final state is entered, thelexer updates these variables; when a dead state (a nonfinal state with no output transitions) isreached, the variables tell what token was matched, and where it ended.Figure 2.5 shows the operation of a lexical analyzer that recognizes longest matches; note thatthe current input position may be far beyond the most recent position at which the recognizerwas in a final state.Figure 2.5: The automaton of Figure 2.4 recognizes several tokens.