K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 17
Текст из файла (страница 17)
The re 0 | [1. . . 9] [0. . . 9]∗ is moreconcise. In practice, many implementations admit a larger class ofstrings as integers, accepting the language [0. . . 9]+ .3. Unsigned real numbers are more complex than integers. One possible remight be (0 | [1. . . 9] [0. . . 9]∗ ) ( | . [0. .
. 9]∗ ) The first part is just the refor an integer. The rest generates either the empty string or a decimalpoint followed by zero or more digits.Programming languages often extend real numbers to scientific notation,as in (0 | [1. . . 9] [0. . . 9]∗ ) ( | .
[0. . . 9]∗ ) E ( | + | −)(0 | [1. . . 9] [0. . . 9]∗ ).This re describes a real number, followed by an E, followed by aninteger to specify the exponent.4. Quoted character strings have their own complexity. In most languages,any character can appear inside a string. While we can write an re forstrings using only the basic operators, it is our first example where acomplement operator simplifies the re.
Using complement, a characterstring in c or Java can be described as “ (ˆ”)∗ ”.c and c++ do not allow a string to span multiple lines in the sourcecode—that is, if the scanner reaches the end of a line while inside astring, it terminates the string and issues an error message. If werepresent newline with the escape sequence \n, in the c style, then there “ ( ˆ(” | \n) )∗ ” will recognize a correctly formed string and will takean error transition on a string that includes a newline.5. Comments appear in a number of forms. c++ and Java offer theprogrammer two ways of writing a comment.
The delimiter // indicatesa comment that runs to the end of the current input line. The re for thisstyle of comment is straightforward: // (ˆ\n)∗ \n, where \n represents thenewline character.Multiline comments in c, c++, and Java begin with the delimiter /* andend with */. If we could disallow * in a comment, the re would beComplement operatorThe notation ∧ c specifies the set {6 − c},the complement of c with respect to 6.Complement has higher precedence than∗ , |, or + .Escape sequenceTwo or more characters that the scannertranslates into another character.
Escapesequences are used for characters that lack aglyph, such as newline or tab, and for ones thatoccur in the syntax, such as an open or closequote.38 CHAPTER 2 Scannerssimple: /* (ˆ*)∗ */. With *, the re is more complex: /* ( ˆ* | *+ ˆ/ )∗ */.An fa to implement this re follows.s0/s1 *^**s2 *s3/s4^(*|/)The correspondence between the re and this fa is not as obvious as itwas in the examples earlier in the chapter. Section 2.4 presentsconstructions that automate the construction of an fa from an re.The complexity of the re and fa for multiline comments arises from theuse of multi-character delimiters.
The transition from s2 to s3 encodesthe fact that the recognizer has seen a * so that it can handle either theappearance of a / or the lack thereof in the correct manner. In contrast,Pascal uses single-character comment delimiters: { and }, so a Pascalcomment is just { ˆ}∗ }.Trying to be specific with an re can also lead to complex expressions. Consider, for example, that the register specifier in a typical assembly languageconsists of the letter r followed immediately by a small integer. In iloc,which admits an unlimited set of register names, the re might be r[0. . .
9]+ ,with the following fa:s0rs10…90…9s2This recognizer accepts r29, and rejects s29. It also accepts r99999, eventhough no currently available computer has 100,000 registers.On a real computer, however, the set of register names is severely limited—say, to 32, 64, 128, or 256 registers. One way for a scanner to check validityof a register name is to convert the digits into a number and test whetheror not it falls into the range of valid register numbers. The alternative is toadopt a more precise re specification, such as:r ( [0. .
. 2] ([0. . . 9] | ) | [4. . . 9] | (3 (0 | 1 | )) )This re specifies a much smaller language, limited to register numbers0 to 31 with an optional leading 0 on single-digit register names. It accepts2.3 Regular Expressions 39r0, r00, r01, and r31, but rejects r001, r32, and r99999. The correspondingfa looks like:s20…2s0rs14…93s50…90,1s3s6s4Which fa is better? They both make a single transition on each input character. Thus, they have the same cost, even though the second fa checks a morecomplex specification. The more complex fa has more states and transitions,so its representation requires more space.
However, their operating costs arethe same.This point is critical: the cost of operating an fa is proportional to the lengthof the input, not to the length or complexity of the re that generates the fa.More complex res may produce fas with more states that, in turn, need morespace. The cost of generating an fa from an re may also rise with increasedcomplexity in the re. But, the cost of fa operation remains one transitionper input character.Can we improve our description of the register specifier? The previous re isboth complex and counterintuitive.
A simpler alternative might be:r0 | r00 | r1 | r01 | r2 | r02 | r3 | r03 | r4 | r04 | r5 | r05 | r6 | r06 | r7 | r07 |r8 | r08 | r9 | r09 | r10 | r11 | r12 | r13 | r14 | r15 | r16 | r17 | r18 | r19 | r20 |r21 | r22 | r23 | r24 | r25 | r26 | r27 | r28 | r29 | r30 | r31This re is conceptually simpler, but much longer than the previous version.The resulting fa still requires one transition per input symbol. Thus, if wecan control the growth in the number of states, we might prefer this version of the re because it is clear and obvious. However, when our processorsuddenly has 256 or 384 registers, enumeration may become tedious, too.2.3.3 Closure Properties of REsRegular expressions and the languages that they generate have been the subject of extensive study. They have many interesting and useful properties.Some of these properties play a critical role in the constructions that buildrecognizers from res.Regular languagesAny language that can be specified by a regularexpression is called a regular language.40 CHAPTER 2 ScannersPROGRAMMING LANGUAGES VERSUS NATURAL LANGUAGESLexical analysis highlights one of the subtle ways in which programminglanguages differ from natural languages, such as English or Chinese.
Innatural languages, the relationship between a word’s representation—itsspelling or its pictogram—and its meaning is not obvious. In English, are isa verb while art is a noun, even though they differ only in the final character.Furthermore, not all combinations of characters are legitimate words. Forexample, arz differs minimally from are and art, but does not occur as aword in normal English usage.A scanner for English could use FA-based techniques to recognize potentialwords, since all English words are drawn from a restricted alphabet.
Afterthat, however, it must look up the prospective word in a dictionary todetermine if it is, in fact, a word. If the word has a unique part of speech,dictionary lookup will also resolve that issue. However, many English wordscan be classified with several parts of speech. Examples include buoy andstress; both can be either a noun or a verb. For these words, the part ofspeech depends on the surrounding context. In some cases, understandingthe grammatical context suffices to classify the word. In other cases, itrequires an understanding of meaning, for both the word and its context.In contrast, the words in a programming language are almost alwaysspecified lexically. Thus, any string in [1.
. . 9][0. . . 9]∗ is a positive integer.The RE [a. . . z]([a. . . z]|[0. . . 9])∗ defines a subset of the Algol identifiers;arz, are and art are all identifiers, with no lookup needed to establish thefact. To be sure, some identifiers may be reserved as keywords. However,these exceptions can be specified lexically, as well. No context is required.This property results from a deliberate decision in programming language design. The choice to make spelling imply a unique part of speechsimplifies scanning, simplifies parsing, and, apparently, gives up little inthe expressiveness of the language. Some languages have allowed wordswith dual parts of speech—for example, PL/I has no reserved keywords.The fact that more recent languages abandoned the idea suggests thatthe complications outweighed the extra linguistic flexibility.Regular expressions are closed under many operations—that is, if we applythe operation to an re or a collection of res, the result is an re.
Obviousexamples are concatenation, union, and closure. The concatenation of twores x and y is just xy. Their union is x | y. The Kleene closure of x is just x∗ .From the definition of an re, all of these expressions are also res.These closure properties play a critical role in the use of res to build scanners. Assume that we have an re for each syntactic category in the sourcelanguage, a0 , a1 , a2 , . . . , an .
Then, to construct an re for all the valid wordsin the language, we can join them with alternation as a0 | a1 | a2 | . . . | an .Since res are closed under union, the result is an re. Anything that we can2.3 Regular Expressions 41do to an re for a single syntactic category will be equally applicable to there for all the valid words in the language.Closure under union implies that any finite language is a regular language.We can construct an re for any finite collection of words by listing themin a large alternation.
Because the set of res is closed under union, thatalternation is an re and the corresponding language is regular.Closure under concatenation allows us to build complex res from simpler ones by concatenating them. This property seems both obvious andunimportant. However, it lets us piece together res in systematic ways.