K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 16
Текст из файла (страница 16)
A six-character identifier consisting of an alphabetic character followed by zero to five alphanumeric characters2. A string of one or more pairs, where each pair consists of an openparenthesis followed by a close parenthesis3. A Pascal comment, which consists of an open brace, {, followed byzero or more characters drawn from an alphabet, 6, followed by aclose brace, }s0a…z,A…Zs1a…z,A…Z,0…934 CHAPTER 2 Scanners2.3 REGULAR EXPRESSIONSThe set of words accepted by a finite automaton, F , forms a language,denoted L(F ).
The transition diagram of the fa specifies, in precise detail,that language. It is not, however, a specification that humans find intuitive.For any fa, we can also describe its language using a notation called a regular expression (re). The language described by an re is called a regularlanguage.Regular expressions are equivalent to the fas described in the previoussection.
(We will prove this with a construction in Section 2.4.) Simplerecognizers have simple re specifications.nnnThe language consisting of the single word new can be described by anre written as new. Writing two characters next to each other implies thatthey are expected to appear in that order.The language consisting of the two words new or while can be writtenas new or while. To avoid possible misinterpretation of or, we writethis using the symbol | to mean or. Thus, we write the re asnew | while.The language consisting of new or not can be written as new | not.Other res are possible, such as n(ew | ot).
Both res specify the samepair of words. The re n(ew | ot) suggests the structure of the fa that wedrew earlier for these two words.w ss2 3e3n s- s0 1QoQst ss4 5To make this discussion concrete, consider some examples that occur in mostprogramming languages. Punctuation marks, such as colons, semicolons,commas, and various brackets, can be represented by their character representations. Their res have the same “spelling” as the punctuation marksthemselves. Thus, the following res might occur in the lexical specificationfor a programming language:: ; ? => ( ) { } [ ]Similarly, keywords have simple res.if while this integer instanceofTo model more complex constructs, such as integers or identifiers, we needa notation that can capture the essence of the cyclic edge in an fa.2.3 Regular Expressions 35The fa for an unsigned integer, shown at the left, has three states: an initialstate s0 , an accepting state s1 for the unique integer zero, and another accepting state s2 for all other integers.
The key to this fa’s power is the transitionfrom s2 back to itself that occurs on each additional digit. State s2 folds thespecification back on itself, creating a rule to derive a new unsigned integerfrom an existing one: add another digit to the right end of the existing number. Another way of stating this rule is: an unsigned integer is either a zero,or a nonzero digit followed by zero or more digits. To capture the essenceof this fa, we need a notation for this notion of “zero or more occurrences”of an re. For the re x, we write this as x∗ , with the meaning “zero or moreoccurrences of x.” We call the * operator Kleene closure, or closure for short.Using the closure operator, we can write an re for this fa:0…91…9s0s20s10 | (1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)∗ .2.3.1 Formalizing the NotationTo work with regular expressions in a rigorous way, we must define themmore formally.
An re describes a set of strings over the characters containedin some alphabet, 6, augmented with a character that represents the emptystring. We call the set of strings a language. For a given re, r, we denotethe language that it specifies as L(r). An re is built up from three basicoperations:1. Alternation The alternation, or union, of two sets of strings, R and S,denoted R | S, is {x | x ∈ R or x ∈ S}.2. Concatenation The concatenation oftwo sets R and S, denoted RS,contains all strings formed by prepending an element of R onto onefrom S, or {x y | x ∈ R and y ∈ S}.S∞ i3.
Closure The Kleene closure of a set R, denoted R ∗ , is i=0 R . This isjust the union of the concatenations of R with itself, zero or more times.For convenience, we sometimes use a notation for finite closure. The notation R i denotes from one to i occurrences of R. A finite closure can bealways be replaced with an enumeration of the possibilities; for example,R 3 is just (R | R R | R R R).
The positive closure, denoted R + , is just R R ∗and consists of one or more occurrences of R. Since all these closures canbe rewritten with the three basic operations, we ignore them in the discussionthat follows.Using the three basic operations, alternation, concatenation, and Kleeneclosure, we can define the set of res over an alphabet 6 as follows:1. If a ∈ 6, then a is also an re denoting the set containing only a.2. If r and s are res, denoting sets L(r) and L(s), respectively, thenFinite closureFor any integer i, the RE Ri designates one to ioccurrences of R.Positive closureThe RE R+ denotes one or more occurrences of R,Sioften written as ∞i=1 R .36 CHAPTER 2 ScannersREGULAR EXPRESSIONS IN VIRTUAL LIFERegular expressions are used in many applications to specify patterns incharacter strings.
Some of the early work on translating REs into code wasdone to provide a flexible way of specifying strings in the "find" commandof a text editor. From that early genesis, the notation has crept into manydifferent applications.Unix and other operating systems use the asterisk as a wildcard to matchsubstrings against file names. Here, ∗ is a shorthand for the RE 6 ∗ , specifying zero or more characters drawn from the entire alphabet of legalcharacters.
(Since few keyboards have a 6 key, the shorthand has stayedwith us.) Many systems use ? as a wildcard that matches a single character.The grep family of tools, and their kin in non-Unix systems, implementregular expression pattern matching. (In fact, grep is an acronym for globalregular-expression pattern match and print.)Regular expressions have found widespread use because they are easilywritten and easily understood. They are one of the techniques of choicewhen a program must recognize a fixed vocabulary. They work well forlanguages that fit within their limited rules.
They are easily translated intoan executable form, and the resulting recognizer is fast.r | s is an re denoting the union, or alternation, of L(r) and L(s),rs is an re denoting the concatenation of L(r) and L(s), respectively, andr∗ is an re denoting the Kleene closure of L(r ).3. is an re denoting the set containing only the empty string.To eliminate any ambiguity, parentheses have highest precedence, followedby closure, concatenation, and alternation, in that order.As a convenient shorthand, we will specify ranges of characters with thefirst and the last element connected by an ellipsis, “. . . ”. To make thisabbreviation stand out, we surround it with a pair of square brackets.
Thus,[0. . . 9] represents the set of decimal digits. It can always be rewritten as(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9).2.3.2 ExamplesThe goal of this chapter is to show how we can use formal techniques toautomate the construction of high-quality scanners and how we can encodethe microsyntax of programming languages into that formalism. Before proceeding further, some examples from real programming languages are inorder.2.3 Regular Expressions 371.
The simplified rule given earlier for identifiers in Algol-like languages,an alphabetic character followed by zero or more alphanumericcharacters, is just ([A. . . Z] | [a. . . z]) ([A. . . Z] | [a. . . z] | [0. . . 9])∗ . Mostlanguages also allow a few special characters, such as the underscore ( ),the percent sign (%), or the ampersand (&), in identifiers.If the language limits the maximum length of an identifier, we can usethe appropriate finite closure. Thus, identifiers limited to six charactersmight be specified as ([A. . . Z] | [a.
. . z]) ([A. . . Z] | [a. . . z] | [0. . . 9])5. Ifwe had to write out the full expansion of the finite closure, the re wouldbe much longer.2. An unsigned integer can be described as either zero or a nonzero digitfollowed by zero or more digits.