Cooper_Engineering_a_Compiler(Second Edition) (Rice), страница 16

PDF-файл Cooper_Engineering_a_Compiler(Second Edition) (Rice), страница 16, который располагается в категории "разное" в предмете "конструирование компиляторов" изседьмого семестра. Cooper_Engineering_a_Compiler(Second Edition) (Rice), страница 16 - СтудИзба 2019-09-18 СтудИзба
Rice 1872

Описание файла

Файл "Cooper_Engineering_a_Compiler(Second Edition)" внутри архива находится в следующих папках: Rice, Купер и Торчсон - перевод. PDF-файл из архива "Rice", который расположен в категории "разное". Всё это находится в предмете "конструирование компиляторов" из седьмого семестра, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .

Просмотр PDF-файла онлайн

Текст 16 страницы из PDF

An re describes a set of strings over the characters containedin some alphabet, 6, augmented with a character that represents the emptystring. We call the set of strings a language. For a given re, r, we denotethe language that it specifies as L(r). An re is built up from three basicoperations:1. Alternation The alternation, or union, of two sets of strings, R and S,denoted R | S, is {x | x ∈ R or x ∈ S}.2.

Concatenation The concatenation oftwo sets R and S, denoted RS,contains all strings formed by prepending an element of R onto onefrom S, or {x y | x ∈ R and y ∈ S}.S∞ i3. Closure The Kleene closure of a set R, denoted R ∗ , is i=0 R . This isjust the union of the concatenations of R with itself, zero or more times.For convenience, we sometimes use a notation for finite closure. The notation R i denotes from one to i occurrences of R.

A finite closure can bealways be replaced with an enumeration of the possibilities; for example,R 3 is just (R | R R | R R R). The positive closure, denoted R + , is just R R ∗and consists of one or more occurrences of R. Since all these closures canbe rewritten with the three basic operations, we ignore them in the discussionthat follows.Using the three basic operations, alternation, concatenation, and Kleeneclosure, we can define the set of res over an alphabet 6 as follows:1. If a ∈ 6, then a is also an re denoting the set containing only a.2. If r and s are res, denoting sets L(r) and L(s), respectively, thenFinite closureFor any integer i, the RE Ri designates one to ioccurrences of R.Positive closureThe RE R+ denotes one or more occurrences of R,Sioften written as ∞i=1 R .36 CHAPTER 2 ScannersREGULAR EXPRESSIONS IN VIRTUAL LIFERegular expressions are used in many applications to specify patterns incharacter strings.

Some of the early work on translating REs into code wasdone to provide a flexible way of specifying strings in the "find" commandof a text editor. From that early genesis, the notation has crept into manydifferent applications.Unix and other operating systems use the asterisk as a wildcard to matchsubstrings against file names. Here, ∗ is a shorthand for the RE 6 ∗ , specifying zero or more characters drawn from the entire alphabet of legalcharacters.

(Since few keyboards have a 6 key, the shorthand has stayedwith us.) Many systems use ? as a wildcard that matches a single character.The grep family of tools, and their kin in non-Unix systems, implementregular expression pattern matching. (In fact, grep is an acronym for globalregular-expression pattern match and print.)Regular expressions have found widespread use because they are easilywritten and easily understood.

They are one of the techniques of choicewhen a program must recognize a fixed vocabulary. They work well forlanguages that fit within their limited rules. They are easily translated intoan executable form, and the resulting recognizer is fast.r | s is an re denoting the union, or alternation, of L(r) and L(s),rs is an re denoting the concatenation of L(r) and L(s), respectively, andr∗ is an re denoting the Kleene closure of L(r ).3. is an re denoting the set containing only the empty string.To eliminate any ambiguity, parentheses have highest precedence, followedby closure, concatenation, and alternation, in that order.As a convenient shorthand, we will specify ranges of characters with thefirst and the last element connected by an ellipsis, “. .

. ”. To make thisabbreviation stand out, we surround it with a pair of square brackets. Thus,[0. . . 9] represents the set of decimal digits. It can always be rewritten as(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9).2.3.2 ExamplesThe goal of this chapter is to show how we can use formal techniques toautomate the construction of high-quality scanners and how we can encodethe microsyntax of programming languages into that formalism. Before proceeding further, some examples from real programming languages are inorder.2.3 Regular Expressions 371.

The simplified rule given earlier for identifiers in Algol-like languages,an alphabetic character followed by zero or more alphanumericcharacters, is just ([A. . . Z] | [a. . . z]) ([A. . . Z] | [a. . . z] | [0. . . 9])∗ . Mostlanguages also allow a few special characters, such as the underscore ( ),the percent sign (%), or the ampersand (&), in identifiers.If the language limits the maximum length of an identifier, we can usethe appropriate finite closure.

Thus, identifiers limited to six charactersmight be specified as ([A. . . Z] | [a. . . z]) ([A. . . Z] | [a. . . z] | [0. . . 9])5. Ifwe had to write out the full expansion of the finite closure, the re wouldbe much longer.2. An unsigned integer can be described as either zero or a nonzero digitfollowed by zero or more digits. The re 0 | [1. . . 9] [0. . . 9]∗ is moreconcise.

In practice, many implementations admit a larger class ofstrings as integers, accepting the language [0. . . 9]+ .3. Unsigned real numbers are more complex than integers. One possible remight be (0 | [1. . . 9] [0. . . 9]∗ ) ( | . [0. . . 9]∗ ) The first part is just the refor an integer. The rest generates either the empty string or a decimalpoint followed by zero or more digits.Programming languages often extend real numbers to scientific notation,as in (0 | [1.

. . 9] [0. . . 9]∗ ) ( | . [0. . . 9]∗ ) E ( | + | −)(0 | [1. . . 9] [0. . . 9]∗ ).This re describes a real number, followed by an E, followed by aninteger to specify the exponent.4. Quoted character strings have their own complexity. In most languages,any character can appear inside a string. While we can write an re forstrings using only the basic operators, it is our first example where acomplement operator simplifies the re. Using complement, a characterstring in c or Java can be described as “ (ˆ”)∗ ”.c and c++ do not allow a string to span multiple lines in the sourcecode—that is, if the scanner reaches the end of a line while inside astring, it terminates the string and issues an error message.

If werepresent newline with the escape sequence \n, in the c style, then there “ ( ˆ(” | \n) )∗ ” will recognize a correctly formed string and will takean error transition on a string that includes a newline.5. Comments appear in a number of forms. c++ and Java offer theprogrammer two ways of writing a comment. The delimiter // indicatesa comment that runs to the end of the current input line.

The re for thisstyle of comment is straightforward: // (ˆ\n)∗ \n, where \n represents thenewline character.Multiline comments in c, c++, and Java begin with the delimiter /* andend with */. If we could disallow * in a comment, the re would beComplement operatorThe notation ∧ c specifies the set {6 − c},the complement of c with respect to 6.Complement has higher precedence than∗ , |, or + .Escape sequenceTwo or more characters that the scannertranslates into another character.

Escapesequences are used for characters that lack aglyph, such as newline or tab, and for ones thatoccur in the syntax, such as an open or closequote.38 CHAPTER 2 Scannerssimple: /* (ˆ*)∗ */. With *, the re is more complex: /* ( ˆ* | *+ ˆ/ )∗ */.An fa to implement this re follows.s0/s1 *^**s2 *s3/s4^(*|/)The correspondence between the re and this fa is not as obvious as itwas in the examples earlier in the chapter.

Section 2.4 presentsconstructions that automate the construction of an fa from an re.The complexity of the re and fa for multiline comments arises from theuse of multi-character delimiters. The transition from s2 to s3 encodesthe fact that the recognizer has seen a * so that it can handle either theappearance of a / or the lack thereof in the correct manner. In contrast,Pascal uses single-character comment delimiters: { and }, so a Pascalcomment is just { ˆ}∗ }.Trying to be specific with an re can also lead to complex expressions. Consider, for example, that the register specifier in a typical assembly languageconsists of the letter r followed immediately by a small integer.

In iloc,which admits an unlimited set of register names, the re might be r[0. . . 9]+ ,with the following fa:s0rs10…90…9s2This recognizer accepts r29, and rejects s29. It also accepts r99999, eventhough no currently available computer has 100,000 registers.On a real computer, however, the set of register names is severely limited—say, to 32, 64, 128, or 256 registers. One way for a scanner to check validityof a register name is to convert the digits into a number and test whetheror not it falls into the range of valid register numbers. The alternative is toadopt a more precise re specification, such as:r ( [0. .

. 2] ([0. . . 9] | ) | [4. . . 9] | (3 (0 | 1 | )) )This re specifies a much smaller language, limited to register numbers0 to 31 with an optional leading 0 on single-digit register names. It accepts2.3 Regular Expressions 39r0, r00, r01, and r31, but rejects r001, r32, and r99999. The correspondingfa looks like:s20…2s0rs14…93s50…90,1s3s6s4Which fa is better? They both make a single transition on each input character. Thus, they have the same cost, even though the second fa checks a morecomplex specification. The more complex fa has more states and transitions,so its representation requires more space. However, their operating costs arethe same.This point is critical: the cost of operating an fa is proportional to the lengthof the input, not to the length or complexity of the re that generates the fa.More complex res may produce fas with more states that, in turn, need morespace.

The cost of generating an fa from an re may also rise with increasedcomplexity in the re. But, the cost of fa operation remains one transitionper input character.Can we improve our description of the register specifier? The previous re isboth complex and counterintuitive. A simpler alternative might be:r0 | r00 | r1 | r01 | r2 | r02 | r3 | r03 | r4 | r04 | r5 | r05 | r6 | r06 | r7 | r07 |r8 | r08 | r9 | r09 | r10 | r11 | r12 | r13 | r14 | r15 | r16 | r17 | r18 | r19 | r20 |r21 | r22 | r23 | r24 | r25 | r26 | r27 | r28 | r29 | r30 | r31This re is conceptually simpler, but much longer than the previous version.The resulting fa still requires one transition per input symbol.

Свежие статьи
Популярно сейчас