Главная » Все файлы » Просмотр файлов из архивов » PDF-файлы » Cooper_Engineering_a_Compiler(Second Edition)

Cooper_Engineering_a_Compiler(Second Edition) (Rice), страница 14

Rice 1869

Описание файла

Файл "Cooper_Engineering_a_Compiler(Second Edition)" внутри архива находится в следующих папках: Rice, Купер и Торчсон - перевод. PDF-файл из архива "Rice", который расположен в категории "разное". Всё это находится в предмете "конструирование компиляторов" из седьмого семестра, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .

Просмотр PDF-файла онлайн

Текст 14 страницы из PDF

Toaccomplish this aggregation and classification, the scanner applies a set ofrules that describe the lexical structure of the input programming language,sometimes called its microsyntax. The microsyntax of a programming language specifies how to group characters into words and, conversely, how toseparate words that run together. (In the context of scanning, we considerpunctuation marks and other symbols as words.)Western languages, such as English, have simple microsyntax. Adjacentalphabetic letters are grouped together, left to right, to form a word.A blank space terminates a word, as do most nonalphabetic symbols. (Theword-building algorithm can treat a hyphen in the midst of a word asif it were an alphabetic character.) Once a group of characters has beenaggregated together to form a potential word, the word-building algorithmcan determine its validity with a dictionary lookup.2.2 Recognizing Words 27Most programming languages have equally simple microsyntax.

Charactersare aggregated into words. In most languages, blanks and punctuation marksterminate a word. For example, Algol and its descendants define an identifieras a single alphabetic character followed by zero or more alphanumeric characters. The identifier ends with the first nonalphanumeric character.

Thus,fee and f1e are valid identifiers, but 12fum is not. Notice that the set ofvalid words is specified by rules rather than by enumeration in a dictionary.In a typical programming language, some words, called keywords or reserved words, match the rule for an identifier but have special meanings. Bothwhile and static are keywords in both C and Java. Keywords (and punctuation marks) form their own syntactic categories. Even though staticmatches the rule for an identifier, the scanner in a C or Java compiler wouldundoubtedly classify it into a category that has only one element, the keyword static. To recognize keywords, the scanner can either use dictionarylookup or encode the keywords directly into its microsyntax rules.The simple lexical structure of programming languages lends itself to efficient scanners.

The compiler writer starts from a specification of the language’s microsyntax. She either encodes the microsyntax into a notationaccepted by a scanner generator, which then constructs an executable scanner, or she uses that specification to build a hand-crafted scanner. Bothgenerated and hand-crafted scanners can be implemented to require justO(1) time per character, so they run in time proportional to the number ofcharacters in the input stream.2.2 RECOGNIZING WORDSThe simplest explanation of an algorithm to recognize words is often acharacter-by-character formulation. The structure of the code can providesome insight into the underlying problem.

Consider the problem of recognizing the keyword new. Assuming the presence of a routine NextChar thatreturns the next character, the code might look like the fragment shown inFigure 2.1. The code tests for n followed by e followed by w. At each step,failure to match the appropriate character causes the code to reject the stringand “try something else.” If the sole purpose of the program was to recognize the word new, then it should print an error message or return failure.Because scanners rarely recognize only one word, we will leave this “errorpath” deliberately vague at this point.The code fragment performs one test per character.

We can represent thecode fragment using the simple transition diagram shown to the right of thecode. The transition diagram represents a recognizer. Each circle representsan abstract state in the computation. Each state is labelled for convenience.Keyworda word that is reserved for a particular syntacticpurpose and, thus, cannot be used as an identifier28 CHAPTER 2 Scannersc ← NextChar();if (c = ‘n’)?then begin;s0n?c ← NextChar();if (c = ‘e’)then begin;s1e?c ← NextChar();if (c = ‘w’)then report success;s2w?else try something else;end;selse try something else;3end;else try something else;n FIGURE 2.1 Code Fragment to Recognize "new".siThe initial state, or start state, is s0 .

We will always label the start state ass0 . State s3 is an accepting state; the recognizer reaches s3 only when theinput is new. Accepting states are drawn with double circles, as shown inthe margin. The arrows represent transitions from state to state based on theinput character. If the recognizer starts in s0 and reads the characters n, e,and w, the transitions take us to s3 .

What happens on any other input, suchas n, o, and t? The n takes the recognizer to s1 . The o does not match theedge leaving s1 , so the input word is not new. In the code, cases that do notmatch new try something else. In the recognizer, we can think of this actionas a transition to an error state. When we draw the transition diagram of arecognizer, we usually omit transitions to the error state.

Each state has atransition to the error state on each unspecified input.Using this same approach to build a recognizer for while would produce thefollowing transition diagram:s0ws1hs2is3ls4es5If it starts in s0 and reaches s5 , it has identified the word while. Thecorresponding code fragment would involve five nested if-then-elseconstructs.To recognize multiple words, we can create multiple edges that leave a givenstate.

(In the code, we would begin to elaborate the do something else paths.)2.2 Recognizing Words 29One recognizer for both new and not might bew ss2 3e3n- s0 - s1QoQst ss4 5The recognizer uses a common test for n that takes it from s0 to s1 ,nedenoted s0 → s1 . If the next character is e, it takes the transition s1 → s2 .oIf, instead, the next character is o, it makes the move s1 → s4 . Finally, a wwtin s2 , causes the transition s2 → s3 , while a t in s4 produces s4 → s5 . States3 indicates that the input was new while s5 indicates that it was not. Therecognizer takes one transition per input character.We can combine the recognizer for new or not with the one for while bymerging their initial states and relabeling all the states.w ss2 3e 3n s- s0 1QoQsJwt ss4 5JJJ^Jh si sl se ss6 78910State s0 has transitions for n and w. The recognizer has three accepting states,s3 , s5 , and s10 .

If any state encounters an input character that does not matchone of its transitions, the recognizer moves to an error state.2.2.1 A Formalism for RecognizersTransition diagrams serve as abstractions of the code that would be requiredto implement them. They can also be viewed as formal mathematical objects, called finite automata, that specify recognizers. Formally, a finiteautomaton (fa) is a five-tuple (S, 6, δ, s0 , S A ), wherennS is the finite set of states in the recognizer, along with an error state se .6 is the finite alphabet used by the recognizer.

Typically, 6 is the unionof the edge labels in the transition diagram.Finite automatona formalism for recognizers that has a finite set ofstates, an alphabet, a transition function, a startstate, and one or more accepting states30 CHAPTER 2 Scannersnnnδ(s, c) is the recognizer’s transition function. It maps each state s ∈ Sand each character c ∈ 6 into some next state. In state si with inputccharacter c, the fa takes the transition si → δ(si , c).s0 ∈ S is the designated start state.S A is the set of accepting states, S A ⊆ S.

Each state in S A appears as adouble circle in the transition diagram.As an example, we can cast the fa for new or not or while in the formalismas follows:S = {s0 , s1 , s2 , s3 , s4 , s5 , s6 , s7 , s8 , s9 , s10 , se }6 = {e, h, i, l, n, o, t, w}( nwes0 → s1 , s0 → s6 , s1 → s2 ,δ=this4 → s5 , s6 → s7 , s7 → s8 ,ows1 → s4 ,s2 → s3 ,s8 → s9 ,s9 → s10l)es0 = s0S A = {s3 , s5 , s10 }For all other combinations of state si and input character c, we defineδ(si , c) = se , where se is the designated error state.

This quintuple is equivalent to the transition diagram; given one, we can easily re-create the other.The transition diagram is a picture of the corresponding fa.An fa accepts a string x if and only if, starting in s0 , the sequence of characters in the string takes the fa through a series of transitions that leavesit in an accepting state when the entire string has been consumed. Thiscorresponds to our intuition for the transition diagram. For the string new,neour example recognizer runs through the transitions s0 → s1 , s1 → s2 , andws2 → s3 .

Свежие статьи
Популярно сейчас