K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 14
Текст из файла (страница 14)
At its end, compilers confrontedchallenges that ranged from multiple functional units to long memory latencies to parallel code generation. The structure and organization of 1980s risccompilers proved flexible enough for these new challenges, so researchersbuilt new passes to insert into the optimizers and code generators of theircompilers.While Java systems use a mix of compilation and interpretation [63, 279],Java is not the first language to employ such a mix. Lisp systems have longincluded both native-code compilers and virtual-machine implementationExercises 23schemes [266, 324].
The Smalltalk-80 system used a bytecode distributionand a virtual machine [233]; several implementations added just-in-timecompilers [126].nEXERCISES1. Consider a simple Web browser that takes as input a textual string inhtml format and displays the specified graphics on the screen. Is thedisplay process one of compilation or interpretation?2.
In designing a compiler, you will face many tradeoffs. What are thefive qualities that you, as a user, consider most important in a compilerthat you purchase? Does that list change when you are the compilerwriter? What does your list tell you about a compiler that you wouldimplement?3. Compilers are used in many different circumstances.
What differencesmight you expect in compilers designed for the following applications?a. A just-in-time compiler used to translate user interface codedownloaded over a networkb. A compiler that targets the embedded processor used in a cellulartelephonec. A compiler used in an introductory programming course at a highschoold. A compiler used to build wind-tunnel simulations that run on amassively parallel processor (where all processors are identical)e. A compiler that targets numerically intensive programs to a largenumber of diverse machinesThis page intentionally left blankChapter2ScannersnCHAPTER OVERVIEWThe scanner’s task is to transform a stream of characters into a stream ofwords in the input language.
Each word must be classified into a syntacticcategory, or “part of speech.” The scanner is the only pass in the compilerto touch every character in the input program. Compiler writers place a premium on speed in scanning, in part because the scanner’s input is larger,in some measure, than that of any other pass, and, in part, because highlyefficient techniques are easy to understand and to implement.This chapter introduces regular expressions, a notation used to describethe valid words in a programming language.
It develops the formal mechanisms to generate scanners from regular expressions, either manually orautomatically.Keywords: Scanner, Finite Automaton, Regular Expression, Fixed Point2.1 INTRODUCTIONScanning is the first stage of a three-part process that the compiler usesto understand the input program. The scanner, or lexical analyzer, reads astream of characters and produces a stream of words. It aggregates characters to form words and applies a set of rules to determine whether or not eachword is legal in the source language. If the word is valid, the scanner assignsit a syntactic category, or part of speech.The scanner is the only pass in the compiler that manipulates every character of the input program.
Because scanners perform a relatively simple task,grouping characters together to form words and punctuation in the sourcelanguage, they lend themselves to fast implementations. Automatic toolsfor scanner generation are common. These tools process a mathematicalEngineering a Compiler. DOI: 10.1016/B978-0-12-088478-0.00002-5c 2012, Elsevier Inc. All rights reserved.Copyright 2526 CHAPTER 2 Scannersdescription of the language’s lexical syntax and produce a fast recognizer.Alternatively, many compilers use hand-crafted scanners; because the taskis simple, such scanners can be fast and robust.Conceptual RoadmapRecognizera program that identifies specific words in astream of charactersThis chapter describes the mathematical tools and programming techniquesthat are commonly used to construct scanners—both generated scannersand hand-crafted scanners.
The chapter begins, in Section 2.2, by introducing a model for recognizers, programs that identify words in a stream ofcharacters. Section 2.3 describes regular expressions, a formal notation forspecifying syntax. In Section 2.4, we show a set of constructions to convert aregular expression into a recognizer. Finally, in Section 2.5 we present threedifferent ways to implement a scanner: a table-driven scanner, a direct-codedscanner, and a hand-coded approach.Both generated and hand-crafted scanners rely on the same underlying techniques. While most textbooks and courses advocate the use of generatedscanners, most commercial compilers and open-source compilers use handcrafted scanners.
A hand-crafted scanner can be faster than a generatedscanner because the implementation can optimize away a portion of the overhead that cannot be avoided in a generated scanner. Because scanners aresimple and they change infrequently, many compiler writers deem that theperformance gain from a hand-crafted scanner outweighs the convenienceof automated scanner generation. We will explore both alternatives.OverviewSyntactic categorya classification of words according to theirgrammatical usageMicrosyntaxthe lexical structure of a languageA compiler’s scanner reads an input stream that consists of charactersand produces an output stream that contains words, each labelled with itssyntactic category—equivalent to a word’s part of speech in English. Toaccomplish this aggregation and classification, the scanner applies a set ofrules that describe the lexical structure of the input programming language,sometimes called its microsyntax.
The microsyntax of a programming language specifies how to group characters into words and, conversely, how toseparate words that run together. (In the context of scanning, we considerpunctuation marks and other symbols as words.)Western languages, such as English, have simple microsyntax.
Adjacentalphabetic letters are grouped together, left to right, to form a word.A blank space terminates a word, as do most nonalphabetic symbols. (Theword-building algorithm can treat a hyphen in the midst of a word asif it were an alphabetic character.) Once a group of characters has beenaggregated together to form a potential word, the word-building algorithmcan determine its validity with a dictionary lookup.2.2 Recognizing Words 27Most programming languages have equally simple microsyntax. Charactersare aggregated into words.
In most languages, blanks and punctuation marksterminate a word. For example, Algol and its descendants define an identifieras a single alphabetic character followed by zero or more alphanumeric characters. The identifier ends with the first nonalphanumeric character. Thus,fee and f1e are valid identifiers, but 12fum is not. Notice that the set ofvalid words is specified by rules rather than by enumeration in a dictionary.In a typical programming language, some words, called keywords or reserved words, match the rule for an identifier but have special meanings. Bothwhile and static are keywords in both C and Java. Keywords (and punctuation marks) form their own syntactic categories.
Even though staticmatches the rule for an identifier, the scanner in a C or Java compiler wouldundoubtedly classify it into a category that has only one element, the keyword static. To recognize keywords, the scanner can either use dictionarylookup or encode the keywords directly into its microsyntax rules.The simple lexical structure of programming languages lends itself to efficient scanners.
The compiler writer starts from a specification of the language’s microsyntax. She either encodes the microsyntax into a notationaccepted by a scanner generator, which then constructs an executable scanner, or she uses that specification to build a hand-crafted scanner.
Bothgenerated and hand-crafted scanners can be implemented to require justO(1) time per character, so they run in time proportional to the number ofcharacters in the input stream.2.2 RECOGNIZING WORDSThe simplest explanation of an algorithm to recognize words is often acharacter-by-character formulation. The structure of the code can providesome insight into the underlying problem. Consider the problem of recognizing the keyword new.
Assuming the presence of a routine NextChar thatreturns the next character, the code might look like the fragment shown inFigure 2.1. The code tests for n followed by e followed by w. At each step,failure to match the appropriate character causes the code to reject the stringand “try something else.” If the sole purpose of the program was to recognize the word new, then it should print an error message or return failure.Because scanners rarely recognize only one word, we will leave this “errorpath” deliberately vague at this point.The code fragment performs one test per character. We can represent thecode fragment using the simple transition diagram shown to the right of thecode.
The transition diagram represents a recognizer. Each circle representsan abstract state in the computation. Each state is labelled for convenience.Keyworda word that is reserved for a particular syntacticpurpose and, thus, cannot be used as an identifier28 CHAPTER 2 Scannersc ← NextChar();if (c = ‘n’)?then begin;s0n?c ← NextChar();if (c = ‘e’)then begin;s1e?c ← NextChar();if (c = ‘w’)then report success;s2w?else try something else;end;selse try something else;3end;else try something else;n FIGURE 2.1 Code Fragment to Recognize "new".siThe initial state, or start state, is s0 . We will always label the start state ass0 . State s3 is an accepting state; the recognizer reaches s3 only when theinput is new.
Accepting states are drawn with double circles, as shown inthe margin. The arrows represent transitions from state to state based on theinput character. If the recognizer starts in s0 and reads the characters n, e,and w, the transitions take us to s3 . What happens on any other input, suchas n, o, and t? The n takes the recognizer to s1 . The o does not match theedge leaving s1 , so the input word is not new. In the code, cases that do notmatch new try something else. In the recognizer, we can think of this actionas a transition to an error state. When we draw the transition diagram of arecognizer, we usually omit transitions to the error state.
Each state has atransition to the error state on each unspecified input.Using this same approach to build a recognizer for while would produce thefollowing transition diagram:s0ws1hs2is3ls4es5If it starts in s0 and reaches s5 , it has identified the word while. Thecorresponding code fragment would involve five nested if-then-elseconstructs.To recognize multiple words, we can create multiple edges that leave a givenstate.
(In the code, we would begin to elaborate the do something else paths.)2.2 Recognizing Words 29One recognizer for both new and not might bew ss2 3e3n- s0 - s1QoQst ss4 5The recognizer uses a common test for n that takes it from s0 to s1 ,nedenoted s0 → s1 . If the next character is e, it takes the transition s1 → s2 .oIf, instead, the next character is o, it makes the move s1 → s4 . Finally, a wwtin s2 , causes the transition s2 → s3 , while a t in s4 produces s4 → s5 .