Cooper_Engineering_a_Compiler(Second Edition) (1157546), страница 27
Текст из файла (страница 27)
Section 3.4 examines bottom-up parsing asexemplified by lr(1) parsers. Section 3.4.2 presents the detailed algorithmfor generating canonical lr(1) parsers. The final section explores severalpractical issues that arise in parser construction.OverviewParsinggiven a stream s of words and a grammar G, finda derivation in G that produces sA compiler’s parser has the primary responsibility for recognizing syntax—that is, for determining if the program being compiled is a valid sentence inthe syntactic model of the programming language. That model is expressedas a formal grammar G; if some string of words s is in the language definedby G we say that G derives s.
For a stream of words s and a grammar G,the parser tries to build a constructive proof that s can be derived in G—aprocess called parsing.Parsing algorithms fall into two general categories. Top-down parsers tryto match the input stream against the productions of the grammar by predicting the next word (at each point). For a limited class of grammars,such prediction can be both accurate and efficient.
Bottom-up parsers workfrom low-level detail—the actual sequence of words—and accumulate context until the derivation is apparent. Again, there exists a restricted class ofgrammars for which we can generate efficient bottom-up parsers. In practice, these restricted sets of grammars are large enough to encompass mostfeatures of interest in programming languages.3.2 Expressing Syntax 853.2 EXPRESSING SYNTAXThe task of the parser is to determine whether or not some stream of wordsfits into the syntax of the parser’s intended source language. Implicit in thisdescription is the notion that we can describe syntax and check it; in practice,we need a notation to describe the syntax of languages that people might useto program computers.
In Chapter 2, we worked with one such notation,regular expressions. They provide a concise notation for describing syntaxand an efficient mechanism for testing the membership of a string in thelanguage described by an re. Unfortunately, res lack the power to describethe full syntax of most programming languages.For most programming languages, syntax is expressed in the form of acontext-free grammar.
This section introduces and defines cfgs and explorestheir use in syntax-checking. It shows how we can begin to encode meaninginto syntax and structure. Finally, it introduces the ideas that underlie theefficient parsing techniques described in the following sections.3.2.1 Why Not Regular Expressions?To motivate the use of cfgs, consider the problem of recognizing algebraicexpressions over variables and the operators +, -, × , and ÷. We can define“variable” as any string that matches the re [a.
. . z] ([a. . . z] | [0. . . 9])∗ , asimplified, lowercase version of an Algol identifier. Now, we can define anexpression as follows:[a. . . z] ([a. . . z] | [0 . . . 9])∗ ( (+ | - | × | ÷) [a. . . z] ([a. . . z] | [0 . . . 9])∗ )∗This re matches “a + b × c” and “fee ÷ fie × foe”. Nothing about the resuggests a notion of operator precedence; in “a + b × c,” which operator executes first, the + or the × ? The standard rule from algebra suggests × and ÷have precedence over + and -.
To enforce other evaluation orders, normalalgebraic notation includes parentheses.Adding parentheses to the re in the places where they need to appear issomewhat tricky. An expression can start with a ‘(’, so we need the optionfor an initial (. Similarly, we need the option for a final ).( ( | ) [a.
. . z] ([a. . . z] | [0. . . 9])∗( (+ | - | × | ÷) [a. . . z] ([a. . . z] | [0. . . 9])∗ )∗ ( ) | )This re can produce an expression enclosed in parentheses, but not onewith internal parentheses to denote precedence. The internal instances of( all occur before a variable; similarly, the internal instances of ) all occurWe will underline ( and ) so that they are visuallydistinct from the ( and ) used for grouping in REs.86 CHAPTER 3 Parsersafter a variable.
This observation suggests the following re:( ( | ) [a. . . z] ([a. . . z] | [0. . . 9])∗( (+ | - | × | ÷) [a. . . z] ([a. . . z] | [0. . . 9])∗ ( ) | ) )∗Notice that we simply moved the final ) inside the closure.This re matches both “a + b × c” and “( a + b ) × c.” It will match any correctly parenthesized expression over variables and the four operators in there. Unfortunately, it also matches many syntactically incorrect expressions,such as “a + ( b × c” and “a + b ) × c ).” In fact, we cannot write an re thatwill match all expressions with balanced parentheses.
(Paired constructs,such as begin and end or then and else, play an important role in mostprogramming languages.) This fact is a fundamental limitation of res; thecorresponding recognizers cannot count because they have only a finite setof states. The language (m )n where m = n is not regular. In principle, dfascannot count. While they work well for microsyntax, they are not suitable todescribe some important programming language features.3.2.2 Context-Free GrammarsContext-free grammarFor a language L, its CFG defines the sets of stringsof symbols that are valid sentences in L.Sentencea string of symbols that can be derived from therules of a grammarTo describe programming language syntax, we need a more powerful notation than regular expressions that still leads to efficient recognizers. Thetraditional solution is to use a context-free grammar (cfg).
Fortunately,large subclasses of the cfgs have the property that they lead to efficientrecognizers.A context-free grammar, G, is a set of rules that describe how to form sentences. The collection of sentences that can be derived from G is called thelanguage defined by G, denoted G. The set of languages defined by contextfree grammars is called the set of context-free languages. An example mayhelp. Consider the following grammar, which we call SN:SheepNoise → baa SheepNoise| baaProductionEach rule in a CFG is called a production.Nonterminal symbola syntactic variable used in a grammar’sproductionsTerminal symbola word that can occur in a sentenceA word consists of a lexeme and its syntacticcategory.
Words are represented in a grammar bytheir syntactic categoryThe first rule, or production reads “SheepNoise can derive the word baafollowed by more SheepNoise.” Here SheepNoise is a syntactic variablerepresenting the set of strings that can be derived from the grammar. Wecall such a syntactic variable a nonterminal symbol. Each word in the language defined by the grammar is a terminal symbol. The second rule reads“SheepNoise can also derive the string baa.”To understand the relationship between the SN grammar and L(SN), we needto specify how to apply rules in SN to derive sentences in L(SN).
To begin,we must identify the goal symbol or start symbol of SN. The goal symbol3.2 Expressing Syntax 87BACKUS-NAUR FORMThe traditional notation used by computer scientists to represent acontext-free grammar is called Backus-Naur form, or BNF. BNF denoted nonterminal symbols by wrapping them in angle brackets, like hSheepNoisei.Terminal symbols were underlined. The symbol ::= means "derives," andthe symbol | means "also derives." In BNF, the sheep noise grammarbecomes:hSheepNoisei::=|baa hSheepNoiseibaaThis is completely equivalent to our grammar SN.BNF has its origins in the late 1950s and early 1960s [273].
The syntactic conventions of angle brackets, underlining, ::=, and | arose from thelimited typographic options available to people writing language descriptions. (For example, see David Gries’ book Compiler Construction for DigitalComputers, which was printed entirely on a standard lineprinter [171].)Throughout this book, we use a typographically updated form of BNF.Nonterminals are written in italics.
Terminals are written in the typewriter font. We use the symbol → for "derives."represents the set of all strings in L(SN). As such, it cannot be one of thewords in the language. Instead, it must be one of the nonterminal symbolsintroduced to add structure and abstraction to the language. Since SN hasonly one nonterminal, SheepNoise must be the goal symbol.To derive a sentence, we start with a prototype string that contains just thegoal symbol, SheepNoise. We pick a nonterminal symbol, α, in the prototypestring, choose a grammar rule, α → β, and rewrite α with β. We repeat thisrewriting process until the prototype string contains no more nonterminals,at which point it consists entirely of words, or terminal symbols, and is asentence in the language.At each point in this derivation process, the string is a collection of terminalor nonterminal symbols. Such a string is called a sentential form if it occursin some step of a valid derivation.
Any sentential form can be derived fromthe start symbol in zero or more steps. Similarly, from any sentential formwe can derive a valid sentence in zero or more steps. Thus, if we begin withSheepNoise and apply successive rewrites using the two rules, at each step inthe process the string is a sentential form.
When we have reached the pointwhere the string contains only terminal symbols, the string is a sentencein L(SN).Derivationa sequence of rewriting steps that begins withthe grammar’s start symbol and ends with asentence in the languageSentential forma string of symbols that occurs as one step in avalid derivation88 CHAPTER 3 ParsersCONTEXT-FREE GRAMMARSFormally, a context-free grammar G is a quadruple (T, NT, S, P) where:Tis the set of terminal symbols, or words, in the language L(G).