K. Cooper, L. Torczon - Engineering a Compiler (2011 - 2nd edition) (798440), страница 28
Текст из файла (страница 28)
We pick a nonterminal symbol, α, in the prototypestring, choose a grammar rule, α → β, and rewrite α with β. We repeat thisrewriting process until the prototype string contains no more nonterminals,at which point it consists entirely of words, or terminal symbols, and is asentence in the language.At each point in this derivation process, the string is a collection of terminalor nonterminal symbols. Such a string is called a sentential form if it occursin some step of a valid derivation. Any sentential form can be derived fromthe start symbol in zero or more steps. Similarly, from any sentential formwe can derive a valid sentence in zero or more steps.
Thus, if we begin withSheepNoise and apply successive rewrites using the two rules, at each step inthe process the string is a sentential form. When we have reached the pointwhere the string contains only terminal symbols, the string is a sentencein L(SN).Derivationa sequence of rewriting steps that begins withthe grammar’s start symbol and ends with asentence in the languageSentential forma string of symbols that occurs as one step in avalid derivation88 CHAPTER 3 ParsersCONTEXT-FREE GRAMMARSFormally, a context-free grammar G is a quadruple (T, NT, S, P) where:Tis the set of terminal symbols, or words, in the language L(G). Terminal symbols correspond to syntactic categories returned by thescanner.NT is the set of nonterminal symbols that appear in the productionsof G. Nonterminals are syntactic variables introduced to provideabstraction and structure in the productions.S is a nonterminal designated as the goal symbol or start symbol ofthe grammar.
S represents the set of sentences in L(G).P is the set of productions or rewrite rules in G. Each rule in P has theform NT → (T ∪ NT)+ ; that is, it replaces a single nonterminal witha string of one or more grammar symbols.The sets T and NT can be derived directly from the set of productions, P.The start symbol may be unambiguous, as in the SheepNoise grammar, orit may not be obvious, as in the following grammar:Paren → ( Bracket )| ()Bracket → [ Paren ]| []In this case, the choice of start symbol determines the shape of the outerbrackets.
Using Paren as S ensures that every sentence has an outermostpair of parentheses, while using Bracket forces an outermost pair of squarebrackets. To allow either, we would need to introduce a new symbol Startand the productions Start→Paren | Bracket.Some tools that manipulate grammars require that S not appear on theright-hand side of any production, which makes S easy to discover.To derive a sentence in SN, we start with the string that consists of one symbol, SheepNoise.
We can rewrite SheepNoise with either rule 1 or rule 2. Ifwe rewrite SheepNoise with rule 2, the string becomes baa and has no furtheropportunities for rewriting. The rewrite shows that baa is a valid sentencein L(SN). The other choice, rewriting the initial string with rule 1, leads toa string with two symbols: baa SheepNoise. This string has one remainingnonterminal; rewriting it with rule 2 leads to the string baa baa, which is asentence in L(SN). We can represent these derivations in tabular form:RuleSentential FormRuleSheepNoise2baaRewrite with Rule 212Sentential FormSheepNoisebaa SheepNoisebaa baaRewrite with Rules 1 Then 23.2 Expressing Syntax 89As a notational convenience, we will use →+ to mean “derives in one ormore steps.” Thus, SheepNoise →+ baa and SheepNoise →+ baa baa.Rule 1 lengthens the string while rule 2 eliminates the nonterminal SheepNoise.
(The string can never contain more than one instance of SheepNoise.)All valid strings in SN are derived by zero or more applications of rule 1,followed by rule 2. Applying rule 1 k times followed by rule 2 generates astring with k + 1 baas.3.2.3 More Complex ExamplesThe SheepNoise grammar is too simple to exhibit the power and complexityof cfgs.
Instead, let’s revisit the example that showed the shortcomings ofres: the language of expressions with parentheses.1 Expr →2|3|4 Op →5|6|7|( Expr )Expr Op namename+×÷Beginning with the start symbol, Expr, we can generate two kinds of subterms: parenthesized subterms, with rule 1, or plain subterms, with rule 2.To generate the sentence “(a + b) × c”, we can use the following rewritesequence (2,6,1,2,4,3), shown on the left. Remember that the grammardeals with syntactic categories, such as name rather than lexemes such asa, b, or c.ExprRule261243Sentential FormExprExpr Op nameExpr × name( Expr ) × name( Expr Op name ) × name( Expr + name ) × name( name + name ) × nameRightmost Derivation of ( a + b ) × cExpr(Expr-ExprOp<name,a>+Op)-<name,c>×<name,b>Corresponding Parse TreeThe tree on the right, called a parse tree, represents the derivation as agraph.Parse tree or syntax treea graph that represents a derivation90 CHAPTER 3 ParsersThis simple cfg for expressions cannot generate a sentence with unbalancedor improperly nested parentheses.
Only rule 1 can generate an open parenthesis; it also generates the matching close parenthesis. Thus, it cannotgenerate strings such as “a + ( b × c” or “a + b ) × c),” and a parser built fromthe grammar will not accept the such strings. (The best re in Section 3.2.1matched both of these strings.) Clearly, cfgs provide us with the ability tospecify constructs that res do not.The derivation of (a + b) × c rewrote, at each step, the rightmost remainingnonterminal symbol. This systematic behavior was a choice; other choicesare possible.
One obvious alternative is to rewrite the leftmost nonterminalat each step. Using leftmost choices would produce a different derivation sequence for the same sentence. The leftmost derivation of (a + b) × cwould be:Rightmost derivationa derivation that rewrites, at each step, therightmost nonterminalLeftmost derivationa derivation that rewrites, at each step, theleftmost nonterminalExprRule212346Sentential FormExprExpr Op name( Expr ) Op name( Expr Op name ) Op name( name Op name ) Op name( name + name ) Op name( name + name ) × nameLeftmost Derivation of ( a + b ) x cExpr(Expr-ExprOp<name,a>+Op)-<name,c>×<name,b>Corresponding Parse TreeThe leftmost and rightmost derivations use the same set of rules; they applythose rules in a different order.
Because a parse tree represents the rulesapplied, but not the order of their application, the parse trees for the twoderivations are identical.AmbiguityA grammar G is ambiguous if some sentence inL(G) has more than one rightmost (or leftmost)derivation.From the compiler’s perspective, it is important that each sentence in thelanguage defined by a cfg has a unique rightmost (or leftmost) derivation.If multiple rightmost (or leftmost) derivations exist for some sentence, then,at some point in the derivation, multiple distinct rewrites of the rightmost(or leftmost) nonterminal lead to the same sentence. A grammar in whichmultiple rightmost (or leftmost) derivations exist for a sentence is called anambiguous grammar.
An ambiguous grammar can produce multiple derivations and multiple parse trees. Since later stages of translation will associatemeaning with the detailed shape of the parse tree, multiple parse trees implymultiple possible meanings for a single program—a bad property for a programming language to have. If the compiler cannot be sure of the meaningof a sentence, it cannot translate it into a definitive code sequence.3.2 Expressing Syntax 91The classic example of an ambiguous construct in the grammar for a programming language is the if-then-else construct of many Algol-likelanguages. The straightforward grammar for if-then-else might be1 Statement →2|3|4|if Expr then Statement else Statementif Expr then StatementAssignment.
. . other statements . . .This fragment shows that the else is optional. Unfortunately, the codefragmentif Expr1 then if Expr2 then Assignment1 else Assignment2has two distinct rightmost derivations. The difference between them issimple. The first derivation has Assignment2 controlled by the innerif, so Assignment2 executes when Expr1 is true and Expr2 is false:StatementifExpr1StatementthenExpr2 thenifStatementelseAssignment1StatementAssignment2The second derivation associates the else clause with the first if, so thatAssignment2 executes when Expr1 is false, independent of the value ofExpr2 :StatementifExpr1thenifStatementelseStatementExpr2 thenStatementAssignment2Assignment1Clearly, these two derivations produce different behaviors in the compiledcode.92 CHAPTER 3 ParsersTo remove this ambiguity, the grammar must be modified to encode arule that determines which if controls an else.
To fix the if-then-elsegrammar, we can rewrite it as1 Statement →2|3|if Expr then Statement4 WithElse5if Expr then WithElse else WithElse→|if Expr then WithElse else StatementAssignmentAssignmentThe solution restricts the set of statements that can occur in the then partof an if-then-else construct. It accepts the same set of sentences as theoriginal grammar, but ensures that each else has an unambiguous match toa specific if.
It encodes into the grammar a simple rule—bind each elseto the innermost unclosed if. It has only one rightmost derivation for theexample.RuleSentential Form1235Statementif Expr then Statementif Expr then if Expr then WithElse else Statementif Expr then if Expr then WithElse else Assignmentif Expr then if Expr then Assignment else AssignmentThe rewritten grammar eliminates the ambiguity.The if-then-else ambiguity arises from a shortcoming in the originalgrammar. The solution resolves the ambiguity in a way by imposing arule that is easy for the programmer to remember. (To avoid the ambiguityentirely, some language designers have restructured the if-then-else construct by introducing elseif and endif.) In Section 3.5.3, we will look atother kinds of ambiguity and systematic ways of handling them.3.2.4 Encoding Meaning into StructureThe if-then-else ambiguity points out the relationship between meaning and grammatical structure.
However, ambiguity is not the only situationwhere meaning and grammatical structure interact. Consider the parse treethat would be built from a rightmost derivation of the simple expressiona + b x c.3.2 Expressing Syntax 93Rule26243ExprSentential FormExprExprExprExprExprOpExprOp namex nameExprOp<name,a>+Op name x name+ name x namename + name x nameDerivation of a + b x c<name,b><name,c>×Corresponding Parse TreeOne natural way to evaluate the expression is with a simple postorder treewalk.
It would first compute a + b and then multiply that result by c toproduce the result (a + b) x c. This evaluation order contradicts the classicrules of algebraic precedence, which would evaluate it as a + (b x c). Sincethe ultimate goal of parsing the expression is to produce code that will implement it, the expression grammar should have the property that it builds a treewhose “natural” treewalk evaluation produces the correct result.The real problem lies in the structure of the grammar.