Cooper_Engineering_a_Compiler(Second Edition) (1157546), страница 65
Текст из файла (страница 65)
It must discover and storemany distinct kinds of information. It encounters a wide variety of names—variables, defined constants, procedures, functions, labels, structures, andfiles. As discussed in the previous section, the compiler also generates manynames. For a variable, it needs a data type, its storage class, the name andlexical level of its declaring procedure, and a base address and offset inmemory. For an array, the compiler also needs the number of dimensionsand the upper and lower bounds for each dimension.
For records or structures, it needs a list of the fields, along with the relevant information foreach field. For functions and procedures, it needs the number of parametersand their types, as well as the types of any returned values; a more sophisticated translation might record information about what variables a procedurecan reference or modify.The compiler must either record this information in the ir or re-derive it ondemand. For the sake of efficiency, most compilers record facts rather thanrecompute them. These facts can be recorded directly in the ir.
For example, a compiler that builds an ast might record information about variablesas annotations (or attributes) of the node representing each variable’s declaration. The advantage of this approach is that it uses a single representationfor the code being compiled. It provides a uniform access method and asingle implementation. The disadvantage of this approach is that the singleaccess method may be inefficient—navigating the ast to find the appropriatedeclaration has its own costs. To eliminate this inefficiency, the compilercan thread the ir so that each reference has a link back to the correspondingdeclaration. This adds space to the ir and overhead to the ir builder.The alternative, as we saw in Chapter 4, is to create a central repository forthese facts and provide efficient access to it. This central repository, calledWhen the compiler writes the IR to disk, it may becheaper to recompute facts than to write themand then read them.254 CHAPTER 5 Intermediate Representationsa symbol table, becomes an integral part of the compiler’s ir.
The symbol table localizes information derived from potentially distant parts of thesource code. It makes such information easily and efficiently available, and itsimplifies the design and implementation of any code that must refer to information about variables derived earlier in compilation. It avoids the expenseof searching the ir to find the portion that represents a variable’s declaration;using a symbol table often eliminates the need to represent the declarationsdirectly in the ir.
(An exception occurs in source-to-source translation. Thecompiler may build a symbol table for efficiency and preserve the declaration syntax in the ir so that it can produce an output program that closelyresembles the input program.) It eliminates the overhead of making eachreference contain a pointer to the declaration. It replaces both of these witha computed mapping from the textual name to the stored information. Thus,in some sense, the symbol table is simply an efficiency trick.At many places in this text, we refer to “the symbol table.” As we shall see inSection 5.5.4, the compiler may include several distinct, specialized symboltables.
A careful implementation might use the same access methods for allthese tables.bSymbol-table implementation requires attention to detail. Because nearlyevery aspect of translation refers to the symbol table, efficiency of accessis critical. Because the compiler cannot predict, before translation, the number of names that it will encounter, expanding the symbol table must beboth graceful and efficient. This section provides a high-level treatment ofthe issues that arise in designing a symbol table.
It presents the compilerspecific aspects of symbol-table design and use. For deeper implementationdetails and design alternatives, see Section B.4 in Appendix B.a5.5.1 Hash Tables012h(d)3456789cA compiler accesses its symbol table frequently. Thus, efficiency is a keyissue in the design of a symbol table. Because hash tables provide constanttime expected-case lookups, they are the method of choice for implementingsymbol tables. Hash tables are conceptually elegant. They use a hash function, h, to map names to small integers, and use the small integer to indexthe table.
With a hashed symbol table, the compiler stores all the informationthat it derives about the name n in the table in slot h(n). The figure in themargin shows a simple ten-slot hash table. It is a vector of records, eachrecord holding the compiler-generated description of a single name. Thenames a, b, and c have already been inserted. The name d is being inserted,at h(d) = 2.5.5 Symbol Tables 255The primary reason to use hash tables is to provide a constant-time expectedcase lookup keyed by a textual name.
To achieve this, h must be inexpensiveto compute. Given an appropriate function h, accessing the record for nrequires computing h(n) and indexing into the table at h(n). If h maps twoor more symbols to the same small integer, a “collision” occurs. (In themarginal figure, this would occur if h(d) = 3.) The implementation musthandle this situation gracefully, preserving both the information and thelookup time. In this section, we assume that h is a perfect hash function, thatis, it never produces a collision. Furthermore, we assume that the compilerknows, in advance, how large to make the table.
Appendix B.4 describeshash-table implementation in more detail, including hash functions, collisionhandling, and schemes for expanding a hash table.Hash tables can be used as an efficient representation for sparse graphs.Given two nodes, x and y, an entry for the key xy indicates that an edge (x, y)exists. (This scheme requires a hash function that generates a good distribution from a pair of small integers; both the multiplicative and universal hashfunctions described in Appendix B.4.1 work well.) A well-implementedhash table can provide fast insertion and a fast test for the presence of aspecific edge.
Additional information is required to answer questions suchas “What nodes are adjacent to x?”5.5.2 Building a Symbol TableThe symbol table defines two interface routines for the rest of the compiler.1. LookUp(name) returns the record stored in the table at h(name) if oneexists. Otherwise, it returns a value indicating that name was notfound.2. Insert(name,record) stores the information in record in the table ath(name). It may expand the table to accommodate the record for name.The compiler can use separate functions for LookUp and Insert, or they canbe combined by passing LookUp a flag that specifies whether or not to insertthe name. This ensures, for example, that a LookUp of an undeclared variablewill fail—a property useful for detecting a violation of the declare-beforeuse rule in syntax-directed translation schemes or for supporting nestedlexical scopes.This simple interface fits directly into the ad hoc syntax-directed translation schemes described in Chapter 4.
In processing declaration syntax, thecompiler builds up a set of attributes for each variable. When the parser recognizes a production that declares some variable, it can enter the name and256 CHAPTER 5 Intermediate RepresentationsAN ALTERNATIVE TO HASHINGHashing is the method most widely used to organize a compiler’s symboltable. Multiset discrimination is an interesting alternative that eliminatesany possibility of worst-case behavior. The critical insight behind multisetdiscrimination is that the index can be constructed offline in the scanner.To use multiset discrimination, the compiler writer must take a differentapproach to scanning.
Instead of processing the input incrementally, thecompiler scans the entire program to find the complete set of identifiers. Asit discovers each identifier, it creates a tuple hname,positioni, where nameis the text of the identifier and position is its ordinal position in the list ofclassified words, or tokens. It enters all the tuples into a large set.The next step sorts the set lexicographically. In effect, this creates a setof subsets, one per identifier.
Each of these subsets holds the tuples forall the occurrences of its identifier. Since each tuple refers to a specifictoken, through its position value, the compiler can use the sorted set tomodify the token stream. The compiler makes a linear scan over the set,processing each subset. It allocates a symbol-table index for the entiresubset, then rewrites the tokens to include that index. This augments theidentifier tokens with their symbol-table indices. If the compiler needs atextual lookup function, the resulting table is ordered alphabetically for abinary search.The price for using this technique is an extra pass over the token stream,along with the cost of the lexicographic sort.
The advantages, from a complexity perspective, are that it avoids any possibility of hashing’s worst-casebehavior and that it makes the initial size of the symbol table obvious, evenbefore parsing. This technique can be used to replace a hash table in almostany application in which an offline solution will work.attributes into the symbol table using Insert. If a variable name can appearin only one declaration, the parser can call LookUp first to detect a repeateduse of the name.
When the parser encounters a variable name outside thedeclaration syntax, it uses LookUp to obtain the appropriate information fromthe symbol table. LookUp fails on any undeclared name. The compiler writer,of course, may need to add functions to initialize the table, to store it to andretrieve it from external media, and to finalize it. For a language with a singlename space, this interface suffices.5.5.3 Handling Nested ScopesFew programming languages provide a single unified name space. Mostlanguages allow a program to declare names at multiple levels.
Each of these5.5 Symbol Tables 257levels has a scope, or a region in the program’s text where the name can beused. Each of these levels has a lifetime, or a period at runtime where thevalue is preserved.If the source language allows scopes to be nested one inside another, then thefront end needs a mechanism to translate a reference, such as x, to the properscope and lifetime. The primary mechanism that compilers use to performthis translation is a scoped symbol table.For the purposes of this discussion, assume that a program can create anarbitrary number of scopes nested one within another.
We will defer anin-depth discussion of lexical scoping until Section 6.3.1; however, mostprogrammers have enough experience with the concept for this discussion.Figure 5.10 shows a c program that creates five distinct scopes. We willlabel the scopes with numbers that indicate the nesting relationships amongthem. The level 0 scope is the outermost scope, while the level 3 scope is theinnermost one.The table on the right side of the figure shows the names declared in eachscope. The declaration of b at level 2a hides the level 1 declaration fromany code inside the block that creates level 2a.