Introduction to Algorithms. (2nd edition) Thomas Cormen_ Charles Leiserson_ Ronad Rivest (811417), страница 43
Текст из файла (страница 43)
How does the professor'smodification affect the running time for successful searches, unsuccessful searches, insertions,and deletions?Exercises 11.2-4Suggest how storage for elements can be allocated and deallocated within the hash table itselfby linking all unused slots into a free list. Assume that one slot can store a flag and either oneelement plus a pointer or two pointers. All dictionary and free-list operations should run inO(1) expected time. Does the free list need to be doubly linked, or does a singly linked freelist suffice?Exercises 11.2-5Show that if |U| > nm, there is a subset of U of size n consisting of keys that all hash to thesame slot, so that the worst-case searching time for hashing with chaining is Θ(n).11.3 Hash functionsIn this section, we discuss some issues regarding the design of good hash functions and thenpresent three schemes for their creation.
Two of the schemes, hashing by division and hashingby multiplication, are heuristic in nature, whereas the third scheme, universal hashing, usesrandomization to provide provably good performance.What makes a good hash function?A good hash function satisfies (approximately) the assumption of simple uniform hashing:each key is equally likely to hash to any of the m slots, independently of where any other keyhas hashed to.
Unfortunately, it is typically not possible to check this condition, since onerarely knows the probability distribution according to which the keys are drawn, and the keysmay not be drawn independently.Occasionally we do know the distribution. For example, if the keys are known to be randomreal numbers k independently and uniformly distributed in the range 0 ≤ k < 1, the hashfunctionh(k) = ⌊km⌋satisfies the condition of simple uniform hashing.In practice, heuristic techniques can often be used to create a hash function that performs well.Qualitative information about distribution of keys may be useful in this design process.
Forexample, consider a compiler's symbol table, in which the keys are character stringsrepresenting identifiers in a program. Closely related symbols, such as pt and pts, often occurin the same program. A good hash function would minimize the chance that such variantshash to the same slot.A good approach is to derive the hash value in a way that is expected to be independent of anypatterns that might exist in the data.
For example, the "division method" (discussed in Section11.3.1) computes the hash value as the remainder when the key is divided by a specifiedprime number. This method frequently gives good results, assuming that the prime number ischosen to be unrelated to any patterns in the distribution of keys.Finally, we note that some applications of hash functions might require stronger propertiesthan are provided by simple uniform hashing. For example, we might want keys that are"close" in some sense to yield hash values that are far apart. (This property is especiallydesirable when we are using linear probing, defined in Section 11.4.) Universal hashing,described in Section 11.3.3, often provides the desired properties.Interpreting keys as natural numbersMost hash functions assume that the universe of keys is the set N = {0, 1, 2, ...} of naturalnumbers.
Thus, if the keys are not natural numbers, a way is found to interpret them as naturalnumbers. For example, a character string can be interpreted as an integer expressed in suitableradix notation. Thus, the identifier pt might be interpreted as the pair of decimal integers (112,116), since p = 112 and t = 116 in the ASCII character set; then, expressed as a radix-128integer, pt becomes (112·128)+116 = 14452. It is usually straightforward in an application todevise some such method for interpreting each key as a (possibly large) natural number. Inwhat follows, we assume that the keys are natural numbers.11.3.1 The division methodIn the division method for creating hash functions, we map a key k into one of m slots bytaking the remainder of k divided by m.
That is, the hash function ish(k) = k mod m.For example, if the hash table has size m = 12 and the key is k = 100, then h(k) = 4. Since itrequires only a single division operation, hashing by division is quite fast.When using the division method, we usually avoid certain values of m. For example, m shouldnot be a power of 2, since if m = 2p, then h(k) is just the p lowest-order bits of k. Unless it isknown that all low-order p-bit patterns are equally likely, it is better to make the hash functiondepend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2p - 1when k is a character string interpreted in radix 2p may be a poor choice, because permutingthe characters of k does not change its hash value.A prime not too close to an exact power of 2 is often a good choice for m.
For example,suppose we wish to allocate a hash table, with collisions resolved by chaining, to hold roughlyn = 2000 character strings, where a character has 8 bits. We don't mind examining an averageof 3 elements in an unsuccessful search, so we allocate a hash table of size m = 701. Thenumber 701 is chosen because it is a prime near 2000/3 but not near any power of 2. Treatingeach key k as an integer, our hash function would beh(k) = k mod 701.11.3.2 The multiplication methodThe multiplication method for creating hash functions operates in two steps.
First, wemultiply the key k by a constant A in the range 0 < A < 1 and extract the fractional part of kA.Then, we multiply this value by m and take the floor of the result. In short, the hash functionish(k) = ⌊m(kA mod 1)⌋,where "k A mod 1" means the fractional part of kA, that is, kA - ⌊kA⌋.An advantage of the multiplication method is that the value of m is not critical. We typicallychoose it to be a power of 2 (m = 2p for some integer p) since we can then easily implementthe function on most computers as follows. Suppose that the word size of the machine is wbits and that k fits into a single word.
We restrict A to be a fraction of the form s/2w, where s isan integer in the range 0 < s < 2w. Referring to Figure 11.4, we first multiply k by the w-bitinteger s = A · 2w. The result is a 2w-bit value r12w + r0, where r1 is the high-order word of theproduct and r0 is the low-order word of the product. The desired p-bit hash value consists ofthe p most significant bits of r0.Figure 11.4: The multiplication method of hashing. The w-bit representation of the key k ismultiplied by the w-bit value s = A · 2w.
The p highest-order bits of the lower w-bit half of theproduct form the desired hash value h(k).Although this method works with any value of the constant A, it works better with somevalues than with others. The optimal choice depends on the characteristics of the data beinghashed. Knuth [185] suggests that(11.2)is likely to work reasonably well.As an example, suppose we have k = 123456, p = 14, m = 214 = 16384, and w = 32.
AdaptingKnuth's suggestion, we choose A to be the fraction of the form s/232 that is closest to,so that A = 2654435769/232. Then k · s = 327706022297664 = (76300 · 232) + 17612864, andso r1 = 76300 and r0 = 17612864. The 14 most significant bits of r0 yield the value h(k) = 67.11.3.3Universal hashingIf a malicious adversary chooses the keys to be hashed by some fixed hash function, then hecan choose n keys that all hash to the same slot, yielding an average retrieval time of Θ(n).Any fixed hash function is vulnerable to such terrible worst-case behavior; the only effectiveway to improve the situation is to choose the hash function randomly in a way that isindependent of the keys that are actually going to be stored.
This approach, called universalhashing, can yield provably good performance on average, no matter what keys are chosen bythe adversary.The main idea behind universal hashing is to select the hash function at random from acarefully designed class of functions at the beginning of execution. As in the case ofquicksort, randomization guarantees that no single input will always evoke worst-casebehavior. Because of the randomization, the algorithm can behave differently on eachexecution, even for the same input, guaranteeing good average-case performance for anyinput. Returning to the example of a compiler's symbol table, we find that the programmer'schoice of identifiers cannot now cause consistently poor hashing performance.
Poorperformance occurs only when the compiler chooses a random hash function that causes theset of identifiers to hash poorly, but the probability of this situation occurring is small and isthe same for any set of identifiers of the same size.Let ℋ be a finite collection of hash functions that map a given universe U of keys into therange {0, 1, ..., m - 1}. Such a collection is said to be universal if for each pair of distinct keysk, lU , the number of hash functions hℋ for which h(k) = h(l) is at most |ℋ| /m.
Inother words, with a hash function randomly chosen from ℋ, the chance of a collisionbetween distinct keys k and l is no more than the chance 1/m of a collision if h(k) and h(l)were randomly and independently chosen from the set {0, 1, ..., m - 1}.The following theorem shows that a universal class of hash functions gives good average-casebehavior. Recall that ni denotes the length of list T[i].Theorem 11.3Suppose that a hash function h is chosen from a universal collection of hash functions and isused to hash n keys into a table T of size m, using chaining to resolve collisions. If key k is notin the table, then the expected length E[nh(k)] of the list that key k hashes to is at most α.
If keyk is in the table, then the expected length E[nh(k)] of the list containing key k is at most 1 + α.Proof We note that the expectations here are over the choice of the hash function, and do notdepend on any assumptions about the distribution of the keys. For each pair k and l of distinctkeys, define the indicator random variable Xkl = I{h(k) = h(l)}. Since by definition, a singlepair of keys collides with probability at most 1/m, we have Pr{h(k) = h(l)} ≤ 1/m, and soLemma 5.1 implies that E[Xkl] ≤ 1/m.Next we define, for each key k, the random variable Yk that equals the number of keys otherthan k that hash to the same slot as k, so thatThus we haveThe remainder of the proof depends on whether key k is in table T.••If k ∉ T, then nh(k) = Yk and |{l : l T and l ≠ k}| = n.
Thus E[nh(k)] = E[Yk] ≤ n/m = α.If k T , then because key k appears in list T[h(k)] and the count Yk does not includekey k, we have nh(k) = Yk + 1 and |{l : l T and l ≠ k}| = n - 1. Thus E[nh(k)] = E[Yk] + 1≤ (n - 1)/m + 1 = 1 + α - 1/m < 1 + α.The following corollary says universal hashing provides the desired payoff: it is nowimpossible for an adversary to pick a sequence of operations that forces the worst-caserunning time. By cleverly randomizing the choice of hash function at run time, we guaranteethat every sequence of operations can be handled with good expected running time.Corollary 11.4Using universal hashing and collision resolution by chaining in a table with m slots, it takesexpected time Θ(n) to handle any sequence of n INSERT, SEARCH and DELETE operationscontaining O(m) INSERT operations.Proof Since the number of insertions is O(m), we have n = O(m) and so α = O(1). TheINSERT and DELETE operations take constant time and, by Theorem 11.3, the expected timefor each SEARCH operation is O(1).