c20-4 (779626), страница 3
Текст из файла (страница 3)
The longer the message, the more precisionrequired of R. This is best illustrated by an example, so let us return to the fictitiouslanguage, Vowellish, of the previous section. Recall that Vowellish has a 5 characteralphabet (A, E, I, O, U), with occurrence probabilities 0.12, 0.42, 0.09, 0.30, and0.07, respectively. Figure 20.5.1 shows how a message beginning “IOU” is encoded:The interval [0, 1) is divided into segments corresponding to the 5 alphabeticalcharacters; the length of a segment is the corresponding pi .
We see that the firstmessage character, “I”, narrows the range of R to 0.37 ≤ R < 0.46. This interval isnow subdivided into five subintervals, again with lengths proportional to the pi ’s. Thesecond message character, “O”, narrows the range of R to 0.3763 ≤ R < 0.4033.The “U” character further narrows the range to 0.37630 ≤ R < 0.37819.
Any valueof R in this range can be sent as encoding “IOU”. In particular, the binary fraction.011000001 is in this range, so “IOU” can be sent in 9 bits. (Huffman coding took10 bits for this example, see §20.4.)Of course there is the problem of knowing when to stop decoding. Thefraction .011000001 represents not simply “IOU,” but “IOU. . . ,” where the ellipsesrepresent an infinite string of successor characters. To resolve this ambiguity,arithmetic coding generally assumes the existence of a special Nch + 1th character,EOM (end of message), which occurs only once at the end of the input.
SinceEOM has a low probability of occurrence, it gets allocated only a very tiny pieceof the number line.In the above example, we gave R as a binary fraction. We could just as wellhave output it in any other radix, e.g., base 94 or base 36, whatever is convenientfor the anticipated storage or communication channel.You might wonder how one deals with the seemingly incredible precisionrequired of R for a long message. The answer is that R is never actually representedall at once. At any give stage we have upper and lower bounds for R representedas a finite number of digits in the output radix.
As digits of the upper and lowerbounds become identical, we can left-shift them away and bring in new digits at thelow-significance end. The routines below have a parameter NWK for the number ofworking digits to keep around. This must be large enough to make the chance ofan accidental degeneracy vanishingly small. (The routines signal if a degeneracyever occurs.) Since the process of discarding old digits and bringing in new ones isperformed identically on encoding and decoding, everything stays synchronized..















