Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 27
Текст из файла (страница 27)
When we start using characters that are encoded with more than a single byte,the differences between Text and String become clear. Consider the Unicode charactersshown in Table 5-8.2Table 5-8. Unicode charactersUnicode code pointU+0041U+00DFU+6771U+10400NameLATIN CAPITALLETTER ALATIN SMALLLETTER SHARP SN/A (a unified Hanideograph)DESERET CAPITAL LETTERLONG IUTF-8 code units41c3 9fe6 9d b1f0 90 90 80Java representation\u0041\u00DF\u6771\uD801\uDC00All but the last character in the table, U+10400, can be expressed using a single Javachar. U+10400 is a supplementary character and is represented by two Java chars,known as a surrogate pair.
The tests in Example 5-5 show the differences between Stringand Text when processing a string of the four characters from Table 5-8.Example 5-5. Tests showing the differences between the String and Text classespublic class StringTextComparisonTest {@Testpublic void string() throws UnsupportedEncodingException {String s = "\u0041\u00DF\u6771\uD801\uDC00";assertThat(s.length(), is(5));assertThat(s.getBytes("UTF-8").length, is(10));assertThat(s.indexOf("\u0041"), is(0));assertThat(s.indexOf("\u00DF"), is(1));assertThat(s.indexOf("\u6771"), is(2));assertThat(s.indexOf("\uD801\uDC00"), is(3));assertThat(s.charAt(0),assertThat(s.charAt(1),assertThat(s.charAt(2),assertThat(s.charAt(3),assertThat(s.charAt(4),is('\u0041'));is('\u00DF'));is('\u6771'));is('\uD801'));is('\uDC00'));assertThat(s.codePointAt(0),assertThat(s.codePointAt(1),assertThat(s.codePointAt(2),assertThat(s.codePointAt(3),is(0x0041));is(0x00DF));is(0x6771));is(0x10400));}2.
This example is based on one from Norbert Lindenberg and Masayoshi Okutsu’s “Supplementary Charactersin the Java Platform,” May 2004.116|Chapter 5: Hadoop I/O@Testpublic void text() {Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");assertThat(t.getLength(), is(10));assertThat(t.find("\u0041"), is(0));assertThat(t.find("\u00DF"), is(1));assertThat(t.find("\u6771"), is(3));assertThat(t.find("\uD801\uDC00"), is(6));assertThat(t.charAt(0),assertThat(t.charAt(1),assertThat(t.charAt(3),assertThat(t.charAt(6),is(0x0041));is(0x00DF));is(0x6771));is(0x10400));}}The test confirms that the length of a String is the number of char code units it contains(five, made up of one from each of the first three characters in the string and a surrogatepair from the last), whereas the length of a Text object is the number of bytes in itsUTF-8 encoding (10 = 1+2+3+4).
Similarly, the indexOf() method in String returnsan index in char code units, and find() for Text returns a byte offset.The charAt() method in String returns the char code unit for the given index, whichin the case of a surrogate pair will not represent a whole Unicode character. The codePointAt() method, indexed by char code unit, is needed to retrieve a single Unicodecharacter represented as an int. In fact, the charAt() method in Text is more like thecodePointAt() method than its namesake in String. The only difference is that it isindexed by byte offset.Iteration. Iterating over the Unicode characters in Text is complicated by the use of byteoffsets for indexing, since you can’t just increment the index.
The idiom for iteration isa little obscure (see Example 5-6): turn the Text object into a java.nio.ByteBuffer,then repeatedly call the bytesToCodePoint() static method on Text with the buffer.This method extracts the next code point as an int and updates the position in thebuffer. The end of the string is detected when bytesToCodePoint() returns –1.Example 5-6. Iterating over the characters in a Text objectpublic class TextIterator {public static void main(String[] args) {Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());int cp;while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1) {Serialization|117System.out.println(Integer.toHexString(cp));}}}Running the program prints the code points for the four characters in the string:% hadoop TextIterator41df677110400Mutability.
Another difference from String is that Text is mutable (like all Writableimplementations in Hadoop, except NullWritable, which is a singleton). You can reusea Text instance by calling one of the set() methods on it. For example:Text t = new Text("hadoop");t.set("pig");assertThat(t.getLength(), is(3));assertThat(t.getBytes().length, is(3));In some situations, the byte array returned by the getBytes() meth‐od may be longer than the length returned by getLength():Text t = new Text("hadoop");t.set(new Text("pig"));assertThat(t.getLength(), is(3));assertThat("Byte length not shortened", t.getBytes().length,is(6));This shows why it is imperative that you always call getLength()when calling getBytes(), so you know how much of the byte arrayis valid data.Resorting to String.
Text doesn’t have as rich an API for manipulating strings asjava.lang.String, so in many cases, you need to convert the Text object to a String.This is done in the usual way, using the toString() method:assertThat(new Text("hadoop").toString(), is("hadoop"));BytesWritableBytesWritable is a wrapper for an array of binary data. Its serialized format is a 4-byteinteger field that specifies the number of bytes to follow, followed by the bytes them‐selves. For example, the byte array of length 2 with values 3 and 5 is serialized as a 4byte integer (00000002) followed by the two bytes from the array (03 and 05):118|Chapter 5: Hadoop I/OBytesWritable b = new BytesWritable(new byte[] { 3, 5 });byte[] bytes = serialize(b);assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));BytesWritable is mutable, and its value may be changed by calling its set() method.As with Text, the size of the byte array returned from the getBytes() method forBytesWritable—the capacity—may not reflect the actual size of the data stored in theBytesWritable.
You can determine the size of the BytesWritable by calling getLength(). To demonstrate:b.setCapacity(11);assertThat(b.getLength(), is(2));assertThat(b.getBytes().length, is(11));NullWritableNullWritable is a special type of Writable, as it has a zero-length serialization. No bytesare written to or read from the stream. It is used as a placeholder; for example, in Map‐Reduce, a key or a value can be declared as a NullWritable when you don’t need to usethat position, effectively storing a constant empty value. NullWritable can also be usefulas a key in a SequenceFile when you want to store a list of values, as opposed to keyvalue pairs.
It is an immutable singleton, and the instance can be retrieved by callingNullWritable.get().ObjectWritable and GenericWritableObjectWritable is a general-purpose wrapper for the following: Java primitives,String, enum, Writable, null, or arrays of any of these types. It is used in Hadoop RPCto marshal and unmarshal method arguments and return types.ObjectWritable is useful when a field can be of more than one type.
For example, ifthe values in a SequenceFile have multiple types, you can declare the value type as anObjectWritable and wrap each type in an ObjectWritable. Being a general-purposemechanism, it wastes a fair amount of space because it writes the classname of thewrapped type every time it is serialized.
In cases where the number of types is small andknown ahead of time, this can be improved by having a static array of types and usingthe index into the array as the serialized reference to the type. This is the approach thatGenericWritable takes, and you have to subclass it to specify which types to support.Writable collectionsThe org.apache.hadoop.io package includes six Writable collection types: ArrayWritable,ArrayPrimitiveWritable,TwoDArrayWritable,MapWritable,SortedMapWritable, and EnumSetWritable.ArrayWritable and TwoDArrayWritable are Writable implementations for arrays andtwo-dimensional arrays (array of arrays) of Writable instances. All the elements of anSerialization|119ArrayWritable or a TwoDArrayWritable must be instances of the same class, which isspecified at construction as follows:ArrayWritable writable = new ArrayWritable(Text.class);In contexts where the Writable is defined by type, such as in SequenceFile keys orvalues or as input to MapReduce in general, you need to subclass ArrayWritable (orTwoDArrayWritable, as appropriate) to set the type statically.
For example:public class TextArrayWritable extends ArrayWritable {public TextArrayWritable() {super(Text.class);}}ArrayWritable and TwoDArrayWritable both have get() and set() methods, as wellas a toArray() method, which creates a shallow copy of the array (or 2D array).ArrayPrimitiveWritable is a wrapper for arrays of Java primitives.
The componenttype is detected when you call set(), so there is no need to subclass to set the type.MapWritable is an implementation of java.util.Map<Writable, Writable>, and SortedMapWritable is an implementation of java.util.SortedMap<WritableComparable, Writable>. The type of each key and value field is a part of the serialization formatfor that field. The type is stored as a single byte that acts as an index into an array oftypes. The array is populated with the standard types in the org.apache.hadoop.iopackage, but custom Writable types are accommodated, too, by writing a header thatencodes the type array for nonstandard types. As they are implemented, MapWritableand SortedMapWritable use positive byte values for custom types, so a maximum of127 distinct nonstandard Writable classes can be used in any particular MapWritableor SortedMapWritable instance.
Here’s a demonstration of using a MapWritable withdifferent types for keys and values:MapWritable src = new MapWritable();src.put(new IntWritable(1), new Text("cat"));src.put(new VIntWritable(2), new LongWritable(163));MapWritable dest = new MapWritable();WritableUtils.cloneInto(dest, src);assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat")));assertThat((LongWritable) dest.get(new VIntWritable(2)),is(new LongWritable(163)));Conspicuous by their absence are Writable collection implementations for sets andlists. A general set can be emulated by using a MapWritable (or a SortedMapWritablefor a sorted set) with NullWritable values. There is also EnumSetWritable for sets ofenum types.