Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 30
Текст из файла (страница 30)
This is not to beconfused with the hsync() method defined by the Syncable inter‐face for synchronizing buffers to the underlying device (see “Coher‐ency Model” on page 74).Sync points come into their own when using sequence files as input to MapReduce,since they permit the files to be split and different portions to be processed independ‐ently by separate map tasks (see “SequenceFileInputFormat” on page 236).Displaying a SequenceFile with the command-line interfaceThe hadoop fs command has a -text option to display sequence files in textual form.It looks at a file’s magic number so that it can attempt to detect the type of the file andappropriately convert it to text. It can recognize gzipped files, sequence files, and Avrodatafiles; otherwise, it assumes the input is plain text.For sequence files, this command is really useful only if the keys and values have mean‐ingful string representations (as defined by the toString() method).
Also, if you haveyour own key or value classes, you will need to make sure they are on Hadoop’s classpath.Running it on the sequence file we created in the previous section gives the followingoutput:% hadoop fs -text numbers.seq | head100One, two, buckle my shoe99Three, four, shut the door98Five, six, pick up sticks97Seven, eight, lay them straight96Nine, ten, a big fat hen95One, two, buckle my shoe94Three, four, shut the door93Five, six, pick up sticks92Seven, eight, lay them straight91Nine, ten, a big fat henSorting and merging SequenceFilesThe most powerful way of sorting (and merging) one or more sequence files is to useMapReduce. MapReduce is inherently parallel and will let you specify the number of132|Chapter 5: Hadoop I/Oreducers to use, which determines the number of output partitions.
For example, byspecifying one reducer, you get a single output file. We can use the sort example thatcomes with Hadoop by specifying that the input and output are sequence files and bysetting the key and value types:% hadoop jar \$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \sort -r 1 \-inFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \-outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat \-outKey org.apache.hadoop.io.IntWritable \-outValue org.apache.hadoop.io.Text \numbers.seq sorted% hadoop fs -text sorted/part-r-00000 | head1Nine, ten, a big fat hen2Seven, eight, lay them straight3Five, six, pick up sticks4Three, four, shut the door5One, two, buckle my shoe6Nine, ten, a big fat hen7Seven, eight, lay them straight8Five, six, pick up sticks9Three, four, shut the door10One, two, buckle my shoeSorting is covered in more detail in “Sorting” on page 255.An alternative to using MapReduce for sort/merge is the SequenceFile.Sorter class,which has a number of sort() and merge() methods.
These functions predate Map‐Reduce and are lower-level functions than MapReduce (for example, to get parallelism,you need to partition your data manually), so in general MapReduce is the preferredapproach to sort and merge sequence files.The SequenceFile formatA sequence file consists of a header followed by one or more records (see Figure 5-2).The first three bytes of a sequence file are the bytes SEQ, which act as a magic number;these are followed by a single byte representing the version number. The header containsother fields, including the names of the key and value classes, compression details, userdefined metadata, and the sync marker.5 Recall that the sync marker is used to allow areader to synchronize to a record boundary from any position in the file. Each file hasa randomly generated sync marker, whose value is stored in the header.
Sync markersappear between records in the sequence file. They are designed to incur less than a 1%storage overhead, so they don’t necessarily appear between every pair of records (suchis the case for short records).5. Full details of the format of these fields may be found in SequenceFile’s documentation and source code.File-Based Data Structures|133Figure 5-2. The internal structure of a sequence file with no compression and with re‐cord compressionThe internal format of the records depends on whether compression is enabled, and ifit is, whether it is record compression or block compression.If no compression is enabled (the default), each record is made up of the record length(in bytes), the key length, the key, and then the value. The length fields are written as 4byte integers adhering to the contract of the writeInt() method of java.io.DataOutput. Keys and values are serialized using the Serialization defined for the class beingwritten to the sequence file.The format for record compression is almost identical to that for no compression, exceptthe value bytes are compressed using the codec defined in the header.
Note that keysare not compressed.Block compression (Figure 5-3) compresses multiple records at once; it is thereforemore compact than and should generally be preferred over record compression becauseit has the opportunity to take advantage of similarities between records. Records areadded to a block until it reaches a minimum size in bytes, defined by theio.seqfile.compress.blocksize property; the default is one million bytes.
A syncmarker is written before the start of every block. The format of a block is a field indicatingthe number of records in the block, followed by four compressed fields: the key lengths,the keys, the value lengths, and the values.134| Chapter 5: Hadoop I/OFigure 5-3. The internal structure of a sequence file with block compressionMapFileA MapFile is a sorted SequenceFile with an index to permit lookups by key. The indexis itself a SequenceFile that contains a fraction of the keys in the map (every 128th key,by default).
The idea is that the index can be loaded into memory to provide fast lookupsfrom the main data file, which is another SequenceFile containing all the map entriesin sorted key order.MapFile offers a very similar interface to SequenceFile for reading and writing—themain thing to be aware of is that when writing using MapFile.Writer, map entries mustbe added in order, otherwise an IOException will be thrown.MapFile variantsHadoop comes with a few variants on the general key-value MapFile interface:• SetFile is a specialization of MapFile for storing a set of Writable keys.
The keysmust be added in sorted order.• ArrayFile is a MapFile where the key is an integer representing the index of theelement in the array and the value is a Writable value.• BloomMapFile is a MapFile that offers a fast version of the get() method, especiallyfor sparsely populated files. The implementation uses a dynamic Bloom filter fortesting whether a given key is in the map. The test is very fast because it is inmemory, and it has a nonzero probability of false positives. Only if the test passes(the key is present) is the regular get() method called.File-Based Data Structures|135Other File Formats and Column-Oriented FormatsWhile sequence files and map files are the oldest binary file formats in Hadoop, theyare not the only ones, and in fact there are better alternatives that should be consideredfor new projects.Avro datafiles (covered in “Avro Datafiles” on page 352) are like sequence files in that theyare designed for large-scale data processing—they are compact and splittable—but theyare portable across different programming languages.
Objects stored in Avro datafilesare described by a schema, rather than in the Java code of the implementation of aWritable object (as is the case for sequence files), making them very Java-centric. Avrodatafiles are widely supported across components in the Hadoop ecosystem, so they area good default choice for a binary format.Sequence files, map files, and Avro datafiles are all row-oriented file formats, whichmeans that the values for each row are stored contiguously in the file. In a columnoriented format, the rows in a file (or, equivalently, a table in Hive) are broken up intorow splits, then each split is stored in column-oriented fashion: the values for each rowin the first column are stored first, followed by the values for each row in the secondcolumn, and so on.
This is shown diagrammatically in Figure 5-4.A column-oriented layout permits columns that are not accessed in a query to be skip‐ped. Consider a query of the table in Figure 5-4 that processes only column 2. Withrow-oriented storage, like a sequence file, the whole row (stored in a sequence file re‐cord) is loaded into memory, even though only the second column is actually read. Lazydeserialization saves some processing cycles by deserializing only the column fields thatare accessed, but it can’t avoid the cost of reading each row’s bytes from disk.With column-oriented storage, only the column 2 parts of the file (highlighted in thefigure) need to be read into memory.
In general, column-oriented formats work wellwhen queries access only a small number of columns in the table. Conversely, roworiented formats are appropriate when a large number of columns of a single row areneeded for processing at the same time.136|Chapter 5: Hadoop I/OFigure 5-4. Row-oriented versus column-oriented storageColumn-oriented formats need more memory for reading and writing, since they haveto buffer a row split in memory, rather than just a single row. Also, it’s not usually possibleto control when writes occur (via flush or sync operations), so column-oriented formatsare not suited to streaming writes, as the current file cannot be recovered if the writerprocess fails. On the other hand, row-oriented formats like sequence files and Avrodatafiles can be read up to the last sync point after a writer failure.