Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 24
Текст из файла (страница 24)
This feature is useful if you have a corrupt filethat you want to inspect so you can decide what to do with it. For example, you mightwant to see whether it can be salvaged before you delete it.You can find a file’s checksum with hadoop fs -checksum. This is useful to checkwhether two files in HDFS have the same contents—something that distcp does, forexample (see “Parallel Copying with distcp” on page 76).98|Chapter 5: Hadoop I/OLocalFileSystemThe Hadoop LocalFileSystem performs client-side checksumming. This means thatwhen you write a file called filename, the filesystem client transparently creates a hiddenfile, .filename.crc, in the same directory containing the checksums for each chunk of thefile. The chunk size is controlled by the file.bytes-per-checksum property, whichdefaults to 512 bytes.
The chunk size is stored as metadata in the .crc file, so the file canbe read back correctly even if the setting for the chunk size has changed. Checksumsare verified when the file is read, and if an error is detected, LocalFileSystem throwsa ChecksumException.Checksums are fairly cheap to compute (in Java, they are implemented in native code),typically adding a few percent overhead to the time to read or write a file. For mostapplications, this is an acceptable price to pay for data integrity. It is, however, possibleto disable checksums, which is typically done when the underlying filesystem supportschecksums natively.
This is accomplished by using RawLocalFileSystem in place ofLocalFileSystem. To do this globally in an application, it suffices to remap the imple‐mentation for file URIs by setting the property fs.file.impl to the valueorg.apache.hadoop.fs.RawLocalFileSystem. Alternatively, you can directly create aRawLocalFileSystem instance, which may be useful if you want to disable checksumverification for only some reads, for example:Configuration conf = ...FileSystem fs = new RawLocalFileSystem();fs.initialize(null, conf);ChecksumFileSystemLocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it easyto add checksumming to other (nonchecksummed) filesystems, as ChecksumFileSystem is just a wrapper around FileSystem.
The general idiom is as follows:FileSystem rawFs = ...FileSystem checksummedFs = new ChecksumFileSystem(rawFs);The underlying filesystem is called the raw filesystem, and may be retrieved using thegetRawFileSystem() method on ChecksumFileSystem. ChecksumFileSystem has a fewmore useful methods for working with checksums, such as getChecksumFile() forgetting the path of a checksum file for any file.
Check the documentation for the others.If an error is detected by ChecksumFileSystem when reading a file, it will call itsreportChecksumFailure() method. The default implementation does nothing, butLocalFileSystem moves the offending file and its checksum to a side directory on thesame device called bad_files. Administrators should periodically check for these badfiles and take action on them.Data Integrity|99CompressionFile compression brings two major benefits: it reduces the space needed to store files,and it speeds up data transfer across the network or to or from disk.
When dealing withlarge volumes of data, both of these savings can be significant, so it pays to carefullyconsider how to use compression in Hadoop.There are many different compression formats, tools, and algorithms, each with dif‐ferent characteristics. Table 5-1 lists some of the more common ones that can be usedwith Hadoop.Table 5-1. A summary of compression formatsCompression formatToolAlgorithmFilename extensionSplittable?DEFLATEaN/ADEFLATE.deflateNogzipgzipDEFLATE.gzNobzip2bzip2bzip2.bz2YesLZOlzopLZO.lzoNobLZ4N/ALZ4.lz4NoSnappy.snappyNoSnappyN/Aa DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line toolfor producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers anda footer.) The .deflate filename extension is a Hadoop convention.b However, LZO files are splittable if they have been indexed in a preprocessing step.
See “Compression and Input Splits” on page105.All compression algorithms exhibit a space/time trade-off: faster compression and de‐compression speeds usually come at the expense of smaller space savings. The toolslisted in Table 5-1 typically give some control over this trade-off at compression timeby offering nine different options: –1 means optimize for speed, and -9 means optimizefor space. For example, the following command creates a compressed file file.gz usingthe fastest compression method:% gzip -1 fileThe different tools have very different compression characteristics.
gzip is a generalpurpose compressor and sits in the middle of the space/time trade-off. bzip2 compressesmore effectively than gzip, but is slower. bzip2’s decompression speed is faster than itscompression speed, but it is still slower than the other formats. LZO, LZ4, and Snappy,on the other hand, all optimize for speed and are around an order of magnitude faster100|Chapter 5: Hadoop I/Othan gzip, but compress less effectively. Snappy and LZ4 are also significantly faster thanLZO for decompression.1The “Splittable” column in Table 5-1 indicates whether the compression format supportssplitting (that is, whether you can seek to any point in the stream and start reading fromsome point further on). Splittable compression formats are especially suitable for Map‐Reduce; see “Compression and Input Splits” on page 105 for further discussion.CodecsA codec is the implementation of a compression-decompression algorithm.
In Hadoop,a codec is represented by an implementation of the CompressionCodec interface. So, forexample, GzipCodec encapsulates the compression and decompression algorithm forgzip. Table 5-2 lists the codecs that are available for Hadoop.Table 5-2. Hadoop compression codecsCompression formatHadoop CompressionCodecDEFLATEorg.apache.hadoop.io.compress.DefaultCodecgziporg.apache.hadoop.io.compress.GzipCodecbzip2org.apache.hadoop.io.compress.BZip2CodecLZOcom.hadoop.compression.lzo.LzopCodecLZ4org.apache.hadoop.io.compress.Lz4CodecSnappyorg.apache.hadoop.io.compress.SnappyCodecThe LZO libraries are GPL licensed and may not be included in Apache distributions,so for this reason the Hadoop codecs must be downloaded separately from Google (orGitHub, which includes bug fixes and more tools).
The LzopCodec, which is compatiblewith the lzop tool, is essentially the LZO format with extra headers, and is the one younormally want. There is also an LzoCodec for the pure LZO format, which usesthe .lzo_deflate filename extension (by analogy with DEFLATE, which is gzip withoutthe headers).Compressing and decompressing streams with CompressionCodecCompressionCodec has two methods that allow you to easily compress or decompressdata. To compress data being written to an output stream, use the createOutputStream(OutputStream out) method to create a CompressionOutputStream to whichyou write your uncompressed data to have it written in compressed form to theunderlying stream.
Conversely, to decompress data being read from an input stream,1. For a comprehensive set of compression benchmarks, jvm-compressor-benchmark is a good reference forJVM-compatible libraries (including some native libraries).Compression|101call createInputStream(InputStream in) to obtain a CompressionInputStream,which allows you to read uncompressed data from the underlying stream.CompressionOutputStream and CompressionInputStream are similar to java.util.zip.DeflaterOutputStream and java.util.zip.DeflaterInputStream, except thatboth of the former provide the ability to reset their underlying compressor or decom‐pressor. This is important for applications that compress sections of the data stream asseparate blocks, such as in a SequenceFile, described in “SequenceFile” on page 127.Example 5-1 illustrates how to use the API to compress data read from standard inputand write it to standard output.Example 5-1.
A program to compress data read from standard input and write it tostandard outputpublic class StreamCompressor {public static void main(String[] args) throws Exception {String codecClassname = args[0];Class<?> codecClass = Class.forName(codecClassname);Configuration conf = new Configuration();CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);CompressionOutputStream out = codec.createOutputStream(System.out);IOUtils.copyBytes(System.in, out, 4096, false);out.finish();}}The application expects the fully qualified name of the CompressionCodec implemen‐tation as the first command-line argument.
We use ReflectionUtils to construct anew instance of the codec, then obtain a compression wrapper around System.out.Then we call the utility method copyBytes() on IOUtils to copy the input to the output,which is compressed by the CompressionOutputStream. Finally, we call finish() onCompressionOutputStream, which tells the compressor to finish writing to the com‐pressed stream, but doesn’t close the stream.
We can try it out with the following com‐mand line, which compresses the string “Text” using the StreamCompressor programwith the GzipCodec, then decompresses it from standard input using gunzip:% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \| gunzip TextInferring CompressionCodecs using CompressionCodecFactoryIf you are reading a compressed file, normally you can infer which codec to use bylooking at its filename extension. A file ending in .gz can be read with GzipCodec, andso on. The extensions for each compression format are listed in Table 5-1.102|Chapter 5: Hadoop I/OCompressionCodecFactory provides a way of mapping a filename extension to aCompressionCodec using its getCodec() method, which takes a Path object for the filein question.
Example 5-2 shows an application that uses this feature to decompress files.Example 5-2. A program to decompress a compressed file using a codec inferred fromthe file’s extensionpublic class FileDecompressor {public static void main(String[] args) throws Exception {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path inputPath = new Path(uri);CompressionCodecFactory factory = new CompressionCodecFactory(conf);CompressionCodec codec = factory.getCodec(inputPath);if (codec == null) {System.err.println("No codec found for " + uri);System.exit(1);}String outputUri =CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());InputStream in = null;OutputStream out = null;try {in = codec.createInputStream(fs.open(inputPath));out = fs.create(new Path(outputUri));IOUtils.copyBytes(in, out, conf);} finally {IOUtils.closeStream(in);IOUtils.closeStream(out);}}}Once the codec has been found, it is used to strip off the file suffix to form the outputfilename (via the removeSuffix() static method of CompressionCodecFactory).
In thisway, a file named file.gz is decompressed to file by invoking the program as follows:% hadoop FileDecompressor file.gzCompressionCodecFactory loads all the codecs in Table 5-2, except LZO, as well as anylisted in the io.compression.codecs configuration property (Table 5-3). By default,the property is empty; you would need to alter it only if you have a custom codec thatyou wish to register (such as the externally hosted LZO codecs).