Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 50
Текст из файла (страница 50)
It is a tab character by default. Considerthe following input file, where → represents a (horizontal) tab character:Input Formats|233line1→On the top of the Crumpetty Treeline2→The Quangle Wangle sat,line3→But his face you could not see,line4→On account of his Beaver Hat.Like in the TextInputFormat case, the input is in a single split comprising four records,although this time the keys are the Text sequences before the tab in each line:(line1,(line2,(line3,(line4,On the top of the Crumpetty Tree)The Quangle Wangle sat,)But his face you could not see,)On account of his Beaver Hat.)NLineInputFormatWith TextInputFormat and KeyValueTextInputFormat, each mapper receives a vari‐able number of lines of input. The number depends on the size of the split and the lengthof the lines.
If you want your mappers to receive a fixed number of lines of input, thenNLineInputFormat is the InputFormat to use. Like with TextInputFormat, the keys arethe byte offsets within the file and the values are the lines themselves.N refers to the number of lines of input that each mapper receives. With N set to 1 (thedefault), each mapper receives exactly one line of input. The mapreduce.input.lineinputformat.linespermap property controls the value of N.
By way of example, con‐sider these four lines again:On the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.If, for example, N is 2, then each split contains two lines. One mapper will receive thefirst two key-value pairs:(0, On the top of the Crumpetty Tree)(33, The Quangle Wangle sat,)And another mapper will receive the second two key-value pairs:(57, But his face you could not see,)(89, On account of his Beaver Hat.)The keys and values are the same as those that TextInputFormat produces. The differ‐ence is in the way the splits are constructed.Usually, having a map task for a small number of lines of input is inefficient (due to theoverhead in task setup), but there are applications that take a small amount of inputdata and run an extensive (i.e., CPU-intensive) computation for it, then emit their out‐put.
Simulations are a good example. By creating an input file that specifies input pa‐rameters, one per line, you can perform a parameter sweep: run a set of simulations inparallel to find how a model varies as the parameter changes.234| Chapter 8: MapReduce Types and FormatsIf you have long-running simulations, you may fall afoul of tasktimeouts. When a task doesn’t report progress for more than 10minutes, the application master assumes it has failed and aborts theprocess (see “Task Failure” on page 193).The best way to guard against this is to report progress periodical‐ly, by writing a status message or incrementing a counter, for exam‐ple.
See “What Constitutes Progress in MapReduce?” on page 191.Another example is using Hadoop to bootstrap data loading from multiple datasources, such as databases. You create a “seed” input file that lists the data sources, oneper line. Then each mapper is allocated a single data source, and it loads the data fromthat source into HDFS.
The job doesn’t need the reduce phase, so the number of reducersshould be set to zero (by calling setNumReduceTasks() on Job). Furthermore,MapReduce jobs can be run to process the data loaded into HDFS. See Appendix C foran example.XMLMost XML parsers operate on whole XML documents, so if a large XML document ismade up of multiple input splits, it is a challenge to parse these individually. Of course,you can process the entire XML document in one mapper (if it is not too large) usingthe technique in “Processing a whole file as a record” on page 228.Large XML documents that are composed of a series of “records” (XML documentfragments) can be broken into these records using simple string or regular-expressionmatching to find the start and end tags of records.
This alleviates the problem when thedocument is split by the framework because the next start tag of a record is easy to findby simply scanning from the start of the split, just like TextInputFormat finds newlineboundaries.Hadoop comes with a class for this purpose called StreamXmlRecordReader (which isin the org.apache.hadoop.streaming.mapreduce package, although it can be usedoutside of Streaming). You can use it by setting your input format to StreamInputFormat and setting the stream.recordreader.class property to org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader. The reader is configured bysetting job configuration properties to tell it the patterns for the start and end tags (seethe class documentation for details).4To take an example, Wikipedia provides dumps of its content in XML form, which areappropriate for processing in parallel with MapReduce using this approach.
The data iscontained in one large XML wrapper document, which contains a series of elements,4. See Mahout’s XmlInputFormat for an improved XML input format.Input Formats|235such as page elements that contain a page’s content and associated metadata. UsingStreamXmlRecordReader, the page elements can be interpreted as records for processingby a mapper.Binary InputHadoop MapReduce is not restricted to processing textual data. It has support for binaryformats, too.SequenceFileInputFormatHadoop’s sequence file format stores sequences of binary key-value pairs.
Sequence filesare well suited as a format for MapReduce data because they are splittable (they havesync points so that readers can synchronize with record boundaries from an arbitrarypoint in the file, such as the start of a split), they support compression as a part of theformat, and they can store arbitrary types using a variety of serialization frameworks.(These topics are covered in “SequenceFile” on page 127.)To use data from sequence files as the input to MapReduce, you can use SequenceFileInputFormat.
The keys and values are determined by the sequence file, and you needto make sure that your map input types correspond. For example, if your sequence filehas IntWritable keys and Text values, like the one created in Chapter 5, then the mapsignature would be Mapper<IntWritable, Text, K, V>, where K and V are the typesof the map’s output keys and values.Although its name doesn’t give it away, SequenceFileInputFormatcan read map files as well as sequence files. If it finds a directory whereit was expecting a sequence file, SequenceFileInputFormat assumesthat it is reading a map file and uses its datafile.
This is why there isno MapFileInputFormat class.SequenceFileAsTextInputFormatSequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that con‐verts the sequence file’s keys and values to Text objects. The conversion is performedby calling toString() on the keys and values. This format makes sequence files suitableinput for Streaming.SequenceFileAsBinaryInputFormatSequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that re‐trieves the sequence file’s keys and values as opaque binary objects. They are encapsu‐lated as BytesWritable objects, and the application is free to interpret the underlyingbyte array as it pleases. In combination with a process that creates sequence files withSequenceFile.Writer’sappendRaw()methodor236|Chapter 8: MapReduce Types and FormatsSequenceFileAsBinaryOutputFormat, this provides a way to use any binary data typeswith MapReduce (packaged as a sequence file), although plugging into Hadoop’s seri‐alization mechanism is normally a cleaner alternative (see “Serialization Frameworks”on page 126).FixedLengthInputFormatFixedLengthInputFormat is for reading fixed-width binary records from a file, whenthe records are not separated by delimiters.
The record size must be set via fixedlengthinputformat.record.length.Multiple InputsAlthough the input to a MapReduce job may consist of multiple input files (constructedby a combination of file globs, filters, and plain paths), all of the input is interpreted bya single InputFormat and a single Mapper.
What often happens, however, is that the dataformat evolves over time, so you have to write your mapper to cope with all of yourlegacy formats. Or you may have data sources that provide the same type of data but indifferent formats. This arises in the case of performing joins of different datasets; see“Reduce-Side Joins” on page 270. For instance, one might be tab-separated plain text, andthe other a binary sequence file.
Even if they are in the same format, they may havedifferent representations, and therefore need to be parsed differently.These cases are handled elegantly by using the MultipleInputs class, which allows youto specify which InputFormat and Mapper to use on a per-path basis. For example, if wehad weather data from the UK Met Office5 that we wanted to combine with the NCDCdata for our maximum temperature analysis, we might set up the input as follows:MultipleInputs.addInputPath(job, ncdcInputPath,TextInputFormat.class, MaxTemperatureMapper.class);MultipleInputs.addInputPath(job, metOfficeInputPath,TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);This code replaces the usual calls to FileInputFormat.addInputPath() and job.setMapperClass().