Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 33
Текст из файла (страница 33)
This is very useful because you can put defaults into configuration files and thenoverride them with the -D option as needed. A common example of this is setting thenumber of reducers for a MapReduce job via -D mapreduce.job.reduces=n. This willoverride the number of reducers set on the cluster or set in any client-side configurationfiles.The other options that GenericOptionsParser and ToolRunner support are listed inTable 6-1. You can find more on Hadoop’s configuration API in “The ConfigurationAPI” on page 141.DonotconfusesettingHadooppropertiesusingthe-D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The syntax for JVM sys‐tem properties does not allow any whitespace between the D and theproperty name, whereas GenericOptionsParser does allowwhitespace.JVM system properties are retrieved from the java.lang.Systemclass, but Hadoop properties are accessible only from a Configuration object.
So, the following command will print nothing, eventhough the color system property has been set (via HADOOP_OPTS),because the System class is not used by ConfigurationPrinter:% HADOOP_OPTS='-Dcolor=yellow' \hadoop ConfigurationPrinter | grep colorIf you want to be able to set configuration through system proper‐ties, you need to mirror the system properties of interest in theconfiguration file. See “Variable Expansion” on page 143 for fur‐ther discussion.Setting Up the Development Environment|151Table 6-1. GenericOptionsParser and ToolRunner optionsOptionDescription-D property=valueSets the given Hadoop configuration property to the given value.
Overrides anydefault or site properties in the configuration and any properties set via the -confoption.-conf filename ...Adds the given files to the list of resources in the configuration. This is a convenientway to set site properties or to set a number of properties at once.-fs uriSets the default filesystem to the given URI. Shortcut for-D fs.defaultFS=uri.-jt host:portSets the YARN resource manager to the given host and port. (In Hadoop 1, it sets thejobtracker address, hence the option name.) Shortcut for -D yarn.resourcemanager.address=host:port.-files file1,file2,...Copies the specified files from the local filesystem (or any filesystem if a scheme isspecified) to the shared filesystem used by MapReduce (usually HDFS) and makesthem available to MapReduce programs in the task’s working directory.
(See“Distributed Cache” on page 274 for more on the distributed cache mechanism forcopying files to machines in the cluster.)-archivesarchive1,archive2,...Copies the specified archives from the local filesystem (or any filesystem if a schemeis specified) to the shared filesystem used by MapReduce (usually HDFS), unarchivesthem, and makes them available to MapReduce programs in the task’s workingdirectory.-libjars jar1,jar2,...Copies the specified JAR files from the local filesystem (or any filesystem if a schemeis specified) to the shared filesystem used by MapReduce (usually HDFS) and addsthem to the MapReduce task’s classpath. This option is a useful way of shipping JARfiles that a job is dependent on.Writing a Unit Test with MRUnitThe map and reduce functions in MapReduce are easy to test in isolation, which is aconsequence of their functional style. MRUnit is a testing library that makes it easy topass known inputs to a mapper or a reducer and check that the outputs are as expected.MRUnit is used in conjunction with a standard test execution framework, such as JUnit,so you can run the tests for MapReduce jobs in your normal development environment.For example, all of the tests described here can be run from within an IDE by followingthe instructions in “Setting Up the Development Environment” on page 144.152|Chapter 6: Developing a MapReduce ApplicationMapperThe test for the mapper is shown in Example 6-5.Example 6-5.
Unit test for MaxTemperatureMapperimportimportimportimportjava.io.IOException;org.apache.hadoop.io.*;org.apache.hadoop.mrunit.mapreduce.MapDriver;org.junit.*;public class MaxTemperatureMapperTest {@Testpublic void processesValidRecord() throws IOException, InterruptedException {Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +// Year ^^^^"99999V0203201N00261220001CN9999999N9-00111+99999999999");// Temperature ^^^^^new MapDriver<LongWritable, Text, Text, IntWritable>().withMapper(new MaxTemperatureMapper()).withInput(new LongWritable(0), value).withOutput(new Text("1950"), new IntWritable(-11)).runTest();}}The idea of the test is very simple: pass a weather record as input to the mapper, andcheck that the output is the year and temperature reading.Since we are testing the mapper, we use MRUnit’s MapDriver, which we configure withthe mapper under test (MaxTemperatureMapper), the input key and value, and the ex‐pected output key (a Text object representing the year, 1950) and expected output value(an IntWritable representing the temperature, −1.1°C), before finally calling therunTest() method to execute the test.
If the expected output values are not emitted bythe mapper, MRUnit will fail the test. Notice that the input key could be set to any valuebecause our mapper ignores it.Proceeding in a test-driven fashion, we create a Mapper implementation that passes thetest (see Example 6-6). Because we will be evolving the classes in this chapter, each isput in a different package indicating its version for ease of exposition. For example,v1.MaxTemperatureMapper is version 1 of MaxTemperatureMapper. In reality, of course,you would evolve classes without repackaging them.Example 6-6.
First version of a Mapper that passes MaxTemperatureMapperTestpublic class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {@OverrideWriting a Unit Test with MRUnit|153public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();String year = line.substring(15, 19);int airTemperature = Integer.parseInt(line.substring(87, 92));context.write(new Text(year), new IntWritable(airTemperature));}}This is a very simple implementation that pulls the year and temperature fields fromthe line and writes them to the Context. Let’s add a test for missing values, which in theraw data are represented by a temperature of +9999:@Testpublic void ignoresMissingTemperatureRecord() throws IOException,InterruptedException {Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +// Year ^^^^"99999V0203201N00261220001CN9999999N9+99991+99999999999");// Temperature ^^^^^new MapDriver<LongWritable, Text, Text, IntWritable>().withMapper(new MaxTemperatureMapper()).withInput(new LongWritable(0), value).runTest();}A MapDriver can be used to check for zero, one, or more output records, according tothe number of times that withOutput() is called.
In our application, since records withmissing temperatures should be filtered out, this test asserts that no output is producedfor this particular input value.The new test fails since +9999 is not treated as a special case. Rather than putting morelogic into the mapper, it makes sense to factor out a parser class to encapsulate theparsing logic; see Example 6-7.Example 6-7. A class for parsing weather records in NCDC formatpublic class NcdcRecordParser {private static final int MISSING_TEMPERATURE = 9999;private String year;private int airTemperature;private String quality;public void parse(String record) {year = record.substring(15, 19);String airTemperatureString;// Remove leading plus sign as parseInt doesn't like them (pre-Java 7)if (record.charAt(87) == '+') {airTemperatureString = record.substring(88, 92);154|Chapter 6: Developing a MapReduce Application} else {airTemperatureString = record.substring(87, 92);}airTemperature = Integer.parseInt(airTemperatureString);quality = record.substring(92, 93);}public void parse(Text record) {parse(record.toString());}public boolean isValidTemperature() {return airTemperature != MISSING_TEMPERATURE && quality.matches("[01459]");}public String getYear() {return year;}public int getAirTemperature() {return airTemperature;}}The resulting mapper (version 2) is much simpler (see Example 6-8).