Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 37
Текст из файла (страница 37)
Each task attempt page has links to the logfiles and counters. If wefollow one of the links to the logfiles for the successful task attempt, we can find thesuspect input record that we logged (the line is wrapped and truncated to fit on thepage):Temperature over 100 degrees for input:0335999999433181957042302005+37950+139117SAO +0004RJSN V02011359003150070356999999433201957010100005+35317+139650SAO +000899999V02002359002650076249N0040005...This record seems to be in a different format from the others.
For one thing, there arespaces in the line, which are not described in the specification.When the job has finished, we can look at the value of the counter we defined to seehow many records over 100°C there are in the whole dataset. Counters are accessiblevia the web UI or the command line:% mapred job -counter job_1410450250506_0006 \'v3.MaxTemperatureMapper$Temperature' OVER_1003The -counter option takes the job ID, counter group name (which is the fully qualifiedclassname here), and counter name (the enum name).
There are only three malformedrecords in the entire dataset of over a billion records. Throwing out bad records isstandard for many big data problems, although we need to be careful in this case becausewe are looking for an extreme value—the maximum temperature rather than an aggre‐gate measure. Still, throwing away three records is probably not going to change theresult.Handling malformed dataCapturing input data that causes a problem is valuable, as we can use it in a test to checkthat the mapper does the right thing. In this MRUnit test, we check that the counter isupdated for the malformed input:170|Chapter 6: Developing a MapReduce Application@Testpublic void parsesMalformedTemperature() throws IOException,InterruptedException {Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +// Year ^^^^"RJSN V02011359003150070356999999433201957010100005+353");// Temperature ^^^^^Counters counters = new Counters();new MapDriver<LongWritable, Text, Text, IntWritable>().withMapper(new MaxTemperatureMapper()).withInput(new LongWritable(0), value).withCounters(counters).runTest();Counter c = counters.findCounter(MaxTemperatureMapper.Temperature.MALFORMED);assertThat(c.getValue(), is(1L));}The record that was causing the problem is of a different format than the other lineswe’ve seen.
Example 6-12 shows a modified program (version 4) using a parser thatignores each line with a temperature field that does not have a leading sign (plus orminus). We’ve also introduced a counter to measure the number of records that we areignoring for this reason.Example 6-12. Mapper for the maximum temperature examplepublic class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {enum Temperature {MALFORMED}private NcdcRecordParser parser = new NcdcRecordParser();@Overridepublic void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {parser.parse(value);if (parser.isValidTemperature()) {int airTemperature = parser.getAirTemperature();context.write(new Text(parser.getYear()), new IntWritable(airTemperature));} else if (parser.isMalformedTemperature()) {System.err.println("Ignoring possibly corrupt input: " + value);context.getCounter(Temperature.MALFORMED).increment(1);}}}Running on a Cluster|171Hadoop LogsHadoop produces logs in various places, and for various audiences.
These are sum‐marized in Table 6-2.Table 6-2. Types of Hadoop logsLogsPrimary audience DescriptionFurtherinformationSystem daemon logsAdministratorsEach Hadoop daemon produces a logfile (usinglog4j) and another file that combines standard outand error. Written in the directory defined by theHADOOP_LOG_DIR environment variable.“System logfiles”on page 295 and“Logging” on page330HDFS audit logsAdministratorsA log of all HDFS requests, turned off by default.Written to the namenode’s log, although this isconfigurable.“Audit Logging” onpage 324MapReduce job history logs UsersA log of the events (such as task completion) thatoccur in the course of running a job.
Saved centrallyin HDFS.“Job History” onpage 166MapReduce task logsEach task child process produces a logfile using log4j This section(called syslog), a file for data sent to standard out(stdout), and a file for standard error (stderr).Written in the userlogs subdirectory of thedirectory defined by the YARN_LOG_DIRenvironment variable.UsersYARN has a service for log aggregation that takes the task logs for completed applicationsand moves them to HDFS, where they are stored in a container file for archival purposes.If this service is enabled (by setting yarn.log-aggregation-enable to true on thecluster), then task logs can be viewed by clicking on the logs link in the task attempt webUI, or by using the mapred job -logs command.By default, log aggregation is not enabled. In this case, task logs can be retrieved byvisiting the node manager’s web UI at http://node-manager-host:8042/logs/userlogs.It is straightforward to write to these logfiles.
Anything written to standard output orstandard error is directed to the relevant logfile. (Of course, in Streaming, standardoutput is used for the map or reduce output, so it will not show up in the standard outputlog.)In Java, you can write to the task’s syslog file if you wish by using the Apache CommonsLogging API (or indeed any logging API that can write to log4j). This is shown inExample 6-13.172|Chapter 6: Developing a MapReduce ApplicationExample 6-13. An identity mapper that writes to standard output and also uses theApache Commons Logging APIimport org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.apache.hadoop.mapreduce.Mapper;public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);@Override@SuppressWarnings("unchecked")public void map(KEYIN key, VALUEIN value, Context context)throws IOException, InterruptedException {// Log to stdout fileSystem.out.println("Map key: " + key);// Log to syslog fileLOG.info("Map key: " + key);if (LOG.isDebugEnabled()) {LOG.debug("Map value: " + value);}context.write((KEYOUT) key, (VALUEOUT) value);}}The default log level is INFO, so DEBUG-level messages do not appear in the syslog tasklogfile.
However, sometimes you want to see these messages. To enable this, set mapreduce.map.log.level or mapreduce.reduce.log.level, as appropriate. For example,in this case, we could set it for the mapper to see the map values in the log as follows:% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \-D mapreduce.map.log.level=DEBUG input/ncdc/sample.txt logging-outThere are some controls for managing the retention and size of task logs. By default,logs are deleted after a minimum of three hours (you can set this using theyarn.nodemanager.log.retain-seconds property, although this is ignored if log ag‐gregation is enabled).
You can also set a cap on the maximum size of each logfile usingthe mapreduce.task.userlog.limit.kb property, which is 0 by default, meaning thereis no cap.Sometimes you may need to debug a problem that you suspect isoccurring in the JVM running a Hadoop command, rather than onthe cluster. You can send DEBUG-level logs to the console by using aninvocation like this:% HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -text /foo/barRunning on a Cluster|173Remote DebuggingWhen a task fails and there is not enough information logged to diagnose the error, youmay want to resort to running a debugger for that task.
This is hard to arrange whenrunning the job on a cluster, as you don’t know which node is going to process whichpart of the input, so you can’t set up your debugger ahead of the failure. However, thereare a few other options available:Reproduce the failure locallyOften the failing task fails consistently on a particular input. You can try to repro‐duce the problem locally by downloading the file that the task is failing on andrunning the job locally, possibly using a debugger such as Java’s VisualVM.Use JVM debugging optionsA common cause of failure is a Java out of memory error in the task JVM. You canset mapred.child.java.opts to include -XX:-HeapDumpOnOutOfMemoryError XX:HeapDumpPath=/path/to/dumps. This setting produces a heap dump that canbe examined afterward with tools such as jhat or the Eclipse Memory Analyzer.Note that the JVM options should be added to the existing memory settings speci‐fied by mapred.child.java.opts.
These are explained in more detail in “Memorysettings in YARN and MapReduce” on page 301.Use task profilingJava profilers give a lot of insight into the JVM, and Hadoop provides a mechanismto profile a subset of the tasks in a job. See “Profiling Tasks” on page 175.In some cases, it’s useful to keep the intermediate files for a failed task attempt for laterinspection, particularly if supplementary dump or profile files are created in the task’sworking directory. You can set mapreduce.task.files.preserve.failedtasks totrue to keep a failed task’s files.You can keep the intermediate files for successful tasks, too, which may be handy if youwant to examine a task that isn’t failing. In this case, set the property mapreduce.task.files.preserve.filepattern to a regular expression that matches the IDsof the tasks whose files you want to keep.Another useful property for debugging is yarn.nodemanager.delete.debug-delaysec, which is the number of seconds to wait to delete localized task attempt files, suchas the script used to launch the task container JVM.
If this is set on the cluster to areasonably large value (e.g., 600 for 10 minutes), then you have enough time to look atthe files before they are deleted.To examine task attempt files, log into the node that the task failed on and look for thedirectory for that task attempt. It will be under one of the local MapReduce directories,as set by the mapreduce.cluster.local.dir property (covered in more detail in “Im‐portant Hadoop Daemon Properties” on page 296). If this property is a comma-separated174|Chapter 6: Developing a MapReduce Applicationlist of directories (to spread load across the physical disks on a machine), you may needto look in all of the directories before you find the directory for that particular taskattempt. The task attempt directory is in the following location:mapreduce.cluster.local.dir/usercache/user/appcache/application-ID/output/task-attempt-IDTuning a JobAfter a job is working, the question many developers ask is, “Can I make it run faster?”There are a few Hadoop-specific “usual suspects” that are worth checking to see whetherthey are responsible for a performance problem.