Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 36
Текст из файла (страница 36)
Note also that the job history is persistent, so youcan find jobs there from previous runs of the resource manager, too.Running on a Cluster|165Job HistoryJob history refers to the events and configuration for a completed MapReduce job. It isretained regardless of whether the job was successful, in an attempt to provide usefulinformation for the user running a job.Job history files are stored in HDFS by the MapReduce application master, in a directoryset by the mapreduce.jobhistory.done-dir property. Job history files are kept for oneweek before being deleted by the system.The history log includes job, task, and attempt events, all of which are stored in a file inJSON format.
The history for a particular job may be viewed through the web UI forthe job history server (which is linked to from the resource manager page) or via thecommand line using mapred job -history (which you point at the job history file).The MapReduce job pageClicking on the link for the “Tracking UI” takes us to the application master’s web UI(or to the history page if the application has completed).
In the case of MapReduce, thistakes us to the job page, illustrated in Figure 6-2.Figure 6-2. Screenshot of the job pageWhile the job is running, you can monitor its progress on this page. The table at thebottom shows the map progress and the reduce progress. “Total” shows the total numberof map and reduce tasks for this job (a row for each). The other columns then show thestate of these tasks: “Pending” (waiting to run), “Running,” or “Complete” (successfullyrun).166|Chapter 6: Developing a MapReduce ApplicationThe lower part of the table shows the total number of failed and killed task attempts forthe map or reduce tasks. Task attempts may be marked as killed if they are speculativeexecution duplicates, if the node they are running on dies, or if they are killed by a user.See “Task Failure” on page 193 for background on task failure.There also are a number of useful links in the navigation.
For example, the “Configu‐ration” link is to the consolidated configuration file for the job, containing all the prop‐erties and their values that were in effect during the job run. If you are unsure of whata particular property was set to, you can click through to inspect the file.Retrieving the ResultsOnce the job is finished, there are various ways to retrieve the results. Each reducerproduces one output file, so there are 30 part files named part-r-00000 to partr-00029 in the max-temp directory.As their names suggest, a good way to think of these “part” files is asparts of the max-temp “file.”If the output is large (which it isn’t in this case), it is important to havemultiple parts so that more than one reducer can work in parallel.Usually, if a file is in this partitioned form, it can still be used easilyenough—as the input to another MapReduce job, for example.
Insome cases, you can exploit the structure of multiple partitions to doa map-side join, for example (see “Map-Side Joins” on page 269).This job produces a very small amount of output, so it is convenient to copy it fromHDFS to our development machine. The -getmerge option to the hadoop fs commandis useful here, as it gets all the files in the directory specified in the source pattern andmerges them into a single file on the local filesystem:% hadoop fs -getmerge max-temp max-temp-local% sort max-temp-local | tail1991607199260519935671994568199556719965611997565199856819995682000558We sorted the output, as the reduce output partitions are unordered (owing to the hashpartition function).
Doing a bit of postprocessing of data from MapReduce is veryRunning on a Cluster|167common, as is feeding it into analysis tools such as R, a spreadsheet, or even a relationaldatabase.Another way of retrieving the output if it is small is to use the -cat option to print theoutput files to the console:% hadoop fs -cat max-temp/*On closer inspection, we see that some of the results don’t look plausible.
For instance,the maximum temperature for 1951 (not shown here) is 590°C! How do we find outwhat’s causing this? Is it corrupt input data or a bug in the program?Debugging a JobThe time-honored way of debugging programs is via print statements, and this is cer‐tainly possible in Hadoop.
However, there are complications to consider: with programsrunning on tens, hundreds, or thousands of nodes, how do we find and examine theoutput of the debug statements, which may be scattered across these nodes? For thisparticular case, where we are looking for (what we think is) an unusual case, we can usea debug statement to log to standard error, in conjunction with updating the task’s statusmessage to prompt us to look in the error log. The web UI makes this easy, as we pass:[will see].We also create a custom counter to count the total number of records with implausibletemperatures in the whole dataset. This gives us valuable information about how to dealwith the condition.
If it turns out to be a common occurrence, we might need to learnmore about the condition and how to extract the temperature in these cases, rather thansimply dropping the records. In fact, when trying to debug a job, you should always askyourself if you can use a counter to get the information you need to find out what’shappening. Even if you need to use logging or a status message, it may be useful to usea counter to gauge the extent of the problem. (There is more on counters in “Coun‐ters” on page 247.)If the amount of log data you produce in the course of debugging is large, you have acouple of options.
One is to write the information to the map’s output, rather than tostandard error, for analysis and aggregation by the reduce task. This approach usuallynecessitates structural changes to your program, so start with the other technique first.The alternative is to write a program (in MapReduce, of course) to analyze the logsproduced by your job.We add our debugging to the mapper (version 3), as opposed to the reducer, as we wantto find out what the source data causing the anomalous output looks like:public class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {enum Temperature {168|Chapter 6: Developing a MapReduce ApplicationOVER_100}private NcdcRecordParser parser = new NcdcRecordParser();@Overridepublic void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {parser.parse(value);if (parser.isValidTemperature()) {int airTemperature = parser.getAirTemperature();if (airTemperature > 1000) {System.err.println("Temperature over 100 degrees for input: " + value);context.setStatus("Detected possibly corrupt record: see logs.");context.getCounter(Temperature.OVER_100).increment(1);}context.write(new Text(parser.getYear()), new IntWritable(airTemperature));}}}If the temperature is over 100°C (represented by 1000, because temperatures are intenths of a degree), we print a line to standard error with the suspect line, as well asupdating the map’s status message using the setStatus() method on Context, directingus to look in the log.
We also increment a counter, which in Java is represented by a fieldof an enum type. In this program, we have defined a single field, OVER_100, as a way tocount the number of records with a temperature of over 100°C.With this modification, we recompile the code, re-create the JAR file, then rerun the joband, while it’s running, go to the tasks page.The tasks and task attempts pagesThe job page has a number of links for viewing the tasks in a job in more detail. Forexample, clicking on the “Map” link brings us to a page that lists information for all ofthe map tasks. The screenshot in Figure 6-3 shows this page for the job run with ourdebugging statements in the “Status” column for the task.Running on a Cluster|169Figure 6-3. Screenshot of the tasks pageClicking on the task link takes us to the task attempts page, which shows each taskattempt for the task.