Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 38
Текст из файла (страница 38)
You should run through the checklistin Table 6-3 before you start trying to profile or optimize at the task level.Table 6-3. Tuning checklistAreaBest practiceFurther informationNumber of mappers How long are your mappers running for? If they are only running for a “Small files andfew seconds on average, you should see whether there’s a way toCombineFileInputFormat” onhave fewer mappers and make them all run longer—a minute or so, page 226as a rule of thumb. The extent to which this is possible depends onthe input format you are using.Number of reducers Check that you are using more than a single reducer. Reduce tasksshould run for five minutes or so and produce at least a block’s worthof data, as a rule of thumb.“Choosing the Number ofReducers” on page 217CombinersCheck whether your job can take advantage of a combiner to reducethe amount of data passing through the shuffle.“Combiner Functions” on page34IntermediatecompressionJob execution time can almost always benefit from enabling mapoutput compression.“Compressing map output” onpage 108Custom serialization If you are using your own custom Writable objects or customcomparators, make sure you have implemented RawComparator.Shuffle tweaks“Implementing aRawComparator for speed” onpage 123The MapReduce shuffle exposes around a dozen tuning parameters for “Configuration Tuning” on pagememory management, which may help you wring out the last bit of 201performance.Profiling TasksLike debugging, profiling a job running on a distributed system such as MapReducepresents some challenges.
Hadoop allows you to profile a fraction of the tasks in a joband, as each task completes, pulls down the profile information to your machine forlater analysis with standard profiling tools.Of course, it’s possible, and somewhat easier, to profile a job running in the local jobrunner. And provided you can run with enough input data to exercise the map andTuning a Job|175reduce tasks, this can be a valuable way of improving the performance of your mappersand reducers. There are a couple of caveats, however.
The local job runner is a verydifferent environment from a cluster, and the data flow patterns are very different. Op‐timizing the CPU performance of your code may be pointless if your MapReduce jobis I/O-bound (as many jobs are). To be sure that any tuning is effective, you shouldcompare the new execution time with the old one running on a real cluster. Even thisis easier said than done, since job execution times can vary due to resource contentionwith other jobs and the decisions the scheduler makes regarding task placement.
To geta good idea of job execution time under these circumstances, perform a series of runs(with and without the change) and check whether any improvement is statisticallysignificant.It’s unfortunately true that some problems (such as excessive memory use) can be re‐produced only on the cluster, and in these cases the ability to profile in situ isindispensable.The HPROF profilerThere are a number of configuration properties to control profiling, which are alsoexposed via convenience methods on JobConf. Enabling profiling is as simple as settingthe property mapreduce.task.profile to true:% hadoop jar hadoop-examples.jar v4.MaxTemperatureDriver \-conf conf/hadoop-cluster.xml \-D mapreduce.task.profile=true \input/ncdc/all max-tempThis runs the job as normal, but adds an -agentlib parameter to the Java commandused to launch the task containers on the node managers.
You can control the preciseparameter that is added by setting the mapreduce.task.profile.params property. Thedefault uses HPROF, a profiling tool that comes with the JDK that, although basic, cangive valuable information about a program’s CPU and heap usage.It doesn’t usually make sense to profile all tasks in the job, so by default only those withIDs 0, 1, and 2 are profiled (for both maps and reduces). You can change this by settingmapreduce.task.profile.maps and mapreduce.task.profile.reduces to specify therange of task IDs to profile.The profile output for each task is saved with the task logs in the userlogs subdirectoryof the node manager’s local log directory (alongside the syslog, stdout, and stderr files),and can be retrieved in the way described in “Hadoop Logs” on page 172, according towhether log aggregation is enabled or not.176|Chapter 6: Developing a MapReduce ApplicationMapReduce WorkflowsSo far in this chapter, you have seen the mechanics of writing a program using Map‐Reduce.
We haven’t yet considered how to turn a data processing problem into theMapReduce model.The data processing you have seen so far in this book is to solve a fairly simple problem:finding the maximum recorded temperature for given years. When the processing getsmore complex, this complexity is generally manifested by having more MapReduce jobs,rather than having more complex map and reduce functions. In other words, as a ruleof thumb, think about adding more jobs, rather than adding complexity to jobs.For more complex problems, it is worth considering a higher-level language than Map‐Reduce, such as Pig, Hive, Cascading, Crunch, or Spark.
One immediate benefit is thatit frees you from having to do the translation into MapReduce jobs, allowing you toconcentrate on the analysis you are performing.Finally, the book Data-Intensive Text Processing with MapReduce by Jimmy Lin andChris Dyer (Morgan & Claypool Publishers, 2010) is a great resource for learning moreabout MapReduce algorithm design and is highly recommended.Decomposing a Problem into MapReduce JobsLet’s look at an example of a more complex problem that we want to translate into aMapReduce workflow.Imagine that we want to find the mean maximum recorded temperature for every dayof the year and every weather station. In concrete terms, to calculate the mean maximumdaily temperature recorded by station 029070-99999, say, on January 1, we take the meanof the maximum daily temperatures for this station for January 1, 1901; January 1, 1902;and so on, up to January 1, 2000.How can we compute this using MapReduce? The computation decomposes most nat‐urally into two stages:1.
Compute the maximum daily temperature for every station-date pair.The MapReduce program in this case is a variant of the maximum temperatureprogram, except that the keys in this case are a composite station-date pair, ratherthan just the year.2. Compute the mean of the maximum daily temperatures for every station-day-monthkey.The mapper takes the output from the previous job (station-date, maximum tem‐perature) records and projects it into (station-day-month, maximum temperature)MapReduce Workflows|177records by dropping the year component. The reduce function then takes the meanof the maximum temperatures for each station-day-month key.The output from the first stage looks like this for the station we are interested in (themean_max_daily_temp.sh script in the examples provides an implementation inHadoop Streaming):029070-99999 19010101 0029070-99999 19020101 -94...The first two fields form the key, and the final column is the maximum temperaturefrom all the readings for the given station and date.
The second stage averages thesedaily maxima over years to yield:029070-99999 0101 -68which is interpreted as saying the mean maximum daily temperature on January 1 forstation 029070-99999 over the century is −6.8°C.It’s possible to do this computation in one MapReduce stage, but it takes more work onthe part of the programmer.2The arguments for having more (but simpler) MapReduce stages are that doing so leadsto more composable and more maintainable mappers and reducers. Some of the casestudies referred to in Part V cover real-world problems that were solved using MapRe‐duce, and in each case, the data processing task is implemented using two or moreMapReduce jobs. The details in that chapter are invaluable for getting a better idea ofhow to decompose a processing problem into a MapReduce workflow.It’s possible to make map and reduce functions even more composable than we havedone.
A mapper commonly performs input format parsing, projection (selecting therelevant fields), and filtering (removing records that are not of interest). In the mappersyou have seen so far, we have implemented all of these functions in a single mapper.However, there is a case for splitting these into distinct mappers and chaining them intoa single mapper using the ChainMapper library class that comes with Hadoop. Combinedwith a ChainReducer, you can run a chain of mappers, followed by a reducer and anotherchain of mappers, in a single MapReduce job.JobControlWhen there is more than one job in a MapReduce workflow, the question arises: howdo you manage the jobs so they are executed in order? There are several approaches,and the main consideration is whether you have a linear chain of jobs or a more complexdirected acyclic graph (DAG) of jobs.2. It’s an interesting exercise to do this.