Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 44

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 44 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 442020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 44)

Thisproperty is also used in the reduce. It’sfairly common to increase this to 100.mapreduce.map.combine.minspillsint3The minimum number of spill filesneeded for the combiner to run (if acombiner is specified).mapreduce.map.output.compressbooleanfalseWhether to compress map outputs.mapreduce.map.output.compress.codecClassorg.apache.hadoop.io.compress.DefaultCodecThe compression codec to use for mapoutputs.mapreduce.shuffle.max.threadsint0The number of worker threads per nodemanager for serving the map outputs toreducers.

This is a cluster-wide settingand cannot be set by individual jobs. 0means use the Netty default of twice thenumber of available processors.nameTable 7-2. Reduce-side tuning propertiesProperty nameTypeDefault value Descriptionmapreduce.reduce.shuffle.parallelcopiesint5The number of threads used to copy map outputs to thereducer.mapreduce.reduce.shuffle.maxfetchfailuresint10The number of times a reducer tries to fetch a mapoutput before reporting the error.mapreduce.task.io.sort.fac inttor10The maximum number of streams to merge at oncewhen sorting files.

This property is also used in the map.mapreduce.reduce.shuffle.input.buffer.percentfloat 0.70The proportion of total heap size to be allocated tothe map outputs buffer during the copy phase of theshuffle.mapreduce.reduce.shuffle.merge.percentfloat 0.66The threshold usage proportion for the map outputsbuffer (defined by mapred.job.shuffle.input.buffer.percent) for starting the process ofmerging the outputs and spilling to disk.202|Chapter 7: How MapReduce WorksProperty nameTypeDefault value Descriptionmapreduce.reduce.merge.inmem.thresholdint1000mapreduce.reduce.input.buffer.percentfloat 0.0The threshold number of map outputs for starting theprocess of merging the outputs and spilling todisk.

A value of 0 or less means there is no threshold,and the spill behavior is governed solely by mapreduce.reduce.shuffle.merge.percent.The proportion of total heap size to be used for retainingmap outputs in memory during the reduce. For thereduce phase to begin, the size of map outputs inmemory must be no more than this size. By default, allmap outputs are merged to disk before the reducebegins, to give the reducers as much memory aspossible.

However, if your reducers require less memory,this value may be increased to minimize the number oftrips to disk.Task ExecutionWe saw how the MapReduce system executes tasks in the context of the overall job atthe beginning of this chapter, in “Anatomy of a MapReduce Job Run” on page 185. Inthis section, we’ll look at some more controls that MapReduce users have over taskexecution.The Task Execution EnvironmentHadoop provides information to a map or reduce task about the environment in whichit is running. For example, a map task can discover the name of the file it is processing(see “File information in the mapper” on page 227), and a map or reduce task can find outthe attempt number of the task.

The properties in Table 7-3 can be accessed from thejob’s configuration, obtained in the old MapReduce API by providing an implementa‐tion of the configure() method for Mapper or Reducer, where the configuration ispassed in as an argument. In the new API, these properties can be accessed from thecontext object passed to all methods of the Mapper or Reducer.Table 7-3. Task environment propertiesProperty nameTypeDescriptionmapreduce.job.idStringThe job ID (see “Job, Task, job_200811201130_0004and Task Attempt IDs” onpage 164 for a descriptionof the format)mapreduce.task.idStringThe task IDtask_200811201130_0004_m_000003StringThe task attempt IDattempt_200811201130_0004_m_000003_0mapreduce.task.attempt.idExampleTask Execution|203Property nameTypeDescriptionExamplemapreduce.task.partitionintThe index of the taskwithin the job3mapreduce.task.ismapboolean Whether this task is atruemap taskStreaming environment variablesHadoop sets job configuration parameters as environment variables for Streaming pro‐grams.

However, it replaces nonalphanumeric characters with underscores to make surethey are valid names. The following Python expression illustrates how you can retrievethe value of the mapreduce.job.id property from within a Python Streaming script:os.environ["mapreduce_job_id"]You can also set environment variables for the Streaming processes launched by Map‐Reduce by supplying the -cmdenv option to the Streaming launcher program (once foreach variable you wish to set). For example, the following sets the MAGIC_PARAMETERenvironment variable:-cmdenv MAGIC_PARAMETER=abracadabraSpeculative ExecutionThe MapReduce model is to break jobs into tasks and run the tasks in parallel to makethe overall job execution time smaller than it would be if the tasks ran sequentially. Thismakes the job execution time sensitive to slow-running tasks, as it takes only one slowtask to make the whole job take significantly longer than it would have done otherwise.When a job consists of hundreds or thousands of tasks, the possibility of a few stragglingtasks is very real.Tasks may be slow for various reasons, including hardware degradation or softwaremisconfiguration, but the causes may be hard to detect because the tasks still completesuccessfully, albeit after a longer time than expected.

Hadoop doesn’t try to diagnoseand fix slow-running tasks; instead, it tries to detect when a task is running slower thanexpected and launches another equivalent task as a backup. This is termed speculativeexecution of tasks.It’s important to understand that speculative execution does not work by launching twoduplicate tasks at about the same time so they can race each other. This would be wastefulof cluster resources.

Rather, the scheduler tracks the progress of all tasks of the sametype (map and reduce) in a job, and only launches speculative duplicates for the smallproportion that are running significantly slower than the average. When a task com‐pletes successfully, any duplicate tasks that are running are killed since they are no longer204| Chapter 7: How MapReduce Worksneeded. So, if the original task completes before the speculative task, the speculative taskis killed; on the other hand, if the speculative task finishes first, the original is killed.Speculative execution is an optimization, and not a feature to make jobs run morereliably. If there are bugs that sometimes cause a task to hang or slow down, relying onspeculative execution to avoid these problems is unwise and won’t work reliably, sincethe same bugs are likely to affect the speculative task. You should fix the bug so that thetask doesn’t hang or slow down.Speculative execution is turned on by default.

It can be enabled or disabled independ‐ently for map tasks and reduce tasks, on a cluster-wide basis, or on a per-job basis. Therelevant properties are shown in Table 7-4.Table 7-4. Speculative execution propertiesProperty nameTypemapreduce.map.speculativeboolean trueDefault valueDescriptionWhether extra instances of map tasksmay be launched if a task is makingslow progressmapreduce.reduce.speculativeboolean trueWhether extra instances of reducetasks may be launched if a task ismaking slow progressyarn.app.mapreduce.am.job.speculator.classClassorg.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculatorThe Speculator classimplementing the speculativeexecution policy (MapReduce 2 only)yarn.app.mapreduce.am.job.task.estimator.classClassorg.apache.hadoop.mapreduce.v2.app.speculate.LegacyTaskRuntimeEstimatorAn implementation of TaskRuntimeEstimator used by Speculator instances that provides estimatesfor task runtimes (MapReduce 2 only)Why would you ever want to turn speculative execution off? The goal of speculativeexecution is to reduce job execution time, but this comes at the cost of cluster efficiency.On a busy cluster, speculative execution can reduce overall throughput, since redundanttasks are being executed in an attempt to bring down the execution time for a single job.For this reason, some cluster administrators prefer to turn it off on the cluster and haveusers explicitly turn it on for individual jobs.

This was especially relevant for olderversions of Hadoop, when speculative execution could be overly aggressive in sched‐uling speculative tasks.There is a good case for turning off speculative execution for reduce tasks, since anyduplicate reduce tasks have to fetch the same map outputs as the original task, and thiscan significantly increase network traffic on the cluster.Another reason for turning off speculative execution is for nonidempotent tasks. How‐ever, in many cases it is possible to write tasks to be idempotent and use anTask Execution|205OutputCommitter to promote the output to its final location when the task succeeds.This technique is explained in more detail in the next section.Output CommittersHadoop MapReduce uses a commit protocol to ensure that jobs and tasks either succeedor fail cleanly.

The behavior is implemented by the OutputCommitter in use for the job,which is set in the old MapReduce API by calling the setOutputCommitter() on JobConf or by setting mapred.output.committer.class in the configuration. In the newMapReduce API, the OutputCommitter is determined by the OutputFormat, via itsgetOutputCommitter() method. The default is FileOutputCommitter, which is ap‐propriate for file-based MapReduce. You can customize an existing OutputCommitteror even write a new implementation if you need to do special setup or cleanup for jobsor tasks.The OutputCommitter API is as follows (in both the old and new MapReduce APIs):public abstract class OutputCommitter {public abstract void setupJob(JobContext jobContext) throws IOException;public void commitJob(JobContext jobContext) throws IOException { }public void abortJob(JobContext jobContext, JobStatus.State state)throws IOException { }public abstract void setupTask(TaskAttemptContext taskContext)throws IOException;public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)throws IOException;public abstract void commitTask(TaskAttemptContext taskContext)throws IOException;public abstract void abortTask(TaskAttemptContext taskContext)throws IOException;}}The setupJob() method is called before the job is run, and is typically used to performinitialization.

For FileOutputCommitter, the method creates the final output directory,${mapreduce.output.fileoutputformat.outputdir}, and a temporary workingspace for task output, _temporary, as a subdirectory underneath it.If the job succeeds, the commitJob() method is called, which in the default file-basedimplementation deletes the temporary working space and creates a hidden emptymarker file in the output directory called _SUCCESS to indicate to filesystem clientsthat the job completed successfully. If the job did not succeed, abortJob() is called witha state object indicating whether the job failed or was killed (by a user, for example).

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.