Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 10

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 10 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 102020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 10)

And is in the process of committing14/09/16 09:48:40 INFO mapred.LocalJobRunner: 1 / 1 copied.14/09/16 09:48:40 INFO mapred.Task: Task attempt_local26392882_0001_r_000000_028| Chapter 2: MapReduceis allowed to commit now14/09/16 09:48:40 INFO output.FileOutputCommitter: Saved output of task'attempt...local26392882_0001_r_000000_0' to file:/Users/tom/book-workspace/hadoop-book/output/_temporary/0/task_local26392882_0001_r_00000014/09/16 09:48:40 INFO mapred.LocalJobRunner: reduce > reduce14/09/16 09:48:40 INFO mapred.Task: Task 'attempt_local26392882_0001_r_000000_0'done.14/09/16 09:48:40 INFO mapred.LocalJobRunner: Finishing task:attempt_local26392882_0001_r_000000_014/09/16 09:48:40 INFO mapred.LocalJobRunner: reduce task executor complete.14/09/16 09:48:41 INFO mapreduce.Job: Job job_local26392882_0001 running in ubermode : false14/09/16 09:48:41 INFO mapreduce.Job: map 100% reduce 100%14/09/16 09:48:41 INFO mapreduce.Job: Job job_local26392882_0001 completedsuccessfully14/09/16 09:48:41 INFO mapreduce.Job: Counters: 30File System CountersFILE: Number of bytes read=377168FILE: Number of bytes written=828464FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0Map-Reduce FrameworkMap input records=5Map output records=5Map output bytes=45Map output materialized bytes=61Input split bytes=129Combine input records=0Combine output records=0Reduce input groups=2Reduce shuffle bytes=61Reduce input records=5Reduce output records=2Spilled Records=10Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=39Total committed heap usage (bytes)=226754560File Input Format CountersBytes Read=529File Output Format CountersBytes Written=29When the hadoop command is invoked with a classname as the first argument, itlaunches a Java virtual machine (JVM) to run the class.

The hadoop command adds theHadoop libraries (and their dependencies) to the classpath and picks up the Hadoopconfiguration, too. To add the application classes to the classpath, we’ve defined anenvironment variable called HADOOP_CLASSPATH, which the hadoop script picks up.Analyzing the Data with Hadoop|29When running in local (standalone) mode, the programs in this bookall assume that you have set the HADOOP_CLASSPATH in this way. Thecommands should be run from the directory that the example codeis installed in.The output from running the job provides some useful information. For example,we can see that the job was given an ID of job_local26392882_0001, and it ranone map task and one reduce task (with the following IDs: attempt_local26392882_0001_m_000000_0 and attempt_local26392882_0001_r_000000_0).Knowing the job and task IDs can be very useful when debugging MapReduce jobs.The last section of the output, titled “Counters,” shows the statistics that Hadoop gen‐erates for each job it runs.

These are very useful for checking whether the amount ofdata processed is what you expected. For example, we can follow the number of recordsthat went through the system: five map input records produced five map output records(since the mapper emitted one output record for each valid input record), then fivereduce input records in two groups (one for each unique key) produced two reduceoutput records.The output was written to the output directory, which contains one output file perreducer.

The job had a single reducer, so we find a single file, named part-r-00000:% cat output/part-r-000001949 1111950 22This result is the same as when we went through it by hand earlier. We interpret this assaying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950 it was2.2°C.Scaling OutYou’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eyeview of the system and look at the data flow for large inputs. For simplicity, the examplesso far have used files on the local filesystem.

However, to scale out, we need to store thedata in a distributed filesystem (typically HDFS, which you’ll learn about in the nextchapter). This allows Hadoop to move the MapReduce computation to each machinehosting a part of the data, using Hadoop’s resource management system, called YARN(see Chapter 4). Let’s see how this works.Data FlowFirst, some terminology. A MapReduce job is a unit of work that the client wants to beperformed: it consists of the input data, the MapReduce program, and configuration30|Chapter 2: MapReduceinformation. Hadoop runs the job by dividing it into tasks, of which there are two types:map tasks and reduce tasks. The tasks are scheduled using YARN and run on nodes inthe cluster. If a task fails, it will be automatically rescheduled to run on a different node.Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits,or just splits.

Hadoop creates one map task for each split, which runs the user-definedmap function for each record in the split.Having many splits means the time taken to process each split is small compared to thetime to process the whole input. So if we are processing the splits in parallel, the pro‐cessing is better load balanced when the splits are small, since a faster machine will beable to process proportionally more splits over the course of the job than a slowermachine. Even if the machines are identical, failed processes or other jobs runningconcurrently make load balancing desirable, and the quality of the load balancing in‐creases as the splits become more fine grained.On the other hand, if splits are too small, the overhead of managing the splits and maptask creation begins to dominate the total job execution time.

For most jobs, a good splitsize tends to be the size of an HDFS block, which is 128 MB by default, although thiscan be changed for the cluster (for all newly created files) or specified when each file iscreated.Hadoop does its best to run the map task on a node where the input data resides inHDFS, because it doesn’t use valuable cluster bandwidth. This is called the data localityoptimization. Sometimes, however, all the nodes hosting the HDFS block replicas for amap task’s input split are running other map tasks, so the job scheduler will look for afree map slot on a node in the same rack as one of the blocks. Very occasionally eventhis is not possible, so an off-rack node is used, which results in an inter-rack networktransfer.

The three possibilities are illustrated in Figure 2-2.It should now be clear why the optimal split size is the same as the block size: it is thelargest size of input that can be guaranteed to be stored on a single node. If the splitspanned two blocks, it would be unlikely that any HDFS node stored both blocks, sosome of the split would have to be transferred across the network to the node runningthe map task, which is clearly less efficient than running the whole map task using localdata.Map tasks write their output to the local disk, not to HDFS. Why is this? Map output isintermediate output: it’s processed by reduce tasks to produce the final output, and oncethe job is complete, the map output can be thrown away. So, storing it in HDFS withreplication would be overkill.

If the node running the map task fails before the mapoutput has been consumed by the reduce task, then Hadoop will automatically rerunthe map task on another node to re-create the map output.Scaling Out|31Figure 2-2. Data-local (a), rack-local (b), and off-rack (c) map tasksReduce tasks don’t have the advantage of data locality; the input to a single reduce taskis normally the output from all mappers. In the present example, we have a single reducetask that is fed by all of the map tasks.

Therefore, the sorted map outputs have to betransferred across the network to the node where the reduce task is running, where theyare merged and then passed to the user-defined reduce function. The output of thereduce is normally stored in HDFS for reliability.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.