Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 36

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 36 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 362020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 36)

Note also that the job history is persistent, so youcan find jobs there from previous runs of the resource manager, too.Running on a Cluster|165Job HistoryJob history refers to the events and configuration for a completed MapReduce job. It isretained regardless of whether the job was successful, in an attempt to provide usefulinformation for the user running a job.Job history files are stored in HDFS by the MapReduce application master, in a directoryset by the mapreduce.jobhistory.done-dir property. Job history files are kept for oneweek before being deleted by the system.The history log includes job, task, and attempt events, all of which are stored in a file inJSON format.

The history for a particular job may be viewed through the web UI forthe job history server (which is linked to from the resource manager page) or via thecommand line using mapred job -history (which you point at the job history file).The MapReduce job pageClicking on the link for the “Tracking UI” takes us to the application master’s web UI(or to the history page if the application has completed).

In the case of MapReduce, thistakes us to the job page, illustrated in Figure 6-2.Figure 6-2. Screenshot of the job pageWhile the job is running, you can monitor its progress on this page. The table at thebottom shows the map progress and the reduce progress. “Total” shows the total numberof map and reduce tasks for this job (a row for each). The other columns then show thestate of these tasks: “Pending” (waiting to run), “Running,” or “Complete” (successfullyrun).166|Chapter 6: Developing a MapReduce ApplicationThe lower part of the table shows the total number of failed and killed task attempts forthe map or reduce tasks. Task attempts may be marked as killed if they are speculativeexecution duplicates, if the node they are running on dies, or if they are killed by a user.See “Task Failure” on page 193 for background on task failure.There also are a number of useful links in the navigation.

For example, the “Configu‐ration” link is to the consolidated configuration file for the job, containing all the prop‐erties and their values that were in effect during the job run. If you are unsure of whata particular property was set to, you can click through to inspect the file.Retrieving the ResultsOnce the job is finished, there are various ways to retrieve the results. Each reducerproduces one output file, so there are 30 part files named part-r-00000 to partr-00029 in the max-temp directory.As their names suggest, a good way to think of these “part” files is asparts of the max-temp “file.”If the output is large (which it isn’t in this case), it is important to havemultiple parts so that more than one reducer can work in parallel.Usually, if a file is in this partitioned form, it can still be used easilyenough—as the input to another MapReduce job, for example.

Insome cases, you can exploit the structure of multiple partitions to doa map-side join, for example (see “Map-Side Joins” on page 269).This job produces a very small amount of output, so it is convenient to copy it fromHDFS to our development machine. The -getmerge option to the hadoop fs commandis useful here, as it gets all the files in the directory specified in the source pattern andmerges them into a single file on the local filesystem:% hadoop fs -getmerge max-temp max-temp-local% sort max-temp-local | tail1991607199260519935671994568199556719965611997565199856819995682000558We sorted the output, as the reduce output partitions are unordered (owing to the hashpartition function).

Doing a bit of postprocessing of data from MapReduce is veryRunning on a Cluster|167common, as is feeding it into analysis tools such as R, a spreadsheet, or even a relationaldatabase.Another way of retrieving the output if it is small is to use the -cat option to print theoutput files to the console:% hadoop fs -cat max-temp/*On closer inspection, we see that some of the results don’t look plausible.

For instance,the maximum temperature for 1951 (not shown here) is 590°C! How do we find outwhat’s causing this? Is it corrupt input data or a bug in the program?Debugging a JobThe time-honored way of debugging programs is via print statements, and this is cer‐tainly possible in Hadoop.

However, there are complications to consider: with programsrunning on tens, hundreds, or thousands of nodes, how do we find and examine theoutput of the debug statements, which may be scattered across these nodes? For thisparticular case, where we are looking for (what we think is) an unusual case, we can usea debug statement to log to standard error, in conjunction with updating the task’s statusmessage to prompt us to look in the error log. The web UI makes this easy, as we pass:[will see].We also create a custom counter to count the total number of records with implausibletemperatures in the whole dataset. This gives us valuable information about how to dealwith the condition.

If it turns out to be a common occurrence, we might need to learnmore about the condition and how to extract the temperature in these cases, rather thansimply dropping the records. In fact, when trying to debug a job, you should always askyourself if you can use a counter to get the information you need to find out what’shappening. Even if you need to use logging or a status message, it may be useful to usea counter to gauge the extent of the problem. (There is more on counters in “Coun‐ters” on page 247.)If the amount of log data you produce in the course of debugging is large, you have acouple of options.

One is to write the information to the map’s output, rather than tostandard error, for analysis and aggregation by the reduce task. This approach usuallynecessitates structural changes to your program, so start with the other technique first.The alternative is to write a program (in MapReduce, of course) to analyze the logsproduced by your job.We add our debugging to the mapper (version 3), as opposed to the reducer, as we wantto find out what the source data causing the anomalous output looks like:public class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {enum Temperature {168|Chapter 6: Developing a MapReduce ApplicationOVER_100}private NcdcRecordParser parser = new NcdcRecordParser();@Overridepublic void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {parser.parse(value);if (parser.isValidTemperature()) {int airTemperature = parser.getAirTemperature();if (airTemperature > 1000) {System.err.println("Temperature over 100 degrees for input: " + value);context.setStatus("Detected possibly corrupt record: see logs.");context.getCounter(Temperature.OVER_100).increment(1);}context.write(new Text(parser.getYear()), new IntWritable(airTemperature));}}}If the temperature is over 100°C (represented by 1000, because temperatures are intenths of a degree), we print a line to standard error with the suspect line, as well asupdating the map’s status message using the setStatus() method on Context, directingus to look in the log.

We also increment a counter, which in Java is represented by a fieldof an enum type. In this program, we have defined a single field, OVER_100, as a way tocount the number of records with a temperature of over 100°C.With this modification, we recompile the code, re-create the JAR file, then rerun the joband, while it’s running, go to the tasks page.The tasks and task attempts pagesThe job page has a number of links for viewing the tasks in a job in more detail. Forexample, clicking on the “Map” link brings us to a page that lists information for all ofthe map tasks. The screenshot in Figure 6-3 shows this page for the job run with ourdebugging statements in the “Status” column for the task.Running on a Cluster|169Figure 6-3. Screenshot of the tasks pageClicking on the task link takes us to the task attempts page, which shows each taskattempt for the task.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.