Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 90

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 90 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 902020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 90)

These include the operators (LOAD, ILLUSTRATE), commands (cat,ls), expressions (matches, FLATTEN), and functions (DIFF, MAX)—all of which are cov‐ered in the following sections.Pig Latin has mixed rules on case sensitivity. Operators and commands are not casesensitive (to make interactive use more forgiving); however, aliases and function namesare case sensitive.StatementsAs a Pig Latin program is executed, each statement is parsed in turn.

If there are syntaxerrors or other (semantic) problems, such as undefined aliases, the interpreter will haltand display an error message. The interpreter builds a logical plan for every relationaloperation, which forms the core of a Pig Latin program. The logical plan for the state‐ment is added to the logical plan for the program so far, and then the interpreter moveson to the next statement.It’s important to note that no data processing takes place while the logical plan of theprogram is being constructed. For example, consider again the Pig Latin program fromthe first example:-- max_temp.pig: Finds the maximum temperature by yearrecords = LOAD 'input/ncdc/micro-tab/sample.txt'AS (year:chararray, temperature:int, quality:int);filtered_records = FILTER records BY temperature != 9999 ANDquality IN (0, 1, 4, 5, 9);grouped_records = GROUP filtered_records BY year;max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);DUMP max_temp;When the Pig Latin interpreter sees the first line containing the LOAD statement, it con‐firms that it is syntactically and semantically correct and adds it to the logical plan, butit does not load the data from the file (or even check whether the file exists).

Indeed,where would it load it? Into memory? Even if it did fit into memory, what would it doPig Latin|433with the data? Perhaps not all the input data is needed (because later statements filterit, for example), so it would be pointless to load it. The point is that it makes no senseto start any processing until the whole flow is defined. Similarly, Pig validates the GROUPand FOREACH...GENERATE statements, and adds them to the logical plan without exe‐cuting them.

The trigger for Pig to start execution is the DUMP statement. At that point,the logical plan is compiled into a physical plan and executed.Multiquery ExecutionBecause DUMP is a diagnostic tool, it will always trigger execution. However, the STOREcommand is different. In interactive mode, STORE acts like DUMP and will always triggerexecution (this includes the run command), but in batch mode it will not (this includesthe exec command). The reason for this is efficiency. In batch mode, Pig will parse thewhole script to see whether there are any optimizations that could be made to limit theamount of data to be written to or read from disk. Consider the following simpleexample:A = LOAD 'input/pig/multiquery/A';B = FILTER A BY $1 == 'banana';C = FILTER A BY $1 != 'banana';STORE B INTO 'output/b';STORE C INTO 'output/c';Relations B and C are both derived from A, so to save reading A twice, Pig can run thisscript as a single MapReduce job by reading A once and writing two output files fromthe job, one for each of B and C.

This feature is called multiquery execution.In previous versions of Pig that did not have multiquery execution, each STORE statementin a script run in batch mode triggered execution, resulting in a job for each STOREstatement. It is possible to restore the old behavior by disabling multiquery executionwith the -M or -no_multiquery option to pig.The physical plan that Pig prepares is a series of MapReduce jobs, which in local modePig runs in the local JVM and in MapReduce mode Pig runs on a Hadoop cluster.You can see the logical and physical plans created by Pig using theEXPLAIN command on a relation (EXPLAIN max_temp;, for example).EXPLAIN will also show the MapReduce plan, which shows how thephysical operators are grouped into MapReduce jobs. This is a goodway to find out how many MapReduce jobs Pig will run for yourquery.434|Chapter 16: PigThe relational operators that can be a part of a logical plan in Pig are summarized inTable 16-1.

We go through the operators in more detail in “Data Processing Opera‐tors” on page 456.Table 16-1. Pig Latin relational operatorsCategoryOperatorDescriptionLoading and storingLOADLoads data from the filesystem or other storage into a relationSTORESaves a relation to the filesystem or other storageFilteringGrouping and joiningSortingDUMP (\d)Prints a relation to the consoleFILTERRemoves unwanted rows from a relationDISTINCTRemoves duplicate rows from a relationFOREACH...GENERATEAdds or removes fields to or from a relationMAPREDUCERuns a MapReduce job using a relation as inputSTREAMTransforms a relation using an external programSAMPLESelects a random sample of a relationASSERTEnsures a condition is true for all rows in a relation; otherwise, failsJOINJoins two or more relationsCOGROUPGroups the data in two or more relationsGROUPGroups the data in a single relationCROSSCreates the cross product of two or more relationsCUBECreates aggregations for all combinations of specified columns in arelationORDERSorts a relation by one or more fieldsRANKAssign a rank to each tuple in a relation, optionally sorting by fields firstLIMITLimits the size of a relation to a maximum number of tuplesCombining and splitting UNIONSPLITCombines two or more relations into oneSplits a relation into two or more relationsThere are other types of statements that are not added to the logical plan.

For example,the diagnostic operators—DESCRIBE, EXPLAIN, and ILLUSTRATE—are provided to allowthe user to interact with the logical plan for debugging purposes (see Table 16-2). DUMPis a sort of diagnostic operator, too, since it is used only to allow interactive debuggingof small result sets or in combination with LIMIT to retrieve a few rows from a largerrelation.

The STORE statement should be used when the size of the output is more thana few lines, as it writes to a file rather than to the console.Pig Latin|435Table 16-2. Pig Latin diagnostic operatorsOperator (Shortcut) DescriptionDESCRIBE (\de)Prints a relation’s schemaEXPLAIN (\e)Prints the logical and physical plansILLUSTRATE (\i) Shows a sample execution of the logical plan, using a generated subset of the inputPig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make itpossible to incorporate macros and user-defined functions into Pig scripts (seeTable 16-3).Table 16-3.

Pig Latin macro and UDF statementsStatementDescriptionREGISTER Registers a JAR file with the Pig runtimeDEFINECreates an alias for a macro, UDF, streaming script, or command specificationIMPORTImports macros defined in a separate file into a scriptBecause they do not process relations, commands are not added to the logical plan;instead, they are executed immediately. Pig provides commands to interact with Hadoopfilesystems (which are very handy for moving data around before or after processingwith Pig) and MapReduce, as well as a few utility commands (described in Table 16-4).Table 16-4. Pig Latin commandsCategoryHadoop filesystemCommandDescriptioncatPrints the contents of one or more filescdChanges the current directorycopyFromLocal Copies a local file or directory to a Hadoop filesystemcopyToLocalCopies a file or directory on a Hadoop filesystem to the local filesystemcpCopies a file or directory to another directoryfsAccesses Hadoop’s filesystem shelllsLists filesmkdirCreates a new directorymvMoves a file or directory to another directorypwdPrints the path of the current working directoryrmDeletes a file or directoryrmfForcibly deletes a file or directory (does not fail if the file or directory does notexist)Hadoop MapReduce kill436|Chapter 16: PigKills a MapReduce jobCategoryCommandDescriptionUtilityclearClears the screen in GruntexecRuns a script in a new Grunt shell in batch modehelpShows the available commands and optionshistoryPrints the query statements run in the current Grunt sessionquit (\q)Exits the interpreterrunRuns a script within the existing Grunt shellsetSets Pig options and MapReduce job propertiesshRuns a shell command from within GruntThe filesystem commands can operate on files or directories in any Hadoop filesystem,and they are very similar to the hadoop fs commands (which is not surprising, as bothare simple wrappers around the Hadoop FileSystem interface).

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.