Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 90
Текст из файла (страница 90)
These include the operators (LOAD, ILLUSTRATE), commands (cat,ls), expressions (matches, FLATTEN), and functions (DIFF, MAX)—all of which are cov‐ered in the following sections.Pig Latin has mixed rules on case sensitivity. Operators and commands are not casesensitive (to make interactive use more forgiving); however, aliases and function namesare case sensitive.StatementsAs a Pig Latin program is executed, each statement is parsed in turn.
If there are syntaxerrors or other (semantic) problems, such as undefined aliases, the interpreter will haltand display an error message. The interpreter builds a logical plan for every relationaloperation, which forms the core of a Pig Latin program. The logical plan for the state‐ment is added to the logical plan for the program so far, and then the interpreter moveson to the next statement.It’s important to note that no data processing takes place while the logical plan of theprogram is being constructed. For example, consider again the Pig Latin program fromthe first example:-- max_temp.pig: Finds the maximum temperature by yearrecords = LOAD 'input/ncdc/micro-tab/sample.txt'AS (year:chararray, temperature:int, quality:int);filtered_records = FILTER records BY temperature != 9999 ANDquality IN (0, 1, 4, 5, 9);grouped_records = GROUP filtered_records BY year;max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);DUMP max_temp;When the Pig Latin interpreter sees the first line containing the LOAD statement, it con‐firms that it is syntactically and semantically correct and adds it to the logical plan, butit does not load the data from the file (or even check whether the file exists).
Indeed,where would it load it? Into memory? Even if it did fit into memory, what would it doPig Latin|433with the data? Perhaps not all the input data is needed (because later statements filterit, for example), so it would be pointless to load it. The point is that it makes no senseto start any processing until the whole flow is defined. Similarly, Pig validates the GROUPand FOREACH...GENERATE statements, and adds them to the logical plan without exe‐cuting them.
The trigger for Pig to start execution is the DUMP statement. At that point,the logical plan is compiled into a physical plan and executed.Multiquery ExecutionBecause DUMP is a diagnostic tool, it will always trigger execution. However, the STOREcommand is different. In interactive mode, STORE acts like DUMP and will always triggerexecution (this includes the run command), but in batch mode it will not (this includesthe exec command). The reason for this is efficiency. In batch mode, Pig will parse thewhole script to see whether there are any optimizations that could be made to limit theamount of data to be written to or read from disk. Consider the following simpleexample:A = LOAD 'input/pig/multiquery/A';B = FILTER A BY $1 == 'banana';C = FILTER A BY $1 != 'banana';STORE B INTO 'output/b';STORE C INTO 'output/c';Relations B and C are both derived from A, so to save reading A twice, Pig can run thisscript as a single MapReduce job by reading A once and writing two output files fromthe job, one for each of B and C.
This feature is called multiquery execution.In previous versions of Pig that did not have multiquery execution, each STORE statementin a script run in batch mode triggered execution, resulting in a job for each STOREstatement. It is possible to restore the old behavior by disabling multiquery executionwith the -M or -no_multiquery option to pig.The physical plan that Pig prepares is a series of MapReduce jobs, which in local modePig runs in the local JVM and in MapReduce mode Pig runs on a Hadoop cluster.You can see the logical and physical plans created by Pig using theEXPLAIN command on a relation (EXPLAIN max_temp;, for example).EXPLAIN will also show the MapReduce plan, which shows how thephysical operators are grouped into MapReduce jobs. This is a goodway to find out how many MapReduce jobs Pig will run for yourquery.434|Chapter 16: PigThe relational operators that can be a part of a logical plan in Pig are summarized inTable 16-1.
We go through the operators in more detail in “Data Processing Opera‐tors” on page 456.Table 16-1. Pig Latin relational operatorsCategoryOperatorDescriptionLoading and storingLOADLoads data from the filesystem or other storage into a relationSTORESaves a relation to the filesystem or other storageFilteringGrouping and joiningSortingDUMP (\d)Prints a relation to the consoleFILTERRemoves unwanted rows from a relationDISTINCTRemoves duplicate rows from a relationFOREACH...GENERATEAdds or removes fields to or from a relationMAPREDUCERuns a MapReduce job using a relation as inputSTREAMTransforms a relation using an external programSAMPLESelects a random sample of a relationASSERTEnsures a condition is true for all rows in a relation; otherwise, failsJOINJoins two or more relationsCOGROUPGroups the data in two or more relationsGROUPGroups the data in a single relationCROSSCreates the cross product of two or more relationsCUBECreates aggregations for all combinations of specified columns in arelationORDERSorts a relation by one or more fieldsRANKAssign a rank to each tuple in a relation, optionally sorting by fields firstLIMITLimits the size of a relation to a maximum number of tuplesCombining and splitting UNIONSPLITCombines two or more relations into oneSplits a relation into two or more relationsThere are other types of statements that are not added to the logical plan.
For example,the diagnostic operators—DESCRIBE, EXPLAIN, and ILLUSTRATE—are provided to allowthe user to interact with the logical plan for debugging purposes (see Table 16-2). DUMPis a sort of diagnostic operator, too, since it is used only to allow interactive debuggingof small result sets or in combination with LIMIT to retrieve a few rows from a largerrelation.
The STORE statement should be used when the size of the output is more thana few lines, as it writes to a file rather than to the console.Pig Latin|435Table 16-2. Pig Latin diagnostic operatorsOperator (Shortcut) DescriptionDESCRIBE (\de)Prints a relation’s schemaEXPLAIN (\e)Prints the logical and physical plansILLUSTRATE (\i) Shows a sample execution of the logical plan, using a generated subset of the inputPig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make itpossible to incorporate macros and user-defined functions into Pig scripts (seeTable 16-3).Table 16-3.
Pig Latin macro and UDF statementsStatementDescriptionREGISTER Registers a JAR file with the Pig runtimeDEFINECreates an alias for a macro, UDF, streaming script, or command specificationIMPORTImports macros defined in a separate file into a scriptBecause they do not process relations, commands are not added to the logical plan;instead, they are executed immediately. Pig provides commands to interact with Hadoopfilesystems (which are very handy for moving data around before or after processingwith Pig) and MapReduce, as well as a few utility commands (described in Table 16-4).Table 16-4. Pig Latin commandsCategoryHadoop filesystemCommandDescriptioncatPrints the contents of one or more filescdChanges the current directorycopyFromLocal Copies a local file or directory to a Hadoop filesystemcopyToLocalCopies a file or directory on a Hadoop filesystem to the local filesystemcpCopies a file or directory to another directoryfsAccesses Hadoop’s filesystem shelllsLists filesmkdirCreates a new directorymvMoves a file or directory to another directorypwdPrints the path of the current working directoryrmDeletes a file or directoryrmfForcibly deletes a file or directory (does not fail if the file or directory does notexist)Hadoop MapReduce kill436|Chapter 16: PigKills a MapReduce jobCategoryCommandDescriptionUtilityclearClears the screen in GruntexecRuns a script in a new Grunt shell in batch modehelpShows the available commands and optionshistoryPrints the query statements run in the current Grunt sessionquit (\q)Exits the interpreterrunRuns a script within the existing Grunt shellsetSets Pig options and MapReduce job propertiesshRuns a shell command from within GruntThe filesystem commands can operate on files or directories in any Hadoop filesystem,and they are very similar to the hadoop fs commands (which is not surprising, as bothare simple wrappers around the Hadoop FileSystem interface).