Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 97
Текст из файла (страница 97)
Oard, “Pairwise Document Similarity in Large Collections withMapReduce,” Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, June2008.464|Chapter 16: PigSorting DataRelations are unordered in Pig. Consider a relation A:grunt> DUMP A;(2,3)(1,2)(2,4)There is no guarantee which order the rows will be processed in.
In particular, whenretrieving the contents of A using DUMP or STORE, the rows may be written in any order.If you want to impose an order on the output, you can use the ORDER operator to sort arelation by one or more fields. The default sort order compares fields of the same typeusing the natural ordering, and different types are given an arbitrary, but deterministic,ordering (a tuple is always “less than” a bag, for example).The following example sorts A by the first field in ascending order and by the secondfield in descending order:grunt> B = ORDER A BY $0, $1 DESC;grunt> DUMP B;(1,2)(2,4)(2,3)Any further processing on a sorted relation is not guaranteed to retain its order.
Forexample:grunt> C = FOREACH B GENERATE *;Even though relation C has the same contents as relation B, its tuples may be emitted inany order by a DUMP or a STORE. It is for this reason that it is usual to perform the ORDERoperation just before retrieving the output.The LIMIT statement is useful for limiting the number of results as a quick-and-dirtyway to get a sample of a relation. (Although random sampling using the SAMPLE operator,or prototyping with the ILLUSTRATE command, should be preferred for generating morerepresentative samples of the data.) It can be used immediately after the ORDER statementto retrieve the first n tuples. Usually, LIMIT will select any n tuples from a relation, butwhen used immediately after an ORDER statement, the order is retained (in an exceptionto the rule that processing a relation does not retain its order):grunt> D = LIMIT B 2;grunt> DUMP D;(1,2)(2,4)If the limit is greater than the number of tuples in the relation, all tuples are returned(so LIMIT has no effect).Data Processing Operators|465Using LIMIT can improve the performance of a query because Pig tries to apply the limitas early as possible in the processing pipeline, to minimize the amount of data that needsto be processed.
For this reason, you should always use LIMIT if you are not interestedin the entire output.Combining and Splitting DataSometimes you have several relations that you would like to combine into one. For this,the UNION statement is used. For example:grunt> DUMP A;(2,3)(1,2)(2,4)grunt> DUMP B;(z,x,8)(w,y,1)grunt> C = UNION A, B;grunt> DUMP C;(2,3)(z,x,8)(1,2)(w,y,1)(2,4)C is the union of relations A and B, and because relations are unordered, the order of thetuples in C is undefined. Also, it’s possible to form the union of two relations with dif‐ferent schemas or with different numbers of fields, as we have done here.
Pig attemptsto merge the schemas from the relations that UNION is operating on. In this case, theyare incompatible, so C has no schema:grunt> DESCRIBE A;A: {f0: int,f1: int}grunt> DESCRIBE B;B: {f0: chararray,f1: chararray,f2: int}grunt> DESCRIBE C;Schema for C unknown.If the output relation has no schema, your script needs to be able to handle tuples thatvary in the number of fields and/or types.The SPLIT operator is the opposite of UNION: it partitions a relation into two or morerelations.
See “Validation and nulls” on page 442 for an example of how to use it.Pig in PracticeThere are some practical techniques that are worth knowing about when you aredeveloping and running Pig programs. This section covers some of them.466|Chapter 16: PigParallelismWhen running in MapReduce mode, it’s important that the degree of parallelism match‐es the size of the dataset. By default, Pig sets the number of reducers by looking at thesize of the input and using one reducer per 1 GB of input, up to a maximum of 999reducers.
You can override these parameters by setting pig.exec.reducers.bytes.per.reducer (the default is 1,000,000,000 bytes) and pig.exec.reducers.max (the default is 999).To explicitly set the number of reducers you want for each job, you can use a PARALLEL clause for operators that run in the reduce phase. These include all the groupingand joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and ORDER.The following line sets the number of reducers to 30 for the GROUP:grouped_records = GROUP records BY year PARALLEL 30;Alternatively, you can set the default_parallel option, and it will take effect for allsubsequent jobs:grunt> set default_parallel 30See “Choosing the Number of Reducers” on page 217 for further discussion.The number of map tasks is set by the size of the input (with one map per HDFS block)and is not affected by the PARALLEL clause.Anonymous RelationsYou usually apply a diagnostic operator like DUMP or DESCRIBE to the most recentlydefined relation.
Since this is so common, Pig has a shortcut to refer to the previousrelation: @. Similarly, it can be tiresome to have to come up with a name for each relationwhen using the interpreter. Pig allows you to use the special syntax => to create a relationwith no alias, which can only be referred to with @.
For example:grunt> => LOAD 'input/ncdc/micro-tab/sample.txt';grunt> DUMP @(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)Parameter SubstitutionIf you have a Pig script that you run on a regular basis, it’s quite common to want to beable to run the same script with different parameters. For example, a script that runsdaily may use the date to determine which input files it runs over. Pig supports parametersubstitution, where parameters in the script are substituted with values supplied at run‐time. Parameters are denoted by identifiers prefixed with a $ character; for example,Pig in Practice|467$input and $output are used in the following script to specify the input and outputpaths:-- max_temp_param.pigrecords = LOAD '$input' AS (year:chararray, temperature:int, quality:int);filtered_records = FILTER records BY temperature != 9999 ANDquality IN (0, 1, 4, 5, 9);grouped_records = GROUP filtered_records BY year;max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);STORE max_temp into '$output';Parameters can be specified when launching Pig using the -param option, once for eachparameter:% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \>-param output=/tmp/out \>ch16-pig/src/main/pig/max_temp_param.pigYou can also put parameters in a file and pass them to Pig using the -param_file option.For example, we can achieve the same result as the previous command by placing theparameter definitions in a file:# Input fileinput=/user/tom/input/ncdc/micro-tab/sample.txt# Output fileoutput=/tmp/outThe pig invocation then becomes:% pig -param_file ch16-pig/src/main/pig/max_temp_param.param \>ch16-pig/src/main/pig/max_temp_param.pigYou can specify multiple parameter files by using -param_file repeatedly.
You can alsouse a combination of -param and -param_file options; if any parameter is defined bothin a parameter file and on the command line, the last value on the command line takesprecedence.Dynamic parametersFor parameters that are supplied using the -param option, it is easy to make the valuedynamic by running a command or script. Many Unix shells support command sub‐stitution for a command enclosed in backticks, and we can use this to make the outputdirectory date-based:% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \>-param output=/tmp/`date "+%Y-%m-%d"`/out \>ch16-pig/src/main/pig/max_temp_param.pigPig also supports backticks in parameter files by executing the enclosed command in ashell and using the shell output as the substituted value.
If the command or script exitswith a nonzero exit status, then the error message is reported and execution halts.468|Chapter 16: PigBacktick support in parameter files is a useful feature; it means that parameters can bedefined in the same way in a file or on the command line.Parameter substitution processingParameter substitution occurs as a preprocessing step before the script is run. You cansee the substitutions that the preprocessor made by executing Pig with the -dryrunoption. In dry run mode, Pig performs parameter substitution (and macro expansion)and generates a copy of the original script with substituted values, but does not executethe script.
You can inspect the generated script and check that the substitutions looksane (because they are dynamically generated, for example) before running it in normalmode.Further ReadingThis chapter provided a basic introduction to using Pig. For a more detailed guide, seeProgramming Pig by Alan Gates (O’Reilly, 2011).Further Reading|469CHAPTER 17HiveIn “Information Platforms and the Rise of the Data Scientist,”1 Jeff Hammerbacherdescribes Information Platforms as “the locus of their organization’s efforts to ingest,process, and generate information,” and how they “serve to accelerate the process oflearning from empirical data.”One of the biggest ingredients in the Information Platform built by Jeff ’s team at Face‐book was Apache Hive, a framework for data warehousing on top of Hadoop. Hive grewfrom a need to manage and learn from the huge volumes of data that Facebook wasproducing every day from its burgeoning social network.