Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 92
Текст из файла (страница 92)
An integer may be more appropriate if weneed to manipulate the year arithmetically (to turn it into a timestamp, for example),whereas the chararray representation might be more appropriate when it’s being usedas a simple identifier. Pig’s flexibility in the degree to which schemas are declared con‐trasts with schemas in traditional SQL databases, which are declared before the data isloaded into the system. Pig is designed for analyzing plain input files with no associatedtype information, so it is quite natural to choose types for fields later than you wouldwith an RDBMS.It’s possible to omit type declarations completely, too:grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'>>AS (year, temperature, quality);grunt> DESCRIBE records;records: {year: bytearray,temperature: bytearray,quality: bytearray}In this case, we have specified only the names of the fields in the schema: year,temperature, and quality.
The types default to bytearray, the most general type,representing a binary string.You don’t need to specify types for every field; you can leave some to default tobytearray, as we have done for year in this declaration:grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'>>AS (year, temperature:int, quality:int);grunt> DESCRIBE records;records: {year: bytearray,temperature: int,quality: int}However, if you specify a schema in this way, you do need to specify every field. Also,there’s no way to specify the type of a field without specifying the name. On the otherhand, the schema is entirely optional and can be omitted by not specifying an AS clause:Pig Latin|441grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';grunt> DESCRIBE records;Schema for records unknown.Fields in a relation with no schema can be referenced using only positional notation: $0refers to the first field in a relation, $1 to the second, and so on.
Their types default tobytearray:grunt> projected_records = FOREACH records GENERATE $0, $1, $2;grunt> DUMP projected_records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)grunt> DESCRIBE projected_records;projected_records: {bytearray,bytearray,bytearray}Although it can be convenient not to assign types to fields (particularly in the first stagesof writing a query), doing so can improve the clarity and efficiency of Pig Latin programsand is generally recommended.Using Hive tables with HCatalogDeclaring a schema as a part of the query is flexible but doesn’t lend itself to schemareuse. A set of Pig queries over the same input data will often have the same schemarepeated in each query. If the query processes a large number of fields, this repetitioncan become hard to maintain.HCatalog (which is a component of Hive) solves this problem by providing access toHive’s metastore, so that Pig queries can reference schemas by name, rather than spec‐ifying them in full each time.
For example, after running through “An Example” on page474 to load data into a Hive table called records, Pig can access the table’s schema anddata as follows:% pig -useHCataloggrunt> records = LOAD 'records' USING org.apache.hcatalog.pig.HCatLoader();grunt> DESCRIBE records;records: {year: chararray,temperature: int,quality: int}grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)Validation and nullsA SQL database will enforce the constraints in a table’s schema at load time; for example,trying to load a string into a column that is declared to be a numeric type will fail.
In442|Chapter 16: PigPig, if the value cannot be cast to the type declared in the schema, it will substitute anull value. Let’s see how this works when we have the following input for the weatherdata, which has an “e” character in place of an integer:19501950195019491949022e1117811111Pig handles the corrupt line by producing a null for the offending value, which isdisplayed as the absence of a value when dumped to screen (and also when saved usingSTORE):grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'>>AS (year:chararray, temperature:int, quality:int);grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,,1)(1949,111,1)(1949,78,1)Pig produces a warning for the invalid field (not shown here) but does not halt itsprocessing.
For large datasets, it is very common to have corrupt, invalid, or merelyunexpected data, and it is generally infeasible to incrementally fix every unparsablerecord. Instead, we can pull out all of the invalid records in one go so we can take actionon them, perhaps by fixing our program (because they indicate that we have made amistake) or by filtering them out (because the data is genuinely unusable):grunt> corrupt_records = FILTER records BY temperature is null;grunt> DUMP corrupt_records;(1950,,1)Note the use of the is null operator, which is analogous to SQL. In practice, we wouldinclude more information from the original record, such as an identifier and the valuethat could not be parsed, to help our analysis of the bad data.We can find the number of corrupt records using the following idiom for counting thenumber of rows in a relation:grunt> grouped = GROUP corrupt_records ALL;grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt_records);grunt> DUMP all_grouped;(all,1)(“GROUP” on page 464 explains grouping and the ALL operation in more detail.)Another useful technique is to use the SPLIT operator to partition the data into “good”and “bad” relations, which can then be analyzed separately:Pig Latin|443grunt> SPLIT records INTO good_records IF temperature is not null,>>bad_records OTHERWISE;grunt> DUMP good_records;(1950,0,1)(1950,22,1)(1949,111,1)(1949,78,1)grunt> DUMP bad_records;(1950,,1)Going back to the case in which temperature’s type was left undeclared, the corruptdata cannot be detected easily, since it doesn’t surface as a null:grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'>>AS (year:chararray, temperature, quality:int);grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,e,1)(1949,111,1)(1949,78,1)grunt> filtered_records = FILTER records BY temperature != 9999 AND>>quality IN (0, 1, 4, 5, 9);grunt> grouped_records = GROUP filtered_records BY year;grunt> max_temp = FOREACH grouped_records GENERATE group,>>MAX(filtered_records.temperature);grunt> DUMP max_temp;(1949,111.0)(1950,22.0)What happens in this case is that the temperature field is interpreted as a bytearray,so the corrupt field is not detected when the input is loaded.
When passed to the MAXfunction, the temperature field is cast to a double, since MAX works only with numerictypes. The corrupt field cannot be represented as a double, so it becomes a null, whichMAX silently ignores. The best approach is generally to declare types for your data onloading and look for missing or corrupt values in the relations themselves before youdo your main processing.Sometimes corrupt data shows up as smaller tuples because fields are simply missing.You can filter these out by using the SIZE function as follows:grunt> A = LOAD 'input/pig/corrupt/missing_fields';grunt> DUMP A;(2,Tie)(4,Coat)(3)(1,Scarf)grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1;grunt> DUMP B;(2,Tie)(4,Coat)(1,Scarf)444|Chapter 16: PigSchema mergingIn Pig, you don’t declare the schema for every new relation in the data flow.
In mostcases, Pig can figure out the resulting schema for the output of a relational operationby considering the schema of the input relation.How are schemas propagated to new relations? Some relational operators don’t changethe schema, so the relation produced by the LIMIT operator (which restricts a relationto a maximum number of tuples), for example, has the same schema as the relation itoperates on. For other operators, the situation is more complicated. UNION, forexample, combines two or more relations into one and tries to merge the input relations’schemas. If the schemas are incompatible, due to different types or number of fields,then the schema of the result of the UNION is unknown.You can find out the schema for any relation in the data flow using the DESCRIBEoperator.
If you want to redefine the schema for a relation, you can use the FOREACH...GENERATE operator with AS clauses to define the schema for some or all of thefields of the input relation.See “User-Defined Functions” on page 448 for a further discussion of schemas.FunctionsFunctions in Pig come in four types:Eval functionA function that takes one or more expressions and returns another expression. Anexample of a built-in eval function is MAX, which returns the maximum value of theentries in a bag. Some eval functions are aggregate functions, which means theyoperate on a bag of data to produce a scalar value; MAX is an example of an aggregatefunction. Furthermore, many aggregate functions are algebraic, which means thatthe result of the function may be calculated incrementally. In MapReduce terms,algebraic functions make use of the combiner and are much more efficient tocalculate (see “Combiner Functions” on page 34).
MAX is an algebraic function,whereas a function to calculate the median of a collection of values is an exampleof a function that is not algebraic.Filter functionA special type of eval function that returns a logical Boolean result. As the namesuggests, filter functions are used in the FILTER operator to remove unwanted rows.They can also be used in other relational operators that take Boolean conditions,and in general, in expressions using Boolean or conditional expressions. An ex‐ample of a built-in filter function is IsEmpty, which tests whether a bag or a mapcontains any items.Pig Latin|445Load functionA function that specifies how to load data into a relation from external storage.Store functionA function that specifies how to save the contents of a relation to external storage.Often, load and store functions are implemented by the same type.