Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 76
Текст из файла (страница 76)
Each field has a repetition (required, optional, orrepeated), a type, and a name. Here is a simple Parquet schema for a weather record:message WeatherRecord {required int32 year;required int32 temperature;required binary stationId (UTF8);}Notice that there is no primitive string type. Instead, Parquet defines logical types thatspecify how primitive types should be interpreted, so there is a separation between theserialized representation (the primitive type) and the semantics that are specific to theapplication (the logical type).
Strings are represented as binary primitives with a UTF8annotation. Some of the logical types defined by Parquet are listed in Table 13-2, alongwith a representative example schema of each. Among those not listed in the table aresigned integers, unsigned integers, more date/time types, and JSON and BSON docu‐ment types. See the Parquet specification for details.368|Chapter 13: ParquetTable 13-2. Parquet logical typesLogical typeannotationDescriptionSchema exampleUTF8A UTF-8 character string. Annotates binary.message m {required binary a (UTF8);}ENUMA set of named values. Annotates binary.message m {required binary a (ENUM);}DECIMAL(precision,scale)An arbitrary-precision signed decimal number.Annotates int32, int64, binary, orfixed_len_byte_array.message m {required int32 a (DECIMAL(5,2));}DATEA date with no time value.
Annotates int32.Represented by the number of days since theUnix epoch (January 1, 1970).message m {required int32 a (DATE);}LISTAn ordered collection of values. Annotatesgroup.message m {required group a (LIST) {repeated group list {required int32 element;}}}MAPAn unordered collection of key-value pairs.Annotates group.message m {required group a (MAP) {repeated group key_value {required binary key (UTF8);optional int32 value;}}}Complex types in Parquet are created using the group type, which adds a layer of nesting.A group with no annotation is simply a nested record.2Lists and maps are built from groups with a particular two-level group structure, asshown in Table 13-2. A list is represented as a LIST group with a nested repeating group(called list) that contains an element field.
In this example, a list of 32-bit integers hasa required int32 element field. For maps, the outer group a (annotated MAP) containsan inner repeating group key_value that contains the key and value fields. In this ex‐ample, the values have been marked optional so that it’s possible to have null valuesin the map.2. This is based on the model used in Protocol Buffers, where groups are used to define complex types like listsand maps.Data Model|369Nested EncodingIn a column-oriented store, a column’s values are stored together.
For a flat table wherethere is no nesting and no repetition—such as the weather record schema—this is simpleenough since each column has the same number of values, making it straightforwardto determine which row each value belongs to.In the general case where there is nesting or repetition—such as the map schema—it ismore challenging, since the structure of the nesting needs to be encoded too. Somecolumnar formats avoid the problem by flattening the structure so that only the toplevel columns are stored in column-major fashion (this is the approach that Hive’sRCFile takes, for example).
A map with nested columns would be stored in such a waythat the keys and values are interleaved, so it would not be possible to read only the keys,say, without also reading the values into memory.Parquet uses the encoding from Dremel, where every primitive type field in the schemais stored in a separate column, and for each value written, the structure is encoded bymeans of two integers: the definition level and the repetition level. The details are in‐tricate,3 but you can think of storing definition and repetition levels like this as a gen‐eralization of using a bit field to encode nulls for a flat record, where the non-nullvalues are written one after another.The upshot of this encoding is that any column (even nested ones) can be read inde‐pendently of the others.
In the case of a Parquet map, for example, the keys can be readwithout accessing any of the values, which can result in significant performance im‐provements, especially if the values are large (such as nested records with many fields).Parquet File FormatA Parquet file consists of a header followed by one or more blocks, terminated by afooter. The header contains only a 4-byte magic number, PAR1, that identifies the file asbeing in Parquet format, and all the file metadata is stored in the footer.
The footer’smetadata includes the format version, the schema, any extra key-value pairs, andmetadata for every block in the file. The final two fields in the footer are a 4-byte fieldencoding the length of the footer metadata, and the magic number again (PAR1).The consequence of storing the metadata in the footer is that reading a Parquet filerequires an initial seek to the end of the file (minus 8 bytes) to read the footer metadatalength, then a second seek backward by that length to read the footer metadata.
Unlikesequence files and Avro datafiles, where the metadata is stored in the header and syncmarkers are used to separate blocks, Parquet files don’t need sync markers since theblock boundaries are stored in the footer metadata. (This is possible because the3. Julien Le Dem’s exposition is excellent.370|Chapter 13: Parquetmetadata is written after all the blocks have been written, so the writer can retain theblock boundary positions in memory until the file is closed.) Therefore, Parquet filesare splittable, since the blocks can be located after reading the footer and can then beprocessed in parallel (by MapReduce, for example).Each block in a Parquet file stores a row group, which is made up of column chunkscontaining the column data for those rows.
The data for each column chunk is writtenin pages; this is illustrated in Figure 13-1.Figure 13-1. The internal structure of a Parquet fileEach page contains values from the same column, making a page a very good candidatefor compression since the values are likely to be similar. The first level of compressionis achieved through how the values are encoded. The simplest encoding is plain en‐coding, where values are written in full (e.g., an int32 is written using a 4-byte littleendian representation), but this doesn’t afford any compression in itself.Parquet also uses more compact encodings, including delta encoding (the differencebetween values is stored), run-length encoding (sequences of identical values are en‐coded as a single value and the count), and dictionary encoding (a dictionary of valuesis built and itself encoded, then values are encoded as integers representing the indexesin the dictionary).
In most cases, it also applies techniques such as bit packing to savespace by storing several small values in a single byte.When writing files, Parquet will choose an appropriate encoding automatically, basedon the column type. For example, Boolean values will be written using a combinationof run-length encoding and bit packing. Most types are encoded using dictionary en‐coding by default; however, a plain encoding will be used as a fallback if the dictionarybecomes too large.
The threshold size at which this happens is referred to as the dictio‐nary page size and is the same as the page size by default (so the dictionary has to fitinto one page if it is to be used). Note that the encoding that is actually used is storedin the file metadata to ensure that readers use the correct encoding.Parquet File Format|371In addition to the encoding, a second level of compression can be applied using a stan‐dard compression algorithm on the encoded page bytes.
By default, no compression isapplied, but Snappy, gzip, and LZO compressors are all supported.For nested data, each page will also store the definition and repetition levels for all thevalues in the page. Since levels are small integers (the maximum is determined by theamount of nesting specified in the schema), they can be very efficiently encoded usinga bit-packed run-length encoding.Parquet ConfigurationParquet file properties are set at write time.
The properties listed in Table 13-3 areappropriate if you are creating Parquet files from MapReduce (using the formats dis‐cussed in “Parquet MapReduce” on page 377), Crunch, Pig, or Hive.Table 13-3. ParquetOutputFormat propertiesProperty nameTypeDefault valueparquet.block.sizeint134217728 (128 MB) The size in bytes of a block (row group).parquet.page.sizeint1048576 (1 MB)The size in bytes of a page.parquet.dictionary.page.sizeint1048576 (1 MB)The maximum allowed size in bytes of a dictionarybefore falling back to plain encoding for a page.parquet.enable.dictionaryboolean trueWhether to use dictionary encoding.parquet.compressionStringThe type of compression to use for Parquet files: UNCOMPRESSED, SNAPPY, GZIP, or LZO.
Usedinstead of mapreduce.output.fileoutputformat.compress.UNCOMPRESSEDDescriptionSetting the block size is a trade-off between scanning efficiency and memory usage.Larger blocks are more efficient to scan through since they contain more rows, whichimproves sequential I/O (as there’s less overhead in setting up each column chunk).However, each block is buffered in memory for both reading and writing, which limitshow large blocks can be. The default block size is 128 MB.The Parquet file block size should be no larger than the HDFS block size for the file sothat each Parquet block can be read from a single HDFS block (and therefore from asingle datanode).
It is common to set them to be the same, and indeed both defaults arefor 128 MB block sizes.A page is the smallest unit of storage in a Parquet file, so retrieving an arbitrary row(with a single column, for the sake of illustration) requires that the page containing therow be decompressed and decoded. Thus, for single-row lookups, it is more efficient tohave smaller pages, so there are fewer values to read through before reaching the targetvalue. However, smaller pages incur a higher storage and processing overhead, due to372| Chapter 13: Parquetthe extra metadata (offsets, dictionaries) resulting from more pages.
The default pagesize is 1 MB.Writing and Reading Parquet FilesMost of the time Parquet files are processed using higher-level tools like Pig, Hive, orImpala, but sometimes low-level sequential access may be required, which we cover inthis section.Parquet has a pluggable in-memory data model to facilitate integration of the Parquetfile format with a wide range of tools and components.