Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf), страница 102
Описание файла
PDF-файл из архива "Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf", который расположен в категории "". Всё это находится в предмете "(смрхиод) современные методы распределенного хранения и обработки данных" из 10 семестр (2 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 102 страницы из PDF
Indeed, it doesn’t even check whether the externallocation exists at the time it is defined. This is a useful feature because it means you cancreate the data lazily after creating the table.When you drop an external table, Hive will leave the data untouched and only deletethe metadata.So how do you choose which type of table to use? In most cases, there is not muchdifference between the two (except of course for the difference in DROP semantics), soit is a just a matter of preference.
As a rule of thumb, if you are doing all your processingwith Hive, then use managed tables, but if you wish to use Hive and other tools on thesame dataset, then use external tables. A common pattern is to use an external table toaccess an initial dataset stored in HDFS (created by another process), then use a Hivetransform to move the data into a managed Hive table.
This works the other way around,too; an external table (not necessarily on HDFS) can be used to export data from Hivefor other applications to use.6Another reason for using external tables is when you wish to associate multiple schemaswith the same dataset.Partitions and BucketsHive organizes tables into partitions—a way of dividing a table into coarse-grained partsbased on the value of a partition column, such as a date. Using partitions can make itfaster to do queries on slices of the data.Tables or partitions may be subdivided further into buckets to give extra structure tothe data that may be used for more efficient queries.
For example, bucketing by user IDmeans we can quickly evaluate a user-based query by running it on a randomized sampleof the total set of users.6. You can also use INSERT OVERWRITE DIRECTORY to export data to a Hadoop filesystem.Tables|491PartitionsTo take an example where partitions are commonly used, imagine logfiles where eachrecord includes a timestamp. If we partition by date, then records for the same date willbe stored in the same partition. The advantage to this scheme is that queries that arerestricted to a particular date or set of dates can run much more efficiently, because theyonly need to scan the files in the partitions that the query pertains to.
Notice that par‐titioning doesn’t preclude more wide-ranging queries: it is still feasible to query theentire dataset across many partitions.A table may be partitioned in multiple dimensions. For example, in addition to parti‐tioning logs by date, we might also subpartition each date partition by country to permitefficient queries by location.Partitions are defined at table creation time using the PARTITIONED BY clause,7 whichtakes a list of column definitions. For the hypothetical logfiles example, we might definea table with records comprising a timestamp and the log line itself:CREATE TABLE logs (ts BIGINT, line STRING)PARTITIONED BY (dt STRING, country STRING);When we load data into a partitioned table, the partition values are specified explicitly:LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'INTO TABLE logsPARTITION (dt='2001-01-01', country='GB');At the filesystem level, partitions are simply nested subdirectories of the table directory.After loading a few more files into the logs table, the directory structure might looklike this:/user/hive/warehouse/logs├── dt=2001-01-01/│├── country=GB/││├── file1││└── file2│└── country=US/│└── file3└── dt=2001-01-02/├── country=GB/│└── file4└── country=US/├── file5└── file6The logs table has two date partitions (2001-01-01 and 2001-01-02, corresponding tosubdirectories called dt=2001-01-01 and dt=2001-01-02); and two country subparti‐7.
However, partitions may be added to or removed from a table after creation using an ALTER TABLE statement.492|Chapter 17: Hivetions (GB and US, corresponding to nested subdirectories called country=GB and country=US). The datafiles reside in the leaf directories.We can ask Hive for the partitions in a table using SHOW PARTITIONS:hive> SHOW PARTITIONS logs;dt=2001-01-01/country=GBdt=2001-01-01/country=USdt=2001-01-02/country=GBdt=2001-01-02/country=USOne thing to bear in mind is that the column definitions in the PARTITIONED BY clauseare full-fledged table columns, called partition columns; however, the datafiles do notcontain values for these columns, since they are derived from the directory names.You can use partition columns in SELECT statements in the usual way.
Hive performsinput pruning to scan only the relevant partitions. For example:SELECT ts, dt, lineFROM logsWHERE country='GB';will only scan file1, file2, and file4. Notice, too, that the query returns the values of thedt partition column, which Hive reads from the directory names since they are not inthe datafiles.BucketsThere are two reasons why you might want to organize your tables (or partitions) intobuckets. The first is to enable more efficient queries.
Bucketing imposes extra structureon the table, which Hive can take advantage of when performing certain queries. Inparticular, a join of two tables that are bucketed on the same columns—which includethe join columns—can be efficiently implemented as a map-side join.The second reason to bucket a table is to make sampling more efficient.
When workingwith large datasets, it is very convenient to try out queries on a fraction of your datasetwhile you are in the process of developing or refining them. We will see how to doefficient sampling at the end of this section.First, let’s see how to tell Hive that a table should be bucketed. We use the CLUSTEREDBY clause to specify the columns to bucket on and the number of buckets:CREATE TABLE bucketed_users (id INT, name STRING)CLUSTERED BY (id) INTO 4 BUCKETS;Here we are using the user ID to determine the bucket (which Hive does by hashing thevalue and reducing modulo the number of buckets), so any particular bucket will ef‐fectively have a random set of users in it.Tables|493In<b>Текст обрезан, так как является слишком большим</b>.