Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 102

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 102 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 1022020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 102)

Indeed, it doesn’t even check whether the externallocation exists at the time it is defined. This is a useful feature because it means you cancreate the data lazily after creating the table.When you drop an external table, Hive will leave the data untouched and only deletethe metadata.So how do you choose which type of table to use? In most cases, there is not muchdifference between the two (except of course for the difference in DROP semantics), soit is a just a matter of preference.

As a rule of thumb, if you are doing all your processingwith Hive, then use managed tables, but if you wish to use Hive and other tools on thesame dataset, then use external tables. A common pattern is to use an external table toaccess an initial dataset stored in HDFS (created by another process), then use a Hivetransform to move the data into a managed Hive table.

This works the other way around,too; an external table (not necessarily on HDFS) can be used to export data from Hivefor other applications to use.6Another reason for using external tables is when you wish to associate multiple schemaswith the same dataset.Partitions and BucketsHive organizes tables into partitions—a way of dividing a table into coarse-grained partsbased on the value of a partition column, such as a date. Using partitions can make itfaster to do queries on slices of the data.Tables or partitions may be subdivided further into buckets to give extra structure tothe data that may be used for more efficient queries.

For example, bucketing by user IDmeans we can quickly evaluate a user-based query by running it on a randomized sampleof the total set of users.6. You can also use INSERT OVERWRITE DIRECTORY to export data to a Hadoop filesystem.Tables|491PartitionsTo take an example where partitions are commonly used, imagine logfiles where eachrecord includes a timestamp. If we partition by date, then records for the same date willbe stored in the same partition. The advantage to this scheme is that queries that arerestricted to a particular date or set of dates can run much more efficiently, because theyonly need to scan the files in the partitions that the query pertains to.

Notice that par‐titioning doesn’t preclude more wide-ranging queries: it is still feasible to query theentire dataset across many partitions.A table may be partitioned in multiple dimensions. For example, in addition to parti‐tioning logs by date, we might also subpartition each date partition by country to permitefficient queries by location.Partitions are defined at table creation time using the PARTITIONED BY clause,7 whichtakes a list of column definitions. For the hypothetical logfiles example, we might definea table with records comprising a timestamp and the log line itself:CREATE TABLE logs (ts BIGINT, line STRING)PARTITIONED BY (dt STRING, country STRING);When we load data into a partitioned table, the partition values are specified explicitly:LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'INTO TABLE logsPARTITION (dt='2001-01-01', country='GB');At the filesystem level, partitions are simply nested subdirectories of the table directory.After loading a few more files into the logs table, the directory structure might looklike this:/user/hive/warehouse/logs├── dt=2001-01-01/│├── country=GB/││├── file1││└── file2│└── country=US/│└── file3└── dt=2001-01-02/├── country=GB/│└── file4└── country=US/├── file5└── file6The logs table has two date partitions (2001-01-01 and 2001-01-02, corresponding tosubdirectories called dt=2001-01-01 and dt=2001-01-02); and two country subparti‐7.

However, partitions may be added to or removed from a table after creation using an ALTER TABLE statement.492|Chapter 17: Hivetions (GB and US, corresponding to nested subdirectories called country=GB and country=US). The datafiles reside in the leaf directories.We can ask Hive for the partitions in a table using SHOW PARTITIONS:hive> SHOW PARTITIONS logs;dt=2001-01-01/country=GBdt=2001-01-01/country=USdt=2001-01-02/country=GBdt=2001-01-02/country=USOne thing to bear in mind is that the column definitions in the PARTITIONED BY clauseare full-fledged table columns, called partition columns; however, the datafiles do notcontain values for these columns, since they are derived from the directory names.You can use partition columns in SELECT statements in the usual way.

Hive performsinput pruning to scan only the relevant partitions. For example:SELECT ts, dt, lineFROM logsWHERE country='GB';will only scan file1, file2, and file4. Notice, too, that the query returns the values of thedt partition column, which Hive reads from the directory names since they are not inthe datafiles.BucketsThere are two reasons why you might want to organize your tables (or partitions) intobuckets. The first is to enable more efficient queries.

Bucketing imposes extra structureon the table, which Hive can take advantage of when performing certain queries. Inparticular, a join of two tables that are bucketed on the same columns—which includethe join columns—can be efficiently implemented as a map-side join.The second reason to bucket a table is to make sampling more efficient.

When workingwith large datasets, it is very convenient to try out queries on a fraction of your datasetwhile you are in the process of developing or refining them. We will see how to doefficient sampling at the end of this section.First, let’s see how to tell Hive that a table should be bucketed. We use the CLUSTEREDBY clause to specify the columns to bucket on and the number of buckets:CREATE TABLE bucketed_users (id INT, name STRING)CLUSTERED BY (id) INTO 4 BUCKETS;Here we are using the user ID to determine the bucket (which Hive does by hashing thevalue and reducing modulo the number of buckets), so any particular bucket will ef‐fectively have a random set of users in it.Tables|493In<b>Текст обрезан, так как является слишком большим</b>.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.