Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 99

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 99 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 992020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 99)

The following commands will create the directories and settheir permissions appropriately:%%%%hadoophadoophadoophadoopfsfsfsfs-mkdir-chmod-mkdir-chmod/tmpa+w /tmp-p /user/hive/warehousea+w /user/hive/warehouseIf all users are in the same group, then permissions g+w are suffi‐cient on the warehouse directory.You can change settings from within a session, too, using the SET command. This isuseful for changing Hive settings for a particular query. For example, the followingcommand ensures buckets are populated according to the table definition (see “Buck‐ets” on page 493):hive> SET hive.enforce.bucketing=true;To see the current value of any property, use SET with just the property name:hive> SET hive.enforce.bucketing;hive.enforce.bucketing=trueBy itself, SET will list all the properties (and their values) set by Hive.

Note that the listwill not include Hadoop defaults, unless they have been explicitly overridden in one ofthe ways covered in this section. Use SET -v to list all the properties in the system,including Hadoop defaults.There is a precedence hierarchy to setting properties. In the following list, lower num‐bers take precedence over higher numbers:476|Chapter 17: Hive1. The Hive SET command2. The command-line -hiveconf option3. hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapredsite.xml, and yarn-site.xml)4.

The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml,mapred-default.xml, and yarn-default.xml)Setting configuration properties for Hadoop is covered in more detail in “Which Prop‐erties Can I Set?” on page 150.Execution enginesHive was originally written to use MapReduce as its execution engine, and that is stillthe default.

It is now also possible to run Hive using Apache Tez as its execution engine,and work is underway to support Spark (see Chapter 19), too. Both Tez and Spark aregeneral directed acyclic graph (DAG) engines that offer more flexibility and higherperformance than MapReduce. For example, unlike MapReduce, where intermediatejob output is materialized to HDFS, Tez and Spark can avoid replication overhead bywriting the intermediate output to local disk, or even store it in memory (at the requestof the Hive planner).The execution engine is controlled by the hive.execution.engine property, whichdefaults to mr (for MapReduce). It’s easy to switch the execution engine on a per-querybasis, so you can see the effect of a different engine on a particular query.

Set Hive touse Tez as follows:hive> SET hive.execution.engine=tez;Note that Tez needs to be installed on the Hadoop cluster first; see the Hive documen‐tation for up-to-date details on how to do this.LoggingYou can find Hive’s error log on the local filesystem at ${java.io.tmpdir}/${user.name}/hive.log. It can be very useful when trying to diagnose configuration problems or othertypes of error. Hadoop’s MapReduce task logs are also a useful resource for trouble‐shooting; see “Hadoop Logs” on page 172 for where to find them.On many systems, ${java.io.tmpdir} is /tmp, but if it’s not, or if you want to set thelogging directory to be another location, then use the following:% hive -hiveconf hive.log.dir='/tmp/${user.name}'The logging configuration is in conf/hive-log4j.properties, and you can edit this file tochange log levels and other logging-related settings.

However, often it’s more convenientRunning Hive|477to set logging configuration for the session. For example, the following handy invocationwill send debug messages to the console:% hive -hiveconf hive.root.logger=DEBUG,consoleHive ServicesThe Hive shell is only one of several services that you can run using the hive command.You can specify the service to run using the --service option. Type hive --servicehelp to get a list of available service names; some of the most useful ones are describedin the following list:cliThe command-line interface to Hive (the shell). This is the default service.hiveserver2Runs Hive as a server exposing a Thrift service, enabling access from a range ofclients written in different languages.

HiveServer 2 improves on the original Hive‐Server by supporting authentication and multiuser concurrency. Applications usingthe Thrift, JDBC, and ODBC connectors need to run a Hive server to communicatewith Hive. Set the hive.server2.thrift.port configuration property to specifythe port the server will listen on (defaults to 10000).beelineA command-line interface to Hive that works in embedded mode (like the regularCLI), or by connecting to a HiveServer 2 process using JDBC.hwiThe Hive Web Interface. A simple web interface that can be used as an alternativeto the CLI without having to install any client software. See also Hue for a morefully featured Hadoop web interface that includes applications for running Hivequeries and browsing the Hive metastore.jarThe Hive equivalent of hadoop jar, a convenient way to run Java applications thatincludes both Hadoop and Hive classes on the classpath.metastoreBy default, the metastore is run in the same process as the Hive service.

Using thisservice, it is possible to run the metastore as a standalone (remote) process. Set theMETASTORE_PORT environment variable (or use the -p command-line option) tospecify the port the server will listen on (defaults to 9083).478| Chapter 17: HiveHive clientsIf you run Hive as a server (hive --service hiveserver2), there are a number ofdifferent mechanisms for connecting to it from applications (the relationship betweenHive clients and Hive services is illustrated in Figure 17-1):Thrift ClientThe Hive server is exposed as a Thrift service, so it’s possible to interact with it usingany programming language that supports Thrift.

There are third-party projectsproviding clients for Python and Ruby; for more details, see the Hive wiki.JDBC driverHive provides a Type 4 (pure Java) JDBC driver, defined in the classorg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URIof the form jdbc:hive2://host:port/dbname, a Java application will connect to aHive server running in a separate process at the given host and port.

(The drivermakes calls to an interface implemented by the Hive Thrift Client using the JavaThrift bindings.)You may alternatively choose to connect to Hive via JDBC in embedded mode usingthe URI jdbc:hive2://. In this mode, Hive runs in the same JVM as the applicationinvoking it; there is no need to launch it as a standalone server, since it does not usethe Thrift service or the Hive Thrift Client.The Beeline CLI uses the JDBC driver to communicate with Hive.ODBC driverAn ODBC driver allows applications that support the ODBC protocol (such asbusiness intelligence software) to connect to Hive. The Apache Hive distributiondoes not ship with an ODBC driver, but several vendors make one freely available.(Like the JDBC driver, ODBC drivers use Thrift to communicate with the Hiveserver.)Running Hive|479Figure 17-1.

Hive architectureThe MetastoreThe metastore is the central repository of Hive metadata. The metastore is divided intotwo pieces: a service and the backing store for the data. By default, the metastore serviceruns in the same JVM as the Hive service and contains an embedded Derby databaseinstance backed by the local disk. This is called the embedded metastore configuration(see Figure 17-2).Using an embedded metastore is a simple way to get started with Hive; however, onlyone embedded Derby database can access the database files on disk at any one time,which means you can have only one Hive session open at a time that accesses the samemetastore. Trying to start a second session produces an error when it attempts to opena connection to the metastore.The solution to supporting multiple sessions (and therefore multiple users) is to use astandalone database.

This configuration is referred to as a local metastore, since themetastore service still runs in the same process as the Hive service but connects to adatabase running in a separate process, either on the same machine or on a remotemachine. Any JDBC-compliant database may be used by setting the javax.jdo.option.* configuration properties listed in Table 17-1.33. The properties have the javax.jdo prefix because the metastore implementation uses the Java Data Objects(JDO) API for persisting Java objects.

Specifically, it uses the DataNucleus implementation of JDO.480|Chapter 17: HiveFigure 17-2. Metastore configurationsMySQL is a popular choice for the standalone metastore. In this case, the javax.jdo.option.ConnectionURL property is set to jdbc:mysql://host/dbname?createDatabaseIfNotExist=true, and javax.jdo.option.ConnectionDriverName is set tocom.mysql.jdbc.Driver.

(The username and password should be set too, of course.)The JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath, whichis simply achieved by placing it in Hive’s lib directory.Going a step further, there’s another metastore configuration called a remote meta‐store, where one or more metastore servers run in separate processes to the Hive service.This brings better manageability and security because the database tier can be com‐pletely firewalled off, and the clients no longer need the database credentials.A Hive service is configured to use a remote metastore by setting hive.metastore.uris to the metastore server URI(s), separated by commas if there is more thanone. Metastore server URIs are of the form thrift://host:port, where the portRunning Hive|481corresponds to the one set by METASTORE_PORT when starting the metastore server (see“Hive Services” on page 478).Table 17-1.

Important metastore configuration propertiesProperty nameTypeDefault valueDescriptionhive.metastore .warehouse.dirURI/user/hive/ warehouseThe directory relative tofs.defaultFS where managedtables are stored.hive.metastore.urisCommaseparatedURIsNot setIf not set (the default), use an inprocess metastore; otherwise,connect to one or more remotemetastores, specified by a list ofURIs. Clients connect in a roundrobin fashion when there aremultiple remote servers.javax.jdo.option.ConnectionURLURIjdbc:derby:;databaseName=metastore_db;create=trueThe JDBC URL of the metastoredatabase.javax.jdo.option.ConnectionDriverNameStringorg.apache.derby.jdbc.EmbeddedDriverThe JDBC driver classname.javax.jdo.option.ConnectionUserNameStringAPPThe JDBC username.javax.jdo.option.ConnectionPasswordStringmineThe JDBC password.Comparison with Traditional DatabasesAlthough Hive resembles a traditional database in many ways (such as supporting aSQL interface), its original HDFS and MapReduce underpinnings mean that there area number of architectural differences that have directly influenced the features that Hivesupports.

Over time, however, these limitations have been (and continue to be) removed,with the result that Hive looks and feels more like a traditional database with every yearthat passes.Schema on Read Versus Schema on WriteIn a traditional database, a table’s schema is enforced at data load time. If the data beingloaded doesn’t conform to the schema, then it is rejected. This design is sometimes calledschema on write because the data is checked against the schema when it is written intothe database.Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when aquery is issued. This is called schema on read.482| Chapter 17: HiveThere are trade-offs between the two approaches. Schema on read makes for a very fastinitial load, since the data does not have to be read, parsed, and serialized to disk in thedatabase’s internal format.

The load operation is just a file copy or move. It is moreflexible, too: consider having two schemas for the same underlying data, depending onthe analysis being performed. (This is possible in Hive using external tables; see “Man‐aged Tables and External Tables” on page 490.)Schema on write makes query time performance faster because the database can indexcolumns and perform compression on the data. The trade-off, however, is that it takeslonger to load data into the database. Furthermore, there are many scenarios where theschema is not known at load time, so there are no indexes to apply, because the querieshave not been formulated yet. These scenarios are where Hive shines.Updates, Transactions, and IndexesUpdates, transactions, and indexes are mainstays of traditional databases. Yet, untilrecently, these features have not been considered a part of Hive’s feature set.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.