Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 98
Текст из файла (страница 98)
After trying a few differentsystems, the team chose Hadoop for storage and processing, since it was cost effectiveand met the scalability requirements.Hive was created to make it possible for analysts with strong SQL skills (but meagerJava programming skills) to run queries on the huge volumes of data that Facebookstored in HDFS. Today, Hive is a successful Apache project used by many organizationsas a general-purpose, scalable data processing platform.Of course, SQL isn’t ideal for every big data problem—it’s not a good fit for buildingcomplex machine-learning algorithms, for example—but it’s great for many analyses,and it has the huge advantage of being very well known in the industry.
What’s more,SQL is the lingua franca in business intelligence tools (ODBC is a common bridge, forexample), so Hive is well placed to integrate with these products.This chapter is an introduction to using Hive. It assumes that you have working knowl‐edge of SQL and general database architecture; as we go through Hive’s features, we’lloften compare them to the equivalent in a traditional RDBMS.1.
Toby Segaran and Jeff Hammerbacher, Beautiful Data: The Stories Behind Elegant Data Solutions (O’Reilly,2009).471Installing HiveIn normal use, Hive runs on your workstation and converts your SQL query into a seriesof jobs for execution on a Hadoop cluster. Hive organizes data into tables, which providea means for attaching structure to data stored in HDFS.
Metadata—such as table sche‐mas—is stored in a database called the metastore.When starting out with Hive, it is convenient to run the metastore on your local ma‐chine. In this configuration, which is the default, the Hive table definitions that youcreate will be local to your machine, so you can’t share them with other users. We’ll seehow to configure a shared remote metastore, which is the norm in production envi‐ronments, in “The Metastore” on page 480.Installation of Hive is straightforward. As a prerequisite, you need to have the sameversion of Hadoop installed locally that your cluster is running.2 Of course, you maychoose to run Hadoop locally, either in standalone or pseudodistributed mode, whilegetting started with Hive.
These options are all covered in Appendix A.Which Versions of Hadoop Does Hive Work With?Any given release of Hive is designed to work with multiple versions of Hadoop. Gen‐erally, Hive works with the latest stable release of Hadoop, as well as supporting a numberof older versions, listed in the release notes. You don’t need to do anything special to tellHive which version of Hadoop you are using, beyond making sure that the hadoopexecutable is on the path or setting the HADOOP_HOME environment variable.Download a release, and unpack the tarball in a suitable place on your workstation:% tar xzf apache-hive-x.y.z-bin.tar.gzIt’s handy to put Hive on your path to make it easy to launch:% export HIVE_HOME=~/sw/apache-hive-x.y.z-bin% export PATH=$PATH:$HIVE_HOME/binNow type hive to launch the Hive shell:% hivehive>2.
It is assumed that you have network connectivity from your workstation to the Hadoop cluster. You can testthis before running Hive by installing Hadoop locally and performing some HDFS operations with the hadoopfs command.472|Chapter 17: HiveThe Hive ShellThe shell is the primary way that we will interact with Hive, by issuing commands inHiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced byMySQL, so if you are familiar with MySQL, you should feel at home using Hive.When starting Hive for the first time, we can check that it is working by listing its tables—there should be none.
The command must be terminated with a semicolon to tellHive to execute it:hive> SHOW TABLES;OKTime taken: 0.473 secondsLike SQL, HiveQL is generally case insensitive (except for string comparisons), so showtables; works equally well here. The Tab key will autocomplete Hive keywords andfunctions.For a fresh install, the command takes a few seconds to run as it lazily creates the met‐astore database on your machine. (The database stores its files in a directory calledmetastore_db, which is relative to the location from which you ran the hive command.)You can also run the Hive shell in noninteractive mode. The -f option runs the com‐mands in the specified file, which is script.q in this example:% hive -f script.qFor short scripts, you can use the -e option to specify the commands inline, in whichcase the final semicolon is not required:% hive -e 'SELECT * FROM dummy'OKXTime taken: 1.22 seconds, Fetched: 1 row(s)It’s useful to have a small table of data to test queries against, such astrying out functions in SELECT expressions using literal data (see“Operators and Functions” on page 488).
Here’s one way of populatinga single-row table:% echo 'X' > /tmp/dummy.txt% hive -e "CREATE TABLE dummy (value STRING); \LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \OVERWRITE INTO TABLE dummy"In both interactive and noninteractive mode, Hive will print information to standarderror—such as the time taken to run a query—during the course of operation. You cansuppress these messages using the -S option at launch time, which has the effect ofshowing only the output result for queries:Installing Hive|473% hive -S -e 'SELECT * FROM dummy'XOther useful Hive shell features include the ability to run commands on the host op‐erating system by using a ! prefix to the command and the ability to access Hadoopfilesystems using the dfs command.An ExampleLet’s see how to use Hive to run a query on the weather dataset we explored in earlierchapters.
The first step is to load the data into Hive’s managed storage. Here we’ll haveHive use the local filesystem for storage; later we’ll see how to store tables in HDFS.Just like an RDBMS, Hive organizes its data into tables. We create a table to hold theweather data using the CREATE TABLE statement:CREATE TABLE records (year STRING, temperature INT, quality INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';The first line declares a records table with three columns: year, temperature, andquality. The type of each column must be specified, too.
Here the year is a string, whilethe other two columns are integers.So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to HiveQL.This declaration is saying that each row in the data file is tab-delimited text. Hive expectsthere to be three fields in each row, corresponding to the table columns, with fieldsseparated by tabs and rows by newlines.Next, we can populate Hive with the data.
This is just a small sample, for exploratorypurposes:LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt'OVERWRITE INTO TABLE records;Running this command tells Hive to put the specified local file in its warehouse direc‐tory. This is a simple filesystem operation. There is no attempt, for example, to parsethe file and store it in an internal database format, because Hive does not mandate anyparticular file format.
Files are stored verbatim; they are not modified by Hive.In this example, we are storing Hive tables on the local filesystem (fs.defaultFS is setto its default value of file:///). Tables are stored as directories under Hive’s warehousedirectory, which is controlled by the hive.metastore.warehouse.dir property anddefaults to /user/hive/warehouse.Thus, the files for the records table are found in the /user/hive/warehouse/recordsdirectory on the local filesystem:% ls /user/hive/warehouse/records/sample.txt474|Chapter 17: HiveIn this case, there is only one file, sample.txt, but in general there can be more, and Hivewill read all of them when querying the table.The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existingfiles in the directory for the table. If it is omitted, the new files are simply added to thetable’s directory (unless they have the same names, in which case they replace the oldfiles).Now that the data is in Hive, we can run a query against it:hive>>>>19491950SELECT year, MAX(temperature)FROM recordsWHERE temperature != 9999 AND quality IN (0, 1, 4, 5, 9)GROUP BY year;11122This SQL query is unremarkable.
It is a SELECT statement with a GROUP BY clause forgrouping rows into years, which uses the MAX aggregate function to find the maximumtemperature for each year group. The remarkable thing is that Hive transforms thisquery into a job, which it executes on our behalf, then prints the results to the console.There are some nuances, such as the SQL constructs that Hive supports and the formatof the data that we can query—and we explore some of these in this chapter—but it isthe ability to execute SQL queries against our raw data that gives Hive its power.Running HiveIn this section, we look at some more practical aspects of running Hive, including howto set up Hive to run against a Hadoop cluster and a shared metastore. In doing so, we’llsee Hive’s architecture in some detail.Configuring HiveHive is configured using an XML configuration file like Hadoop’s.
The file is called hivesite.xml and is located in Hive’s conf directory. This file is where you can set propertiesthat you want to set every time you run Hive. The same directory contains hivedefault.xml, which documents the properties that Hive exposes and their default values.You can override the configuration directory that Hive looks for in hive-site.xml bypassing the --config option to the hive command:% hive --config /Users/tom/dev/hive-confNote that this option specifies the containing directory, not hive-site.xml itself.
It can beuseful when you have multiple site files—for different clusters, say—that you switchbetween on a regular basis. Alternatively, you can set the HIVE_CONF_DIR environmentvariable to the configuration directory for the same effect.Running Hive|475The hive-site.xml file is a natural place to put the cluster connection details: you canspecify the filesystem and resource manager using the usual Hadoop properties,fs.defaultFS and yarn.resourcemanager.address (see Appendix A for more detailson configuring Hadoop). If not set, they default to the local filesystem and the local (inprocess) job runner—just like they do in Hadoop—which is very handy when tryingout Hive on small trial datasets.
Metastore configuration settings (covered in “TheMetastore” on page 480) are commonly found in hive-site.xml, too.Hive also permits you to set properties on a per-session basis, by passing the -hiveconfoption to the hive command. For example, the following command sets the cluster (inthis case, to a pseudodistributed cluster) for the duration of the session:% hive -hiveconf fs.defaultFS=hdfs://localhost \-hiveconf mapreduce.framework.name=yarn \-hiveconf yarn.resourcemanager.address=localhost:8032If you plan to have more than one Hive user sharing a Hadoopcluster, you need to make the directories that Hive uses writable byall users.