Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 100
Текст из файла (страница 100)
This isbecause Hive was built to operate over HDFS data using MapReduce, where full-tablescans are the norm and a table update is achieved by transforming the data into a newtable. For a data warehousing application that runs over large portions of the dataset,this works well.Hive has long supported adding new rows in bulk to an existing table by using INSERTINTO to add new data files to a table. From release 0.14.0, finer-grained changes arepossible, so you can call INSERT INTO TABLE...VALUES to insert small batches of valuescomputed in SQL.
In addition, it is possible to UPDATE and DELETE rows in a table.HDFS does not provide in-place file updates, so changes resulting from inserts, updates,and deletes are stored in small delta files. Delta files are periodically merged into thebase table files by MapReduce jobs that are run in the background by the metastore.These features only work in the context of transactions (introduced in Hive 0.13.0), sothe table they are being used on needs to have transactions enabled on it. Queries readingthe table are guaranteed to see a consistent snapshot of the table.Hive also has support for table- and partition-level locking.
Locks prevent, for example,one process from dropping a table while another is reading from it. Locks are managedtransparently using ZooKeeper, so the user doesn’t have to acquire or release them,although it is possible to get information about which locks are being held via the SHOWLOCKS statement. By default, locks are not enabled.Hive indexes can speed up queries in certain cases. A query such as SELECT * from tWHERE x = a, for example, can take advantage of an index on column x, since only asmall portion of the table’s files need to be scanned. There are currently two index types:compact and bitmap. (The index implementation was designed to be pluggable, so it’sexpected that a variety of implementations will emerge for different use cases.)Comparison with Traditional Databases|483Compact indexes store the HDFS block numbers of each value, rather than each fileoffset, so they don’t take up much disk space but are still effective for the case wherevalues are clustered together in nearby rows.
Bitmap indexes use compressed bitsets toefficiently store the rows that a particular value appears in, and they are usually appro‐priate for low-cardinality columns (such as gender or country).SQL-on-Hadoop AlternativesIn the years since Hive was created, many other SQL-on-Hadoop engines have emergedto address some of Hive’s limitations. Cloudera Impala, an open source interactive SQLengine, was one of the first, giving an order of magnitude performance boost comparedto Hive running on MapReduce. Impala uses a dedicated daemon that runs on eachdatanode in the cluster.
When a client runs a query it contacts an arbitrary node runningan Impala daemon, which acts as a coordinator node for the query. The coordinatorsends work to other Impala daemons in the cluster and combines their results into thefull result set for the query. Impala uses the Hive metastore and supports Hive formatsand most HiveQL constructs (plus SQL-92), so in practice it is straightforward to mi‐grate between the two systems, or to run both on the same cluster.Hive has not stood still, though, and since Impala was launched, the “Stinger” initiativeby Hortonworks has improved the performance of Hive through support for Tez as anexecution engine, and the addition of a vectorized query engine among other improve‐ments.Other prominent open source Hive alternatives include Presto from Facebook, ApacheDrill, and Spark SQL.
Presto and Drill have similar architectures to Impala, althoughDrill targets SQL:2011 rather than HiveQL. Spark SQL uses Spark as its underlyingengine, and lets you embed SQL queries in Spark programs.Spark SQL is different to using the Spark execution engine fromwithin Hive (“Hive on Spark,” see “Execution engines” on page 477).Hive, on Spark provides all the features of Hive since it is a part ofthe Hive project. Spark SQL, on the other hand, is a new SQL en‐gine that offers some level of Hive compatibility.Apache Phoenix takes a different approach entirely: it provides SQL on HBase. SQLaccess is through a JDBC driver that turns queries into HBase scans and takes advantageof HBase coprocessors to perform server-side aggregation.
Metadata is stored in HBase,too.484|Chapter 17: HiveHiveQLHive’s SQL dialect, called HiveQL, is a mixture of SQL-92, MySQL, and Oracle’s SQLdialect. The level of SQL-92 support has improved over time, and will likely continueto get better. HiveQL also provides features from later SQL standards, such as windowfunctions (also known as analytic functions) from SQL:2003. Some of Hive’s nonstandard extensions to SQL were inspired by MapReduce, such as multitable inserts(see “Multitable insert” on page 501) and the TRANSFORM, MAP, and REDUCE clauses (see“MapReduce Scripts” on page 503).This chapter does not provide a complete reference to HiveQL; for that, see the Hivedocumentation. Instead, we focus on commonly used features and pay particular at‐tention to features that diverge from either SQL-92 or popular databases such as MySQL.Table 17-2 provides a high-level comparison of SQL and HiveQL.Table 17-2. A high-level comparison of SQL and HiveQLFeatureSQLHiveQLReferencesUpdatesUPDATE, INSERT, DELETEUPDATE, INSERT, DELETE“Inserts” on page 500; “Updates,Transactions, and Indexes” onpage 483TransactionsSupportedLimited supportIndexesSupportedSupportedData typesIntegral, floating-point, fixedpoint, text and binary strings,temporalBoolean, integral, floatingpoint, fixed-point, text andbinary strings, temporal, array,map, struct“Data Types” on page 486FunctionsHundreds of built-in functionsHundreds of built-in functions“Operators and Functions” onpage 488Multitable insertsNot supportedSupported“Multitable insert” on page 501Supported“CREATE TABLE...AS SELECT” onpage 501Not valid SQL-92, but found inCREATETABLE...AS SE some databasesLECTSELECTSQL-92SQL-92.
SORT BY for partialordering, LIMIT to limitnumber of rows returned“Querying Data” on page 503JoinsSQL-92, or variants (join tablesin the FROM clause, joincondition in the WHERE clause)Inner joins, outer joins, semijoins, map joins, cross joins“Joins” on page 505SubqueriesIn any clause (correlated ornoncorrelated)In the FROM, WHERE, or HAVING clauses (uncorrelated“Subqueries” on page 508subqueries not supported)ViewsUpdatable (materialized ornonmaterialized)Read-only (materialized viewsnot supported)“Views” on page 509HiveQL|485FeatureSQLHiveQLReferencesExtension pointsUser-defined functions, storedproceduresUser-defined functions,MapReduce scripts“User-Defined Functions” onpage 510; “MapReduce Scripts” onpage 503Data TypesHive supports both primitive and complex data types.
Primitives include numeric,Boolean, string, and timestamp types. The complex data types include arrays, maps, andstructs. Hive’s data types are listed in Table 17-3. Note that the literals shown are thoseused from within HiveQL; they are not the serialized forms used in the table’s storageformat (see “Storage Formats” on page 496).Table 17-3. Hive data typesCategory TypeDescriptionPrimitive BOOLEANTrue/false value.TRUETINYINT1-byte (8-bit) signed integer, from –128 to 127.1YSMALLINT2-byte (16-bit) signed integer, from –32,768 to32,767.1SINT4-byte (32-bit) signed integer, from –2,147,483,648 to 12,147,483,647.BIGINT8-byte (64-bit) signed integer, from –9,223,372,036,854,775,808 to9,223,372,036,854,775,807.FLOAT4-byte (32-bit) single-precision floating-point number. 1.0DOUBLE8-byte (64-bit) double-precision floating-pointnumber.DECIMALArbitrary-precision signed decimal number.1.0STRINGUnbounded variable-length character string.'a', "a"1.0VARCHARVariable-length character string.'a', "a"Fixed-length character string.'a', "a"BINARYByte array.Not supportedDATE|1LCHARTIMESTAMP Timestamp with nanosecond precision.486Literal examplesChapter 17: HiveDate.1325502245000, '2012-01-0203:04:05.123456789''2012-01-02'Category TypeDescriptionLiteral examplesComplexARRAYAn ordered collection of fields.