Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 85
Текст из файла (страница 85)
(Map tasks reading from a database in parallel are runningin separate processes. Thus, they cannot share a single database transaction.) The bestway to do this is to ensure that any processes that update existing rows of a table aredisabled during the import.Incremental ImportsIt’s common to run imports on a periodic basis so that the data in HDFS is keptsynchronized with the data stored in the database. To do this, there needs to be someway of identifying the new data. Sqoop will import rows that have a column value (forthe column specified with --check-column) that is greater than some specified value(set via --last-value).The value specified as --last-value can be a row ID that is strictly increasing, such asan AUTO_INCREMENT primary key in MySQL. This is suitable for the case where new rowsare added to the database table, but existing rows are not updated.
This mode is calledappend mode, and is activated via --incremental append. Another option is timebased incremental imports (specified by --incremental lastmodified), which is ap‐propriate when existing rows may be updated, and there is a column (the check column)that records the last modified time of the update.At the end of an incremental import, Sqoop will print out the value to be specified as--last-value on the next import. This is useful when running incremental importsmanually, but for running periodic imports it is better to use Sqoop’s saved job facility,which automatically stores the last value and uses it on the next job run.
Typesqoop job --help for usage instructions for saved jobs.Direct-Mode ImportsSqoop’s architecture allows it to choose from multiple available strategies for performingan import. Most databases will use the DataDrivenDBInputFormat-based approach de‐scribed earlier. Some databases, however, offer specific tools designed to extract dataquickly.
For example, MySQL’s mysqldump application can read from a table with greaterthroughput than a JDBC channel. The use of these external tools is referred to as directmode in Sqoop’s documentation. Direct mode must be specifically enabled by the user(via the --direct argument), as it is not as general purpose as the JDBC approach. (Forexample, MySQL’s direct mode cannot handle large objects, such as CLOB or BLOBImports: A Deeper Look|411columns, and that’s why Sqoop needs to use a JDBC-specific API to load these columnsinto HDFS.)For databases that provide such tools, Sqoop can use these to great effect.
A direct-modeimport from MySQL is usually much more efficient (in terms of map tasks and timerequired) than a comparable JDBC-based import. Sqoop will still launch multiple maptasks in parallel. These tasks will then spawn instances of the mysqldump program andread its output. Sqoop can also perform direct-mode imports from PostgreSQL, Oracle,and Netezza.Even when direct mode is used to access the contents of a database, the metadata is stillqueried through JDBC.Working with Imported DataOnce data has been imported to HDFS, it is ready for processing by custom MapReduceprograms. Text-based imports can easily be used in scripts run with Hadoop Streamingor in MapReduce jobs run with the default TextInputFormat.To use individual fields of an imported record, though, the field delimiters (and anyescape/enclosing characters) must be parsed and the field values extracted and con‐verted to the appropriate data types.
For example, the ID of the “sprocket” widget isrepresented as the string "1" in the text file, but should be parsed into an Integer orint variable in Java. The generated table class provided by Sqoop can automate thisprocess, allowing you to focus on the actual MapReduce job to run. Each autogeneratedclass has several overloaded methods named parse() that operate on the data repre‐sented as Text, CharSequence, char[], or other common types.The MapReduce application called MaxWidgetId (available in the example code) willfind the widget with the highest ID. The class can be compiled into a JAR file along withWidget.java using the Maven POM that comes with the example code. The JAR file iscalled sqoop-examples.jar, and is executed like so:% HADOOP_CLASSPATH=$SQOOP_HOME/sqoop-version.jar hadoop jar \> sqoop-examples.jar MaxWidgetId -libjars $SQOOP_HOME/sqoop-version.jarThis command line ensures that Sqoop is on the classpath locally (via $HADOOP_CLASSPATH) when running the MaxWidgetId.run() method, as well as when map tasks arerunning on the cluster (via the -libjars argument).When run, the maxwidget path in HDFS will contain a file named part-r-00000 withthe following expected result:3,gadget,99.99,1983-08-13,13,Our flagship productIt is worth noting that in this example MapReduce program, a Widget object was emittedfrom the mapper to the reducer; the autogenerated Widget class implements the412|Chapter 15: SqoopWritable interface provided by Hadoop, which allows the object to be sent via Hadoop’sserialization mechanism, as well as written to and read from SequenceFiles.The MaxWidgetId example is built on the new MapReduce API.
MapReduce applicationsthat rely on Sqoop-generated code can be built on the new or old APIs, though someadvanced features (such as working with large objects) are more convenient to use inthe new API.Avro-based imports can be processed using the APIs described in “Avro MapReduce”on page 359. With the Generic Avro mapping, the MapReduce program does not needto use schema-specific generated code (although this is an option too, by using Avro’sSpecific compiler; Sqoop does not do the code generation in this case).
The examplecode includes a program called MaxWidgetIdGenericAvro, which finds the widget withthe highest ID and writes out the result in an Avro datafile.Imported Data and HiveAs we’ll see in Chapter 17, for many types of analysis, using a system such as Hive tohandle relational operations can dramatically ease the development of the analyticpipeline. Especially for data originally from a relational data source, using Hive makesa lot of sense. Hive and Sqoop together form a powerful toolchain for performing anal‐ysis.Suppose we had another log of data in our system, coming from a web-based widgetpurchasing system.
This might return logfiles containing a widget ID, a quantity, ashipping address, and an order date.Here is a snippet from an example log of this type:1,15,120 Any St.,Los Angeles,CA,90210,2010-08-013,4,120 Any St.,Los Angeles,CA,90210,2010-08-012,5,400 Some Pl.,Cupertino,CA,95014,2010-07-302,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18By using Hadoop to analyze this purchase log, we can gain insight into our sales oper‐ation. By combining this data with the data extracted from our relational data source(the widgets table), we can do better. In this example session, we will compute whichzip code is responsible for the most sales dollars, so we can better focus our sales team’soperations. Doing this requires data from both the sales log and the widgets table.The table shown in the previous code snippet should be in a local file named sales.logfor this to work.First, let’s load the sales data into Hive:hive>>>>CREATE TABLE sales(widget_id INT, qty INT,street STRING, city STRING, state STRING,zip INT, sale_date STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';Working with Imported Data|413OKTime taken: 5.248 secondshive> LOAD DATA LOCAL INPATH "ch15-sqoop/sales.log" INTO TABLE sales;...Loading data to table default.salesTable default.sales stats: [numFiles=1, numRows=0, totalSize=189, rawDataSize=0]OKTime taken: 0.6 secondsSqoop can generate a Hive table based on a table from an existing relational data source.We’ve already imported the widgets data to HDFS, so we can generate the Hive tabledefinition and then load in the HDFS-resident data:% sqoop create-hive-table --connect jdbc:mysql://localhost/hadoopguide \> --table widgets --fields-terminated-by ','...14/10/29 11:54:52 INFO hive.HiveImport: OK14/10/29 11:54:52 INFO hive.HiveImport: Time taken: 1.098 seconds14/10/29 11:54:52 INFO hive.HiveImport: Hive import complete.% hivehive> LOAD DATA INPATH "widgets" INTO TABLE widgets;Loading data to table widgetsOKTime taken: 3.265 secondsWhen creating a Hive table definition with a specific already imported dataset in mind,we need to specify the delimiters used in that dataset.
Otherwise, Sqoop will allow Hiveto use its default delimiters (which are different from Sqoop’s default delimiters).Hive’s type system is less rich than that of most SQL systems. ManySQL types do not have direct analogues in Hive. When Sqoop gen‐erates a Hive table definition for an import, it uses the best Hive typeavailable to hold a column’s values. This may result in a decrease inprecision. When this occurs, Sqoop will provide you with a warningmessage such as this one:14/10/29 11:54:43 WARN hive.TableDefWriter:Column design_date had to becast to a less precise type in HiveThis three-step process of importing data to HDFS, creating the Hive table, and thenloading the HDFS-resident data into Hive can be shortened to one step if you know thatyou want to import straight from a database directly into Hive.