Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 84
Текст из файла (страница 84)
However, they cannot hold binary fields (such as database columns of typeVARBINARY), and distinguishing between null values and String-based fields contain‐ing the value "null" can be problematic (although using the --null-string importoption allows you to control the representation of null values).To handle these conditions, Sqoop also supports SequenceFiles, Avro datafiles, andParquet files. These binary formats provide the most precise representation possible ofthe imported data. They also allow data to be compressed while retaining MapReduce’sability to process different sections of the same file in parallel. However, current versionsof Sqoop cannot load Avro datafiles or SequenceFiles into Hive (although you canload Avro into Hive manually, and Parquet can be loaded directly into Hive by Sqoop).2.
Available from the Apache Software Foundation website.406| Chapter 15: SqoopAnother disadvantage of SequenceFiles is that they are Java specific, whereas Avro andParquet files can be processed by a wide range of languages.Generated CodeIn addition to writing the contents of the database table to HDFS, Sqoop also providesyou with a generated Java source file (widgets.java) written to the current local directory.(After running the sqoop import command shown earlier, you can see this file byrunning ls widgets.java.)As you’ll learn in “Imports: A Deeper Look” on page 408, Sqoop can use generated codeto handle the deserialization of table-specific data from the database source beforewriting it to HDFS.The generated class (widgets) is capable of holding a single record retrieved from theimported table.
It can manipulate such a record in MapReduce or store it in a SequenceFile in HDFS. (SequenceFiles written by Sqoop during the import process will storeeach imported row in the “value” element of the SequenceFile’s key-value pair format,using the generated class.)It is likely that you don’t want to name your generated class widgets, since each instanceof the class refers to only a single record. We can use a different Sqoop tool to generatesource code without performing an import; this generated code will still examine thedatabase table to determine the appropriate data types for each field:% sqoop codegen --connect jdbc:mysql://localhost/hadoopguide \> --table widgets --class-name WidgetThe codegen tool simply generates code; it does not perform the full import.
We speci‐fied that we’d like it to generate a class named Widget; this will be written to Widget.java.We also could have specified --class-name and other code-generation arguments dur‐ing the import process we performed earlier. This tool can be used to regenerate codeif you accidentally remove the source file, or generate code with different settings thanwere used during the import.If you’re working with records imported to SequenceFiles, it is inevitable that you’llneed to use the generated classes (to deserialize data from the SequenceFile storage).You can work with text-file-based records without using generated code, but as we’llsee in “Working with Imported Data” on page 412, Sqoop’s generated code can handlesome tedious aspects of data processing for you.Additional Serialization SystemsRecent versions of Sqoop support Avro-based serialization and schema generation aswell (see Chapter 12), allowing you to use Sqoop in your project without integratingwith generated code.Generated Code|407Imports: A Deeper LookAs mentioned earlier, Sqoop imports a table from a database by running a MapReducejob that extracts rows from the table, and writes the records to HDFS.
How does Map‐Reduce read the rows? This section explains how Sqoop works under the hood.At a high level, Figure 15-1 demonstrates how Sqoop interacts with both the databasesource and Hadoop. Like Hadoop itself, Sqoop is written in Java. Java provides an APIcalled Java Database Connectivity, or JDBC, that allows applications to access data storedin an RDBMS as well as to inspect the nature of this data. Most database vendors providea JDBC driver that implements the JDBC API and contains the necessary code to con‐nect to their database servers.Based on the URL in the connect string used to access the database,Sqoop attempts to predict which driver it should load. You still needto download the JDBC driver itself and install it on your Sqoop cli‐ent. For cases where Sqoop does not know which JDBC driver isappropriate, users can specify the JDBC driver explicitly with the--driver argument.
This capability allows Sqoop to work with a widevariety of database platforms.Before the import can start, Sqoop uses JDBC to examine the table it is to import. Itretrieves a list of all the columns and their SQL data types. These SQL types (VARCHAR,INTEGER, etc.) can then be mapped to Java data types (String, Integer, etc.), which willhold the field values in MapReduce applications. Sqoop’s code generator will use thisinformation to create a table-specific class to hold a record extracted from the table.408|Chapter 15: SqoopFigure 15-1.
Sqoop’s import processThe Widget class from earlier, for example, contains the following methods that retrieveeach column from an extracted record:publicpublicpublicpublicpublicpublicInteger get_id();String get_widget_name();java.math.BigDecimal get_price();java.sql.Date get_design_date();Integer get_version();String get_design_comment();More critical to the import system’s operation, though, are the serialization methodsthat form the DBWritable interface, which allow the Widget class to interact with JDBC:public void readFields(ResultSet __dbResults) throws SQLException;public void write(PreparedStatement __dbStmt) throws SQLException;JDBC’s ResultSet interface provides a cursor that retrieves records from a query; thereadFields() method here will populate the fields of the Widget object with the col‐umns from one row of the ResultSet’s data.
The write() method shown here allowsSqoop to insert new Widget rows into a table, a process called exporting. Exports arediscussed in “Performing an Export” on page 417.The MapReduce job launched by Sqoop uses an InputFormat that can read sections ofa table from a database via JDBC.
The DataDrivenDBInputFormat provided with Ha‐doop partitions a query’s results over several map tasks.Reading a table is typically done with a simple query such as:Imports: A Deeper Look|409SELECT col1,col2,col3,... FROM tableNameBut often, better import performance can be gained by dividing this query across mul‐tiple nodes. This is done using a splitting column.
Using metadata about the table, Sqoopwill guess a good column to use for splitting the table (typically the primary key for thetable, if one exists). The minimum and maximum values for the primary key columnare retrieved, and then these are used in conjunction with a target number of tasks todetermine the queries that each map task should issue.For example, suppose the widgets table had 100,000 entries, with the id column con‐taining values 0 through 99,999. When importing this table, Sqoop would determinethat id is the primary key column for the table. When starting the MapReduce job, theDataDrivenDBInputFormat used to perform the import would issue a statement suchas SELECT MIN(id), MAX(id) FROM widgets. These values would then be used to in‐terpolate over the entire range of data.
Assuming we specified that five map tasks shouldrun in parallel (with -m 5), this would result in each map task executing queries suchas SELECT id, widget_name, ... FROM widgets WHERE id >= 0 AND id < 20000,SELECT id, widget_name, ... FROM widgets WHERE id >= 20000 AND id <40000, and so on.The choice of splitting column is essential to parallelizing work efficiently. If the idcolumn were not uniformly distributed (perhaps there are no widgets with IDs between50,000 and 75,000), then some map tasks might have little or no work to perform,whereas others would have a great deal. Users can specify a particular splitting columnwhen running an import job (via the --split-by argument), to tune the job to the data’sactual distribution. If an import job is run as a single (sequential) task with -m 1, thissplit process is not performed.After generating the deserialization code and configuring the InputFormat, Sqoop sendsthe job to the MapReduce cluster.
Map tasks execute the queries and deserialize rowsfrom the ResultSet into instances of the generated class, which are either stored directlyin SequenceFiles or transformed into delimited text before being written to HDFS.Controlling the ImportSqoop does not need to import an entire table at a time. For example, a subset of thetable’s columns can be specified for import. Users can also specify a WHERE clause toinclude in queries via the --where argument, which bounds the rows of the table toimport.
For example, if widgets 0 through 99,999 were imported last month, but thismonth our vendor catalog included 1,000 new types of widget, an import could beconfigured with the clause WHERE id >= 100000; this will start an import job to retrieveall the new rows added to the source database since the previous import run. Usersupplied WHERE clauses are applied before task splitting is performed, and are pusheddown into the queries executed by each task.410|Chapter 15: SqoopFor more control—to perform column transformations, for example—users can specifya --query argument.Imports and ConsistencyWhen importing data to HDFS, it is important that you ensure access to a consistentsnapshot of the source data.