Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 86
Текст из файла (страница 86)
During an import, Sqoopcan generate the Hive table definition and then load in the data. Had we not alreadyperformed the import, we could have executed this command, which creates thewidgets table in Hive based on the copy in MySQL:% sqoop import --connect jdbc:mysql://localhost/hadoopguide \> --table widgets -m 1 --hive-import414|Chapter 15: SqoopRunning sqoop import with the --hive-import argument will loadthe data directly from the source database into Hive; it infers a Hiveschema automatically based on the schema for the table in the sourcedatabase. Using this, you can get started working with your data inHive with only one command.Regardless of which data import route we chose, we can now use the widgets datasetand the sales dataset together to calculate the most profitable zip code.
Let’s do so, andalso save the result of this query in another table for later:hive> CREATE TABLE zip_profits> AS> SELECT SUM(w.price * s.qty) AS sales_vol, s.zip FROM SALES s> JOIN widgets w ON (s.widget_id = w.id) GROUP BY s.zip;...Moving data to: hdfs://localhost/user/hive/warehouse/zip_profits...OKhive> SELECT * FROM zip_profits ORDER BY sales_vol DESC;...OK403.71 9021028.01000520.095014Importing Large ObjectsMost databases provide the capability to store large amounts of data in a single field.Depending on whether this data is textual or binary in nature, it is usually representedas a CLOB or BLOB column in the table. These “large objects” are often handled speciallyby the database itself.
In particular, most tables are physically laid out on disk as inFigure 15-2. When scanning through rows to determine which rows match the criteriafor a particular query, this typically involves reading all columns of each row from disk.If large objects were stored “inline” in this fashion, they would adversely affect the per‐formance of such scans.
Therefore, large objects are often stored externally from theirrows, as in Figure 15-3. Accessing a large object often requires “opening” it through thereference contained in the row.Importing Large Objects|415Figure 15-2. Database tables are typically physically represented as an array of rows,with all the columns in a row stored adjacent to one anotherFigure 15-3. Large objects are usually held in a separate area of storage; the main rowstorage contains indirect references to the large objectsThe difficulty of working with large objects in a database suggests that a system such asHadoop, which is much better suited to storing and processing large, complex dataobjects, is an ideal repository for such information.
Sqoop can extract large objects fromtables and store them in HDFS for further processing.As in a database, MapReduce typically materializes every record before passing it alongto the mapper. If individual records are truly large, this can be very inefficient.As shown earlier, records imported by Sqoop are laid out on disk in a fashion verysimilar to a database’s internal structure: an array of records with all fields of a recordconcatenated together. When running a MapReduce program over imported records,each map task must fully materialize all fields of each record in its input split.
If thecontents of a large object field are relevant only for a small subset of the total numberof records used as input to a MapReduce program, it would be inefficient to fully ma‐terialize all these records. Furthermore, depending on the size of the large object, fullmaterialization in memory may be impossible.416| Chapter 15: SqoopTo overcome these difficulties, Sqoop will store imported large objects in a separate filecalled a LobFile, if they are larger than a threshold size of 16 MB (configurable via thesqoop.inline.lob.length.max setting, in bytes).
The LobFile format can store indi‐vidual records of very large size (a 64-bit address space is used). Each record in a LobFileholds a single large object. The LobFile format allows clients to hold a reference to arecord without accessing the record contents. When records are accessed, this is donethrough a java.io.InputStream (for binary objects) or java.io.Reader (forcharacter-based objects).When a record is imported, the “normal” fields will be materialized together in a textfile, along with a reference to the LobFile where a CLOB or BLOB column is stored. Forexample, suppose our widgets table contained a BLOB field named schematic holdingthe actual schematic diagram for each widget.An imported record might then look like:2,gizmo,4.00,2009-11-30,4,null,externalLob(lf,lobfile0,100,5011714)The externalLob(...) text is a reference to an externally stored large object, stored inLobFile format (lf) in a file named lobfile0, with the specified byte offset and lengthinside that file.When working with this record, the Widget.get_schematic() method would returnan object of type BlobRef referencing the schematic column, but not actually contain‐ing its contents.
The BlobRef.getDataStream() method actually opens the LobFileand returns an InputStream, allowing you to access the schematic field’s contents.When running a MapReduce job processing many Widget records, you might need toaccess the schematic fields of only a handful of records. This system allows you to incurthe I/O costs of accessing only the required large object entries—a big savings, as indi‐vidual schematics may be several megabytes or more of data.The BlobRef and ClobRef classes cache references to underlying LobFiles within a maptask.
If you do access the schematic fields of several sequentially ordered records, theywill take advantage of the existing file pointer’s alignment on the next record body.Performing an ExportIn Sqoop, an import refers to the movement of data from a database system into HDFS.By contrast, an export uses HDFS as the source of data and a remote database as thedestination. In the previous sections, we imported some data and then performed someanalysis using Hive. We can export the results of this analysis to a database for con‐sumption by other tools.Before exporting a table from HDFS to a database, we must prepare the database toreceive the data by creating the target table. Although Sqoop can infer which Java typesare appropriate to hold SQL data types, this translation does not work in both directionsPerforming an Export|417(for example, there are several possible SQL column definitions that can hold data in aJava String; this could be CHAR(64), VARCHAR(200), or something else entirely).
Con‐sequently, you must determine which types are most appropriate.We are going to export the zip_profits table from Hive. We need to create a table inMySQL that has target columns in the same order, with the appropriate SQL types:% mysql hadoopguidemysql> CREATE TABLE sales_by_zip (volume DECIMAL(8,2), zip INTEGER);Query OK, 0 rows affected (0.01 sec)Then we run the export command:% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 \> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits \> --input-fields-terminated-by '\0001'...14/10/29 12:05:08 INFO mapreduce.ExportJobBase: Transferred 176 bytes in 13.5373seconds (13.0011 bytes/sec)14/10/29 12:05:08 INFO mapreduce.ExportJobBase: Exported 3 records.Finally, we can verify that the export worked by checking MySQL:% mysql hadoopguide -e 'SELECT * FROM sales_by_zip'+--------+-------+| volume | zip|+--------+-------+| 28.00 | 10005 || 403.71 | 90210 || 20.00 | 95014 |+--------+-------+When we created the zip_profits table in Hive, we did not specify any delimiters.
SoHive used its default delimiters: a Ctrl-A character (Unicode 0x0001) between fieldsand a newline at the end of each record. When we used Hive to access the contents ofthis table (in a SELECT statement), Hive converted this to a tab-delimited representationfor display on the console. But when reading the tables directly from files, we need totell Sqoop which delimiters to use. Sqoop assumes records are newline-delimited bydefault, but needs to be told about the Ctrl-A field delimiters. The --input-fieldsterminated-by argument to sqoop export specified this information.
Sqoop supportsseveral escape sequences, which start with a backslash (\) character, when specifyingdelimiters.In the example syntax, the escape sequence is enclosed in single quotes to ensure thatthe shell processes it literally. Without the quotes, the leading backslash itself may needto be escaped (e.g., --input-fields-terminated-by \\0001). The escape sequencessupported by Sqoop are listed in Table 15-1.418|Chapter 15: SqoopTable 15-1. Escape sequences that can be used to specify nonprintable characters asfield and record delimiters in SqoopEscapeDescription\bBackspaces.\nNewline.\rCarriage return.\tTab.\'Single quote.\"Double quote.\\Backslash.\0NUL. This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the--enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.\0oooThe octal representation of a Unicode character’s code point.
The actual character is specified by the octal value ooo.\0xhhh The hexadecimal representation of a Unicode character’s code point. This should be of the form \0xhhh, where hhhis the hex value. For example, --fields-terminated-by '\0x10' specifies the carriage return character.Exports: A Deeper LookThe Sqoop performs exports is very similar in nature to how Sqoop performs imports(see Figure 15-4). Before performing the export, Sqoop picks a strategy based on thedatabase connect string.
For most systems, Sqoop uses JDBC. Sqoop then generates aJava class based on the target table definition. This generated class has the ability toparse records from text files and insert values of the appropriate types into a table (inaddition to the ability to read the columns from a ResultSet). A MapReduce job is thenlaunched that reads the source datafiles from HDFS, parses the records using the gen‐erated class, and executes the chosen export strategy.The JDBC-based export strategy builds up batch INSERT statements that will each addmultiple records to the target table. Inserting many records per statement performsmuch better than executing many single-row INSERT statements on most database sys‐tems. Separate threads are used to read from HDFS and communicate with the database,to ensure that I/O operations involving different systems are overlapped as much aspossible.Exports: A Deeper Look|419Figure 15-4.