Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 87
Текст из файла (страница 87)
Exports are performed in parallel using MapReduceFor MySQL, Sqoop can employ a direct-mode strategy using mysqlimport. Each maptask spawns a mysqlimport process that it communicates with via a named FIFO fileon the local filesystem. Data is then streamed into mysqlimport via the FIFO channel,and from there into the database.Whereas most MapReduce jobs reading from HDFS pick the degree of parallelism(number of map tasks) based on the number and size of the files to process, Sqoop’sexport system allows users explicit control over the number of tasks.
The performanceof the export can be affected by the number of parallel writers to the database, so Sqoopuses the CombineFileInputFormat class to group the input files into a smaller numberof map tasks.Exports and TransactionalityDue to the parallel nature of the process, often an export is not an atomic operation.Sqoop will spawn multiple tasks to export slices of the data in parallel. These tasks cancomplete at different times, meaning that even though transactions are used inside tasks,results from one task may be visible before the results of another task.
Moreover, data‐bases often use fixed-size buffers to store transactions. As a result, one transaction can‐not necessarily contain the entire set of operations performed by a task. Sqoop commitsresults every few thousand rows, to ensure that it does not run out of memory. These420|Chapter 15: Sqoopintermediate results are visible while the export continues. Applications that will usethe results of an export should not be started until the export process is complete, orthey may see partial results.To solve this problem, Sqoop can export to a temporary staging table and then, at theend of the job—if the export has succeeded—move the staged data into the destinationtable in a single transaction. You can specify a staging table with the --stagingtable option.
The staging table must already exist and have the same schema as thedestination. It must also be empty, unless the --clear-staging-table option is alsosupplied.Using a staging table is slower, since the data must be written twice:first to the staging table, then to the destination table. The exportprocess also uses more space while it is running, since there are twocopies of the data while the staged data is being copied to the desti‐nation.Exports and SequenceFilesThe example export reads source data from a Hive table, which is stored in HDFS as adelimited text file.
Sqoop can also export delimited text files that were not Hive tables.For example, it can export text files that are the output of a MapReduce job.Sqoop can export records stored in SequenceFiles to an output table too, althoughsome restrictions apply. A SequenceFile cannot contain arbitrary record types. Sqoop’sexport tool will read objects from SequenceFiles and send them directly to the OutputCollector, which passes the objects to the database export OutputFormat. To work withSqoop, the record must be stored in the “value” portion of the SequenceFile’s key-valuepair format and must subclass the org.apache.sqoop.lib.SqoopRecord abstract class(as is done by all classes generated by Sqoop).If you use the codegen tool (sqoop-codegen) to generate a SqoopRecord implementationfor a record based on your export target table, you can write a MapReduce program thatpopulates instances of this class and writes them to SequenceFiles.
sqoop-export canthen export these SequenceFiles to the table. Another means by which data may be inSqoopRecord instances in SequenceFiles is if data is imported from a database table toHDFS and modified in some fashion, and then the results are stored in SequenceFilesholding records of the same data type.In this case, Sqoop should reuse the existing class definition to read data from SequenceFiles, rather than generating a new (temporary) record container class to performthe export, as is done when converting text-based records to database rows. You cansuppress code generation and instead use an existing record class and JAR by providingthe --class-name and --jar-file arguments to Sqoop.
Sqoop will use the specifiedclass, loaded from the specified JAR, when exporting records.Exports: A Deeper Look|421In the following example, we reimport the widgets table as SequenceFiles, and thenexport it back to the database in a different table:% sqoop import --connect jdbc:mysql://localhost/hadoopguide \> --table widgets -m 1 --class-name WidgetHolder --as-sequencefile \> --target-dir widget_sequence_files --bindir ....14/10/29 12:25:03 INFO mapreduce.ImportJobBase: Retrieved 3 records.% mysql hadoopguidemysql> CREATE TABLE widgets2(id INT, widget_name VARCHAR(100),-> price DOUBLE, designed DATE, version INT, notes VARCHAR(200));Query OK, 0 rows affected (0.03 sec)mysql> exit;% sqoop export --connect jdbc:mysql://localhost/hadoopguide \> --table widgets2 -m 1 --class-name WidgetHolder \> --jar-file WidgetHolder.jar --export-dir widget_sequence_files...14/10/29 12:28:17 INFO mapreduce.ExportJobBase: Exported 3 records.During the import, we specified the SequenceFile format and indicated that we wantedthe JAR file to be placed in the current directory (with --bindir) so we can reuse it.Otherwise, it would be placed in a temporary directory.
We then created a destinationtable for the export, which had a slightly different schema (albeit one that is compatiblewith the original data). Finally, we ran an export that used the existing generated codeto read the records from the SequenceFile and write them to the database.Further ReadingFor more information on using Sqoop, consult the Apache Sqoop Cookbook by KathleenTing and Jarek Jarcec Cecho (O’Reilly, 2013).422|Chapter 15: SqoopCHAPTER 16PigApache Pig raises the level of abstraction for processing large datasets. MapReduceallows you, as the programmer, to specify a map function followed by a reduce function,but working out how to fit your data processing into this pattern, which often requiresmultiple MapReduce stages, can be a challenge. With Pig, the data structures are muchricher, typically being multivalued and nested, and the transformations you can applyto the data are much more powerful.
They include joins, for example, which are not forthe faint of heart in MapReduce.Pig is made up of two pieces:• The language used to express data flows, called Pig Latin.• The execution environment to run Pig Latin programs. There are currently twoenvironments: local execution in a single JVM and distributed execution on a Ha‐doop cluster.A Pig Latin program is made up of a series of operations, or transformations, that areapplied to the input data to produce output. Taken as a whole, the operations describea data flow, which the Pig execution environment translates into an executable repre‐sentation and then runs.
Under the covers, Pig turns the transformations into a seriesof MapReduce jobs, but as a programmer you are mostly unaware of this, which allowsyou to focus on the data rather than the nature of the execution.Pig is a scripting language for exploring large datasets. One criticism of MapReduce isthat the development cycle is very long. Writing the mappers and reducers, compilingand packaging the code, submitting the job(s), and retrieving the results is a timeconsuming business, and even with Streaming, which removes the compile and packagestep, the experience is still involved.
Pig’s sweet spot is its ability to process terabytes ofdata in response to a half-dozen lines of Pig Latin issued from the console. Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge423datasets there. Pig is very supportive of a programmer writing a query, since it providesseveral commands for introspecting the data structures in your program as it is written.Even more useful, it can perform a sample run on a representative subset of your inputdata, so you can see whether there are errors in the processing before unleashing it onthe full dataset.Pig was designed to be extensible. Virtually all parts of the processing path are custom‐izable: loading, storing, filtering, grouping, and joining can all be altered by user-definedfunctions (UDFs).
These functions operate on Pig’s nested data model, so they canintegrate very deeply with Pig’s operators. As another benefit, UDFs tend to be morereusable than the libraries developed for writing MapReduce programs.In some cases, Pig doesn’t perform as well as programs written in MapReduce.
However,the gap is narrowing with each release, as the Pig team implements sophisticated algo‐rithms for applying Pig’s relational operators. It’s fair to say that unless you are willingto invest a lot of effort optimizing Java MapReduce code, writing queries in Pig Latinwill save you time.Installing and Running PigPig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster,there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS(or other Hadoop filesystems) from your workstation.Installation is straightforward. Download a stable release from http://pig.apache.org/releases.html, and unpack the tarball in a suitable place on your workstation:% tar xzf pig-x.y.z.tar.gzIt’s convenient to add Pig’s binary directory to your command-line path.