Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 82
Текст из файла (страница 82)
The file prefix is used to ensure that HDFS files created by second-tier agentsat the same time don’t collide.In the more usual case of agents running on different machines, the hostname can beused to make the filename unique by configuring a host interceptor (see Table 14-1)and including the %{host} escape sequence in the file path, or prefix:agent2.sinks.sink2.hdfs.filePrefix = events-%{host}A diagram of the whole system is shown in Figure 14-6.Sink Groups|397Figure 14-6. Load balancing between two agentsIntegrating Flume with ApplicationsAn Avro source is an RPC endpoint that accepts Flume events, making it possible towrite an RPC client to send events to the endpoint, which can be embedded in anyapplication that wants to introduce events into Flume.The Flume SDK is a module that provides a Java RpcClient class for sending Eventobjects to an Avro endpoint (an Avro source running in a Flume agent, usually in an‐other tier).
Clients can be configured to fail over or load balance between endpoints,and Thrift endpoints (Thrift sources) are supported too.The Flume embedded agent offers similar functionality: it is a cut-down Flume agentthat runs in a Java application. It has a single special source that your application sendsFlume Event objects to by calling a method on the EmbeddedAgent object; the only sinks398| Chapter 14: Flumethat are supported are Avro sinks, but it can be configured with multiple sinks forfailover or load balancing.Both the SDK and the embedded agent are described in more detail in the Flume De‐veloper Guide.Component CatalogWe’ve only used a handful of Flume components in this chapter.
Flume comes withmany more, which are briefly described in Table 14-1. Refer to the Flume User Guidefor further information on how to configure and use them.Table 14-1. Flume componentsCategorySourceSinkComponentDescriptionAvroListens on a port for events sent over Avro RPC by an Avro sink or the Flume SDK.ExecRuns a Unix command (e.g., tail -F/path/to/file) and converts lines read fromstandard output into events. Note that this source cannot guarantee delivery of events tothe channel; see the spooling directory source or the Flume SDK for better alternatives.HTTPListens on a port and converts HTTP requests into events using a pluggable handler (e.g., aJSON handler or binary blob handler).JMSReads messages from a JMS queue or topic and converts them into events.NetcatListens on a port and converts each line of text into an event.SequencegeneratorGenerates events from an incrementing counter.
Useful for testing.Spooling directoryReads lines from files placed in a spooling directory and converts them into events.SyslogReads lines from syslog and converts them into events.ThriftListens on a port for events sent over Thrift RPC by a Thrift sink or the Flume SDK.TwitterConnects to Twitter’s streaming API (1% of the firehose) and converts tweets into events.AvroSends events over Avro RPC to an Avro source.ElasticsearchWrites events to an Elasticsearch cluster using the Logstash format.File rollWrites events to the local filesystem.HBaseWrites events to HBase using a choice of serializer.HDFSWrites events to HDFS in text, sequence file, Avro, or a custom format.IRCSends events to an IRC channel.LoggerLogs events at INFO level using SLF4J.
Useful for testing.Morphline (Solr)Runs events through an in-process chain of Morphline commands. Typically used to loaddata into Solr.NullDiscards all events.ThriftSends events over Thrift RPC to a Thrift source.Component Catalog|399CategoryComponentDescriptionChannelFileStores events in a transaction log stored on the local filesystem.JDBCStores events in a database (embedded Derby).MemoryStores events in an in-memory queue.Interceptor HostMorphlineSets a host header containing the agent’s hostname or IP address on all events.Filters events through a Morphline configuration file. Useful for conditionally droppingevents or adding headers based on pattern matching or content extraction.Regex extractorSets headers extracted from the event body as text using a specified regular expression.Regex filteringIncludes or excludes events by matching the event body as text against a specified regularexpression.StaticSets a fixed header and value on all events.TimestampSets a timestamp header containing the time in milliseconds at which the agentprocesses the event.UUIDSets an id header containing a universally unique identifier on all events.
Useful for laterdeduplication.Further ReadingThis chapter has given a short overview of Flume. For more detail, see Using Flume byHari Shreedharan (O’Reilly, 2014). There is also a lot of practical information aboutdesigning ingest pipelines (and building Hadoop applications in general) in HadoopApplication Architectures by Mark Grover, Ted Malaska, Jonathan Seidman, and GwenShapira (O’Reilly, 2014).400| Chapter 14: FlumeCHAPTER 15SqoopAaron KimballA great strength of the Hadoop platform is its ability to work with data in several dif‐ferent forms. HDFS can reliably store logs and other data from a plethora of sources,and MapReduce programs can parse diverse ad hoc data formats, extracting relevantinformation and combining multiple datasets into powerful results.But to interact with data in storage repositories outside of HDFS, MapReduce programsneed to use external APIs.
Often, valuable data in an organization is stored in structureddata stores such as relational database management systems (RDBMSs). ApacheSqoop is an open source tool that allows users to extract data from a structured datastore into Hadoop for further processing. This processing can be done with MapReduceprograms or other higher-level tools such as Hive. (It’s even possible to use Sqoop tomove data from a database into HBase.) When the final results of an analytic pipelineare available, Sqoop can export these results back to the data store for consumption byother clients.In this chapter, we’ll take a look at how Sqoop works and how you can use it in yourdata processing pipeline.Getting SqoopSqoop is available in a few places.
The primary home of the project is the Apache Soft‐ware Foundation. This repository contains all the Sqoop source code and documenta‐tion. Official releases are available at this site, as well as the source code for the versioncurrently under development. The repository itself contains instructions for compilingthe project. Alternatively, you can get Sqoop from a Hadoop vendor distribution.If you download a release from Apache, it will be placed in a directory such as /home/yourname/sqoop-x.y.z/.
We’ll call this directory $SQOOP_HOME. You can run Sqoop byrunning the executable script $SQOOP_HOME/bin/sqoop.401If you’ve installed a release from a vendor, the package will have placed Sqoop’s scriptsin a standard location such as /usr/bin/sqoop. You can run Sqoop by simply typing sqoopat the command line. (Regardless of how you install Sqoop, we’ll refer to this script asjust sqoop from here on.)Sqoop 2Sqoop 2 is a rewrite of Sqoop that addresses the architectural limitations of Sqoop 1.For example, Sqoop 1 is a command-line tool and does not provide a Java API, so it’sdifficult to embed it in other programs.
Also, in Sqoop 1 every connector has to knowabout every output format, so it is a lot of work to write new connectors. Sqoop 2 has aserver component that runs jobs, as well as a range of clients: a command-line interface(CLI), a web UI, a REST API, and a Java API. Sqoop 2 also will be able to use alternativeexecution engines, such as Spark. Note that Sqoop 2’s CLI is not compatible with Sqoop1’s CLI.The Sqoop 1 release series is the current stable release series, and is what is used in thischapter. Sqoop 2 is under active development but does not yet have feature parity withSqoop 1, so you should check that it can support your use case before using it in pro‐duction.Running Sqoop with no arguments does not do much of interest:% sqoopTry sqoop help for usage.Sqoop is organized as a set of tools or commands.