Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 82

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 82 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 822020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 82)

The file prefix is used to ensure that HDFS files created by second-tier agentsat the same time don’t collide.In the more usual case of agents running on different machines, the hostname can beused to make the filename unique by configuring a host interceptor (see Table 14-1)and including the %{host} escape sequence in the file path, or prefix:agent2.sinks.sink2.hdfs.filePrefix = events-%{host}A diagram of the whole system is shown in Figure 14-6.Sink Groups|397Figure 14-6. Load balancing between two agentsIntegrating Flume with ApplicationsAn Avro source is an RPC endpoint that accepts Flume events, making it possible towrite an RPC client to send events to the endpoint, which can be embedded in anyapplication that wants to introduce events into Flume.The Flume SDK is a module that provides a Java RpcClient class for sending Eventobjects to an Avro endpoint (an Avro source running in a Flume agent, usually in an‐other tier).

Clients can be configured to fail over or load balance between endpoints,and Thrift endpoints (Thrift sources) are supported too.The Flume embedded agent offers similar functionality: it is a cut-down Flume agentthat runs in a Java application. It has a single special source that your application sendsFlume Event objects to by calling a method on the EmbeddedAgent object; the only sinks398| Chapter 14: Flumethat are supported are Avro sinks, but it can be configured with multiple sinks forfailover or load balancing.Both the SDK and the embedded agent are described in more detail in the Flume De‐veloper Guide.Component CatalogWe’ve only used a handful of Flume components in this chapter.

Flume comes withmany more, which are briefly described in Table 14-1. Refer to the Flume User Guidefor further information on how to configure and use them.Table 14-1. Flume componentsCategorySourceSinkComponentDescriptionAvroListens on a port for events sent over Avro RPC by an Avro sink or the Flume SDK.ExecRuns a Unix command (e.g., tail -F/path/to/file) and converts lines read fromstandard output into events. Note that this source cannot guarantee delivery of events tothe channel; see the spooling directory source or the Flume SDK for better alternatives.HTTPListens on a port and converts HTTP requests into events using a pluggable handler (e.g., aJSON handler or binary blob handler).JMSReads messages from a JMS queue or topic and converts them into events.NetcatListens on a port and converts each line of text into an event.SequencegeneratorGenerates events from an incrementing counter.

Useful for testing.Spooling directoryReads lines from files placed in a spooling directory and converts them into events.SyslogReads lines from syslog and converts them into events.ThriftListens on a port for events sent over Thrift RPC by a Thrift sink or the Flume SDK.TwitterConnects to Twitter’s streaming API (1% of the firehose) and converts tweets into events.AvroSends events over Avro RPC to an Avro source.ElasticsearchWrites events to an Elasticsearch cluster using the Logstash format.File rollWrites events to the local filesystem.HBaseWrites events to HBase using a choice of serializer.HDFSWrites events to HDFS in text, sequence file, Avro, or a custom format.IRCSends events to an IRC channel.LoggerLogs events at INFO level using SLF4J.

Useful for testing.Morphline (Solr)Runs events through an in-process chain of Morphline commands. Typically used to loaddata into Solr.NullDiscards all events.ThriftSends events over Thrift RPC to a Thrift source.Component Catalog|399CategoryComponentDescriptionChannelFileStores events in a transaction log stored on the local filesystem.JDBCStores events in a database (embedded Derby).MemoryStores events in an in-memory queue.Interceptor HostMorphlineSets a host header containing the agent’s hostname or IP address on all events.Filters events through a Morphline configuration file. Useful for conditionally droppingevents or adding headers based on pattern matching or content extraction.Regex extractorSets headers extracted from the event body as text using a specified regular expression.Regex filteringIncludes or excludes events by matching the event body as text against a specified regularexpression.StaticSets a fixed header and value on all events.TimestampSets a timestamp header containing the time in milliseconds at which the agentprocesses the event.UUIDSets an id header containing a universally unique identifier on all events.

Useful for laterdeduplication.Further ReadingThis chapter has given a short overview of Flume. For more detail, see Using Flume byHari Shreedharan (O’Reilly, 2014). There is also a lot of practical information aboutdesigning ingest pipelines (and building Hadoop applications in general) in HadoopApplication Architectures by Mark Grover, Ted Malaska, Jonathan Seidman, and GwenShapira (O’Reilly, 2014).400| Chapter 14: FlumeCHAPTER 15SqoopAaron KimballA great strength of the Hadoop platform is its ability to work with data in several dif‐ferent forms. HDFS can reliably store logs and other data from a plethora of sources,and MapReduce programs can parse diverse ad hoc data formats, extracting relevantinformation and combining multiple datasets into powerful results.But to interact with data in storage repositories outside of HDFS, MapReduce programsneed to use external APIs.

Often, valuable data in an organization is stored in structureddata stores such as relational database management systems (RDBMSs). ApacheSqoop is an open source tool that allows users to extract data from a structured datastore into Hadoop for further processing. This processing can be done with MapReduceprograms or other higher-level tools such as Hive. (It’s even possible to use Sqoop tomove data from a database into HBase.) When the final results of an analytic pipelineare available, Sqoop can export these results back to the data store for consumption byother clients.In this chapter, we’ll take a look at how Sqoop works and how you can use it in yourdata processing pipeline.Getting SqoopSqoop is available in a few places.

The primary home of the project is the Apache Soft‐ware Foundation. This repository contains all the Sqoop source code and documenta‐tion. Official releases are available at this site, as well as the source code for the versioncurrently under development. The repository itself contains instructions for compilingthe project. Alternatively, you can get Sqoop from a Hadoop vendor distribution.If you download a release from Apache, it will be placed in a directory such as /home/yourname/sqoop-x.y.z/.

We’ll call this directory $SQOOP_HOME. You can run Sqoop byrunning the executable script $SQOOP_HOME/bin/sqoop.401If you’ve installed a release from a vendor, the package will have placed Sqoop’s scriptsin a standard location such as /usr/bin/sqoop. You can run Sqoop by simply typing sqoopat the command line. (Regardless of how you install Sqoop, we’ll refer to this script asjust sqoop from here on.)Sqoop 2Sqoop 2 is a rewrite of Sqoop that addresses the architectural limitations of Sqoop 1.For example, Sqoop 1 is a command-line tool and does not provide a Java API, so it’sdifficult to embed it in other programs.

Also, in Sqoop 1 every connector has to knowabout every output format, so it is a lot of work to write new connectors. Sqoop 2 has aserver component that runs jobs, as well as a range of clients: a command-line interface(CLI), a web UI, a REST API, and a Java API. Sqoop 2 also will be able to use alternativeexecution engines, such as Spark. Note that Sqoop 2’s CLI is not compatible with Sqoop1’s CLI.The Sqoop 1 release series is the current stable release series, and is what is used in thischapter. Sqoop 2 is under active development but does not yet have feature parity withSqoop 1, so you should check that it can support your use case before using it in pro‐duction.Running Sqoop with no arguments does not do much of interest:% sqoopTry sqoop help for usage.Sqoop is organized as a set of tools or commands.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.