Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 51

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 51 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 512020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 51)

Both the Met Office and NCDC data are text based, so we useTextInputFormat for each. But the line format of the two data sources is different, sowe use two different mappers. The MaxTemperatureMapper reads NCDC input data andextracts the year and temperature fields. The MetOfficeMaxTemperatureMapper readsMet Office input data and extracts the year and temperature fields. The important thingis that the map outputs have the same types, since the reducers (which are all of thesame type) see the aggregated map outputs and are not aware of the different mappersused to produce them.5. Met Office data is generally available only to the research and academic community. However, there is a smallamount of monthly weather station data available at http://www.metoffice.gov.uk/climate/uk/stationdata/.Input Formats|237The MultipleInputs class has an overloaded version of addInputPath() that doesn’ttake a mapper:public static void addInputPath(Job job, Path path,Class<? extends InputFormat> inputFormatClass)This is useful when you only have one mapper (set using the Job’s setMapperClass()method) but multiple input formats.Database Input (and Output)DBInputFormat is an input format for reading data from a relational database, usingJDBC.

Because it doesn’t have any sharding capabilities, you need to be careful not tooverwhelm the database from which you are reading by running too many mappers.For this reason, it is best used for loading relatively small datasets, perhaps for joiningwith larger datasets from HDFS using MultipleInputs. The corresponding outputformat is DBOutputFormat, which is useful for dumping job outputs (of modest size)into a database.For an alternative way of moving data between relational databases and HDFS, considerusing Sqoop, which is described in Chapter 15.HBase’s TableInputFormat is designed to allow a MapReduce program to operate ondata stored in an HBase table.

TableOutputFormat is for writing MapReduce outputsinto an HBase table.Output FormatsHadoop has output data formats that correspond to the input formats covered in theprevious section. The OutputFormat class hierarchy appears in Figure 8-4.238|Chapter 8: MapReduce Types and FormatsFigure 8-4. OutputFormat class hierarchyText OutputThe default output format, TextOutputFormat, writes records as lines of text. Its keysand values may be of any type, since TextOutputFormat turns them to strings by callingtoString() on them. Each key-value pair is separated by a tab character, although thatmay be changed using the mapreduce.output.textoutputformat.separator proper‐ty.

The counterpart to TextOutputFormat for reading in this case is KeyValueTextInputFormat, since it breaks lines into key-value pairs based on a configurableseparator (see “KeyValueTextInputFormat” on page 233).You can suppress the key or the value from the output (or both, making this outputformat equivalent to NullOutputFormat, which emits nothing) using a NullWritabletype. This also causes no separator to be written, which makes the output suitable forreading in using TextInputFormat.Binary OutputSequenceFileOutputFormatAs the name indicates, SequenceFileOutputFormat writes sequence files for its output.This is a good choice of output if it forms the input to a further MapReduce job, sinceit is compact and is readily compressed.

Compression is controlled via the static methodson SequenceFileOutputFormat, as described in “Using Compression in MapReduce”Output Formats|239on page 107. For an example of how to use SequenceFileOutputFormat, see “Sorting”on page 255.SequenceFileAsBinaryOutputFormatSequenceFileAsBinaryOutputFormat—the counterpart to SequenceFileAsBinaryInputFormat—writes keys and values in raw binary format into a sequence file container.MapFileOutputFormatMapFileOutputFormat writes map files as output.

The keys in a MapFile must be addedin order, so you need to ensure that your reducers emit keys in sorted order.The reduce input keys are guaranteed to be sorted, but the output keysare under the control of the reduce function, and there is nothing inthe general MapReduce contract that states that the reduce outputkeys have to be ordered in any way.

The extra constraint of sortedreduce output keys is just needed for MapFileOutputFormat.Multiple OutputsFileOutputFormat and its subclasses generate a set of files in the output directory. Thereis one file per reducer, and files are named by the partition number: part-r-00000, partr-00001, and so on. Sometimes there is a need to have more control over the naming ofthe files or to produce multiple files per reducer. MapReduce comes with the MultipleOutputs class to help you do this.6An example: Partitioning dataConsider the problem of partitioning the weather dataset by weather station. We wouldlike to run a job whose output is one file per station, with each file containing all therecords for that station.One way of doing this is to have a reducer for each weather station.

To arrange this, weneed to do two things. First, write a partitioner that puts records from the same weatherstation into the same partition. Second, set the number of reducers on the job to be thenumber of weather stations. The partitioner would look like this:6. The old MapReduce API includes two classes for producing multiple outputs: MultipleOutputFormat andMultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat hasmore control over the output directory structure and file naming. MultipleOutputs in the new API com‐bines the best features of the two multiple output classes in the old API. The code on this book’s websiteincludes old API equivalents of the examples in this section using both MultipleOutputs and MultipleOutputFormat.240|Chapter 8: MapReduce Types and Formatspublic class StationPartitioner extends Partitioner<LongWritable, Text> {private NcdcRecordParser parser = new NcdcRecordParser();@Overridepublic int getPartition(LongWritable key, Text value, int numPartitions) {parser.parse(value);return getPartition(parser.getStationId());}private int getPartition(String stationId) {...}}The getPartition(String) method, whose implementation is not shown, turns thestation ID into a partition index.

To do this, it needs a list of all the station IDs; it thenjust returns the index of the station ID in the list.There are two drawbacks to this approach. The first is that since the number of partitionsneeds to be known before the job is run, so does the number of weather stations. Al‐though the NCDC provides metadata about its stations, there is no guarantee that theIDs encountered in the data will match those in the metadata. A station that appears inthe metadata but not in the data wastes a reduce task.

Worse, a station that appears inthe data but not in the metadata doesn’t get a reduce task; it has to be thrown away. Oneway of mitigating this problem would be to write a job to extract the unique station IDs,but it’s a shame that we need an extra job to do this.The second drawback is more subtle. It is generally a bad idea to allow the number ofpartitions to be rigidly fixed by the application, since this can lead to small or unevensized partitions. Having many reducers doing a small amount of work isn’t an efficientway of organizing a job; it’s much better to get reducers to do more work and have fewerof them, as the overhead in running a task is then reduced.

Uneven-sized partitions canbe difficult to avoid, too. Different weather stations will have gathered a widely varyingamount of data; for example, compare a station that opened one year ago to one thathas been gathering data for a century. If a few reduce tasks take significantly longer thanthe others, they will dominate the job execution time and cause it to be longer than itneeds to be.Output Formats|241There are two special cases when it does make sense to allow theapplication to set the number of partitions (or equivalently, the num‐ber of reducers):Zero reducersThis is a vacuous case: there are no partitions, as the applica‐tion needs to run only map tasks.One reducerIt can be convenient to run small jobs to combine the output ofprevious jobs into a single file. This should be attempted onlywhen the amount of data is small enough to be processed com‐fortably by one reducer.It is much better to let the cluster drive the number of partitions for a job, the idea beingthat the more cluster resources there are available, the faster the job can complete.

Thisis why the default HashPartitioner works so well: it works with any number of parti‐tions and ensures each partition has a good mix of keys, leading to more evenly sizedpartitions.If we go back to using HashPartitioner, each partition will contain multiple stations,so to create a file per station, we need to arrange for each reducer to write multiple files.This is where MultipleOutputs comes in.MultipleOutputsMultipleOutputs allows you to write data to files whose names are derived from theoutput keys and values, or in fact from an arbitrary string. This allows each reducer (ormapper in a map-only job) to create more than a single file.

Filenames are of the formname-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is anarbitrary name that is set by the program and nnnnn is an integer designating the partnumber, starting from 00000. The part number ensures that outputs written from dif‐ferent partitions (mappers or reducers) do not collide in the case of the same name.The program in Example 8-5 shows how to use MultipleOutputs to partition the datasetby station.Example 8-5. Partitioning whole dataset into files named by the station ID usingMultipleOutputspublic class PartitionByStationUsingMultipleOutputs extends Configuredimplements Tool {static class StationMapperextends Mapper<LongWritable, Text, Text, Text> {private NcdcRecordParser parser = new NcdcRecordParser();242|Chapter 8: MapReduce Types and Formats@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {parser.parse(value);context.write(new Text(parser.getStationId()), value);}}static class MultipleOutputsReducerextends Reducer<Text, Text, NullWritable, Text> {private MultipleOutputs<NullWritable, Text> multipleOutputs;@Overrideprotected void setup(Context context)throws IOException, InterruptedException {multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);}@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException {for (Text value : values) {multipleOutputs.write(NullWritable.get(), value, key.toString());}}@Overrideprotected void cleanup(Context context)throws IOException, InterruptedException {multipleOutputs.close();}}@Overridepublic int run(String[] args) throws Exception {Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);if (job == null) {return -1;}job.setMapperClass(StationMapper.class);job.setMapOutputKeyClass(Text.class);job.setReducerClass(MultipleOutputsReducer.class);job.setOutputKeyClass(NullWritable.class);return job.waitForCompletion(true) ? 0 : 1;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new PartitionByStationUsingMultipleOutputs(),args);System.exit(exitCode);Output Formats|243}}In the reducer, which is where we generate the output, we construct an instance ofMultipleOutputs in the setup() method and assign it to an instance variable.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.