Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 48

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 48 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 482020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 48)

Properties for controlling split sizeProperty nameTypeDefault valueDescriptionmapreduce.input.fileinputformat.split.minsizeint1The smallest valid size inbytes for a file splitmapreduce.input.fileinputformat.split.maxsize alongLong.MAX_VALUE (i.e.,The largest valid size inbytes for a file splitdfs.blocksizelong9223372036854775807)128 MB (i.e., 134217728)The size of a block in HDFSin bytesa This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat).

Instead, it iscalculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapreduce.job.maps (or the setNumMapTasks() method on JobConf). Because the number of map tasks defaults to 1,this makes the maximum split size the size of the input.The minimum split size is usually 1 byte, although some formats have a lower boundon the split size. (For example, sequence files insert sync entries every so often in thestream, so the minimum split size has to be large enough to ensure that every split hasa sync point to allow the reader to resynchronize with a record boundary.

See “Readinga SequenceFile” on page 129.)Applications may impose a minimum split size. By setting this to a value larger than theblock size, they can force splits to be larger than a block. There is no good reason fordoing this when using HDFS, because doing so will increase the number of blocks thatare not local to a map task.The maximum split size defaults to the maximum value that can be represented by aJava long type.

It has an effect only when it is less than the block size, forcing splits tobe smaller than a block.The split size is calculated by the following formula (see the computeSplitSize()method in FileInputFormat):max(minimumSize, min(maximumSize, blockSize))and by default:minimumSize < blockSize < maximumSizeso the split size is blockSize. Various settings for these parameters and how they affectthe final split size are illustrated in Table 8-6.Input Formats|225Table 8-6. Examples of how to control the split sizeMinimum split size Maximum split sizeBlock sizeSplit size Comment1 (default)128 MB(default)128 MB(default)By default, the split size is the same as thedefault block size.Long.MAX_VALUE256 MB256 MBThe most natural way to increase the split sizeis to have larger blocks in HDFS, either bysetting dfs.blocksize or by configuringthis on a per-file basis at file construction time.128 MB(default)256 MBMaking the minimum split size greater thanthe block size increases the split size, but atthe cost of locality.128 MB(default)64 MBMaking the maximum split size less than theblock size decreases the split size.1 (default)Long.MAX_VALUE(default)256 MBLong.MAX_VALUE(default)1 (default)64 MBSmall files and CombineFileInputFormatHadoop works better with a small number of large files than a large number of smallfiles.

One reason for this is that FileInputFormat generates splits in such a way thateach split is all or part of a single file. If the file is very small (“small” means significantlysmaller than an HDFS block) and there are a lot of them, each map task will processvery little input, and there will be a lot of them (one per file), each of which imposesextra bookkeeping overhead.

Compare a 1 GB file broken into eight 128 MB blocks with10,000 or so 100 KB files. The 10,000 files use one map each, and the job time can betens or hundreds of times slower than the equivalent one with a single input file andeight map tasks.The situation is alleviated somewhat by CombineFileInputFormat, which was designedto work well with small files. Where FileInputFormat creates a split per file,CombineFileInputFormat packs many files into each split so that each mapper has moreto process.

Crucially, CombineFileInputFormat takes node and rack locality into ac‐count when deciding which blocks to place in the same split, so it does not compromisethe speed at which it can process the input in a typical MapReduce job.Of course, if possible, it is still a good idea to avoid the many small files case, becauseMapReduce works best when it can operate at the transfer rate of the disks in the cluster,and processing many small files increases the number of seeks that are needed to run ajob. Also, storing large numbers of small files in HDFS is wasteful of the namenode’smemory.

One technique for avoiding the many small files case is to merge small filesinto larger files by using a sequence file, as in Example 8-4; with this approach, the keyscan act as filenames (or a constant such as NullWritable, if not needed) and the valuesas file contents. But if you already have a large number of small files in HDFS, thenCombineFileInputFormat is worth trying.226|Chapter 8: MapReduce Types and FormatsCombineFileInputFormat isn’t just good for small files.

It can bringbenefits when processing large files, too, since it will generate one splitper node, which may be made up of multiple blocks. Essentially,CombineFileInputFormat decouples the amount of data that a map‐per consumes from the block size of the files in HDFS.Preventing splittingSome applications don’t want files to be split, as this allows a single mapper to processeach input file in its entirety. For example, a simple way to check if all the records in afile are sorted is to go through the records in order, checking whether each record is notless than the preceding one.

Implemented as a map task, this algorithm will work onlyif one map processes the whole file.2There are a couple of ways to ensure that an existing file is not split. The first (quickand-dirty) way is to increase the minimum split size to be larger than the largest file inyour system. Setting it to its maximum value, Long.MAX_VALUE, has this effect. Thesecond is to subclass the concrete subclass of FileInputFormat that you want to use, tooverride the isSplitable() method3 to return false. For example, here’s a nonsplit‐table TextInputFormat:import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;public class NonSplittableTextInputFormat extends TextInputFormat {@Overrideprotected boolean isSplitable(JobContext context, Path file) {return false;}}File information in the mapperA mapper processing a file input split can find information about the split by callingthe getInputSplit() method on the Mapper’s Context object.

When the input formatderives from FileInputFormat, the InputSplit returned by this method can be cast toa FileSplit to access the file information listed in Table 8-7.In the old MapReduce API, and the Streaming interface, the same file split informationis made available through properties that can be read from the mapper’s configuration.2. This is how the mapper in SortValidator.RecordStatsChecker is implemented.3.

In the method name isSplitable(), “splitable” has a single “t.” It is usually spelled “splittable,” which is thespelling I have used in this book.Input Formats|227(In the old MapReduce API this is achieved by implementing configure() in yourMapper implementation to get access to the JobConf object.)In addition to the properties in Table 8-7, all mappers and reducers have access to theproperties listed in “The Task Execution Environment” on page 203.Table 8-7. File split propertiesFileSplit method Property nameTypeDescriptiongetPath()mapreduce.map.input.filePath/StringThe path of the input file being processedgetStart()mapreduce.map.input.startlongThe byte offset of the start of the split fromthe beginning of the filegetLength()mapreduce.map.input.lengthlongThe length of the split in bytesIn the next section, we’ll see how to use a FileSplit when we need to access the split’sfilename.Processing a whole file as a recordA related requirement that sometimes crops up is for mappers to have access to the fullcontents of a file.

Not splitting the file gets you part of the way there, but you also needto have a RecordReader that delivers the file contents as the value of the record. Thelisting for WholeFileInputFormat in Example 8-2 shows a way of doing this.Example 8-2. An InputFormat for reading a whole file as a recordpublic class WholeFileInputFormatextends FileInputFormat<NullWritable, BytesWritable> {@Overrideprotected boolean isSplitable(JobContext context, Path file) {return false;}@Overridepublic RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException,InterruptedException {WholeFileRecordReader reader = new WholeFileRecordReader();reader.initialize(split, context);return reader;}}WholeFileInputFormat defines a format where the keys are not used, represented byNullWritable, and the values are the file contents, represented by BytesWritable in‐stances.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.