Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 29

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 29 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 292020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 29)

Here’s what Doug Cutting said inresponse to that question:Why didn’t I use Serialization when we first started Hadoop? Because it looked big andhairy and I thought we needed something lean and mean, where we had precise controlover exactly how objects are written and read, since that is central to Hadoop. WithSerialization you can get some control, but you have to fight for it.The logic for not using RMI [Remote Method Invocation] was similar.

Effective, highperformance inter-process communications are critical to Hadoop. I felt like we’d needto precisely control how things like connections, timeouts and buffers are handled, andRMI gives you little control over those.The problem is that Java Serialization doesn’t meet the criteria for a serialization formatlisted earlier: compact, fast, extensible, and interoperable.126| Chapter 5: Hadoop I/OSerialization IDLThere are a number of other serialization frameworks that approach the problem in adifferent way: rather than defining types through code, you define them in a languageneutral, declarative fashion, using an interface description language (IDL).

The systemcan then generate types for different languages, which is good for interoperability. Theyalso typically define versioning schemes that make type evolution straightforward.Apache Thrift and Google Protocol Buffers are both popular serialization frameworks,and both are commonly used as a format for persistent binary data. There is limitedsupport for these as MapReduce formats;3 however, they are used internally in parts ofHadoop for RPC and data exchange.Avro is an IDL-based serialization framework designed to work well with large-scaledata processing in Hadoop. It is covered in Chapter 12.File-Based Data StructuresFor some applications, you need a specialized data structure to hold your data. For doingMapReduce-based processing, putting each blob of binary data into its own file doesn’tscale, so Hadoop developed a number of higher-level containers for these situations.SequenceFileImagine a logfile where each log record is a new line of text.

If you want to log binarytypes, plain text isn’t a suitable format. Hadoop’s SequenceFile class fits the bill inthis situation, providing a persistent data structure for binary key-value pairs. To use itas a logfile format, you would choose a key, such as timestamp represented by aLongWritable, and the value would be a Writable that represents the quantity beinglogged.SequenceFiles also work well as containers for smaller files. HDFS and MapReduce areoptimized for large files, so packing files into a SequenceFile makes storingand processing the smaller files more efficient (“Processing a whole file as a record” onpage 228 contains a program to pack files into a SequenceFile).4Writing a SequenceFileTo create a SequenceFile, use one of its createWriter() static methods, which returna SequenceFile.Writer instance.

There are several overloaded versions, but they allrequire you to specify a stream to write to (either an FSDataOutputStream or a3. Twitter’s Elephant Bird project includes tools for working with Thrift and Protocol Buffers in Hadoop.4. In a similar vein, the blog post “A Million Little Files” by Stuart Sierra includes code for converting a tar fileinto a SequenceFile.File-Based Data Structures|127FileSystem and Path pairing), a Configuration object, and the key and value types.Optional arguments include the compression type and codec, a Progressable callbackto be informed of write progress, and a Metadata instance to be stored in the SequenceFile header.The keys and values stored in a SequenceFile do not necessarily need to be Writables.Any types that can be serialized and deserialized by a Serialization may be used.Once you have a SequenceFile.Writer, you then write key-value pairs using theappend() method. When you’ve finished, you call the close() method (SequenceFile.Writer implements java.io.Closeable).Example 5-10 shows a short program to write some key-value pairs to a SequenceFile using the API just described.Example 5-10.

Writing a SequenceFilepublic class SequenceFileWriteDemo {private static final String[] DATA = {"One, two, buckle my shoe","Three, four, shut the door","Five, six, pick up sticks","Seven, eight, lay them straight","Nine, ten, a big fat hen"};public static void main(String[] args) throws IOException {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path path = new Path(uri);IntWritable key = new IntWritable();Text value = new Text();SequenceFile.Writer writer = null;try {writer = SequenceFile.createWriter(fs, conf, path,key.getClass(), value.getClass());for (int i = 0; i < 100; i++) {key.set(100 - i);value.set(DATA[i % DATA.length]);System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);writer.append(key, value);}} finally {IOUtils.closeStream(writer);}}}128| Chapter 5: Hadoop I/OThe keys in the sequence file are integers counting down from 100 to 1, represented asIntWritable objects.

The values are Text objects. Before each record is appended tothe SequenceFile.Writer, we call the getLength() method to discover the currentposition in the file. (We will use this information about record boundaries in the nextsection, when we read the file nonsequentially.) We write the position out to the console,along with the key and value pairs. The result of running it is shown here:% hadoop SequenceFileWriteDemo numbers.seq[128]100One, two, buckle my shoe[173]99Three, four, shut the door[220]98Five, six, pick up sticks[264]97Seven, eight, lay them straight[314]96Nine, ten, a big fat hen[359]95One, two, buckle my shoe[404]94Three, four, shut the door[451]93Five, six, pick up sticks[495]92Seven, eight, lay them straight[545]91Nine, ten, a big fat hen...[1976] 60One, two, buckle my shoe[2021] 59Three, four, shut the door[2088] 58Five, six, pick up sticks[2132] 57Seven, eight, lay them straight[2182] 56Nine, ten, a big fat hen...[4557] 5One, two, buckle my shoe[4602] 4Three, four, shut the door[4649] 3Five, six, pick up sticks[4693] 2Seven, eight, lay them straight[4743] 1Nine, ten, a big fat henReading a SequenceFileReading sequence files from beginning to end is a matter of creating an instance ofSequenceFile.Reader and iterating over records by repeatedly invoking one of thenext() methods.

Which one you use depends on the serialization framework you areusing. If you are using Writable types, you can use the next() method that takes a keyand a value argument and reads the next key and value in the stream into thesevariables:public boolean next(Writable key, Writable val)The return value is true if a key-value pair was read and false if the end of the file hasbeen reached.For other, non-Writable serialization frameworks (such as Apache Thrift), you shoulduse these two methods:public Object next(Object key) throws IOExceptionpublic Object getCurrentValue(Object val) throws IOExceptionFile-Based Data Structures|129In this case, you need to make sure that the serialization you want to use has been setin the io.serializations property; see “Serialization Frameworks” on page 126.If the next() method returns a non-null object, a key-value pair was read from thestream, and the value can be retrieved using the getCurrentValue() method.

Other‐wise, if next() returns null, the end of the file has been reached.The program in Example 5-11 demonstrates how to read a sequence file that hasWritable keys and values. Note how the types are discovered from the SequenceFile.Reader via calls to getKeyClass() and getValueClass(), and then ReflectionUtils is used to create an instance for the key and an instance for the value. Thistechnique allows the program to be used with any sequence file that has Writable keysand values.Example 5-11. Reading a SequenceFilepublic class SequenceFileReadDemo {public static void main(String[] args) throws IOException {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path path = new Path(uri);SequenceFile.Reader reader = null;try {reader = new SequenceFile.Reader(fs, path, conf);Writable key = (Writable)ReflectionUtils.newInstance(reader.getKeyClass(), conf);Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);long position = reader.getPosition();while (reader.next(key, value)) {String syncSeen = reader.syncSeen() ? "*" : "";System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);position = reader.getPosition(); // beginning of next record}} finally {IOUtils.closeStream(reader);}}}Another feature of the program is that it displays the positions of the sync points in thesequence file.

A sync point is a point in the stream that can be used to resynchronizewith a record boundary if the reader is “lost”—for example, after seeking to an arbitraryposition in the stream. Sync points are recorded by SequenceFile.Writer, which in‐serts a special entry to mark the sync point every few records as a sequence file is being130|Chapter 5: Hadoop I/Owritten. Such entries are small enough to incur only a modest storage overhead—lessthan 1%. Sync points always align with record boundaries.Running the program in Example 5-11 shows the sync points in the sequence file asasterisks.

The first one occurs at position 2021 (the second one occurs at position 4075,but is not shown in the output):% hadoop SequenceFileReadDemo numbers.seq[128]100One, two, buckle my shoe[173]99Three, four, shut the door[220]98Five, six, pick up sticks[264]97Seven, eight, lay them straight[314]96Nine, ten, a big fat hen[359]95One, two, buckle my shoe[404]94Three, four, shut the door[451]93Five, six, pick up sticks[495]92Seven, eight, lay them straight[545]91Nine, ten, a big fat hen[590]90One, two, buckle my shoe...[1976] 60One, two, buckle my shoe[2021*] 59Three, four, shut the door[2088] 58Five, six, pick up sticks[2132] 57Seven, eight, lay them straight[2182] 56Nine, ten, a big fat hen...[4557] 5One, two, buckle my shoe[4602] 4Three, four, shut the door[4649] 3Five, six, pick up sticks[4693] 2Seven, eight, lay them straight[4743] 1Nine, ten, a big fat henThere are two ways to seek to a given position in a sequence file.

The first is the seek()method, which positions the reader at the given point in the file. For example, seekingto a record boundary works as expected:reader.seek(359);assertThat(reader.next(key, value), is(true));assertThat(((IntWritable) key).get(), is(95));But if the position in the file is not at a record boundary, the reader fails when the next()method is called:reader.seek(360);reader.next(key, value); // fails with IOExceptionThe second way to find a record boundary makes use of sync points.

The sync(longposition) method on SequenceFile.Reader positions the reader at the next sync pointafter position. (If there are no sync points in the file after this position, then the readerwill be positioned at the end of the file.) Thus, we can call sync() with any position inFile-Based Data Structures|131the stream—not necessarily a record boundary—and the reader will reestablish itself atthe next sync point so reading can continue:reader.sync(360);assertThat(reader.getPosition(), is(2021L));assertThat(reader.next(key, value), is(true));assertThat(((IntWritable) key).get(), is(59));SequenceFile.Writer has a method called sync() for inserting async point at the current position in the stream.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.