Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 12

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 12 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 122020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 12)

Alternatively, you could use “pull”-style processing in the new MapReduce API; see Appendix D.38| Chapter 2: MapReduceif last_key && last_key != keyputs "#{last_key}\t#{max_val}"last_key, max_val = key, val.to_ielselast_key, max_val = key, [max_val, val.to_i].maxendendputs "#{last_key}\t#{max_val}" if last_keyAgain, the program iterates over lines from standard input, but this time we have tostore some state as we process each key group. In this case, the keys are the years, andwe store the last key seen and the maximum temperature seen so far for that key. TheMapReduce framework ensures that the keys are ordered, so we know that if a key isdifferent from the previous one, we have moved into a new key group. In contrast tothe Java API, where you are provided an iterator over each key group, in Streaming youhave to find key group boundaries in your program.For each line, we pull out the key and value.

Then, if we’ve just finished a group(last_key && last_key != key), we write the key and the maximum temperature forthat group, separated by a tab character, before resetting the maximum temperature forthe new key. If we haven’t just finished a group, we just update the maximum temperaturefor the current key.The last line of the program ensures that a line is written for the last key group in theinput.We can now simulate the whole MapReduce pipeline with a Unix pipeline (which isequivalent to the Unix pipeline shown in Figure 2-1):% cat input/ncdc/sample.txt | \ch02-mr-intro/src/main/ruby/max_temperature_map.rb | \sort | ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb1949 1111950 22The output is the same as that of the Java program, so the next step is to run it usingHadoop itself.The hadoop command doesn’t support a Streaming option; instead, you specify theStreaming JAR file along with the jar option.

Options to the Streaming program specifythe input and output paths and the map and reduce scripts. This is what it looks like:% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \-input input/ncdc/sample.txt \-output output \-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \-reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rbWhen running on a large dataset on a cluster, we should use the -combiner option toset the combiner:Hadoop Streaming|39% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \-files ch02-mr-intro/src/main/ruby/max_temperature_map.rb,\ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb \-input input/ncdc/all \-output output \-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \-combiner ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb \-reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rbNote also the use of -files, which we use when running Streaming programs on thecluster to ship the scripts to the cluster.PythonStreaming supports any programming language that can read from standard input andwrite to standard output, so for readers more familiar with Python, here’s the sameexample again.5 The map script is in Example 2-9, and the reduce script is inExample 2-10.Example 2-9.

Map function for maximum temperature in Python#!/usr/bin/env pythonimport reimport sysfor line in sys.stdin:val = line.strip()(year, temp, q) = (val[15:19], val[87:92], val[92:93])if (temp != "+9999" and re.match("[01459]", q)):print "%s\t%s" % (year, temp)Example 2-10. Reduce function for maximum temperature in Python#!/usr/bin/env pythonimport sys(last_key, max_val) = (None, -sys.maxint)for line in sys.stdin:(key, val) = line.strip().split("\t")if last_key and last_key != key:print "%s\t%s" % (last_key, max_val)(last_key, max_val) = (key, int(val))else:(last_key, max_val) = (key, max(max_val, int(val)))5.

As an alternative to Streaming, Python programmers should consider Dumbo, which makes the StreamingMapReduce interface more Pythonic and easier to use.40|Chapter 2: MapReduceif last_key:print "%s\t%s" % (last_key, max_val)We can test the programs and run the job in the same way we did in Ruby. For example,to run a test:% cat input/ncdc/sample.txt | \ch02-mr-intro/src/main/python/max_temperature_map.py | \sort | ch02-mr-intro/src/main/python/max_temperature_reduce.py1949111195022Hadoop Streaming|41CHAPTER 3The Hadoop Distributed FilesystemWhen a dataset outgrows the storage capacity of a single physical machine, it becomesnecessary to partition it across a number of separate machines. Filesystems that managethe storage across a network of machines are called distributed filesystems.

Since theyare network based, all the complications of network programming kick in, thus makingdistributed filesystems more complex than regular disk filesystems. For example, oneof the biggest challenges is making the filesystem tolerate node failure without sufferingdata loss.Hadoop comes with a distributed filesystem called HDFS, which stands for HadoopDistributed Filesystem.

(You may sometimes see references to “DFS”—informally or inolder documentation or configurations—which is the same thing.) HDFS is Hadoop’sflagship filesystem and is the focus of this chapter, but Hadoop actually has a generalpurpose filesystem abstraction, so we’ll see along the way how Hadoop integrates withother storage systems (such as the local filesystem and Amazon S3).The Design of HDFSHDFS is a filesystem designed for storing very large files with streaming data accesspatterns, running on clusters of commodity hardware.1 Let’s examine this statement inmore detail:1.

The architecture of HDFS is described in Robert Chansler et al.’s, “The Hadoop Distributed File System,”which appeared in The Architecture of Open Source Applications: Elegance, Evolution, and a Few FearlessHacks by Amy Brown and Greg Wilson (eds.).43Very large files“Very large” in this context means files that are hundreds of megabytes, gigabytes,or terabytes in size. There are Hadoop clusters running today that store petabytesof data.2Streaming data accessHDFS is built around the idea that the most efficient data processing pattern is awrite-once, read-many-times pattern.

A dataset is typically generated or copiedfrom source, and then various analyses are performed on that dataset over time.Each analysis will involve a large proportion, if not all, of the dataset, so the timeto read the whole dataset is more important than the latency in reading the firstrecord.Commodity hardwareHadoop doesn’t require expensive, highly reliable hardware. It’s designed to run onclusters of commodity hardware (commonly available hardware that can be ob‐tained from multiple vendors)3 for which the chance of node failure across thecluster is high, at least for large clusters.

HDFS is designed to carry on workingwithout a noticeable interruption to the user in the face of such failure.It is also worth examining the applications for which using HDFS does not work so well.Although this may change in the future, these are areas where HDFS is not a good fittoday:Low-latency data accessApplications that require low-latency access to data, in the tens of millisecondsrange, will not work well with HDFS. Remember, HDFS is optimized for deliveringa high throughput of data, and this may be at the expense of latency.

HBase (seeChapter 20) is currently a better choice for low-latency access.Lots of small filesBecause the namenode holds filesystem metadata in memory, the limit to the num‐ber of files in a filesystem is governed by the amount of memory on the namenode.As a rule of thumb, each file, directory, and block takes about 150 bytes.

So, forexample, if you had one million files, each taking one block, you would need at least300 MB of memory. Although storing millions of files is feasible, billions is beyondthe capability of current hardware.42. See Konstantin V. Shvachko and Arun C. Murthy, “Scaling Hadoop to 4000 nodes at Yahoo!”, September 30,2008.3. See Chapter 10 for a typical machine specification.4. For an exposition of the scalability limits of HDFS, see Konstantin V. Shvachko, “HDFS Scalability: The Limitsto Growth”, April 2010.44|Chapter 3: The Hadoop Distributed FilesystemMultiple writers, arbitrary file modificationsFiles in HDFS may be written to by a single writer. Writes are always made at theend of the file, in append-only fashion.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.