Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 8

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 8 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 82020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 8)

Reynold Xin et al., “GraySort on Apache Spark by Databricks,” November 2014.What’s in This Book?|15The first two chapters in this part are about data formats. Chapter 12 looks at Avro, across-language data serialization library for Hadoop, and Chapter 13 covers Parquet,an efficient columnar storage format for nested data.The next two chapters look at data ingestion, or how to get your data into Hadoop.Chapter 14 is about Flume, for high-volume ingestion of streaming data.

Chapter 15 isabout Sqoop, for efficient bulk transfer of data between structured data stores (likerelational databases) and HDFS.The common theme of the next four chapters is data processing, and in particular usinghigher-level abstractions than MapReduce. Pig (Chapter 16) is a data flow language forexploring very large datasets. Hive (Chapter 17) is a data warehouse for managing datastored in HDFS and provides a query language based on SQL. Crunch (Chapter 18) isa high-level Java API for writing data processing pipelines that can run on MapReduceor Spark.

Spark (Chapter 19) is a cluster computing framework for large-scale dataprocessing; it provides a directed acyclic graph (DAG) engine, and APIs in Scala, Java,and Python.Chapter 20 is an introduction to HBase, a distributed column-oriented real-time data‐base that uses HDFS for its underlying storage. And Chapter 21 is about ZooKeeper, adistributed, highly available coordination service that provides useful primitives forbuilding distributed applications.Finally, Part V is a collection of case studies contributed by people using Hadoop ininteresting ways.Supplementary information about Hadoop, such as how to install it on your machine,can be found in the appendixes.16|Chapter 1: Meet HadoopFigure 1-1.

Structure of the book: there are various pathways through the contentWhat’s in This Book?|17CHAPTER 2MapReduceMapReduce is a programming model for data processing. The model is simple, yet nottoo simple to express useful programs in. Hadoop can run MapReduce programs writtenin various languages; in this chapter, we look at the same program expressed in Java,Ruby, and Python.

Most importantly, MapReduce programs are inherently parallel, thusputting very large-scale data analysis into the hands of anyone with enough machinesat their disposal. MapReduce comes into its own for large datasets, so let’s start by lookingat one.A Weather DatasetFor our example, we will write a program that mines weather data. Weather sensorscollect data every hour at many locations across the globe and gather a large volume oflog data, which is a good candidate for analysis with MapReduce because we want toprocess all the data, and the data is semi-structured and record-oriented.Data FormatThe data we will use is from the National Climatic Data Center, or NCDC.

The data isstored using a line-oriented ASCII format, in which each line is a record. The formatsupports a rich set of meteorological elements, many of which are optional or withvariable data lengths. For simplicity, we focus on the basic elements, such as temperature,which are always present and are of fixed width.Example 2-1 shows a sample line with some of the salient fields annotated. The line hasbeen split into multiple lines to show each field; in the real file, fields are packed intoone line with no delimiters.19Example 2-1. Format of a National Climatic Data Center record0057332130999991950010103004+51317+028783FM-12+017199999V0203201N00721004501CN0100001N9-01281-01391102681####USAF weather station identifierWBAN weather station identifierobservation dateobservation time# latitude (degrees x 1000)# longitude (degrees x 1000)# elevation (meters)# wind direction (degrees)# quality code# sky ceiling height (meters)# quality code# visibility distance (meters)# quality code######air temperature (degrees Celsius x 10)quality codedew point temperature (degrees Celsius x 10)quality codeatmospheric pressure (hectopascals x 10)quality codeDatafiles are organized by date and weather station.

There is a directory for each yearfrom 1901 to 2001, each containing a gzipped file for each weather station with itsreadings for that year. For example, here are the first entries for 1990:% ls raw/1990 | head010010-99999-1990.gz010014-99999-1990.gz010015-99999-1990.gz010016-99999-1990.gz010017-99999-1990.gz010030-99999-1990.gz010040-99999-1990.gz010080-99999-1990.gz010100-99999-1990.gz010150-99999-1990.gzThere are tens of thousands of weather stations, so the whole dataset is made up of alarge number of relatively small files.

It’s generally easier and more efficient to process20|Chapter 2: MapReducea smaller number of relatively large files, so the data was preprocessed so that each year’sreadings were concatenated into a single file. (The means by which this was carried outis described in Appendix C.)Analyzing the Data with Unix ToolsWhat’s the highest recorded global temperature for each year in the dataset? We willanswer this first without using Hadoop, as this information will provide a performancebaseline and a useful means to check our results.The classic tool for processing line-oriented data is awk.

Example 2-2 is a small scriptto calculate the maximum temperature for each year.Example 2-2. A program for finding the maximum recorded temperature by year fromNCDC weather records#!/usr/bin/env bashfor year in all/*doecho -ne `basename $year .gz`"\t"gunzip -c $year | \awk '{ temp = substr($0, 88, 5) + 0;q = substr($0, 93, 1);if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }END { print max }'doneThe script loops through the compressed year files, first printing the year, and thenprocessing each file using awk. The awk script extracts two fields from the data: the airtemperature and the quality code.

The air temperature value is turned into an integerby adding 0. Next, a test is applied to see whether the temperature is valid (the value9999 signifies a missing value in the NCDC dataset) and whether the quality code in‐dicates that the reading is not suspect or erroneous. If the reading is OK, the value iscompared with the maximum value seen so far, which is updated if a new maximum isfound. The END block is executed after all the lines in the file have been processed, andit prints the maximum value.Here is the beginning of a run:% ./max_temperature.sh1901 3171902 2441903 2891904 2561905 283...The temperature values in the source file are scaled by a factor of 10, so this works outas a maximum temperature of 31.7°C for 1901 (there were very few readings at theAnalyzing the Data with Unix Tools|21beginning of the century, so this is plausible).

The complete run for the century took 42minutes in one run on a single EC2 High-CPU Extra Large instance.To speed up the processing, we need to run parts of the program in parallel. In theory,this is straightforward: we could process different years in different processes, using allthe available hardware threads on a machine. There are a few problems with this,however.First, dividing the work into equal-size pieces isn’t always easy or obvious.

In this case,the file size for different years varies widely, so some processes will finish much earlierthan others. Even if they pick up further work, the whole run is dominated by the longestfile. A better approach, although one that requires more work, is to split the input intofixed-size chunks and assign each chunk to a process.Second, combining the results from independent processes may require further pro‐cessing. In this case, the result for each year is independent of other years, and they maybe combined by concatenating all the results and sorting by year.

If using the fixed-sizechunk approach, the combination is more delicate. For this example, data for a particularyear will typically be split into several chunks, each processed independently. We’ll endup with the maximum temperature for each chunk, so the final step is to look for thehighest of these maximums for each year.Third, you are still limited by the processing capacity of a single machine. If the besttime you can achieve is 20 minutes with the number of processors you have, then that’sit. You can’t make it go faster. Also, some datasets grow beyond the capacity of a singlemachine.

When we start using multiple machines, a whole host of other factors comeinto play, mainly falling into the categories of coordination and reliability. Who runsthe overall job? How do we deal with failed processes?So, although it’s feasible to parallelize the processing, in practice it’s messy. Using aframework like Hadoop to take care of these issues is a great help.Analyzing the Data with HadoopTo take advantage of the parallel processing that Hadoop provides, we need to expressour query as a MapReduce job. After some local, small-scale testing, we will be able torun it on a cluster of machines.Map and ReduceMapReduce works by breaking the processing into two phases: the map phase and thereduce phase. Each phase has key-value pairs as input and output, the types of whichmay be chosen by the programmer.

The programmer also specifies two functions: themap function and the reduce function.22|Chapter 2: MapReduceThe input to our map phase is the raw NCDC data. We choose a text input format thatgives us each line in the dataset as a text value. The key is the offset of the beginning ofthe line from the beginning of the file, but as we have no need for this, we ignore it.Our map function is simple. We pull out the year and the air temperature, because theseare the only fields we are interested in.

In this case, the map function is just a datapreparation phase, setting up the data in such a way that the reduce function can do itswork on it: finding the maximum temperature for each year. The map function is alsoa good place to drop bad records: here we filter out temperatures that are missing,suspect, or erroneous.To visualize the way the map works, consider the following sample lines of input data(some unused columns have been dropped to fit the page, indicated by ellipses):0067011990999991950051507004...9999999N9+00001+99999999999...0043011990999991950051512004...9999999N9+00221+99999999999...0043011990999991950051518004...9999999N9-00111+99999999999...0043012650999991949032412004...0500001N9+01111+99999999999...0043012650999991949032418004...0500001N9+00781+99999999999...These lines are presented to the map function as the key-value pairs:(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)The keys are the line offsets within the file, which we ignore in our map function.

Themap function merely extracts the year and the air temperature (indicated in bold text),and emits them as its output (the temperature values have been interpreted asintegers):(1950,(1950,(1950,(1949,(1949,0)22)−11)111)78)The output from the map function is processed by the MapReduce framework beforebeing sent to the reduce function. This processing sorts and groups the key-value pairsby key. So, continuing the example, our reduce function sees the following input:(1949, [111, 78])(1950, [0, 22, −11])Each year appears with a list of all its air temperature readings. All the reduce functionhas to do now is iterate through the list and pick up the maximum reading:(1949, 111)(1950, 22)Analyzing the Data with Hadoop|23This is the final output: the maximum global temperature recorded in each year.The whole data flow is illustrated in Figure 2-1.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.