Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 5

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 5 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 52020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 5)

Mashups between different informationsources make for unexpected and hitherto unimaginable applications.Take, for example, the Astrometry.net project, which watches the Astrometry group onFlickr for new photos of the night sky. It analyzes each image and identifies which partof the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies.This project shows the kinds of things that are possible when data (in this case, taggedphotographic images) is made available and used for something (image analysis) thatwas not anticipated by the creator.4|Chapter 1: Meet HadoopIt has been said that “more data usually beats better algorithms,” which is to say that forsome problems (such as recommending movies or music based on past preferences),however fiendish your algorithms, often they can be beaten simply by having more data(and a less sophisticated algorithm).3The good news is that big data is here.

The bad news is that we are struggling to storeand analyze it.Data Storage and AnalysisThe problem is simple: although the storage capacities of hard drives have increasedmassively over the years, access speeds—the rate at which data can be read from drives—have not kept up. One typical drive from 1990 could store 1,370 MB of data and had atransfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in aroundfive minutes. Over 20 years later, 1-terabyte drives are the norm, but the transfer speedis around 100 MB/s, so it takes more than two and a half hours to read all the data offthe disk.This is a long time to read all data on a single drive—and writing is even slower.

Theobvious way to reduce the time is to read from multiple disks at once. Imagine if we had100 drives, each holding one hundredth of the data. Working in parallel, we could readthe data in under two minutes.Using only one hundredth of a disk may seem wasteful. But we can store 100 datasets,each of which is 1 terabyte, and provide shared access to them. We can imagine that theusers of such a system would be happy to share access in return for shorter analysistimes, and statistically, that their analysis jobs would be likely to be spread over time,so they wouldn’t interfere with each other too much.There’s more to being able to read and write data in parallel to or from multiple disks,though.The first problem to solve is hardware failure: as soon as you start using many pieces ofhardware, the chance that one will fail is fairly high.

A common way of avoiding dataloss is through replication: redundant copies of the data are kept by the system so thatin the event of failure, there is another copy available. This is how RAID works, forinstance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS),takes a slightly different approach, as you shall see later.3. The quote is from Anand Rajaraman’s blog post “More data usually beats better algorithms,” in which hewrites about the Netflix Challenge. Alon Halevy, Peter Norvig, and Fernando Pereira make the same pointin “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009.4. These specifications are for the Seagate ST-41600n.Data Storage and Analysis|5The second problem is that most analysis tasks need to be able to combine the data insome way, and data read from one disk may need to be combined with data from anyof the other 99 disks.

Various distributed systems allow data to be combined from mul‐tiple sources, but doing this correctly is notoriously challenging. MapReduce providesa programming model that abstracts the problem from disk reads and writes, trans‐forming it into a computation over sets of keys and values. We look at the details of thismodel in later chapters, but the important point for the present discussion is that thereare two parts to the computation—the map and the reduce—and it’s the interface be‐tween the two where the “mixing” occurs.

Like HDFS, MapReduce has built-inreliability.In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage andanalysis. What’s more, because it runs on commodity hardware and is open source,Hadoop is affordable.Querying All Your DataThe approach taken by MapReduce may seem like a brute-force approach. The premiseis that the entire dataset—or at least a good portion of it—can be processed for eachquery. But this is its power. MapReduce is a batch query processor, and the ability torun an ad hoc query against your whole dataset and get the results in a reasonable timeis transformative.

It changes the way you think about data and unlocks data that waspreviously archived on tape or disk. It gives people the opportunity to innovate withdata. Questions that took too long to get answered before can now be answered, whichin turn leads to new questions and new insights.For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing emaillogs.

One ad hoc query they wrote was to find the geographic distribution of their users.In their words:This data was so useful that we’ve scheduled the MapReduce job to run monthly and wewill be using this data to help us decide which Rackspace data centers to place new mailservers in as we grow.By bringing several hundred gigabytes of data together and having the tools to analyzeit, the Rackspace engineers were able to gain an understanding of the data that theyotherwise would never have had, and furthermore, they were able to use what they hadlearned to improve the service for their customers.Beyond BatchFor all its strengths, MapReduce is fundamentally a batch processing system, and is notsuitable for interactive analysis.

You can’t run a query and get results back in a fewseconds or less. Queries typically take minutes or more, so it’s best for offline use, wherethere isn’t a human sitting in the processing loop waiting for results.6|Chapter 1: Meet HadoopHowever, since its original incarnation, Hadoop has evolved beyond batch processing.Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects,not just HDFS and MapReduce, that fall under the umbrella of infrastructure for dis‐tributed computing and large-scale data processing.

Many of these are hosted by theApache Software Foundation, which provides support for a community of open sourcesoftware projects, including the original HTTP Server from which it gets its name.The first component to provide online access was HBase, a key-value store that usesHDFS for its underlying storage. HBase provides both online read/write access of in‐dividual rows and batch operations for reading and writing data in bulk, making it agood solution for building applications on.The real enabler for new processing models in Hadoop was the introduction of YARN(which stands for Yet Another Resource Negotiator) in Hadoop 2.

YARN is a clusterresource management system, which allows any distributed program (not just MapRe‐duce) to run on data in a Hadoop cluster.In the last few years, there has been a flowering of different processing patterns thatwork with Hadoop. Here is a sample:Interactive SQLBy dispensing with MapReduce and using a distributed query engine that usesdedicated “always on” daemons (like Impala) or container reuse (like Hive on Tez),it’s possible to achieve low-latency responses for SQL queries on Hadoop while stillscaling up to large dataset sizes.Iterative processingMany algorithms—such as those in machine learning—are iterative in nature, soit’s much more efficient to hold each intermediate working set in memory, com‐pared to loading from disk on each iteration. The architecture of MapReduce doesnot allow this, but it’s straightforward with Spark, for example, and it enables ahighly exploratory style of working with datasets.Stream processingStreaming systems like Storm, Spark Streaming, or Samza make it possible to runreal-time, distributed computations on unbounded streams of data and emit resultsto Hadoop storage or external systems.SearchThe Solr search platform can run on a Hadoop cluster, indexing documents as theyare added to HDFS, and serving search queries from indexes stored in HDFS.Despite the emergence of different processing frameworks on Hadoop, MapReduce stillhas a place for batch processing, and it is useful to understand how it works since itintroduces several concepts that apply more generally (like the idea of input formats,or how a dataset is split into pieces).Beyond Batch|7Comparison with Other SystemsHadoop isn’t the first distributed system for data storage and analysis, but it has someunique properties that set it apart from other systems that may seem similar.

Here welook at some of them.Relational Database Management SystemsWhy can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoopneeded?The answer to these questions comes from another trend in disk drives: seek time isimproving more slowly than transfer rate. Seeking is the process of moving the disk’shead to a particular place on the disk to read or write data. It characterizes the latencyof a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write largeportions of the dataset than streaming through it, which operates at the transfer rate.On the other hand, for updating a small proportion of records in a database, a traditionalB-Tree (the data structure used in relational databases, which is limited by the rate atwhich it can perform seeks) works well.

For updating the majority of a database, a BTree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.In many ways, MapReduce can be seen as a complement to a Relational Database Man‐agement System (RDBMS). (The differences between the two systems are shown inTable 1-1.) MapReduce is a good fit for problems that need to analyze the whole datasetin a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queriesor updates, where the dataset has been indexed to deliver low-latency retrieval andupdate times of a relatively small amount of data. MapReduce suits applications wherethe data is written once and read many times, whereas a relational database is good fordatasets that are continually updated.5Table 1-1. RDBMS compared to MapReduceTraditional RDBMSMapReduceData sizeGigabytesPetabytesAccessInteractive and batchBatchUpdatesRead and write many timesWrite once, read many timesTransactionsACIDNone5.

In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A majorstep backwards,” in which they criticized MapReduce for being a poor substitute for relational databases.Many commentators argued that it was a false comparison (see, for example, Mark C. Chu-Carroll’s “Data‐bases are hammers; MapReduce is a screwdriver”), and DeWitt and Stonebraker followed up with “MapRe‐duce II,” where they addressed the main topics brought up by others.8| Chapter 1: Meet HadoopTraditional RDBMSMapReduceStructureSchema-on-writeSchema-on-readIntegrityHighLowScalingNonlinearLinearHowever, the differences between relational databases and Hadoop systems are blurring.Relational databases have started incorporating some of the ideas from Hadoop, andfrom the other direction, Hadoop systems such as Hive are becoming more interactive(by moving away from MapReduce) and adding features like indexes and transactionsthat make them look more and more like traditional RDBMSs.Another difference between Hadoop and an RDBMS is the amount of structure in thedatasets on which they operate.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.