Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 6

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 6 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 62020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 6)

Structured data is organized into entities that have adefined format, such as XML documents or database tables that conform to a particularpredefined schema. This is the realm of the RDBMS. Semi-structured data, on the otherhand, is looser, and though there may be a schema, it is often ignored, so it may be usedonly as a guide to the structure of the data: for example, a spreadsheet, in which thestructure is the grid of cells, although the cells themselves may hold any form of data.Unstructured data does not have any particular internal structure: for example, plaintext or image data.

Hadoop works well on unstructured or semi-structured data becauseit is designed to interpret the data at processing time (so called schema-on-read). Thisprovides flexibility and avoids the costly data loading phase of an RDBMS, since inHadoop it is just a file copy.Relational data is often normalized to retain its integrity and remove redundancy.Normalization poses problems for Hadoop processing because it makes reading a recorda nonlocal operation, and one of the central assumptions that Hadoop makes is that itis possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for example,the client hostnames are specified in full each time, even though the same client mayappear many times), and this is one reason that logfiles of all kinds are particularly wellsuited to analysis with Hadoop.

Note that Hadoop can perform joins; it’s just that theyare not used as much as in the relational world.MapReduce—and the other processing models in Hadoop—scales linearly with the sizeof the data. Data is partitioned, and the functional primitives (like map and reduce) canwork in parallel on separate partitions. This means that if you double the size of theinput data, a job will run twice as slowly. But if you also double the size of the cluster, ajob will run as fast as the original one. This is not generally true of SQL queries.Comparison with Other Systems|9Grid ComputingThe high-performance computing (HPC) and grid computing communities have beendoing large-scale data processing for years, using such application program interfaces(APIs) as the Message Passing Interface (MPI). Broadly, the approach in HPC is todistribute the work across a cluster of machines, which access a shared filesystem, hostedby a storage area network (SAN).

This works well for predominantly compute-intensivejobs, but it becomes a problem when nodes need to access larger data volumes (hundredsof gigabytes, the point at which Hadoop really starts to shine), since the network band‐width is the bottleneck and compute nodes become idle.Hadoop tries to co-locate the data with the compute nodes, so data access is fast becauseit is local.6 This feature, known as data locality, is at the heart of data processing inHadoop and is the reason for its good performance. Recognizing that network band‐width is the most precious resource in a data center environment (it is easy to saturatenetwork links by copying data around), Hadoop goes to great lengths to conserve it byexplicitly modeling network topology. Notice that this arrangement does not precludehigh-CPU analyses in Hadoop.MPI gives great control to programmers, but it requires that they explicitly handle themechanics of the data flow, exposed via low-level C routines and constructs such assockets, as well as the higher-level algorithms for the analyses.

Processing in Hadoopoperates only at the higher level: the programmer thinks in terms of the data model(such as key-value pairs for MapReduce), while the data flow remains implicit.Coordinating the processes in a large-scale distributed computation is a challenge. Thehardest aspect is gracefully handling partial failure—when you don’t know whether ornot a remote process has failed—and still making progress with the overall computation.Distributed processing frameworks like MapReduce spare the programmer from havingto think about failure, since the implementation detects failed tasks and reschedulesreplacements on machines that are healthy. MapReduce is able to do this because it is ashared-nothing architecture, meaning that tasks have no dependence on one other.

(Thisis a slight oversimplification, since the output from mappers is fed to the reducers, butthis is under the control of the MapReduce system; in this case, it needs to take morecare rerunning a failed reducer than rerunning a failed map, because it has to make sureit can retrieve the necessary map outputs and, if not, regenerate them by running therelevant maps again.) So from the programmer’s point of view, the order in which thetasks run doesn’t matter.

By contrast, MPI programs have to explicitly manage their owncheckpointing and recovery, which gives more control to the programmer but makesthem more difficult to write.6. Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Eco‐nomics,” March 2003.10|Chapter 1: Meet HadoopVolunteer ComputingWhen people first hear about Hadoop and MapReduce they often ask, “How is it dif‐ferent from SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs aproject called SETI@home in which volunteers donate CPU time from their otherwiseidle computers to analyze radio telescope data for signs of intelligent life outside Earth.SETI@home is the most well known of many volunteer computing projects; others in‐clude the Great Internet Mersenne Prime Search (to search for large prime numbers)and Folding@home (to understand protein folding and how it relates to disease).Volunteer computing projects work by breaking the problems they are trying tosolve into chunks called work units, which are sent to computers around the world tobe analyzed.

For example, a SETI@home work unit is about 0.35 MB of radio telescopedata, and takes hours or days to analyze on a typical home computer. When the analysisis completed, the results are sent back to the server, and the client gets another workunit. As a precaution to combat cheating, each work unit is sent to three different ma‐chines and needs at least two results to agree to be accepted.Although SETI@home may be superficially similar to MapReduce (breaking a probleminto independent pieces to be worked on in parallel), there are some significant differ‐ences. The SETI@home problem is very CPU-intensive, which makes it suitable forrunning on hundreds of thousands of computers across the world7 because the time totransfer the work unit is dwarfed by the time to run the computation on it.

Volunteersare donating CPU cycles, not bandwidth.7. In January 2008, SETI@home was reported to be processing 300 gigabytes a day, using 320,000 computers(most of which are not dedicated to SETI@home; they are used for other things, too).Comparison with Other Systems|11MapReduce is designed to run jobs that last minutes or hours on trusted, dedicatedhardware running in a single data center with very high aggregate bandwidthinterconnects. By contrast, SETI@home runs a perpetual computation on untrustedmachines on the Internet with highly variable connection speeds and no data locality.A Brief History of Apache HadoopHadoop was created by Doug Cutting, the creator of Apache Lucene, the widely usedtext search library.

Hadoop has its origins in Apache Nutch, an open source web searchengine, itself a part of the Lucene project.The Origin of the Name “Hadoop”The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, DougCutting, explains how the name came about:The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell andpronounce, meaningless, and not used elsewhere: those are my naming criteria.

Kidsare good at generating such. Googol is a kid’s term.Projects in the Hadoop ecosystem also tend to have names that are unrelated to theirfunction, often with an elephant or other animal theme (“Pig,” for example). Smallercomponents are given more descriptive (and therefore more mundane) names. This isa good principle, as it means you can generally work out what something does from itsname.

For example, the namenode8 manages the filesystem namespace.Building a web search engine from scratch was an ambitious goal, for not only is thesoftware required to crawl and index websites complex to write, but it is also a challengeto run without a dedicated operations team, since there are so many moving parts. It’sexpensive, too: Mike Cafarella and Doug Cutting estimated a system supporting aone-billion-page index would cost around $500,000 in hardware, with a monthly run‐ning cost of $30,000.9 Nevertheless, they believed it was a worthy goal, as it would openup and ultimately democratize search engine algorithms.Nutch was started in 2002, and a working crawler and search system quickly emerged.However, its creators realized that their architecture wouldn’t scale to the billions ofpages on the Web.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.