Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 14

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 14 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 142020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 14)

On largeclusters with many files and blocks, the time it takes for a namenode to start from coldcan be 30 minutes or more.48| Chapter 3: The Hadoop Distributed FilesystemThe long recovery time is a problem for routine maintenance, too. In fact, becauseunexpected failure of the namenode is so rare, the case for planned downtime is actuallymore important in practice.Hadoop 2 remedied this situation by adding support for HDFS high availability (HA).In this implementation, there are a pair of namenodes in an active-standby configura‐tion.

In the event of the failure of the active namenode, the standby takes over its dutiesto continue servicing client requests without a significant interruption. A few architec‐tural changes are needed to allow this to happen:• The namenodes must use highly available shared storage to share the edit log.

Whena standby namenode comes up, it reads up to the end of the shared edit log tosynchronize its state with the active namenode, and then continues to read newentries as they are written by the active namenode.• Datanodes must send block reports to both namenodes because the block mappingsare stored in a namenode’s memory, and not on disk.• Clients must be configured to handle namenode failover, using a mechanism thatis transparent to users.• The secondary namenode’s role is subsumed by the standby, which takes periodiccheckpoints of the active namenode’s namespace.There are two choices for the highly available shared storage: an NFS filer, or a quorumjournal manager (QJM).

The QJM is a dedicated HDFS implementation, designed forthe sole purpose of providing a highly available edit log, and is the recommended choicefor most HDFS installations. The QJM runs as a group of journal nodes, and each editmust be written to a majority of the journal nodes. Typically, there are three journalnodes, so the system can tolerate the loss of one of them.

This arrangement is similarto the way ZooKeeper works, although it is important to realize that the QJM imple‐mentation does not use ZooKeeper. (Note, however, that HDFS HA does use ZooKeeperfor electing the active namenode, as explained in the next section.)If the active namenode fails, the standby can take over very quickly (in a few tens ofseconds) because it has the latest state available in memory: both the latest edit log entriesand an up-to-date block mapping. The actual observed failover time will be longer inpractice (around a minute or so), because the system needs to be conservative in de‐ciding that the active namenode has failed.In the unlikely event of the standby being down when the active fails, the administratorcan still start the standby from cold.

This is no worse than the non-HA case, and froman operational point of view it’s an improvement, because the process is a standardoperational procedure built into Hadoop.HDFS Concepts|49Failover and fencingThe transition from the active namenode to the standby is managed by a new entity inthe system called the failover controller. There are various failover controllers, but thedefault implementation uses ZooKeeper to ensure that only one namenode is active.Each namenode runs a lightweight failover controller process whose job it is to monitorits namenode for failures (using a simple heartbeating mechanism) and trigger a failovershould a namenode fail.Failover may also be initiated manually by an administrator, for example, in the case ofroutine maintenance.

This is known as a graceful failover, since the failover controllerarranges an orderly transition for both namenodes to switch roles.In the case of an ungraceful failover, however, it is impossible to be sure that the failednamenode has stopped running. For example, a slow network or a network partitioncan trigger a failover transition, even though the previously active namenode is stillrunning and thinks it is still the active namenode. The HA implementation goes to greatlengths to ensure that the previously active namenode is prevented from doing anydamage and causing corruption—a method known as fencing.The QJM only allows one namenode to write to the edit log at one time; however, it isstill possible for the previously active namenode to serve stale read requests to clients,so setting up an SSH fencing command that will kill the namenode’s process is a goodidea.

Stronger fencing methods are required when using an NFS filer for the shared editlog, since it is not possible to only allow one namenode to write at a time (this is whyQJM is recommended). The range of fencing mechanisms includes revoking the name‐node’s access to the shared storage directory (typically by using a vendor-specific NFScommand), and disabling its network port via a remote management command. As alast resort, the previously active namenode can be fenced with a technique rathergraphically known as STONITH, or “shoot the other node in the head,” which uses aspecialized power distribution unit to forcibly power down the host machine.Client failover is handled transparently by the client library. The simplest implemen‐tation uses client-side configuration to control failover.

The HDFS URI uses a logicalhostname that is mapped to a pair of namenode addresses (in the configuration file),and the client library tries each namenode address until the operation succeeds.The Command-Line InterfaceWe’re going to have a look at HDFS by interacting with it from the command line. Thereare many other interfaces to HDFS, but the command line is one of the simplest and,to many developers, the most familiar.We are going to run HDFS on one machine, so first follow the instructions for settingup Hadoop in pseudodistributed mode in Appendix A. Later we’ll see how to run HDFSon a cluster of machines to give us scalability and fault tolerance.50|Chapter 3: The Hadoop Distributed FilesystemThere are two properties that we set in the pseudodistributed configuration that deservefurther explanation. The first is fs.defaultFS, set to hdfs://localhost/, which is usedto set a default filesystem for Hadoop.5 Filesystems are specified by a URI, and here wehave used an hdfs URI to configure Hadoop to use HDFS by default.

The HDFS dae‐mons will use this property to determine the host and port for the HDFS namenode.We’ll be running it on localhost, on the default HDFS port, 8020. And HDFS clients willuse this property to work out where the namenode is running so they can connectto it.We set the second property, dfs.replication, to 1 so that HDFS doesn’t replicatefilesystem blocks by the default factor of three. When running with a single datanode,HDFS can’t replicate blocks to three datanodes, so it would perpetually warn aboutblocks being under-replicated. This setting solves that problem.Basic Filesystem OperationsThe filesystem is ready to be used, and we can do all of the usual filesystem operations,such as reading files, creating directories, moving files, deleting data, and listing direc‐tories.

You can type hadoop fs -help to get detailed help on every command.Start by copying a file from the local filesystem to HDFS:% hadoop fs -copyFromLocal input/docs/quangle.txt \hdfs://localhost/user/tom/quangle.txtThis command invokes Hadoop’s filesystem shell command fs, which supports a num‐ber of subcommands—in this case, we are running -copyFromLocal. The local filequangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance running onlocalhost.

In fact, we could have omitted the scheme and host of the URI and picked upthe default, hdfs://localhost, as specified in core-site.xml:% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txtWe also could have used a relative path and copied the file to our home directory inHDFS, which in this case is /user/tom:% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txtLet’s copy the file back to the local filesystem and check whether it’s the same:% hadoop fs -copyToLocal quangle.txt quangle.copy.txt% md5 input/docs/quangle.txt quangle.copy.txtMD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb25.

In Hadoop 1, the name for this property was fs.default.name. Hadoop 2 introduced many new propertynames, and deprecated the old ones (see “Which Properties Can I Set?” on page 150). This book uses the newproperty names.The Command-Line Interface|51The MD5 digests are the same, showing that the file survived its trip to HDFS and isback intact.Finally, let’s look at an HDFS file listing. We create a directory first just to see how it isdisplayed in the listing:% hadoop fs -mkdir books% hadoop fs -ls .Found 2 itemsdrwxr-xr-x- tom supergroup-rw-r--r-1 tom supergroup0 2014-10-04 13:22 books119 2014-10-04 13:21 quangle.txtThe information returned is very similar to that returned by the Unix command ls l, with a few minor differences.

The first column shows the file mode. The secondcolumn is the replication factor of the file (something a traditional Unix filesystem doesnot have). Remember we set the default replication factor in the site-wide configurationto be 1, which is why we see the same value here. The entry in this column is empty fordirectories because the concept of replication does not apply to them—directories aretreated as metadata and stored by the namenode, not the datanodes.

The third andfourth columns show the file owner and group. The fifth column is the size of the filein bytes, or zero for directories. The sixth and seventh columns are the last modifieddate and time. Finally, the eighth column is the name of the file or directory.File Permissions in HDFSHDFS has a permissions model for files and directories that is much like the POSIXmodel. There are three types of permission: the read permission (r), the write permission(w), and the execute permission (x).

The read permission is required to read files or listthe contents of a directory. The write permission is required to write a file or, for adirectory, to create or delete files or directories in it. The execute permission is ignoredfor a file because you can’t execute a file on HDFS (unlike POSIX), and for a directorythis permission is required to access its children.Each file and directory has an owner, a group, and a mode. The mode is made up of thepermissions for the user who is the owner, the permissions for the users who aremembers of the group, and the permissions for users who are neither the owners normembers of the group.By default, Hadoop runs with security disabled, which means that a client’s identity isnot authenticated.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.