Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 23

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 23 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 232020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 23)

To make the timetaken for a job to start more predictable, the Fair Scheduler supports preemption.Scheduling in YARN|93Preemption allows the scheduler to kill containers for queues that are running withmore than their fair share of resources so that the resources can be allocated to a queuethat is under its fair share. Note that preemption reduces overall cluster efficiency, sincethe terminated containers need to be reexecuted.Preemption is enabled globally by setting yarn.scheduler.fair.preemption to true.There are two relevant preemption timeout settings: one for minimum share and onefor fair share, both specified in seconds. By default, the timeouts are not set, so you needto set at least one to allow containers to be preempted.If a queue waits for as long as its minimum share preemption timeout without receivingits minimum guaranteed share, then the scheduler may preempt other containers. Thedefault timeout is set for all queues via the defaultMinSharePreemptionTimeout toplevel element in the allocation file, and on a per-queue basis by setting the minSharePreemptionTimeout element for a queue.Likewise, if a queue remains below half of its fair share for as long as the fair sharepreemption timeout, then the scheduler may preempt other containers.

The defaulttimeout is set for all queues via the defaultFairSharePreemptionTimeout top-levelelement in the allocation file, and on a per-queue basis by setting fairSharePreemptionTimeout on a queue. The threshold may also be changed from its default of 0.5 bysetting defaultFairSharePreemptionThreshold and fairSharePreemptionThreshold (per-queue).Delay SchedulingAll the YARN schedulers try to honor locality requests. On a busy cluster, if an appli‐cation requests a particular node, there is a good chance that other containers are run‐ning on it at the time of the request. The obvious course of action is to immediatelyloosen the locality requirement and allocate a container on the same rack. However, ithas been observed in practice that waiting a short time (no more than a few seconds)can dramatically increase the chances of being allocated a container on the requestednode, and therefore increase the efficiency of the cluster.

This feature is called delayscheduling, and it is supported by both the Capacity Scheduler and the Fair Scheduler.Every node manager in a YARN cluster periodically sends a heartbeat request to theresource manager—by default, one per second. Heartbeats carry information about thenode manager’s running containers and the resources available for new containers, soeach heartbeat is a potential scheduling opportunity for an application to run a container.When using delay scheduling, the scheduler doesn’t simply use the first schedulingopportunity it receives, but waits for up to a given maximum number of schedulingopportunities to occur before loosening the locality constraint and taking the nextscheduling opportunity.94|Chapter 4: YARNFor the Capacity Scheduler, delay scheduling is configured by settingyarn.scheduler.capacity.node-locality-delay to a positive integer representingthe number of scheduling opportunities that it is prepared to miss before loosening thenode constraint to match any node in the same rack.The Fair Scheduler also uses the number of scheduling opportunities to determine thedelay, although it is expressed as a proportion of the cluster size.

For example, settingyarn.scheduler.fair.locality.threshold.node to 0.5 means that the schedulershould wait until half of the nodes in the cluster have presented scheduling opportunitiesbefore accepting another node in the same rack. There is a corresponding property,yarn.scheduler.fair.locality.threshold.rack, for setting the threshold beforeanother rack is accepted instead of the one requested.Dominant Resource FairnessWhen there is only a single resource type being scheduled, such as memory, then theconcept of capacity or fairness is easy to determine.

If two users are running applications,you can measure the amount of memory that each is using to compare the two appli‐cations. However, when there are multiple resource types in play, things get more com‐plicated. If one user’s application requires lots of CPU but little memory and the other’srequires little CPU and lots of memory, how are these two applications compared?The way that the schedulers in YARN address this problem is to look at each user’sdominant resource and use it as a measure of the cluster usage.

This approach is calledDominant Resource Fairness, or DRF for short.9 The idea is best illustrated with a simpleexample.Imagine a cluster with a total of 100 CPUs and 10 TB of memory. Application A requestscontainers of (2 CPUs, 300 GB), and application B requests containers of (6 CPUs, 100GB). A’s request is (2%, 3%) of the cluster, so memory is dominant since its proportion(3%) is larger than CPU’s (2%).

B’s request is (6%, 1%), so CPU is dominant. Since B’scontainer requests are twice as big in the dominant resource (6% versus 3%), it will beallocated half as many containers under fair sharing.By default DRF is not used, so during resource calculations, only memory is consideredand CPU is ignored. The Capacity Scheduler can be configured to use DRF by settingyarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator in capacity-scheduler.xml.For the Fair Scheduler, DRF can be enabled by setting the top-level element defaultQueueSchedulingPolicy in the allocation file to drf.9.

DRF was introduced in Ghodsi et al.’s “Dominant Resource Fairness: Fair Allocation of Multiple ResourceTypes,” March 2011.Scheduling in YARN|95Further ReadingThis chapter has given a short overview of YARN. For more detail, see Apache HadoopYARN by Arun C. Murthy et al.

(Addison-Wesley, 2014).96|Chapter 4: YARNCHAPTER 5Hadoop I/OHadoop comes with a set of primitives for data I/O. Some of these are techniques thatare more general than Hadoop, such as data integrity and compression, but deservespecial consideration when dealing with multiterabyte datasets. Others are Hadooptools or APIs that form the building blocks for developing distributed systems, such asserialization frameworks and on-disk data structures.Data IntegrityUsers of Hadoop rightly expect that no data will be lost or corrupted during storage orprocessing. However, because every I/O operation on the disk or network carries withit a small chance of introducing errors into the data that it is reading or writing, whenthe volumes of data flowing through the system are as large as the ones Hadoop is capableof handling, the chance of data corruption occurring is high.The usual way of detecting corrupted data is by computing a checksum for the data whenit first enters the system, and again whenever it is transmitted across a channel that isunreliable and hence capable of corrupting the data.

The data is deemed to be corruptif the newly generated checksum doesn’t exactly match the original. This techniquedoesn’t offer any way to fix the data—it is merely error detection. (And this is a reasonfor not using low-end hardware; in particular, be sure to use ECC memory.) Note thatit is possible that it’s the checksum that is corrupt, not the data, but this is very unlikely,because the checksum is much smaller than the data.A commonly used error-detecting code is CRC-32 (32-bit cyclic redundancy check),which computes a 32-bit integer checksum for input of any size. CRC-32 is used forchecksumming in Hadoop’s ChecksumFileSystem, while HDFS uses a more efficientvariant called CRC-32C.97Data Integrity in HDFSHDFS transparently checksums all data written to it and by default verifies checksumswhen reading data. A separate checksum is created for every dfs.bytes-perchecksum bytes of data.

The default is 512 bytes, and because a CRC-32C checksum is4 bytes long, the storage overhead is less than 1%.Datanodes are responsible for verifying the data they receive before storing the data andits checksum. This applies to data that they receive from clients and from otherdatanodes during replication. A client writing data sends it to a pipeline of datanodes(as explained in Chapter 3), and the last datanode in the pipeline verifies the checksum.If the datanode detects an error, the client receives a subclass of IOException, which itshould handle in an application-specific manner (for example, by retrying the opera‐tion).When clients read data from datanodes, they verify checksums as well, comparing themwith the ones stored at the datanodes.

Each datanode keeps a persistent log of checksumverifications, so it knows the last time each of its blocks was verified. When a clientsuccessfully verifies a block, it tells the datanode, which updates its log. Keeping statisticssuch as these is valuable in detecting bad disks.In addition to block verification on client reads, each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the data‐node. This is to guard against corruption due to “bit rot” in the physical storage media.See “Datanode block scanner” on page 328 for details on how to access the scannerreports.Because HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying oneof the good replicas to produce a new, uncorrupt replica.

The way this works is that ifa client detects an error when reading a block, it reports the bad block and the datanodeit was trying to read from to the namenode before throwing a ChecksumException. Thenamenode marks the block replica as corrupt so it doesn’t direct any more clients to itor try to copy this replica to another datanode. It then schedules a copy of the block tobe replicated on another datanode, so its replication factor is back at the expected level.Once this has happened, the corrupt replica is deleted.It is possible to disable verification of checksums by passing false to the setVerifyChecksum() method on FileSystem before using the open() method to read a file. Thesame effect is possible from the shell by using the -ignoreCrc option with the -get orthe equivalent -copyToLocal command.

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.