Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 18

Файл №811394 Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf) 18 страницаTom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394) страница 182020-08-252020-08-25СтудИзба

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 18)

After the globpicks out an initial set of files to include, the filter is used to refine the results. Forexample:fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))will expand to /2007/12/30.Filters can act only on a file’s name, as represented by a Path. They can’t use a file’sproperties, such as creation time, as their basis. Nevertheless, they can perform matchingthat neither glob patterns nor regular expressions can achieve.

For example, if you storefiles in a directory structure that is laid out by date (like in the previous section), youcan write a PathFilter to pick out files that fall in a given date range.Deleting DataUse the delete() method on FileSystem to permanently remove files or directories:public boolean delete(Path f, boolean recursive) throws IOExceptionIf f is a file or an empty directory, the value of recursive is ignored. A nonemptydirectory is deleted, along with its contents, only if recursive is true (otherwise, anIOException is thrown).68|Chapter 3: The Hadoop Distributed FilesystemData FlowAnatomy of a File ReadTo get an idea of how data flows between the client interacting with HDFS, the name‐node, and the datanodes, consider Figure 3-2, which shows the main sequence of eventswhen reading a file.Figure 3-2. A client reading data from HDFSThe client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2).DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), todetermine the locations of the first few blocks in the file (step 2).

For each block, thenamenode returns the addresses of the datanodes that have a copy of that block. Fur‐thermore, the datanodes are sorted according to their proximity to the client (accordingto the topology of the cluster’s network; see “Network Topology and Hadoop” on page70). If the client is itself a datanode (in the case of a MapReduce task, for instance), theclient will read from the local datanode if that datanode hosts a copy of the block (seealso Figure 2-2 and “Short-circuit local reads” on page 308).The DistributedFileSystem returns an FSDataInputStream (an input stream thatsupports file seeks) to the client for it to read data from. FSDataInputStream in turnwraps a DFSInputStream, which manages the datanode and namenode I/O.The client then calls read() on the stream (step 3).

DFSInputStream, which has storedthe datanode addresses for the first few blocks in the file, then connects to the firstData Flow|69(closest) datanode for the first block in the file. Data is streamed from the datanode backto the client, which calls read() repeatedly on the stream (step 4). When the end of theblock is reached, DFSInputStream will close the connection to the datanode, then findthe best datanode for the next block (step 5). This happens transparently to the client,which from its point of view is just reading a continuous stream.Blocks are read in order, with the DFSInputStream opening new connections todatanodes as the client reads through the stream.

It will also call the namenode to retrievethe datanode locations for the next batch of blocks as needed. When the client hasfinished reading, it calls close() on the FSDataInputStream (step 6).During reading, if the DFSInputStream encounters an error while communicating witha datanode, it will try the next closest one for that block.

It will also remember datanodesthat have failed so that it doesn’t needlessly retry them for later blocks. The DFSInputStream also verifies checksums for the data transferred to it from the datanode. If acorrupted block is found, the DFSInputStream attempts to read a replica of the blockfrom another datanode; it also reports the corrupted block to the namenode.One important aspect of this design is that the client contacts datanodes directly toretrieve data and is guided by the namenode to the best datanode for each block.

Thisdesign allows HDFS to scale to a large number of concurrent clients because the datatraffic is spread across all the datanodes in the cluster. Meanwhile, the namenode merelyhas to service block location requests (which it stores in memory, making them veryefficient) and does not, for example, serve data, which would quickly become a bottle‐neck as the number of clients grew.Network Topology and HadoopWhat does it mean for two nodes in a local network to be “close” to each other? In thecontext of high-volume data processing, the limiting factor is the rate at which we cantransfer data between nodes—bandwidth is a scarce commodity. The idea is to use thebandwidth between two nodes as a measure of distance.Rather than measuring bandwidth between nodes, which can be difficult to do in prac‐tice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows asthe square of the number of nodes), Hadoop takes a simple approach in which thenetwork is represented as a tree and the distance between two nodes is the sum of theirdistances to their closest common ancestor.

Levels in the tree are not predefined, but itis common to have levels that correspond to the data center, the rack, and the node thata process is running on. The idea is that the bandwidth available for each of the followingscenarios becomes progressively less:• Processes on the same node• Different nodes on the same rack70|Chapter 3: The Hadoop Distributed Filesystem• Nodes on different racks in the same data center• Nodes in different data centers8For example, imagine a node n1 on rack r1 in data center d1.

This can be representedas /d1/r1/n1. Using this notation, here are the distances for the four scenarios:• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)This is illustrated schematically in Figure 3-3. (Mathematically inclined readers willnotice that this is an example of a distance metric.)Figure 3-3.

Network distance in HadoopFinally, it is important to realize that Hadoop cannot magically discover your networktopology for you; it needs some help (we’ll cover how to configure topology in “NetworkTopology” on page 286). By default, though, it assumes that the network is flat—a singlelevel hierarchy—or in other words, that all nodes are on a single rack in a single datacenter.

For small clusters, this may actually be the case, and no further configuration isrequired.8. At the time of this writing, Hadoop is not suited for running across data centers.Data Flow|71Anatomy of a File WriteNext we’ll look at how files are written to HDFS. Although quite detailed, it is instructiveto understand the data flow because it clarifies HDFS’s coherency model.We’re going to consider the case of creating a new file, writing data to it, then closingthe file. This is illustrated in Figure 3-4.Figure 3-4. A client writing data to HDFSThe client creates the file by calling create() on DistributedFileSystem (step 1 inFigure 3-4).

DistributedFileSystem makes an RPC call to the namenode to create anew file in the filesystem’s namespace, with no blocks associated with it (step 2). Thenamenode performs various checks to make sure the file doesn’t already exist and thatthe client has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file; otherwise, file creation fails and the client is thrown anIOException. The DistributedFileSystem returns an FSDataOutputStream for theclient to start writing data to. Just as in the read case, FSDataOutputStream wraps aDFSOutputStream, which handles communication with the datanodes and namenode.As the client writes data (step 3), the DFSOutputStream splits it into packets, which itwrites to an internal queue called the data queue.

The data queue is consumed by theDataStreamer, which is responsible for asking the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline, and here we’ll assume the replication level is three, so there are three nodes in72|Chapter 3: The Hadoop Distributed Filesystemthe pipeline. The DataStreamer streams the packets to the first datanode in the pipeline,which stores each packet and forwards it to the second datanode in the pipeline. Sim‐ilarly, the second datanode stores the packet and forwards it to the third (and last)datanode in the pipeline (step 4).The DFSOutputStream also maintains an internal queue of packets that are waiting tobe acknowledged by datanodes, called the ack queue.

A packet is removed from the ackqueue only when it has been acknowledged by all the datanodes in the pipeline (step 5).If any datanode fails while data is being written to it, then the following actions aretaken, which are transparent to the client writing the data. First, the pipeline is closed,and any packets in the ack queue are added to the front of the data queue so thatdatanodes that are downstream from the failed node will not miss any packets. Thecurrent block on the good datanodes is given a new identity, which is communicated tothe namenode, so that the partial block on the failed datanode will be deleted if the faileddatanode recovers later on.

The failed datanode is removed from the pipeline, and anew pipeline is constructed from the two good datanodes. The remainder of the block’sdata is written to the good datanodes in the pipeline. The namenode notices that theblock is under-replicated, and it arranges for a further replica to be created on anothernode. Subsequent blocks are then treated as normal.It’s possible, but unlikely, for multiple datanodes to fail while a block is being written.As long as dfs.namenode.replication.min replicas (which defaults to 1) are written,the write will succeed, and the block will be asynchronously replicated across the clusteruntil its target replication factor is reached (dfs.replication, which defaults to 3).When the client has finished writing data, it calls close() on the stream (step 6).

Характеристики

Тип файла

PDF-файл

Размер

9,6 Mb

Материал

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Тип материала

Книга

Предмет

(СМРХиОД) Современные методы распределенного хранения и обработки данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

tom-white-hadoop-the-definitive-guide_-4-edition-2015.pdf.rar

Tom White - Hadoop The Definitive Guide_ 4 edition - 2015.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.