Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 20
Текст из файла (страница 20)
YARN was introduced in Hadoop 2 to improve the MapReduce im‐plementation, but it is general enough to support other distributed computing para‐digms as well.YARN provides APIs for requesting and working with cluster resources, but these APIsare not typically used directly by user code. Instead, users write to higher-level APIsprovided by distributed computing frameworks, which themselves are built on YARNand hide the resource management details from the user.
The situation is illustrated inFigure 4-1, which shows some distributed computing frameworks (MapReduce, Spark,and so on) running as YARN applications on the cluster compute layer (YARN) and thecluster storage layer (HDFS and HBase).Figure 4-1. YARN applicationsThere is also a layer of applications that build on the frameworks shown in Figure 4-1.Pig, Hive, and Crunch are all examples of processing frameworks that run on MapRe‐duce, Spark, or Tez (or on all three), and don’t interact with YARN directly.79This chapter walks through the features in YARN and provides a basis for understandinglater chapters in Part IV that cover Hadoop’s distributed processing frameworks.Anatomy of a YARN Application RunYARN provides its core services via two types of long-running daemon: a resourcemanager (one per cluster) to manage the use of resources across the cluster, and nodemanagers running on all the nodes in the cluster to launch and monitor containers.
Acontainer executes an application-specific process with a constrained set of resources(memory, CPU, and so on). Depending on how YARN is configured (see “YARN” onpage 300), a container may be a Unix process or a Linux cgroup. Figure 4-2 illustrates howYARN runs an application.Figure 4-2. How YARN runs an applicationTo run an application on YARN, a client contacts the resource manager and asks it torun an application master process (step 1 in Figure 4-2).
The resource manager thenfinds a node manager that can launch the application master in a container (steps 2a80| Chapter 4: YARNand 2b).1 Precisely what the application master does once it is running depends on theapplication. It could simply run a computation in the container it is running in andreturn the result to the client. Or it could request more containers from the resourcemanagers (step 3), and use them to run a distributed computation (steps 4a and 4b).The latter is what the MapReduce YARN application does, which we’ll look at in moredetail in “Anatomy of a MapReduce Job Run” on page 185.Notice from Figure 4-2 that YARN itself does not provide any way for the parts of theapplication (client, master, process) to communicate with one another.
Most nontrivialYARN applications use some form of remote communication (such as Hadoop’s RPClayer) to pass status updates and results back to the client, but these are specific to theapplication.Resource RequestsYARN has a flexible model for making resource requests. A request for a set of containerscan express the amount of computer resources required for each container (memoryand CPU), as well as locality constraints for the containers in that request.Locality is critical in ensuring that distributed data processing algorithms use the clusterbandwidth efficiently,2 so YARN allows an application to specify locality constraints forthe containers it is requesting.
Locality constraints can be used to request a containeron a specific node or rack, or anywhere on the cluster (off-rack).Sometimes the locality constraint cannot be met, in which case either no allocation ismade or, optionally, the constraint can be loosened. For example, if a specific node wasrequested but it is not possible to start a container on it (because other containers arerunning on it), then YARN will try to start a container on a node in the same rack, or,if that’s not possible, on any node in the cluster.In the common case of launching a container to process an HDFS block (to run a maptask in MapReduce, say), the application will request a container on one of the nodeshosting the block’s three replicas, or on a node in one of the racks hosting the replicas,or, failing that, on any node in the cluster.A YARN application can make resource requests at any time while it is running.
Forexample, an application can make all of its requests up front, or it can take a moredynamic approach whereby it requests more resources dynamically to meet the chang‐ing needs of the application.1. It’s also possible for the client to start the application master, possibly outside the cluster, or in the same JVMas the client. This is called an unmanaged application master.2. For more on this topic see “Scaling Out” on page 30 and “Network Topology and Hadoop” on page 70.Anatomy of a YARN Application Run|81Spark takes the first approach, starting a fixed number of executors on the cluster (see“Spark on YARN” on page 571).
MapReduce, on the other hand, has two phases: the maptask containers are requested up front, but the reduce task containers are not starteduntil later. Also, if any tasks fail, additional containers will be requested so the failedtasks can be rerun.Application LifespanThe lifespan of a YARN application can vary dramatically: from a short-lived applicationof a few seconds to a long-running application that runs for days or even months.
Ratherthan look at how long the application runs for, it’s useful to categorize applications interms of how they map to the jobs that users run. The simplest case is one applicationper user job, which is the approach that MapReduce takes.The second model is to run one application per workflow or user session of (possiblyunrelated) jobs. This approach can be more efficient than the first, since containers canbe reused between jobs, and there is also the potential to cache intermediate data be‐tween jobs. Spark is an example that uses this model.The third model is a long-running application that is shared by different users. Such anapplication often acts in some kind of coordination role. For example, Apache Sliderhas a long-running application master for launching other applications on the cluster.This approach is also used by Impala (see “SQL-on-Hadoop Alternatives” on page 484) toprovide a proxy application that the Impala daemons communicate with to requestcluster resources.
The “always on” application master means that users have very lowlatency responses to their queries since the overhead of starting a new application masteris avoided.3Building YARN ApplicationsWriting a YARN application from scratch is fairly involved, but in many cases is notnecessary, as it is often possible to use an existing application that fits the bill.
For ex‐ample, if you are interested in running a directed acyclic graph (DAG) of jobs, thenSpark or Tez is appropriate; or for stream processing, Spark, Samza, or Storm works.4There are a couple of projects that simplify the process of building a YARN application.Apache Slider, mentioned earlier, makes it possible to run existing distributed applica‐tions on YARN. Users can run their own instances of an application (such as HBase) ona cluster, independently of other users, which means that different users can run dif‐ferent versions of the same application.
Slider provides controls to change the number3. The low-latency application master code lives in the Llama project.4. All of these projects are Apache Software Foundation projects.82|Chapter 4: YARNof nodes an application is running on, and to suspend then resume a runningapplication.Apache Twill is similar to Slider, but in addition provides a simple programming modelfor developing distributed applications on YARN. Twill allows you to define clusterprocesses as an extension of a Java Runnable, then runs them in YARN containers onthe cluster. Twill also provides support for, among other things, real-time logging (logevents from runnables are streamed back to the client) and command messages (sentfrom the client to runnables).In cases where none of these options are sufficient—such as an application that hascomplex scheduling requirements—then the distributed shell application that is a partof the YARN project itself serves as an example of how to write a YARN application.
Itdemonstrates how to use YARN’s client APIs to handle communication between theclient or application master and the YARN daemons.YARN Compared to MapReduce 1The distributed implementation of MapReduce in the original version of Hadoop (ver‐sion 1 and earlier) is sometimes referred to as “MapReduce 1” to distinguish it fromMapReduce 2, the implementation that uses YARN (in Hadoop 2 and later).It’s important to realize that the old and new MapReduce APIs are notthe same thing as the MapReduce 1 and MapReduce 2 implementa‐tions. The APIs are user-facing client-side features and determinehow you write MapReduce programs (see Appendix D), whereas theimplementations are just different ways of running MapReduce pro‐grams.
All four combinations are supported: both the old and newMapReduce APIs run on both MapReduce 1 and 2.In MapReduce 1, there are two types of daemon that control the job execution process:a jobtracker and one or more tasktrackers. The jobtracker coordinates all the jobs runon the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks andsend progress reports to the jobtracker, which keeps a record of the overall progress ofeach job.
If a task fails, the jobtracker can reschedule it on a different tasktracker.In MapReduce 1, the jobtracker takes care of both job scheduling (matching tasks withtasktrackers) and task progress monitoring (keeping track of tasks, restarting failed orslow tasks, and doing task bookkeeping, such as maintaining counter totals). By con‐trast, in YARN these responsibilities are handled by separate entities: the resource man‐ager and an application master (one for each MapReduce job). The jobtracker is alsoresponsible for storing job history for completed jobs, although it is possible to run aYARN Compared to MapReduce 1|83job history server as a separate daemon to take the load off the jobtracker.