Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 63
Текст из файла (страница 63)
(The default is8,192 MB, which is normally too low for most setups.)Hadoop Configuration|301The next step is to determine how to set memory options for individual jobs. There aretwo main controls: one for the size of the container allocated by YARN, and another forthe heap size of the Java process run in the container.The memory controls for MapReduce are all set by the client in thejob configuration. The YARN settings are cluster settings and can‐not be modified by the client.Container sizes are determined by mapreduce.map.memory.mb and mapreduce.reduce.memory.mb; both default to 1,024 MB. These settings are used by the applicationmaster when negotiating for resources in the cluster, and also by the node manager,which runs and monitors the task containers.
The heap size of the Java process is set bymapred.child.java.opts, and defaults to 200 MB. You can also set the Java optionsseparately for map and reduce tasks (see Table 10-4).Table 10-4. MapReduce job memory properties (set by the client)Property nameTypeDefault value Descriptionmapreduce.map.memory.mbint1024The amount of memory for map containers.mapreduce.reduce.memory.mb int1024The amount of memory for reduce containers.mapred.child.java.optsString -Xmx200mThe JVM options used to launch the container processthat runs map and reduce tasks. In addition to memorysettings, this property can include JVM properties fordebugging, for example.mapreduce.map.java.optsString -Xmx200mThe JVM options used for the child process that runsmap tasks.mapreduce.reduce.java.opts String -Xmx200mThe JVM options used for the child process that runsreduce tasks.For example, suppose mapred.child.java.opts is set to -Xmx800m and mapreduce.map.memory.mb is left at its default value of 1,024 MB.
When a map task is run,the node manager will allocate a 1,024 MB container (decreasing the size of its pool bythat amount for the duration of the task) and will launch the task JVM configured withan 800 MB maximum heap size. Note that the JVM process will have a larger memoryfootprint than the heap size, and the overhead will depend on such things as the nativelibraries that are in use, the size of the permanent generation space, and so on. Theimportant thing is that the physical memory used by the JVM process, including anyprocesses that it spawns, such as Streaming processes, does not exceed its allocation(1,024 MB).
If a container uses more memory than it has been allocated, then it may beterminated by the node manager and marked as failed.302|Chapter 10: Setting Up a Hadoop ClusterYARN schedulers impose a minimum or maximum on memory allocations. The defaultminimum is 1,024 MB (set by yarn.scheduler.minimum-allocation-mb), and the de‐fault maximum is 8,192 MB (set by yarn.scheduler.maximum-allocation-mb).There are also virtual memory constraints that a container must meet. If a container’svirtual memory usage exceeds a given multiple of the allocated physical memory, thenode manager may terminate the process. The multiple is expressed by theyarn.nodemanager.vmem-pmem-ratio property, which defaults to 2.1. In the exampleused earlier, the virtual memory threshold above which the task may be terminated is2,150 MB, which is 2.1 × 1,024 MB.When configuring memory parameters it’s very useful to be able to monitor a task’sactual memory usage during a job run, and this is possible via MapReduce task counters.ThecountersPHYSICAL_MEMORY_BYTES,VIRTUAL_MEMORY_BYTES,andCOMMITTED_HEAP_BYTES (described in Table 9-2) provide snapshot values of memoryusage and are therefore suitable for observation during the course of a task attempt.Hadoop also provides settings to control how much memory is used for MapReduceoperations.
These can be set on a per-job basis and are covered in “Shuffle and Sort” onpage 197.CPU settings in YARN and MapReduceIn addition to memory, YARN treats CPU usage as a managed resource, and applicationscan request the number of cores they need. The number of cores that a node managercan allocate to containers is controlled by the yarn.nodemanager.resource.cpuvcores property. It should be set to the total number of cores on the machine, minus acore for each daemon process running on the machine (datanode, node manager, andany other long-running processes).MapReduce jobs can control the number of cores allocated to map and reduce containersby setting mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores.
Both de‐fault to 1, an appropriate setting for normal single-threaded MapReduce tasks, whichcan only saturate a single core.Hadoop Configuration|303While the number of cores is tracked during scheduling (so a con‐tainer won’t be allocated on a machine where there are no sparecores, for example), the node manager will not, by default, limitactual CPU usage of running containers.
This means that a contain‐er can abuse its allocation by using more CPU than it was given,possibly starving other containers running on the same host. YARNhas support for enforcing CPU limits using Linux cgroups. The nodemanager’s container executor class (yarn.nodemanager.containerexecutor.class) must be set to use the LinuxContainerExecutorclass, which in turn must be configured to use cgroups (see theproperties under yarn.nodemanager.linux-container-executor).Hadoop Daemon Addresses and PortsHadoop daemons generally run both an RPC server for communication between dae‐mons (Table 10-5) and an HTTP server to provide web pages for human consumption(Table 10-6). Each server is configured by setting the network address and port numberto listen on. A port number of 0 instructs the server to start on a free port, but this isgenerally discouraged because it is incompatible with setting cluster-wide firewall pol‐icies.In general, the properties for setting a server’s RPC and HTTP addresses serve doubleduty: they determine the network interface that the server will bind to, and they areused by clients or other machines in the cluster to connect to the server.
For example,node managers use the yarn.resourcemanager.resource-tracker.address propertyto find the address of their resource manager.It is often desirable for servers to bind to multiple network interfaces, but setting thenetwork address to 0.0.0.0, which works for the server, breaks the second case, sincethe address is not resolvable by clients or other machines in the cluster. One solution isto have separate configurations for clients and servers, but a better way is to set the bindhost for the server. By setting yarn.resourcemanager.hostname to the (externally re‐solvable) hostname or IP address and yarn.resourcemanager.bind-host to 0.0.0.0,you ensure that the resource manager will bind to all addresses on the machine, whileat the same time providing a resolvable address for node managers and clients.In addition to an RPC server, datanodes run a TCP/IP server for block transfers.
Theserver address and port are set by the dfs.datanode.address property , which has adefault value of 0.0.0.0:50010.304|Chapter 10: Setting Up a Hadoop ClusterTable 10-5. RPC server propertiesProperty nameDefault valueDescriptionfs.defaultFSfile:///When set to an HDFS URI, this propertydetermines the namenode’s RPC serveraddress and port. The default port is 8020 ifnot specified.The address the namenode’s RPC server willbind to.
If not set (the default), the bindaddress is determined by fs.defaultFS.It can be set to 0.0.0.0 to make thenamenode listen on all interfaces.dfs.namenode.rpc-bind-hostdfs.datanode.ipc.address0.0.0.0:50020The datanode’s RPC server address and port.mapreduce.jobhistory.address0.0.0.0:10020The job history server’s RPC server addressand port. This is used by the client (typicallyoutside the cluster) to query job history.The address the job history server’s RPC andHTTP servers will bind to.mapreduce.jobhistory.bind-hostyarn.resourcemanager.hostname0.0.0.0The hostname of the machine the resourcemanager runs on. Abbreviated ${y.rm.hostname} below.The address the resource manager’s RPC andHTTP servers will bind to.yarn.resourcemanager.bind-hostyarn.resourcemanager.address${y.rm.hostname}:8032The resource manager’s RPC server addressand port.
This is used by the client (typicallyoutside the cluster) to communicate withthe resource manager.yarn.resourcemanager.admin.address${y.rm.hostname}:8033The resource manager’s admin RPC serveraddress and port. This is used by the adminclient (invoked with yarn rmadmin,typically run outside the cluster) tocommunicate with the resource manager.yarn.resourcemanager.scheduler.address${y.rm.hostname}:8030The resource manager scheduler’s RPC serveraddress and port. This is used by (in-cluster)application masters to communicate withthe resource manager.yarn.resourcemanager.resourcetracker.address${y.rm.hostname}:8031The resource manager resource tracker’s RPCserver address and port.
This is used by (incluster) node managers to communicatewith the resource manager.yarn.nodemanager.hostname0.0.0.0The hostname of the machine the nodemanager runs on. Abbreviated ${y.nm.hostname} below.yarn.nodemanager.bind-hostThe address the node manager’s RPC andHTTP servers will bind to.Hadoop Configuration|305Property nameDefault valueDescriptionyarn.nodemanager.address${y.nm.hostname}:0The node manager’s RPC server address andport. This is used by (in-cluster) applicationmasters to communicate with nodemanagers.yarn.nodemanager.localizer.address${y.nm.hostname}:8040The node manager localizer’s RPC serveraddress and port.Table 10-6. HTTP server propertiesProperty nameDefault valueDescriptiondfs.namenode.http-address0.0.0.0:50070The namenode’s HTTP server address andport.The address the namenode’s HTTP serverwill bind to.dfs.namenode.http-bind-hostdfs.namenode.secondary.httpaddress0.0.0.0:50090The secondary namenode’s HTTP serveraddress and port.dfs.datanode.http.address0.0.0.0:50075The datanode’s HTTP server address andport.
(Note that the property name isinconsistent with the ones for thenamenode.)mapreduce.jobhistory.webapp.address0.0.0.0:19888The MapReduce job history server’s addressand port. This property is set in mapredsite.xml.mapreduce.shuffle.port13562The shuffle handler’s HTTP port number.This is used for serving map outputs, and isnot a user-accessible web UI. This propertyis set in mapred-site.xml.yarn.resourcemanager.webapp.address${y.rm.hostname}:8088The resource manager’s HTTP server addressand port.yarn.nodemanager.webapp.address${y.nm.hostname}:8042The node manager’s HTTP server addressand port.yarn.web-proxy.addressThe web app proxy server’s HTTP serveraddress and port. If not set (the default),then the web app proxy server will run inthe resource manager process.There is also a setting for controlling which network interfaces the datanodes use astheir IP addresses (for HTTP and RPC servers).
The relevant property is dfs.datanode.dns.interface, which is set to default to use the default network interface. Youcan set this explicitly to report the address of a particular interface (eth0, for example).306|Chapter 10: Setting Up a Hadoop ClusterOther Hadoop PropertiesThis section discusses some other properties that you might consider setting.Cluster membershipTo aid in the addition and removal of nodes in the future, you can specify a file con‐taining a list of authorized machines that may join the cluster as datanodes or nodemanagers.Thefileisspecifiedusingthedfs.hostsandyarn.resourcemanager.nodes.include-path properties (for datanodes and nodemanagers, respectively), and the corresponding dfs.hosts.exclude andyarn.resourcemanager.nodes.exclude-path properties specify the files used for de‐commissioning. See “Commissioning and Decommissioning Nodes” on page 334 for fur‐ther discussion.Buffer sizeHadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations.