Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 61
Текст из файла (страница 61)
The great advantage of this is simplicity,both conceptually (since there is only one configuration to deal with) and operationally(as the Hadoop scripts are sufficient to manage a single configuration setup).For some clusters, the one-size-fits-all configuration model breaks down. For example,if you expand the cluster with new machines that have a different hardware specificationfrom the existing ones, you need a different configuration for the new machines to takeadvantage of their extra resources.In these cases, you need to have the concept of a class of machine and maintain a separateconfiguration for each class. Hadoop doesn’t provide tools to do this, but there areseveral excellent tools for doing precisely this type of configuration management, suchas Chef, Puppet, CFEngine, and Bcfg2.For a cluster of any size, it can be a challenge to keep all of the machines in sync.
Considerwhat happens if the machine is unavailable when you push out an update. Who ensuresit gets the update when it becomes available? This is a big problem and can lead todivergent installations, so even if you use the Hadoop control scripts for managingHadoop, it may be a good idea to use configuration management tools for maintainingthe cluster.
These tools are also excellent for doing regular maintenance, such as patchingsecurity holes and updating system packages.Hadoop Configuration|293Environment SettingsIn this section, we consider how to set the variables in hadoop-env.sh. There are alsoanalogous configuration files for MapReduce and YARN (but not for HDFS), calledmapred-env.sh and yarn-env.sh, where variables pertaining to those components can beset. Note that the MapReduce and YARN files override the values set in hadoop-env.sh.JavaThe location of the Java implementation to use is determined by the JAVA_HOME settingin hadoop-env.sh or the JAVA_HOME shell environment variable, if not set in hadoopenv.sh.
It’s a good idea to set the value in hadoop-env.sh, so that it is clearly defined inone place and to ensure that the whole cluster is using the same version of Java.Memory heap sizeBy default, Hadoop allocates 1,000 MB (1 GB) of memory to each daemon it runs. Thisis controlled by the HADOOP_HEAPSIZE setting in hadoop-env.sh. There are also envi‐ronment variables to allow you to change the heap size for a single daemon. For example,you can set YARN_RESOURCEMANAGER_HEAPSIZE in yarn-env.sh to override the heap sizefor the resource manager.Surprisingly, there are no corresponding environment variables for HDFS daemons,despite it being very common to give the namenode more heap space. There is anotherway to set the namenode heap size, however; this is discussed in the following sidebar.How Much Memory Does a Namenode Need?A namenode can eat up memory, since a reference to every block of every file is main‐tained in memory.
It’s difficult to give a precise formula because memory usage dependson the number of blocks per file, the filename length, and the number of directories inthe filesystem; plus, it can change from one Hadoop release to another.The default of 1,000 MB of namenode memory is normally enough for a few millionfiles, but as a rule of thumb for sizing purposes, you can conservatively allow 1,000 MBper million blocks of storage.For example, a 200-node cluster with 24 TB of disk space per node, a block size of 128MB, and a replication factor of 3 has room for about 2 million blocks (or more): 200 ×24,000,000 MB ⁄ (128 MB × 3).
So in this case, setting the namenode memory to 12,000MB would be a good starting point.You can increase the namenode’s memory without changing the memory allocated toother Hadoop daemons by setting HADOOP_NAMENODE_OPTS in hadoop-env.sh to includea JVM option for setting the memory size. HADOOP_NAMENODE_OPTS allows you to passextra options to the namenode’s JVM. So, for example, if you were using a Sun JVM,294| Chapter 10: Setting Up a Hadoop Cluster-Xmx2000m would specify that 2,000 MB of memory should be allocated to the name‐node.If you change the namenode’s memory allocation, don’t forget to do the same for thesecondary namenode (using the HADOOP_SECONDARYNAMENODE_OPTS variable), since itsmemory requirements are comparable to the primary namenode’s.In addition to the memory requirements of the daemons, the node manager allocatescontainers to applications, so we need to factor these into the total memory footprintof a worker machine; see “Memory settings in YARN and MapReduce” on page 301.System logfilesSystem logfiles produced by Hadoop are stored in $HADOOP_HOME/logs by default.This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
It’s a good ideato change this so that logfiles are kept out of the directory that Hadoop is installed in.Changing this keeps logfiles in one place, even after the installation directory changesdue to an upgrade. A common choice is /var/log/hadoop, set by including the followingline in hadoop-env.sh:export HADOOP_LOG_DIR=/var/log/hadoopThe log directory will be created if it doesn’t already exist. (If it does not exist, confirmthat the relevant Unix Hadoop user has permission to create it.) Each Hadoop daemonrunning on a machine produces two logfiles. The first is the log output written via log4j.This file, whose name ends in .log, should be the first port of call when diagnosingproblems because most application log messages are written here.
The standard Hadooplog4j configuration uses a daily rolling file appender to rotate logfiles. Old logfiles arenever deleted, so you should arrange for them to be periodically deleted or archived, soas to not run out of disk space on the local node.The second logfile is the combined standard output and standard error log. This logfile,whose name ends in .out, usually contains little or no output, since Hadoop uses log4jfor logging. It is rotated only when the daemon is restarted, and only the last five logsare retained. Old logfiles are suffixed with a number between 1 and 5, with 5 being theoldest file.Logfile names (of both types) are a combination of the name of the user running thedaemon, the daemon name, and the machine hostname. For example, hadoop-hdfsdatanode-ip-10-45-174-112.log.2014-09-20 is the name of a logfile after it has been ro‐tated.
This naming structure makes it possible to archive logs from all machines in thecluster in a single directory, if needed, since the filenames are unique.The username in the logfile name is actually the default for the HADOOP_IDENT_STRINGsetting in hadoop-env.sh. If you wish to give the Hadoop instance a different identityHadoop Configuration|295for the purposes of naming the logfiles, change HADOOP_IDENT_STRING to be the iden‐tifier you want.SSH settingsThe control scripts allow you to run commands on (remote) worker nodes from themaster node using SSH.
It can be useful to customize the SSH settings, for variousreasons. For example, you may want to reduce the connection timeout (using theConnectTimeout option) so the control scripts don’t hang around waiting to see whethera dead node is going to respond. Obviously, this can be taken too far. If the timeout istoo low, then busy nodes will be skipped, which is bad.Another useful SSH setting is StrictHostKeyChecking, which can be set to no to au‐tomatically add new host keys to the known hosts files.
The default, ask, prompts theuser to confirm that the key fingerprint has been verified, which is not a suitable settingin a large cluster environment.5To pass extra options to SSH, define the HADOOP_SSH_OPTS environment variable inhadoop-env.sh. See the ssh and ssh_config manual pages for more SSH settings.Important Hadoop Daemon PropertiesHadoop has a bewildering number of configuration properties. In this section, weaddress the ones that you need to define (or at least understand why the default isappropriate) for any real-world working cluster.
These properties are set in the Hadoopsite files: core-site.xml, hdfs-site.xml, and yarn-site.xml. Typical instances of these filesare shown in Examples 10-1, 10-2, and 10-3.6 You can learn more about the format ofHadoop’s configuration files in “The Configuration API” on page 141.To find the actual configuration of a running daemon, visit the /conf page on its webserver. For example, http://resource-manager-host:8088/conf shows the configurationthat the resource manager is running with. This page shows the combined site anddefault configuration files that the daemon is running with, and also shows which fileeach property was picked up from.Example 10-1. A typical core-site.xml configuration file<?xml version="1.0"?><!-- core-site.xml --><configuration><property>5.
For more discussion on the security implications of SSH host keys, consult the article “SSH Host Key Pro‐tection” by Brian Hatch.6. Notice that there is no site file for MapReduce shown here. This is because the only MapReduce daemon isthe job history server, and the defaults are sufficient.296|Chapter 10: Setting Up a Hadoop Cluster<name>fs.defaultFS</name><value>hdfs://namenode/</value></property></configuration>Example 10-2. A typical hdfs-site.xml configuration file<?xml version="1.0"?><!-- hdfs-site.xml --><configuration><property><name>dfs.namenode.name.dir</name><value>/disk1/hdfs/name,/remote/hdfs/name</value></property><property><name>dfs.datanode.data.dir</name><value>/disk1/hdfs/data,/disk2/hdfs/data</value></property><property><name>dfs.namenode.checkpoint.dir</name><value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary</value></property></configuration>Example 10-3.