Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 60
Текст из файла (страница 60)
The script’s location is controlled by the propertynet.topology.script.file.name. The script must accept a variable number of argu‐ments that are the hostnames or IP addresses to be mapped, and it must emit the cor‐responding network locations to standard output, separated by whitespace. The Hadoopwiki has an example.If no script location is specified, the default behavior is to map all nodes to a singlenetwork location, called /default-rack.Cluster Setup and InstallationThis section describes how to install and configure a basic Hadoop cluster from scratchusing the Apache Hadoop distribution on a Unix operating system. It provides back‐ground information on the things you need to think about when setting up Hadoop.For a production installation, most users and operators should consider one of theHadoop cluster management tools listed at the beginning of this chapter.Installing JavaHadoop runs on both Unix and Windows operating systems, and requires Java to beinstalled.
For a production installation, you should select a combination of operatingsystem, Java, and Hadoop that has been certified by the vendor of the Hadoop distri‐bution you are using. There is also a page on the Hadoop wiki that lists combinationsthat community members have run with success.Creating Unix User AccountsIt’s good practice to create dedicated Unix user accounts to separate the Hadoop pro‐cesses from each other, and from other services running on the same machine.
TheHDFS, MapReduce, and YARN services are usually run as separate users, named hdfs,mapred, and yarn, respectively. They all belong to the same hadoop group.288|Chapter 10: Setting Up a Hadoop ClusterInstalling HadoopDownload Hadoop from the Apache Hadoop releases page, and unpack the contents ofthe distribution in a sensible location, such as /usr/local (/opt is another standard choice;note that Hadoop should not be installed in a user’s home directory, as that may be anNFS-mounted directory):% cd /usr/local% sudo tar xzf hadoop-x.y.z.tar.gzYou also need to change the owner of the Hadoop files to be the hadoop user and group:% sudo chown -R hadoop:hadoop hadoop-x.y.zIt’s convenient to put the Hadoop binaries on the shell path too:% export HADOOP_HOME=/usr/local/hadoop-x.y.z% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinConfiguring SSHThe Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wideoperations.
For example, there is a script for stopping and starting all the daemons inthe cluster. Note that the control scripts are optional—cluster-wide operations can beperformed by other mechanisms, too, such as a distributed shell or dedicated Hadoopmanagement applications.To work seamlessly, SSH needs to be set up to allow passwordless login for the hdfs andyarn users from machines in the cluster.2 The simplest way to achieve this is to generatea public/private key pair and place it in an NFS location that is shared across the cluster.First, generate an RSA key pair by typing the following. You need to do this twice, onceas the hdfs user and once as the yarn user:% ssh-keygen -t rsa -f ~/.ssh/id_rsaEven though we want passwordless logins, keys without passphrases are not consideredgood practice (it’s OK to have an empty passphrase when running a local pseudodistributed cluster, as described in Appendix A), so we specify a passphrase whenprompted for one.
We use ssh-agent to avoid the need to enter a password for eachconnection.The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public keyis stored in a file with the same name but with .pub appended, ~/.ssh/id_rsa.pub.2. The mapred user doesn’t use SSH, as in Hadoop 2 and later, the only MapReduce daemon is the job historyserver.Cluster Setup and Installation|289Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file onall the machines in the cluster that we want to connect to. If the users’ home directoriesare stored on an NFS filesystem, the keys can be shared across the cluster by typing thefollowing (first as hdfs and then as yarn):% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keysIf the home directory is not shared using NFS, the public keys will need to be shared bysome other means (such as ssh-copy-id).Test that you can SSH from the master to a worker machine by making sure ssh-agentis running,3 and then run ssh-add to store your passphrase.
You should be able to SSHto a worker without entering the passphrase again.Configuring HadoopHadoop must have its configuration set appropriately to run in distributed mode on acluster. The important configuration settings to achieve this are discussed in “HadoopConfiguration” on page 292.Formatting the HDFS FilesystemBefore it can be used, a brand-new HDFS installation needs to be formatted. The for‐matting process creates an empty filesystem by creating the storage directories and theinitial versions of the namenode’s persistent data structures.
Datanodes are not involvedin the initial formatting process, since the namenode manages all of the filesystem’smetadata, and datanodes can join or leave the cluster dynamically. For the same reason,you don’t need to say how large a filesystem to create, since this is determined by thenumber of datanodes in the cluster, which can be increased as needed, long after thefilesystem is formatted.Formatting HDFS is a fast operation.
Run the following command as the hdfs user:% hdfs namenode -formatStarting and Stopping the DaemonsHadoop comes with scripts for running commands and starting and stopping daemonsacross the whole cluster. To use these scripts (which can be found in the sbin directory),you need to tell Hadoop which machines are in the cluster. There is a file for this purpose,called slaves, which contains a list of the machine hostnames or IP addresses, one perline.
The slaves file lists the machines that the datanodes and node managers should runon. It resides in Hadoop’s configuration directory, although it may be placed elsewhere3. See its man page for instructions on how to start ssh-agent.290|Chapter 10: Setting Up a Hadoop Cluster(and given another name) by changing the HADOOP_SLAVES setting in hadoop-env.sh.Also, this file does not need to be distributed to worker nodes, since they are used onlyby the control scripts running on the namenode or resource manager.The HDFS daemons are started by running the following command as the hdfs user:% start-dfs.shThe machine (or machines) that the namenode and secondary namenode run on isdetermined by interrogating the Hadoop configuration for their hostnames. For exam‐ple, the script finds the namenode’s hostname by executing the following:% hdfs getconf -namenodesBy default, this finds the namenode’s hostname from fs.defaultFS.
In slightly moredetail, the start-dfs.sh script does the following:• Starts a namenode on each machine returned by executing hdfs getconf-namenodes4• Starts a datanode on each machine listed in the slaves file• Starts a secondary namenode on each machine returned by executing hdfs getconf -secondarynamenodesThe YARN daemons are started in a similar way, by running the following commandas the yarn user on the machine hosting the resource manager:% start-yarn.shIn this case, the resource manager is always run on the machine from which the startyarn.sh script was run. More specifically, the script:• Starts a resource manager on the local machine• Starts a node manager on each machine listed in the slaves fileAlso provided are stop-dfs.sh and stop-yarn.sh scripts to stop the daemons started bythe corresponding start scripts.These scripts start and stop Hadoop daemons using the hadoop-daemon.sh script (orthe yarn-daemon.sh script, in the case of YARN).
If you use the aforementioned scripts,you shouldn’t call hadoop-daemon.sh directly. But if you need to control Hadoop dae‐mons from another system or from your own scripts, the hadoop-daemon.sh script is agood integration point. Likewise, hadoop-daemons.sh (with an “s”) is handy for startingthe same daemon on a set of hosts.4. There can be more than one namenode when running HDFS HA.Cluster Setup and Installation|291Finally, there is only one MapReduce daemon—the job history server, which is startedas follows, as the mapred user:% mr-jobhistory-daemon.sh start historyserverCreating User DirectoriesOnce you have a Hadoop cluster up and running, you need to give users access to it.This involves creating a home directory for each user and setting ownership permissionson it:% hadoop fs -mkdir /user/username% hadoop fs -chown username:username /user/usernameThis is a good time to set space limits on the directory.
The following sets a 1 TB limiton the given user directory:% hdfs dfsadmin -setSpaceQuota 1t /user/usernameHadoop ConfigurationThere are a handful of files for controlling the configuration of a Hadoop installation;the most important ones are listed in Table 10-1.Table 10-1. Hadoop configuration filesFilenameFormatDescriptionhadoop-env.shBash scriptEnvironment variables that are used in the scripts to run Hadoopmapred-env.shBash scriptEnvironment variables that are used in the scripts to runMapReduce (overrides variables set in hadoop-env.sh)yarn-env.shBash scriptEnvironment variables that are used in the scripts to run YARN(overrides variables set in hadoop-env.sh)core-site.xmlHadoop configuration Configuration settings for Hadoop Core, such as I/O settings thatXMLare common to HDFS, MapReduce, and YARNhdfs-site.xmlHadoop configuration Configuration settings for HDFS daemons: the namenode, theXMLsecondary namenode, and the datanodesmapred-site.xmlHadoop configuration Configuration settings for MapReduce daemons: the job historyXMLserveryarn-site.xmlHadoop configuration Configuration settings for YARN daemons: the resourceXMLmanager, the web app proxy server, and the node managersslavesPlain textA list of machines (one per line) that each run a datanode and anode managerhadoop-metrics2 .properties Java propertiesProperties for controlling how metrics are published in Hadoop(see “Metrics and JMX” on page 331)log4j.propertiesProperties for system logfiles, the namenode audit log, and thetask log for the task JVM process (“Hadoop Logs” on page 172)292|Java propertiesChapter 10: Setting Up a Hadoop ClusterFilenameFormatDescriptionhadoop-policy.xmlHadoop configuration Configuration settings for access control lists when runningXMLHadoop in secure modeThese files are all found in the etc/hadoop directory of the Hadoop distribution.
Theconfiguration directory can be relocated to another part of the filesystem (outside theHadoop installation, which makes upgrades marginally easier) as long as daemons arestarted with the --config option (or, equivalently, with the HADOOP_CONF_DIR environ‐ment variable set) specifying the location of this directory on the local filesystem.Configuration ManagementHadoop does not have a single, global location for configuration information. Instead,each Hadoop node in the cluster has its own set of configuration files, and it is up toadministrators to ensure that they are kept in sync across the system. There are parallelshell tools that can help do this, such as dsh or pdsh. This is an area where Hadoop clustermanagement tools like Cloudera Manager and Apache Ambari really shine, since theytake care of propagating changes across the cluster.Hadoop is designed so that it is possible to have a single set of configuration files thatare used for all master and worker machines.