Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 66
Текст из файла (страница 66)
Forexample:314|Chapter 10: Setting Up a Hadoop Cluster% hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-*-tests.jar \TestDFSIOTestDFSIO.1.7Missing arguments.Usage: TestDFSIO [genericOptions] -read [-random | -backward |-skip [-skipSize Size]] | -write | -append | -clean [-compression codecClassName][-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName][-bufferSize Bytes] [-rootDir]Benchmarking MapReduce with TeraSortHadoop comes with a MapReduce program called TeraSort that does a total sort of itsinput.9 It is very useful for benchmarking HDFS and MapReduce together, as the fullinput dataset is transferred through the shuffle.
The three steps are: generate some ran‐dom data, perform the sort, then validate the results.First, we generate some random data using teragen (found in the examples JAR file,not the tests one). It runs a map-only job that generates a specified number of rows ofbinary data. Each row is 100 bytes long, so to generate one terabyte of data using 1,000maps, run the following (10t is short for 10 trillion):% hadoop jar \$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \teragen -Dmapreduce.job.maps=1000 10t random-dataNext, run terasort:% hadoop jar \$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \terasort random-data sorted-dataThe overall execution time of the sort is the metric we are interested in, but it’s instructiveto watch the job’s progress via the web UI (http://resource-manager-host:8088/), whereyou can get a feel for how long each phase of the job takes. Adjusting the parametersmentioned in “Tuning a Job” on page 175 is a useful exercise, too.As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:% hadoop jar \$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \teravalidate sorted-data reportThis command runs a short MapReduce job that performs a series of checks on thesorted data to check whether the sort is accurate.
Any errors can be found in the report/part-r-00000 output file.9. In 2008, TeraSort was used to break the world record for sorting 1 TB of data; see “A Brief History of ApacheHadoop” on page 12.Benchmarking a Hadoop Cluster|315Other benchmarksThere are many more Hadoop benchmarks, but the following are widely used:• TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReducejob as a convenient way to read or write files in parallel.• MRBench (invoked with mrbench) runs a small job a number of times. It acts as agood counterpoint to TeraSort, as it checks whether small job runs are responsive.• NNBench (invoked with nnbench) is useful for load-testing namenode hardware.• Gridmix is a suite of benchmarks designed to model a realistic cluster workload bymimicking a variety of data-access patterns seen in practice.
See the documentationin the distribution for how to run Gridmix.• SWIM, or the Statistical Workload Injector for MapReduce, is a repository of reallife MapReduce workloads that you can use to generate representative test work‐loads for your system.• TPCx-HS is a standardized benchmark based on TeraSort from the TransactionProcessing Performance Council.User JobsFor tuning, it is best to include a few jobs that are representative of the jobs that yourusers run, so your cluster is tuned for these and not just for the standard benchmarks.If this is your first Hadoop cluster and you don’t have any user jobs yet, then eitherGridmix or SWIM is a good substitute.When running your own jobs as benchmarks, you should select a dataset for your userjobs and use it each time you run the benchmarks to allow comparisons between runs.When you set up a new cluster or upgrade a cluster, you will be able to use the samedataset to compare the performance with previous runs.316|Chapter 10: Setting Up a Hadoop ClusterCHAPTER 11Administering HadoopThe previous chapter was devoted to setting up a Hadoop cluster.
In this chapter, welook at the procedures to keep a cluster running smoothly.HDFSPersistent Data StructuresAs an administrator, it is invaluable to have a basic understanding of how the compo‐nents of HDFS—the namenode, the secondary namenode, and the datanodes—organize their persistent data on disk. Knowing which files are which can help youdiagnose problems or spot that something is awry.Namenode directory structureA running namenode has a directory structure like this:${dfs.namenode.name.dir}/├── current│├── VERSION│├── edits_0000000000000000001-0000000000000000019│├── edits_inprogress_0000000000000000020│├── fsimage_0000000000000000000│├── fsimage_0000000000000000000.md5│├── fsimage_0000000000000000019│├── fsimage_0000000000000000019.md5│└── seen_txid└── in_use.lockRecall from Chapter 10 that the dfs.namenode.name.dir property is a list of directories,with the same contents mirrored in each directory.
This mechanism provides resilience,particularly if one of the directories is an NFS mount, as is recommended.317The VERSION file is a Java properties file that contains information about the versionof HDFS that is running. Here are the contents of a typical file:#Mon Sep 29 09:54:36 BST 2014namespaceID=1342387246clusterID=CID-01b5c398-959c-4ea8-aae6-1e0d9bd8b142cTime=0storageType=NAME_NODEblockpoolID=BP-526805057-127.0.0.1-1411980876842layoutVersion=-57The layoutVersion is a negative integer that defines the version of HDFS’s persistentdata structures.
This version number has no relation to the release number of the Ha‐doop distribution. Whenever the layout changes, the version number is decremented(for example, the version after −57 is −58). When this happens, HDFS needs to beupgraded, since a newer namenode (or datanode) will not operate if its storage layoutis an older version. Upgrading HDFS is covered in “Upgrades” on page 337.The namespaceID is a unique identifier for the filesystem namespace, which is createdwhen the namenode is first formatted.
The clusterID is a unique identifier for theHDFS cluster as a whole; this is important for HDFS federation (see “HDFS Federa‐tion” on page 48), where a cluster is made up of multiple namespaces and each name‐space is managed by one namenode. The blockpoolID is a unique identifier for theblock pool containing all the files in the namespace managed by this namenode.The cTime property marks the creation time of the namenode’s storage. For newly for‐matted storage, the value is always zero, but it is updated to a timestamp whenever thefilesystem is upgraded.The storageType indicates that this storage directory contains data structures for anamenode.The in_use.lock file is a lock file that the namenode uses to lock the storage directory.This prevents another namenode instance from running at the same time with (andpossibly corrupting) the same storage directory.The other files in the namenode’s storage directory are the edits and fsimage files, andseen_txid.
To understand what these files are for, we need to dig into the workings ofthe namenode a little more.The filesystem image and edit logWhen a filesystem client performs a write operation (such as creating or moving a file),the transaction is first recorded in the edit log.
The namenode also has an in-memoryrepresentation of the filesystem metadata, which it updates after the edit log has beenmodified. The in-memory metadata is used to serve read requests.318|Chapter 11: Administering HadoopConceptually the edit log is a single entity, but it is represented as a number of files ondisk. Each file is called a segment, and has the prefix edits and a suffix that indicates thetransaction IDs contained in it. Only one file is open for writes at any one time(edits_inprogress_0000000000000000020 in the preceding example), and it is flushedand synced after every transaction before a success code is returned to the client.
Fornamenodes that write to multiple directories, the write must be flushed and synced toevery copy before returning successfully. This ensures that no transaction is lost due tomachine failure.Each fsimage file is a complete persistent checkpoint of the filesystem metadata. (Thesuffix indicates the last transaction in the image.) However, it is not updated for everyfilesystem write operation, because writing out the fsimage file, which can grow to begigabytes in size, would be very slow. This does not compromise resilience because ifthe namenode fails, then the latest state of its metadata can be reconstructed by loadingthe latest fsimage from disk into memory, and then applying each of the transactionsfrom the relevant point onward in the edit log. In fact, this is precisely what the name‐node does when it starts up (see “Safe Mode” on page 322).Each fsimage file contains a serialized form of all the directory and fileinodes in the filesystem.
Each inode is an internal representation of afile or directory’s metadata and contains such information as the file’sreplication level, modification and access times, access permissions,block size, and the blocks the file is made up of. For directories, themodification time, permissions, and quota metadata are stored.An fsimage file does not record the datanodes on which the blocks arestored. Instead, the namenode keeps this mapping in memory, whichit constructs by asking the datanodes for their block lists when theyjoin the cluster and periodically afterward to ensure the namenode’sblock mapping is up to date.As described, the edit log would grow without bound (even if it was spread across severalphysical edits files). Though this state of affairs would have no impact on the systemwhile the namenode is running, if the namenode were restarted, it would take a longtime to apply each of the transactions in its (very long) edit log.
During this time, thefilesystem would be offline, which is generally undesirable.HDFS|319The solution is to run the secondary namenode, whose purpose is to produce check‐points of the primary’s in-memory filesystem metadata.1 The checkpointing processproceeds as follows (and is shown schematically in Figure 11-1 for the edit log and imagefiles shown earlier):1. The secondary asks the primary to roll its in-progress edits file, so new edits go toa new file. The primary also updates the seen_txid file in all its storage directories.2.