Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 15
Текст из файла (страница 15)
Because clients are remote, it is possible for a client to become anarbitrary user simply by creating an account of that name on the remote system. Thisis not possible if security is turned on; see “Security” on page 309. Either way, it is worth‐while having permissions enabled (as they are by default; see the dfs.permissions.enabled property) to avoid accidental modification or deletion of substantialparts of the filesystem, either by users or by automated tools or programs.52|Chapter 3: The Hadoop Distributed FilesystemWhen permissions checking is enabled, the owner permissions are checked if the client’susername matches the owner, and the group permissions are checked if the client is amember of the group; otherwise, the other permissions are checked.There is a concept of a superuser, which is the identity of the namenode process.
Per‐missions checks are not performed for the superuser.Hadoop FilesystemsHadoop has an abstract notion of filesystems, of which HDFS is just one implementa‐tion. The Java abstract class org.apache.hadoop.fs.FileSystem represents the clientinterface to a filesystem in Hadoop, and there are several concrete implementations.The main ones that ship with Hadoop are described in Table 3-1.Table 3-1.
Hadoop filesystemsFilesystem URI scheme Java implementation(all under org.apache.hadoop)DescriptionLocalfilefs.LocalFileSystemA filesystem for a locally connected diskwith client-side checksums. Use RawLocalFileSystem for a local filesystem with nochecksums. See “LocalFileSystem” on page99.HDFShdfshdfs.DistributedFileSystemHadoop’s distributed filesystem. HDFS isdesigned to work efficiently in conjunctionwith MapReduce.WebHDFSwebhdfshdfs.web.WebHdfsFileSystemA filesystem providing authenticated read/write access to HDFS over HTTP. See “HTTP”on page 54.SecureWebHDFSswebhdfs hdfs.web.SWebHdfsFileSystemThe HTTPS version of WebHDFS.HARharfs.HarFileSystemA filesystem layered on another filesystemfor archiving files. Hadoop Archives are usedfor packing lots of files in HDFS into a singlearchive file to reduce the namenode’smemory usage. Use the hadooparchive command to create HAR files.Viewviewfsviewfs.ViewFileSystemA client-side mount table for other Hadoopfilesystems.
Commonly used to create mountpoints for federated namenodes (see “HDFSFederation” on page 48).FTPftpfs.ftp.FTPFileSystemA filesystem backed by an FTP server.S3s3afs.s3a.S3AFileSystemA filesystem backed by Amazon S3. Replacesthe older s3n (S3 native) implementation.Hadoop Filesystems|53Filesystem URI scheme Java implementation(all under org.apache.hadoop)DescriptionAzurewasbfs.azure.NativeAzureFileSystemA filesystem backed by Microsoft Azure.Swiftswiftfs.swift.snative.SwiftNativeFile A filesystem backed by OpenStack Swift.SystemHadoop provides many interfaces to its filesystems, and it generally uses the URI schemeto pick the correct filesystem instance to communicate with.
For example, the filesystemshell that we met in the previous section operates with all Hadoop filesystems. To listthe files in the root directory of the local filesystem, type:% hadoop fs -ls file:///Although it is possible (and sometimes very convenient) to run MapReduce programsthat access any of these filesystems, when you are processing large volumes of data youshould choose a distributed filesystem that has the data locality optimization, notablyHDFS (see “Scaling Out” on page 30).InterfacesHadoop is written in Java, so most Hadoop filesystem interactions are mediated throughthe Java API. The filesystem shell, for example, is a Java application that uses the JavaFileSystem class to provide filesystem operations.
The other filesystem interfaces arediscussed briefly in this section. These interfaces are most commonly used with HDFS,since the other filesystems in Hadoop typically have existing tools to access the under‐lying filesystem (FTP clients for FTP, S3 tools for S3, etc.), but many of them will workwith any Hadoop filesystem.HTTPBy exposing its filesystem interface as a Java API, Hadoop makes it awkward for nonJava applications to access HDFS. The HTTP REST API exposed by the WebHDFSprotocol makes it easier for other languages to interact with HDFS.
Note that the HTTPinterface is slower than the native Java client, so should be avoided for very large datatransfers if possible.There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemonsserve HTTP requests to clients; and via a proxy (or proxies), which accesses HDFS onthe client’s behalf using the usual DistributedFileSystem API. The two ways are il‐lustrated in Figure 3-1. Both use the WebHDFS protocol.54|Chapter 3: The Hadoop Distributed FilesystemFigure 3-1.
Accessing HDFS over HTTP directly and via a bank of HDFS proxiesIn the first case, the embedded web servers in the namenode and datanodes act asWebHDFS endpoints. (WebHDFS is enabled by default, since dfs.webhdfs.enabled isset to true.) File metadata operations are handled by the namenode, while file read (andwrite) operations are sent first to the namenode, which sends an HTTP redirect to theclient indicating the datanode to stream file data from (or to).The second way of accessing HDFS over HTTP relies on one or more standalone proxyservers. (The proxies are stateless, so they can run behind a standard load balancer.) Alltraffic to the cluster passes through the proxy, so the client never accesses the namenodeor datanode directly.
This allows for stricter firewall and bandwidth-limiting policiesto be put in place. It’s common to use a proxy for transfers between Hadoop clusterslocated in different data centers, or when accessing a Hadoop cluster running in thecloud from an external network.The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, soclients can access both using webhdfs (or swebhdfs) URIs.
The HttpFS proxy is startedindependently of the namenode and datanode daemons, using the httpfs.sh script, andby default listens on a different port number (14000).CHadoop provides a C library called libhdfs that mirrors the Java FileSystem interface(it was written as a C library for accessing HDFS, but despite its name it can be used toHadoop Filesystems|55access any Hadoop filesystem). It works using the Java Native Interface (JNI) to call aJava filesystem client. There is also a libwebhdfs library that uses the WebHDFS interfacedescribed in the previous section.The C API is very similar to the Java one, but it typically lags the Java one, so some newerfeatures may not be supported.
You can find the header file, hdfs.h, in the includedirectory of the Apache Hadoop binary tarball distribution.The Apache Hadoop binary tarball comes with prebuilt libhdfs binaries for 64-bit Linux,but for other platforms you will need to build them yourself by following the BUILDING.txt instructions at the top level of the source tree.NFSIt is possible to mount HDFS on a local client’s filesystem using Hadoop’s NFSv3 gateway.You can then use Unix utilities (such as ls and cat) to interact with the filesystem,upload files, and in general use POSIX libraries to access the filesystem from any pro‐gramming language.
Appending to a file works, but random modifications of a file donot, since HDFS can only write to the end of a file.Consult the Hadoop documentation for how to configure and run the NFS gateway andconnect to it from a client.FUSEFilesystem in Userspace (FUSE) allows filesystems that are implemented in user spaceto be integrated as Unix filesystems. Hadoop’s Fuse-DFS contrib module allows HDFS(or any Hadoop filesystem) to be mounted as a standard local filesystem.
Fuse-DFS isimplemented in C using libhdfs as the interface to HDFS. At the time of writing, theHadoop NFS gateway is the more robust solution to mounting HDFS, so should bepreferred over Fuse-DFS.The Java InterfaceIn this section, we dig into the Hadoop FileSystem class: the API for interacting withone of Hadoop’s filesystems.6 Although we focus mainly on the HDFS implementation,DistributedFileSystem, in general you should strive to write your code against theFileSystem abstract class, to retain portability across filesystems. This is very usefulwhen testing your program, for example, because you can rapidly run tests using datastored on the local filesystem.6.
In Hadoop 2 and later, there is a new filesystem interface called FileContext with better handling of multiplefilesystems (so a single FileContext can resolve multiple filesystem schemes, for example) and a cleaner,more consistent interface. FileSystem is still more widely used, however.56|Chapter 3: The Hadoop Distributed FilesystemReading Data from a Hadoop URLOne of the simplest ways to read a file from a Hadoop filesystem is by using ajava.net.URL object to open a stream to read the data from. The general idiom is:InputStream in = null;try {in = new URL("hdfs://host/path").openStream();// process in} finally {IOUtils.closeStream(in);}There’s a little bit more work required to make Java recognize Hadoop’s hdfs URLscheme. This is achieved by calling the setURLStreamHandlerFactory() method onURL with an instance of FsUrlStreamHandlerFactory.
This method can be called onlyonce per JVM, so it is typically executed in a static block. This limitation means that ifsome other part of your program—perhaps a third-party component outside your con‐trol—sets a URLStreamHandlerFactory, you won’t be able to use this approach forreading data from Hadoop. The next section discusses an alternative.Example 3-1 shows a program for displaying files from Hadoop filesystems on standardoutput, like the Unix cat command.Example 3-1. Displaying files from a Hadoop filesystem on standard output using aURLStreamHandlerpublic class URLCat {static {URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());}public static void main(String[] args) throws Exception {InputStream in = null;try {in = new URL(args[0]).openStream();IOUtils.copyBytes(in, System.out, 4096, false);} finally {IOUtils.closeStream(in);}}}We make use of the handy IOUtils class that comes with Hadoop for closing the streamin the finally clause, and also for copying bytes between the input stream and theoutput stream (System.out, in this case).