Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 31
Текст из файла (страница 31)
It is for this reasonthat Flume (see Chapter 14) uses row-oriented formats.The first column-oriented file format in Hadoop was Hive’s RCFile, short for RecordColumnar File. It has since been superseded by Hive’s ORCFile (Optimized Record Col‐umnar File), and Parquet (covered in Chapter 13). Parquet is a general-purpose columnoriented file format based on Google’s Dremel, and has wide support across Hadoopcomponents. Avro also has a column-oriented format called Trevni.File-Based Data Structures|137PART IIMapReduceCHAPTER 6Developing a MapReduce ApplicationIn Chapter 2, we introduced the MapReduce model.
In this chapter, we look at thepractical aspects of developing a MapReduce application in Hadoop.Writing a program in MapReduce follows a certain pattern. You start by writing yourmap and reduce functions, ideally with unit tests to make sure they do what you expect.Then you write a driver program to run a job, which can run from your IDE using asmall subset of the data to check that it is working. If it fails, you can use your IDE’sdebugger to find the source of the problem.
With this information, you can expand yourunit tests to cover this case and improve your mapper or reducer as appropriate to handlesuch input correctly.When the program runs as expected against the small dataset, you are ready to unleashit on a cluster.
Running against the full dataset is likely to expose some more issues,which you can fix as before, by expanding your tests and altering your mapper or reducerto handle the new cases. Debugging failing programs in the cluster is a challenge, sowe’ll look at some common techniques to make it easier.After the program is working, you may wish to do some tuning, first by running throughsome standard checks for making MapReduce programs faster and then by doing taskprofiling. Profiling distributed programs is not easy, but Hadoop has hooks to aid inthe process.Before we start writing a MapReduce program, however, we need to set up and configurethe development environment. And to do that, we need to learn a bit about how Hadoopdoes configuration.The Configuration APIComponents in Hadoop are configured using Hadoop’s own configuration API. Aninstance of the Configuration class (found in the org.apache.hadoop.conf package)141represents a collection of configuration properties and their values.
Each property isnamed by a String, and the type of a value may be one of several, including Java prim‐itives such as boolean, int, long, and float; other useful types such as String, Class,and java.io.File; and collections of Strings.Configurations read their properties from resources—XML files with a simple structurefor defining name-value pairs. See Example 6-1.Example 6-1. A simple configuration file, configuration-1.xml<?xml version="1.0"?><configuration><property><name>color</name><value>yellow</value><description>Color</description></property><property><name>size</name><value>10</value><description>Size</description></property><property><name>weight</name><value>heavy</value><final>true</final><description>Weight</description></property><property><name>size-weight</name><value>${size},${weight}</value><description>Size and weight</description></property></configuration>Assuming this Configuration is in a file called configuration-1.xml, we can access itsproperties using a piece of code like this:Configuration conf = new Configuration();conf.addResource("configuration-1.xml");assertThat(conf.get("color"), is("yellow"));assertThat(conf.getInt("size", 0), is(10));assertThat(conf.get("breadth", "wide"), is("wide"));There are a couple of things to note: type information is not stored in the XML file;instead, properties can be interpreted as a given type when they are read.
Also, the get()methods allow you to specify a default value, which is used if the property is not definedin the XML file, as in the case of breadth here.142|Chapter 6: Developing a MapReduce ApplicationCombining ResourcesThings get interesting when more than one resource is used to define a Configuration. This is used in Hadoop to separate out the default properties for the system,defined internally in a file called core-default.xml, from the site-specific overrides incore-site.xml.
The file in Example 6-2 defines the size and weight properties.Example 6-2. A second configuration file, configuration-2.xml<?xml version="1.0"?><configuration><property><name>size</name><value>12</value></property><property><name>weight</name><value>light</value></property></configuration>Resources are added to a Configuration in order:Configuration conf = new Configuration();conf.addResource("configuration-1.xml");conf.addResource("configuration-2.xml");Properties defined in resources that are added later override the earlier definitions.
Sothe size property takes its value from the second configuration file, configuration-2.xml:assertThat(conf.getInt("size", 0), is(12));However, properties that are marked as final cannot be overridden in later definitions.The weight property is final in the first configuration file, so the attempt to overrideit in the second fails, and it takes the value from the first:assertThat(conf.get("weight"), is("heavy"));Attempting to override final properties usually indicates a configuration error, so thisresults in a warning message being logged to aid diagnosis. Administrators mark prop‐erties as final in the daemon’s site files that they don’t want users to change in theirclient-side configuration files or job submission parameters.Variable ExpansionConfiguration properties can be defined in terms of other properties, or system prop‐erties.
For example, the property size-weight in the first configuration file is definedas ${size},${weight}, and these properties are expanded using the values found inthe configuration:The Configuration API|143assertThat(conf.get("size-weight"), is("12,heavy"));System properties take priority over properties defined in resource files:System.setProperty("size", "14");assertThat(conf.get("size-weight"), is("14,heavy"));This feature is useful for overriding properties on the command line by using-Dproperty=value JVM arguments.Note that although configuration properties can be defined in terms of system proper‐ties, unless system properties are redefined using configuration properties, they are notaccessible through the configuration API.
Hence:System.setProperty("length", "2");assertThat(conf.get("length"), is((String) null));Setting Up the Development EnvironmentThe first step is to create a project so you can build MapReduce programs and run themin local (standalone) mode from the command line or within your IDE. The MavenProject Object Model (POM) in Example 6-3 shows the dependencies needed for build‐ing and testing MapReduce programs.Example 6-3.
A Maven POM for building and testing a MapReduce application<project><modelVersion>4.0.0</modelVersion><groupId>com.hadoopbook</groupId><artifactId>hadoop-book-mr-dev</artifactId><version>4.0</version><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><hadoop.version>2.5.1</hadoop.version></properties><dependencies><!-- Hadoop main client artifact --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>${hadoop.version}</version></dependency><!-- Unit test artifacts --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency><dependency><groupId>org.apache.mrunit</groupId><artifactId>mrunit</artifactId>144|Chapter 6: Developing a MapReduce Application<version>1.1.0</version><classifier>hadoop2</classifier><scope>test</scope></dependency><!-- Hadoop test artifact for running mini clusters --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-minicluster</artifactId><version>${hadoop.version}</version><scope>test</scope></dependency></dependencies><build><finalName>hadoop-examples</finalName><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.1</version><configuration><source>1.6</source><target>1.6</target></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-jar-plugin</artifactId><version>2.5</version><configuration><outputDirectory>${basedir}</outputDirectory></configuration></plugin></plugins></build></project>The dependencies section is the interesting part of the POM.
(It is straightforward touse another build tool, such as Gradle or Ant with Ivy, as long as you use the same setof dependencies defined here.) For building MapReduce jobs, you only need to havethe hadoop-client dependency, which contains all the Hadoop client-side classesneeded to interact with HDFS and MapReduce. For running unit tests, we use junit,and for writing MapReduce tests, we use mrunit. The hadoop-minicluster librarycontains the “mini-” clusters that are useful for testing with Hadoop clusters runningin a single JVM.Many IDEs can read Maven POMs directly, so you can just point them at the directorycontaining the pom.xml file and start writing code. Alternatively, you can use Maven togenerate configuration files for your IDE.
For example, the following creates Eclipseconfiguration files so you can import the project into Eclipse:Setting Up the Development Environment|145% mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=trueManaging ConfigurationWhen developing Hadoop applications, it is common to switch between running theapplication locally and running it on a cluster. In fact, you may have several clusters youwork with, or you may have a local “pseudodistributed” cluster that you like to test on(a pseudodistributed cluster is one whose daemons all run on the local machine; settingup this mode is covered in Appendix A).One way to accommodate these variations is to have Hadoop configuration files con‐taining the connection settings for each cluster you run against and specify which oneyou are using when you run Hadoop applications or tools.