Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 2
Текст из файла (страница 2)
Developing a MapReduce Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141The Configuration APICombining ResourcesVariable ExpansionSetting Up the Development EnvironmentManaging ConfigurationGenericOptionsParser, Tool, and ToolRunnerWriting a Unit Test with MRUnitMapperReducerRunning Locally on Test DataRunning a Job in a Local Job RunnerTesting the DriverRunning on a ClusterPackaging a JobLaunching a JobThe MapReduce Web UIRetrieving the ResultsDebugging a JobHadoop Logs141143143144146148152153156156157158160160162165167168172Table of Contents|viiRemote DebuggingTuning a JobProfiling TasksMapReduce WorkflowsDecomposing a Problem into MapReduce JobsJobControlApache Oozie1741751751771771781797.
How MapReduce Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185Anatomy of a MapReduce Job RunJob SubmissionJob InitializationTask AssignmentTask ExecutionProgress and Status UpdatesJob CompletionFailuresTask FailureApplication Master FailureNode Manager FailureResource Manager FailureShuffle and SortThe Map SideThe Reduce SideConfiguration TuningTask ExecutionThe Task Execution EnvironmentSpeculative ExecutionOutput Committers1851861871881891901921931931941951961971971982012032032042068.
MapReduce Types and Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209MapReduce TypesThe Default MapReduce JobInput FormatsInput Splits and RecordsText InputBinary InputMultiple InputsDatabase Input (and Output)Output FormatsText OutputBinary Outputviii| Table of Contents209214220220232236237238238239239Multiple OutputsLazy OutputDatabase Output2402452459. MapReduce Features. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247CountersBuilt-in CountersUser-Defined Java CountersUser-Defined Streaming CountersSortingPreparationPartial SortTotal SortSecondary SortJoinsMap-Side JoinsReduce-Side JoinsSide Data DistributionUsing the Job ConfigurationDistributed CacheMapReduce Library ClassesPart III.247247251255255256257259262268269270273273274279Hadoop Operations10. Setting Up a Hadoop Cluster.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283Cluster SpecificationCluster SizingNetwork TopologyCluster Setup and InstallationInstalling JavaCreating Unix User AccountsInstalling HadoopConfiguring SSHConfiguring HadoopFormatting the HDFS FilesystemStarting and Stopping the DaemonsCreating User DirectoriesHadoop ConfigurationConfiguration ManagementEnvironment SettingsImportant Hadoop Daemon Properties284285286288288288289289290290290292292293294296Table of Contents|ixHadoop Daemon Addresses and PortsOther Hadoop PropertiesSecurityKerberos and HadoopDelegation TokensOther Security EnhancementsBenchmarking a Hadoop ClusterHadoop BenchmarksUser Jobs30430730930931231331431431611.
Administering Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317HDFSPersistent Data StructuresSafe ModeAudit LoggingToolsMonitoringLoggingMetrics and JMXMaintenanceRoutine Administration ProceduresCommissioning and Decommissioning NodesUpgradesPart IV.317317322324325330330331332332334337Related Projects12. Avro. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345Avro Data Types and SchemasIn-Memory Serialization and DeserializationThe Specific APIAvro DatafilesInteroperabilityPython APIAvro ToolsSchema ResolutionSort OrderAvro MapReduceSorting Using Avro MapReduceAvro in Other Languagesx|Table of Contents34634935135235435435535535835936336513. Parquet. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367Data ModelNested EncodingParquet File FormatParquet ConfigurationWriting and Reading Parquet FilesAvro, Protocol Buffers, and ThriftParquet MapReduce36837037037237337537714. Flume. . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381Installing FlumeAn ExampleTransactions and ReliabilityBatchingThe HDFS SinkPartitioning and InterceptorsFile FormatsFan OutDelivery GuaranteesReplicating and Multiplexing SelectorsDistribution: Agent TiersDelivery GuaranteesSink GroupsIntegrating Flume with ApplicationsComponent CatalogFurther Reading38138238438538538738738838939039039339539839940015. Sqoop. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401Getting SqoopSqoop ConnectorsA Sample ImportText and Binary File FormatsGenerated CodeAdditional Serialization SystemsImports: A Deeper LookControlling the ImportImports and ConsistencyIncremental ImportsDirect-Mode ImportsWorking with Imported DataImported Data and HiveImporting Large Objects401403403406407407408410411411411412413415Table of Contents|xiPerforming an ExportExports: A Deeper LookExports and TransactionalityExports and SequenceFilesFurther Reading41741942042142216.
Pig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423Installing and Running PigExecution TypesRunning Pig ProgramsGruntPig Latin EditorsAn ExampleGenerating ExamplesComparison with DatabasesPig LatinStructureStatementsExpressionsTypesSchemasFunctionsMacrosUser-Defined FunctionsA Filter UDFAn Eval UDFA Load UDFData Processing OperatorsLoading and Storing DataFiltering DataGrouping and Joining DataSorting DataCombining and Splitting DataPig in PracticeParallelismAnonymous RelationsParameter SubstitutionFurther Reading42442442642642742742943043243243343843944144544744844845245345645645745946546646646746746746917. Hive.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471Installing HiveThe Hive Shellxii|Table of Contents472473An ExampleRunning HiveConfiguring HiveHive ServicesThe MetastoreComparison with Traditional DatabasesSchema on Read Versus Schema on WriteUpdates, Transactions, and IndexesSQL-on-Hadoop AlternativesHiveQLData TypesOperators and FunctionsTablesManaged Tables and External TablesPartitions and BucketsStorage FormatsImporting DataAltering TablesDropping TablesQuerying DataSorting and AggregatingMapReduce ScriptsJoinsSubqueriesViewsUser-Defined FunctionsWriting a UDFWriting a UDAFFurther Reading47447547547848048248248348448548648848949049149650050250250350350350550850951051151351818. Crunch.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519An ExampleThe Core Crunch APIPrimitive OperationsTypesSources and TargetsFunctionsMaterializationPipeline ExecutionRunning a PipelineStopping a PipelineInspecting a Crunch Plan520523523528531533535538538539540Table of Contents|xiiiIterative AlgorithmsCheckpointing a PipelineCrunch LibrariesFurther Reading54354554554819.
Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549Installing SparkAn ExampleSpark Applications, Jobs, Stages, and TasksA Scala Standalone ApplicationA Java ExampleA Python ExampleResilient Distributed DatasetsCreationTransformations and ActionsPersistenceSerializationShared VariablesBroadcast VariablesAccumulatorsAnatomy of a Spark Job RunJob SubmissionDAG ConstructionTask SchedulingTask ExecutionExecutors and Cluster ManagersSpark on YARNFurther Reading55055055255255455555655655756056256456456456556556656957057057157420. HBase.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575HBasicsBackdropConceptsWhirlwind Tour of the Data ModelImplementationInstallationTest DriveClientsJavaMapReduceREST and ThriftBuilding an Online Query Applicationxiv| Table of Contents575576576576578581582584584587589589Schema DesignLoading DataOnline QueriesHBase Versus RDBMSSuccessful ServiceHBasePraxisHDFSUIMetricsCountersFurther Reading59059159459759859960060060160160160121.
ZooKeeper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603Installing and Running ZooKeeperAn ExampleGroup Membership in ZooKeeperCreating the GroupJoining a GroupListing Members in a GroupDeleting a GroupThe ZooKeeper ServiceData ModelOperationsImplementationConsistencySessionsStatesBuilding Applications with ZooKeeperA Configuration ServiceThe Resilient ZooKeeper ApplicationA Lock ServiceMore Distributed Data Structures and ProtocolsZooKeeper in ProductionResilience and PerformanceConfigurationFurther Reading604606606607609610612613614616620621623625627627630634636637637639640Table of Contents|xvPart V.Case Studies22. Composable Data at Cerner.