Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 71
Текст из файла (страница 71)
This process is reversible,so rolling back an upgrade is also straightforward.After every successful upgrade, you should perform a couple of final cleanup steps:1. Remove the old installation and configuration files from the cluster.2. Fix any deprecation warnings in your code and configuration.Upgrades are where Hadoop cluster management tools like Cloudera Manager andApache Ambari come into their own. They simplify the upgrade process and also makeit easy to do rolling upgrades, where nodes are upgraded in batches (or one at a timefor master nodes), so that clients don’t experience service interruptions.HDFS data and metadata upgradesIf you use the procedure just described to upgrade to a new version of HDFS and itexpects a different layout version, then the namenode will refuse to run.
A message likethe following will appear in its log:File system image contains an old layout version -16.An upgrade to version -18 is required.Please restart NameNode with -upgrade option.338| Chapter 11: Administering HadoopThe most reliable way of finding out whether you need to upgrade the filesystem is byperforming a trial on a test cluster.An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doingan upgrade does not double the storage requirements of the cluster, as the datanodesuse hard links to keep two references (for the current and previous version) to the sameblock of data. This design makes it straightforward to roll back to the previous versionof the filesystem, if you need to.
You should understand that any changes made to thedata on the upgraded system will be lost after the rollback completes, however.You can keep only the previous version of the filesystem, which means you can’t rollback several versions. Therefore, to carry out another upgrade to HDFS data andmetadata, you will need to delete the previous version, a process called finalizing theupgrade. Once an upgrade is finalized, there is no procedure for rolling back to a pre‐vious version.In general, you can skip releases when upgrading, but in some cases, you may have togo through intermediate releases. The release notes make it clear when this is required.You should only attempt to upgrade a healthy filesystem.
Before running the upgrade,do a full fsck (see “Filesystem check (fsck)” on page 326). As an extra precaution, youcan keep a copy of the fsck output that lists all the files and blocks in the system, so youcan compare it with the output of running fsck after the upgrade.It’s also worth clearing out temporary files before doing the upgrade—both local tem‐porary files and those in the MapReduce system directory on HDFS.With these preliminaries out of the way, here is the high-level procedure for upgradinga cluster when the filesystem layout needs to be migrated:1.
Ensure that any previous upgrade is finalized before proceeding with anotherupgrade.2. Shut down the YARN and MapReduce daemons.3. Shut down HDFS, and back up the namenode directories.4. Install the new version of Hadoop on the cluster and on clients.5. Start HDFS with the -upgrade option.6. Wait until the upgrade is complete.7.
Perform some sanity checks on HDFS.8. Start the YARN and MapReduce daemons.9. Roll back or finalize the upgrade (optional).While running the upgrade procedure, it is a good idea to remove the Hadoop scriptsfrom your PATH environment variable. This forces you to be explicit about which versionMaintenance|339of the scripts you are running. It can be convenient to define two environment variablesfor the new installation directories; in the following instructions, we have definedOLD_HADOOP_HOME and NEW_HADOOP_HOME.Start the upgrade. To perform the upgrade, run the following command (this is step 5in the high-level upgrade procedure):% $NEW_HADOOP_HOME/bin/start-dfs.sh -upgradeThis causes the namenode to upgrade its metadata, placing the previous version in anew directory called previous under dfs.namenode.name.dir.
Similarly, datanodes up‐grade their storage directories, preserving the old copy in a directory called previous.Wait until the upgrade is complete. The upgrade process is not instantaneous, but you cancheck the progress of an upgrade using dfsadmin (step 6; upgrade events also appear inthe daemons’ logfiles):% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -upgradeProgress statusUpgrade for version -18 has been completed.Upgrade is not finalized.Check the upgrade. This shows that the upgrade is complete. At this stage, you shouldrun some sanity checks (step 7) on the filesystem (e.g., check files and blocks usingfsck, test basic file operations).
You might choose to put HDFS into safe mode while youare running some of these checks (the ones that are read-only) to prevent others frommaking changes; see “Safe Mode” on page 322.Roll back the upgrade (optional). If you find that the new version is not working correctly,you may choose to roll back to the previous version (step 9).
This is possible only if youhave not finalized the upgrade.A rollback reverts the filesystem state to before the upgrade wasperformed, so any changes made in the meantime will be lost. Inother words, it rolls back to the previous state of the filesystem, ratherthan downgrading the current state of the filesystem to a formerversion.First, shut down the new daemons:% $NEW_HADOOP_HOME/bin/stop-dfs.shThen start up the old version of HDFS with the -rollback option:% $OLD_HADOOP_HOME/bin/start-dfs.sh -rollback340|Chapter 11: Administering HadoopThis command gets the namenode and datanodes to replace their current storagedirectories with their previous copies.
The filesystem will be returned to its previousstate.Finalize the upgrade (optional). When you are happy with the new version of HDFS, youcan finalize the upgrade (step 9) to remove the previous storage directories.After an upgrade has been finalized, there is no way to roll back tothe previous version.This step is required before performing another upgrade:% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -finalizeUpgrade% $NEW_HADOOP_HOME/bin/hdfs dfsadmin -upgradeProgress statusThere are no upgrades in progress.HDFS is now fully upgraded to the new version.Maintenance|341PART IVRelated ProjectsCHAPTER 12AvroApache Avro 1 is a language-neutral data serialization system. The project was createdby Doug Cutting (the creator of Hadoop) to address the major downside of HadoopWritables: lack of language portability.
Having a data format that can be processed bymany languages (currently C, C++, C#, Java, JavaScript, Perl, PHP, Python, and Ruby)makes it easier to share datasets with a wider audience than one tied to a single language.It is also more future-proof, allowing data to potentially outlive the language used toread and write it.But why a new data serialization system? Avro has a set of features that, takentogether, differentiate it from other systems such as Apache Thrift or Google’s ProtocolBuffers.2 Like in these systems and others, Avro data is described using a languageindependent schema. However, unlike in some other systems, code generation is op‐tional in Avro, which means you can read and write data that conforms to a given schemaeven if your code has not seen that particular schema before.
To achieve this, Avroassumes that the schema is always present—at both read and write time—which makesfor a very compact encoding, since encoded values do not need to be tagged with a fieldidentifier.Avro schemas are usually written in JSON, and data is usually encoded using a binaryformat, but there are other options, too. There is a higher-level language called AvroIDL for writing schemas in a C-like language that is more familiar to developers. Thereis also a JSON-based data encoder, which, being human readable, is useful for proto‐typing and debugging Avro data.The Avro specification precisely defines the binary format that all implementations mustsupport. It also specifies many of the other features of Avro that implementations should1.
Named after the British aircraft manufacturer from the 20th century.2. Avro also performs favorably compared to other serialization libraries, as the benchmarks demonstrate.345support. One area that the specification does not rule on, however, is APIs: implemen‐tations have complete latitude in the APIs they expose for working with Avro data, sinceeach one is necessarily language specific. The fact that there is only one binary formatis significant, because it means the barrier for implementing a new language binding islower and avoids the problem of a combinatorial explosion of languages and formats,which would harm interoperability.Avro has rich schema resolution capabilities.
Within certain carefully defined con‐straints, the schema used to read data need not be identical to the schema that was usedto write the data. This is the mechanism by which Avro supports schema evolution. Forexample, a new, optional field may be added to a record by declaring it in the schemaused to read the old data. New and old clients alike will be able to read the old data,while new clients can write new data that uses the new field. Conversely, if an old clientsees newly encoded data, it will gracefully ignore the new field and carry on processingas it would have done with old data.Avro specifies an object container format for sequences of objects, similar to Hadoop’ssequence file.
An Avro datafile has a metadata section where the schema is stored, whichmakes the file self-describing. Avro datafiles support compression and are splittable,which is crucial for a MapReduce data input format. In fact, support goes beyond Map‐Reduce: all of the data processing frameworks in this book (Pig, Hive, Crunch, Spark)can read and write Avro datafiles.Avro can be used for RPC, too, although this isn’t covered here. More information is inthe specification.Avro Data Types and SchemasAvro defines a small number of primitive data types, which can be used to buildapplication-specific data structures by writing schemas.