Tom White - Hadoop The Definitive Guide_ 4 edition - 2015 (811394), страница 73
Текст из файла (страница 73)
Get usage instructions for the Avro tools by typingjava -jar avro-tools-*.jar.352|Chapter 12: AvroWriting Avro objects to a datafile is similar to writing to a stream. We use a DatumWriteras before, but instead of using an Encoder, we create a DataFileWriter instance withthe DatumWriter. Then we can create a new datafile (which, by convention, has a .av‐ro extension) and append objects to it:File file = new File("data.avro");DatumWriter<GenericRecord> writer =new GenericDatumWriter<GenericRecord>(schema);DataFileWriter<GenericRecord> dataFileWriter =new DataFileWriter<GenericRecord>(writer);dataFileWriter.create(schema, file);dataFileWriter.append(datum);dataFileWriter.close();The objects that we write to the datafile must conform to the file’s schema; otherwise,an exception will be thrown when we call append().This example demonstrates writing to a local file (java.io.File in the previous snip‐pet), but we can write to any java.io.OutputStream by using the overloaded create()method on DataFileWriter.
To write a file to HDFS, for example, we get an OutputStream by calling create() on FileSystem (see “Writing Data” on page 61).Reading back objects from a datafile is similar to the earlier case of reading objects froman in-memory stream, with one important difference: we don’t have to specify a schema,since it is read from the file metadata. Indeed, we can get the schema from the DataFileReader instance, using getSchema(), and verify that it is the same as the one we usedto write the original object:DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();DataFileReader<GenericRecord> dataFileReader =new DataFileReader<GenericRecord>(file, reader);assertThat("Schema is the same", schema, is(dataFileReader.getSchema()));DataFileReader is a regular Java iterator, so we can iterate through its data objects bycalling its hasNext() and next() methods.
The following snippet checks that there isonly one record and that it has the expected field values:assertThat(dataFileReader.hasNext(), is(true));GenericRecord result = dataFileReader.next();assertThat(result.get("left").toString(), is("L"));assertThat(result.get("right").toString(), is("R"));assertThat(dataFileReader.hasNext(), is(false));Rather than using the usual next() method, however, it is preferable to use the over‐loaded form that takes an instance of the object to be returned (in this case, GenericRecord), since it will reuse the object and save allocation and garbage collection costsfor files containing many objects.
The following is idiomatic:GenericRecord record = null;while (dataFileReader.hasNext()) {record = dataFileReader.next(record);Avro Datafiles|353// process record}If object reuse is not important, you can use this shorter form:for (GenericRecord record : dataFileReader) {// process record}For the general case of reading a file on a Hadoop filesystem, use Avro’s FsInput tospecify the input file using a Hadoop Path object. DataFileReader actually offers ran‐dom access to Avro datafiles (via its seek() and sync() methods); however, in manycases, sequential streaming access is sufficient, for which DataFileStream should beused. DataFileStream can read from any Java InputStream.InteroperabilityTo demonstrate Avro’s language interoperability, let’s write a datafile using one language(Python) and read it back with another (Java).Python APIThe program in Example 12-1 reads comma-separated strings from standard input andwrites them as StringPair records to an Avro datafile.
Like in the Java code for writinga datafile, we create a DatumWriter and a DataFileWriter object. Notice that we haveembedded the Avro schema in the code, although we could equally well have read itfrom a file.Python represents Avro records as dictionaries; each line that is read from standard inis turned into a dict object and appended to the DataFileWriter.Example 12-1. A Python program for writing Avro record pairs to a datafileimport osimport stringimport sysfrom avro import schemafrom avro import iofrom avro import datafileif __name__ == '__main__':if len(sys.argv) != 2:sys.exit('Usage: %s <data_file>' % sys.argv[0])avro_file = sys.argv[1]writer = open(avro_file, 'wb')datum_writer = io.DatumWriter()schema_object = schema.parse("\{ "type": "record","name": "StringPair",354|Chapter 12: Avro"doc": "A pair of strings.","fields": [{"name": "left", "type": "string"},{"name": "right", "type": "string"}]}")dfw = datafile.DataFileWriter(writer, datum_writer, schema_object)for line in sys.stdin.readlines():(left, right) = string.split(line.strip(), ',')dfw.append({'left':left, 'right':right});dfw.close()Before we can run the program, we need to install Avro for Python:% easy_install avroTo run the program, we specify the name of the file to write output to (pairs.avro) andsend input pairs over standard in, marking the end of file by typing Ctrl-D:% python ch12-avro/src/main/py/write_pairs.py pairs.avroa,1c,2b,3b,2^DAvro ToolsNext, we’ll use the Avro tools (written in Java) to display the contents of pairs.avro.
Thetools JAR is available from the Avro website; here we assume it’s been placed in a localdirectory called $AVRO_HOME. The tojson command converts an Avro datafile toJSON and prints it to the console:% java -jar $AVRO_HOME/avro-tools-*.jar tojson pairs.avro{"left":"a","right":"1"}{"left":"c","right":"2"}{"left":"b","right":"3"}{"left":"b","right":"2"}We have successfully exchanged complex data between two Avro implementations(Python and Java).Schema ResolutionWe can choose to use a different schema for reading the data back (the reader’s sche‐ma) from the one we used to write it (the writer’s schema). This is a powerful tool becauseit enables schema evolution.
To illustrate, consider a new schema for string pairs withan added description field:Schema Resolution|355{"type": "record","name": "StringPair","doc": "A pair of strings with an added field.","fields": [{"name": "left", "type": "string"},{"name": "right", "type": "string"},{"name": "description", "type": "string", "default": ""}]}We can use this schema to read the data we serialized earlier because, crucially, we havegiven the description field a default value (the empty string),4 which Avro will usewhen there is no such field defined in the records it is reading. Had we omitted thedefault attribute, we would get an error when trying to read the old data.To make the default value null rather than the empty string, we wouldinstead define the description field using a union with the null Avrotype:{"name": "description", "type": ["null", "string"], "default": null}When the reader’s schema is different from the writer’s, we use the constructor forGenericDatumReader that takes two schema objects, the writer’s and the reader’s, in thatorder:DatumReader<GenericRecord> reader =new GenericDatumReader<GenericRecord>(schema, newSchema);Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(),null);GenericRecord result = reader.read(null, decoder);assertThat(result.get("left").toString(), is("L"));assertThat(result.get("right").toString(), is("R"));assertThat(result.get("description").toString(), is(""));For datafiles, which have the writer’s schema stored in the metadata, we only need tospecify the reader’s schema explicitly, which we can do by passing null for the writer’sschema:DatumReader<GenericRecord> reader =new GenericDatumReader<GenericRecord>(null, newSchema);Another common use of a different reader’s schema is to drop fields in a record, anoperation called projection.
This is useful when you have records with a large numberof fields and you want to read only some of them. For example, this schema can be usedto get only the right field of a StringPair:4. Default values for fields are encoded using JSON. See the Avro specification for a description of this encodingfor each data type.356|Chapter 12: Avro{"type": "record","name": "StringPair","doc": "The right field of a pair of strings.","fields": [{"name": "right", "type": "string"}]}The rules for schema resolution have a direct bearing on how schemas may evolve fromone version to the next, and are spelled out in the Avro specification for all Avro types.A summary of the rules for record evolution from the point of view of readers andwriters (or servers and clients) is presented in Table 12-4.Table 12-4.
Schema resolution of recordsNew schemaWriter Reader ActionAdded fieldOldNewThe reader uses the default value of the new field, since it is not written by the writer.NewOldThe reader does not know about the new field written by the writer, so it is ignored(projection).NewThe reader ignores the removed field (projection).OldThe removed field is not written by the writer. If the old schema had a default defined forthe field, the reader uses this; otherwise, it gets an error. In this case, it is best to updatethe reader’s schema, either at the same time as or before the writer’s.Removed field OldNewAnother useful technique for evolving Avro schemas is the use of name aliases.