Write to an Avro file from a Spark job in local mode

AVRO (i.e row oriented) and Parquet (i.e. column oriented) file formats are HDFS (i.e. Hadoop Distributed File System) friendly binary data formats as they store data compressed and splittable for mapreduce/spark tasks. These formats are used in “Write Once and Read for ever” type use-cases.

What is an AVRO file format?

Avro stores the schema (i.e.e data definition) in JSON format making it easy to read and interpret, the data itself is stored in binary format (i.e. compressed) making it compact and efficient. Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing. Avro stores the schema in header of file so data is self-describing. Avro is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).

Writing as an AVRO file from Spark

Step 1: The Avro library needs to be added via the pom.xml as shown below.

Step 2: Spark job that runs in local mode and writes to an Avro file in HDFS.

Stores in binary format, something like

What is a pair RDD?

It is a key/value RDD which is useful in operations like joins (i.e. inner, left outer, and right outer), reduceByKey, foldByKey, lookup, etc. Java doesn’t have a built-in Tuple2 class to construct key/value pairs, so it uses scala.Tuple2 class. You can create a pair RDD with new Tuple2(elem1, elem2) and can then access its relevant elements with the _1() and _2() methods. In the above example, a Tuple2 is created with the key being “AvroKey<GenericRecord>” and the value is “Void”. The “saveAsNewAPIHadoopFile” is invoked on a pairRDD, and using the “AvroKeyOutputFormat“, which is a org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<K,V> for writing Avro container files. The value is “Void” or null since Avro container files can only contain records, and not key/value pairs.

Avro-tools via command-line

Type “avro-tools” to get the possible command options.

Get the schema of an avro file

Output:

Get the contents of an avro file as JSON

Output:

Convert Avro to trevni columnar format

Same column values will be stored next to each other. All the order_ids will be stored sequentially for all 3 rows and all the order_names will be stored together for the 3 rows.

Convert back to JSON.

Use AvroKeyValueOutputFormat.class instead of AvroKeyOutputFormat.class

This stores the records with “key” and “value”. Using the same schema for both to keep it simple.

Stores in binary format, something like


Why & What are the benefits

🎯 Why java-success.com?

🎯 What are the benefits of Q&As approach?

Learn by categories such as FAQs – Core Java, Key Area – Low Latency, Core Java – Java 8, JEE – Microservices, Big Data – NoSQL, Architecture – Distributed, Big Data – Spark, etc. Some posts belong to multiple categories.

BigData on Cloudera
Module 1 Installing & getting started with Cloudera Quick Start+
Unit 1 Installing & getting started with Cloudera QuickStart on VMWare for windows in 17 steps  - Preview
Unit 2 ⏯ Cloudera Hue, Terminal Window (on edge node) & Cloudera Manager overview  - Preview
Unit 3 Understanding Cloudera Hadoop users  - Preview
Unit 4 Upgrading Java version to JDK 8 in Cloudera Quickstart  - Preview
Module 2 Getting started with HDFS on Cloudera+
Unit 1 ⏯ Hue and terminal window to work with HDFS  - Preview
Unit 2 Java program to list files in HDFS & write to HDFS using Hadoop API  - Preview
Unit 3 ⏯ Java program to list files on HDFS & write to a file in HDFS  - Preview
Unit 4 Write to & Read from a csv file in HDFS using Java & Hadoop API  - Preview
Unit 5 ⏯ Write to & read from HDFS using Hadoop API in Java  - Preview
Module 3 Running an Apache Spark job on Cloudera+
Unit 1 Before running a Spark job on a YARN cluster in Cloudera  - Preview
Unit 2 Running a Spark job on YARN cluster in Cloudera  - Preview
Unit 3 ⏯ Running a Spark job on YARN cluster  - Preview
Unit 4 Write to HDFS from Spark in YARN mode & local mode  - Preview
Unit 5 ⏯ Write to HDFS from Spark in YARN & local modes  - Preview
Unit 6 Spark running on YARN and Local modes reading from HDFS  - Preview
Unit 7 ⏯ Spark running on YARN and Local modes reading from HDFS  - Preview
Module 4 Hive on Cloudera+
Unit 1 Getting started with Hive  - Preview
Unit 2 ⏯ Getting started with Hive  - Preview
Module 5 HBase on Cloudera+
Unit 1 Write to HBase from Java  - Preview
Unit 2 Read from HBase in Java  - Preview
Unit 3 HBase shell commands to get, scan, and delete  - Preview
Unit 4 ⏯ Write to & read from HBase  - Preview
Module 6 Writing to & reading from Avro in Spark-
Unit 1 Write to an Avro file from a Spark job in local mode  - Preview
Unit 2 Read an Avro file from HDFS via a Spark job running in local mode  - Preview
Unit 3 ⏯ Write to & read from an Avro file on HDFS using Spark  - Preview
Unit 4 Write to HDFS as Avro from a Spark job using Avro IDL  - Preview
Unit 5 ⏯ Write to Avro using Avro IDL from a Spark job  - Preview
Unit 6 Create a Hive table over Avro data  - Preview
Unit 7 ⏯ Hive table over an Avro folder & avro-tools to generate the schema  - Preview
Module 7 Writing to & reading from Parquet in Spark+
Unit 1 Write to a Parquet file from a Spark job in local mode  - Preview
Unit 2 Read from a Parquet file in a Spark job running in local mode  - Preview
Unit 3 ⏯ Write to and read from Parquet data on HDFS via Spark  - Preview
Unit 4 Create a Hive table over Parquet data  - Preview
Unit 5 ⏯ Hive over Parquet data  - Preview
Module 8 Spark SQL+
Unit 1 Spark SQL read a Hive table  - Preview
Unit 2 Write to Parquet using Spark SQL & Dataframe  - Preview
Unit 3 Read from Parquet with Spark SQL & Dataframe  - Preview
Unit 4 ⏯ Spark SQL basics video tutorial  - Preview
Module 9 Spark streaming+
Unit 1 Spark streaming text files  - Preview
Unit 2 Spark file streaming in Java  - Preview
Unit 3 ⏯ Spark streaming video tutorial  - Preview
Top