02: Spark tutorial – reading a file from HDFS

This extends Spark tutorial – writing a file from a local file system to HDFS.

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on VMWare (non commercial use) or on VirtualBox. I am using VMWare. Cloudera requires at least 8GB RAM and 16GB is recommended.

Step 1: The “ReadFromHdfs” Java class.

Step 2: Build the jar file “sequence-file-1.0-SNAPSHOT.jar” with

Step 3: Run it as a Spark job with “spark-submit

Output:

Key Points

The “args[0]” can be a single text file or a directory that contains multiple files. The “JavaRDD” will contain individual records from the file(s). Useful to process CSV files that can be split by records.

“args[0]” can be

1) A file: hdfs://localhost:8020/sampledata/sample.txt

2) a folder with many files: hdfs://localhost:8020/sampledata

What if you want to read the whole file into RDD instead of individual records from all the files?

You need to use the “wholeTextFiles” method.

Where the “Tuple2<String, String>” will hold the “file name (full HDFS path)” and the “file contents” respectively. You can process a fie at a time. Handy for non splittable file formats like XML and JSON.

Reading a sequence file from HDFS

A sequence file consist of binary data as key/value pairs. They support compression, splittable, and can solve small files problem by combining small text files into a single sequence file.

Complete tutorial: Spark tutorial – reading a Sequence File from HDFS

How will you get a RDD for a Hadoop file with an arbitrary InputFormat?

1) using the “newAPIHadoopFile” function.

Complete tutorial: 1) Convert XML file To Sequence File with Apache Spark – writing & reading and 2) Convert XML file To an Avro File with Apache Spark – writing & reading.

Why is it named “newAPIHadoopFile”

old api -> org.apache.hadoop.mapred
new api -> org.apache.hadoop.mapreduce

You need to have the right imports as in:


Categories Menu - Q&As, FAQs & Tutorials

Top