09. Getting started with Spark & Hadoop – client mode on local & cluster mode on yarn

This extends Setting up & getting started with Spark local mode with Sbt & Scala and Setting up & getting started with Hadoop on Mac.

The Spark job written in Scala will be reading the data from HDFS.

1. CSV data in HDFS

2. Make sure HDFS & YARN services are up

You can check if the services are up with:

If the services are down start them with:

3. Load the data on to HDFS from the local file system

4. Spark job to read from HDFS

This is an extension to the SimpleSpark.java covered in the previous Spark tutorials in the series setting-up-getting-started-with-scala-sbt-spark-hadoop

Within Eclipse you can “Run As” -> “Scala Application”


5. Package with sbt

6. Spark-submit local & client mode

Spark-submit to run the master on the local & in client mode:

You will get the same output:

6. Spark-submit yarn & cluster mode

Spark-submit to run the master on the yarn & in cluster mode:

Firstly, in the source code remove “.config(“spark.master”, “local”)”. The source code should look like:

Package it:

Set the following environment variables (e.g. ~/.bash_profile or ~/.bashrc):

Run it on yarn in cluster mode:

Check http://localhost:8088/cluster for the output. Click on the id and then the logs.


You can practice the Spark code that are covered in the Spark tutorials on Apache Zeppelin on Docker.

800+ Java & Big Data Interview Q&As