02: Spark on Zeppelin – read a file from local file system

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook.

Step 1: Pull this from the docker hub, and build the image with the following command.

You can verify the image with the “docker images” command.

Step 2: The input file to read “employees.txt” in the $(pwd)/seed.

Step 3: Run the container with the above image.

Note: $(pwd)/seed – is the folder where the employees.txt input file will be placed on the host system, and will be synchronized with the container path “/zeppelin/seed”.

You can inspect the container files/logs with the following commands in a separate terminal window:

Get the container id with:

sh to the container with:

Step 4: Open Zeppelin notebook via a web browser “http:localhost:8080”. Create a note book with “spark” as a default interpreter.

You can view the “SQL” output in multiple formats like tabular, graph chart, pie chart, etc.

Zeppelin read a file via Spark

Zeppelin read a file via Spark

You can group by location, and output the total salary per location with the following query:

Zeppelin spark SQL

Zeppelin spark SQL

Alternatively, you can achieve the similar results via Spark dataframe operations as shown below.

800+ Java & Big Data Interview Q&As