01B: Spark on Zeppelin – custom Dockerfile

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

What is Apache Zeppelin?

Zeppelin is a web based notebook to execute arbitrary code in Scala, SQL, Spark, etc. You can mix languages. Apache Zeppelin helps data analysts, data scientist, and business users to get better understanding of data. As described below you can quickly explore data, create visualizations and share their insights, as web pages, with various stakeholders. For example

1) Prepare data using Shell by say downloading files with curl/wget, and then inject to HDFS.

2) Perform data analytics with Spark (i.e Scala) or pyspark (i.e. Python).

3) Perform simple visualizations in SQL.

4) Export the results with Shell, and publish to create graphs.

How to install Apache Zeppelin on Docker?

You need around ~4GB disk space to create a Docker container with Ubuntu OS and Apache Zeppelin.

Apache Zeppelin on Docker

Step 1: Create a folder say “docker-zepplin” under a folder named “projects”. Within the folder docker-zepplin, create a file named “Dockerfile” and it should have the following contents. It installs Java on Ubuntu and then the Zeppelin note book.

The Dockerfile shown below was simplified from the image “apache/zeppelin“, which is already available to be used from the Docker Hub. It installs all the conda packages, Python related packages, and “R” language related packages.

The https://zeppelin.apache.org/download.html has the link to Zeppelin downloads. Get the binary package with all interpreters as shown above.

Step 2: Create a “scripts” folder, and create the “docker-entrypoint.sh” file with the following contents.

Step 3: Create a docker image with the following command. The docker command looks for a “Dockerfile” in the current (i.e “.”) folder.

This may take a while to download the 1 GB zeppelin-0.8.0-bin-all.tgz. Once the image is created, you can verify it with:

Step 4: You can now run a docker container with this image as shown below:

Step 5: Go to a browser and type: “http://locahost:8080”.

Apache Zeppelin UI

Step 6: Select the link “Create new note”, and name it “Simple Spark with Scala” and select the interpreter as “spark”.

Type the following simple Spark code to add 1 to the given set of numbers.

Press the play button, and the output will be:

Spark using Scala

Categories Menu - Q&As, FAQs & Tutorials