Blog Archives
1 2

01A: Spark on Zeppelin – Docker pull from Docker hub

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

What is Apache Zeppelin?

Zeppelin is a web based notebook to execute arbitrary code in Scala, SQL, Spark, etc. You can mix languages. Apache Zeppelin helps data analysts, data scientist, and business users to get better understanding of data. As described below you can quickly explore data, create visualizations and share their insights, as web pages, with various stakeholders. For example

1) Prepare data using Shell by say downloading files with curl/wget, and then inject to HDFS.

2) Perform data analytics with Spark (i.e Scala) or pyspark (i.e. Python).

3) Perform simple visualizations in SQL.

4) Export the results with Shell, and publish to create graphs.

How to install Apache Zeppelin on Docker

Step 1: Go to the Docker Hub https://hub.docker.com/, which is the repository for the images that you can pull create isolated containers.

Step 2: Search for “Zeppelin“.

Step 3: Select “apache/zeppelin“. Click on “Dockerfile” and inspect what is getting installed FYI. Click on “Build details” to get the version or tag. For example “0.8.0” or 0.7.3.

Step 4: Pull this from the docker hub, and build the image with the following command.

This may take several minutes to download and create an image.… Read more ...

Tags:

01B: Spark on Zeppelin – custom Dockerfile

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

What is Apache Zeppelin?

Zeppelin is a web based notebook to execute arbitrary code in Scala, SQL, Spark, etc. You can mix languages. Apache Zeppelin helps data analysts, data scientist, and business users to get better understanding of data. As described below you can quickly explore data, create visualizations and share their insights, as web pages, with various stakeholders. For example

1) Prepare data using Shell by say downloading files with curl/wget, and then inject to HDFS.

2) Perform data analytics with Spark (i.e Scala) or pyspark (i.e. Python).

3) Perform simple visualizations in SQL.

4) Export the results with Shell, and publish to create graphs.

How to install Apache Zeppelin on Docker?

You need around ~4GB disk space to create a Docker container with Ubuntu OS and Apache Zeppelin.

Step 1: Create a folder say “docker-zepplin” under a folder named “projects”. Within the folder docker-zepplin, create a file named “Dockerfile” and it should have the following contents. It installs Java on Ubuntu and then the Zeppelin note book.

The Dockerfile shown below was simplified from the image “apache/zeppelin“, which is already available to be used from the Docker Hub. It installs all the conda packages, Python related packages, and “R” language related packages.… Read more ...



02: Spark on Zeppelin – read a file from local file system

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook.

Step 1: Pull this from the docker hub, and build the image with the following command.

You can verify the image with the “docker images” command.

Step 2: The input file to read “employees.txt” in the $(pwd)/seed.

Step 3: Run the container with the above image.

Note: $(pwd)/seed – is the folder where the employees.txt input file will be placed on the host system, and will be synchronized with the container path “/zeppelin/seed”.

You can inspect the container files/logs with the following commands in a separate terminal window:

Get the container id with:

sh to the container with:

Step 4: Open Zeppelin notebook via a web browser “http:localhost:8080”.… Read more ...



03: Spark on Zeppelin – DataFrame Operations in Scala

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

This tutorial extends Apache Zeppelin on Docker Tutorial – Docker pull from Docker hub and Spark stand-alone to read a file from local file system

1. Print the schema of the Dataframe

root
|– id: integer (nullable = true)
|– name: string (nullable = true)
|– location: string (nullable = true)
|– salary: double (nullable = true)

2. Show contents of a Dataframe

3.

Read more ...


04: Spark on Zeppelin – DataFrame joins in Scala

This tutorial extends the series: Spark on Apache Zeppelin Tutorials.

1. Create “Orders” DataFrame

2.

Read more ...


05: Spark on Zeppelin – semi-structured log file

This tutorial extends the series: Spark on Apache Zeppelin Tutorials. Step 1: Pull apache/zeppelin image from the docker hub, and build the image with the following command.

“docker images” will show the image that was created. Step 2: Run the above image to create a container with the following…

Read more ...


06: Spark on Zeppelin – RDD operation zipWithIndex

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook. Q. Why do we need zipWithIndex? A. In database world there are various instances where we want to assign a…

Read more ...


07: Spark on Zeppelin – window functions in Scala

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook. Q. What are the different types of functions in Spark SQL? A. There are 4 types of functions: 1) Built-in…

Read more ...


08: Spark on Zeppelin – convert DataFrames to RDD[Row] and RDD[Row] to DataFrame

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook. Important: It is not a best practice to mutate values or to use RDD directly as opposed to using Dataframes….

Read more ...


09: Spark on Zeppelin – convert DataFrames to RDD and RDD to DataFrame

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As. This extends setting up Apache Zeppelin Notebook. Important: It is not a best practice to mutate values or to use RDD directly as opposed to using Dataframes….

Read more ...


1 2

500+ Enterprise & Core Java programmer & architect Q&As

Java & Big Data Tutorials

Top