23: Docker Tutorial: Apache Spark (spark-submit) in Python 3 with virtual env on Cloudera quickstart

Prerequisite: Docker is installed on your Windows or Mac, and you have a basic understanding of Docker. Docker tutorials step by step | Hadoop, Hive, Impala & Spark on Cloudera quickstart on Docker tutorials

Step 1: Pull the modified image “gdancik/cloudera” of cloudera/quickstart with python3.4 & vim installed. vim is aliased with “vi”. This image is availbe via Docker hub – https://hub.docker.com/r/gdancik/cloudera.

Cloudera quickstart Docker with Python 3

Cloudera quickstart Docker with Python 3

Step 2: Run the container on a command line.

Python3

Step 3: Configure python3.

Image “gdancik/cloudera” comes with python3.

To use Python3 for pyspark:

If you want python to point python3

Install pip3

Step 4: Install pip3.

Virtual Environment

Step 5: Install virtualenv.

Step 6: Create “projects/my-app” directory

Step 7: Create a virtual environment named “my-app-env”.

Step 8: Activate the virtual environment.

(my-app_env) means we are in “my-app_env” virtual environment. So if you install a package like say “pytest” it will installed in the virtual environment site packages and not in the global site packages.

switch to global

Switch back to virtual env

“history” command to re-run the “source” command with “!”

Create a Python project structure

Step 9: Create the project structure and the relevant python files.

simple.py
driver.py
setup.py

setup.py to build .egg (i.e. zip) files containing all the modules. setup.py is a python file, which usually tells you that the module/package you are about to install has been packaged and distributed with Distutils, which is the standard for distributing Python Modules.

tree -L 4

Build an .egg file

Step 10: Let’s build an .egg file with setup.py in a virtual environment.

View an .egg file

spark-submit

Outputs:

Create a requirements.txt file

Switch to global environment

Let’s install the packages from “requirements.txt” file. Remove the line “simple-spark==0.0.0”.

Now the global environment will have the following package dependencies.


Categories Menu - Q&As, FAQs & Tutorials

Top