02: Databricks – Spark schemas, casting & PySpark API

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL.

Q: What is a Dataframe?
A: A DataFrame is a data abstraction or a domain-specific language (DSL) for working with structured and semi-structured data, i.e. datasets with a schema. Dataframes are immutable, stored in memory, resilient (i.e. fault-tolerant), distributed (i.e. spread across multiple machines) and have capabilities to process your data in parallel in multiple machines.

Dataframes

Dataframes

DataFrames can be constructed from a wide array of sources such as CSV, JSON, XML data files, Parquet/AVRO files, RDBMs databases, ElasticSearch, and many more. You can also programmatically create from a List, Array, or Sequence.

Output:

Let’s start the tutorial now by creating a Dataframe from a .csv file.

Step 1: Remove all cells in the notebook with the “x” and then confirm or create a new Python notebook. If the cluster is not running as it auto terminates after 2 hours, create a new cluster and attach it to the notebook.

You can click “ctrl+return” to execute the code in the cell.

Output:

As you can see inferSchema=”true” has inferred the schema correctly.

What if you want to assign your own schema?

Step 2: Let’s make the “emp_id” as a String type.

You can click “ctrl+return” to execute the code in the cell.

Output:

Casting

Step 3: Let’s cast the “emp_id” back to IntegerType from StringType.

Output:

Applying a function

Step 4: Let’s calculate 20% flat rate on the salary as a new “emp_tax” column. “lit” is for literal function.

Output:

Note: When you have multiple cells, you can also “Run All Above” & “Run All Below

Databricks run all above or run all below

Important: PySpark API

Have the PySpark API PySpark modules handy to code. You can click on “Dataframe” to see what functions are available. For example, withColumn function in the Dataframe module

Databricks PySpark Modules & API

Check the API for “lit” function in “pyspark.sql.functions module” in the LHS navigation.

PySpark functions

PySpark lit function


300+ Java Interview FAQs

800+ Java Interview Q&As

Top