24: PySpark with Hierarchical Data on Databricks

This tutorial extends Getting started with Databricks. Currently spark does not support recursion like you can use in SQL via “Common Table Expression“.

Step 1: Login to Databricks notebook:

Step 2: Create a CLUSTER and it will take a few minutes to come up. This cluster will go down after 2 hours.

Step 3: Create simple hierarchical data with 3 levels as shown below: level-0, level-1 & level-2. The level-0 is the top parent.

Hierarchy Example

Node Hierarchy

CTE example with Node Hierarchy


Spark SQL does not support recursive CTE as discussed later in this post. In most of hierarchical data, depth is unknown, hence you could identify the top level hierarchy of one column from another column using WHILE loop and recursively joining DataFrame as shown below.

Step 4: Loop through the levels breadth first (i.e. left to right) for each level as shown below.


Step 5: Combine the above 3 levels of dataframes vt_level_0, vt_level_1 and vt_level_2.


Spark SQL does not support recursive CTE

Spark SQL does not support recursive CTE (i.e. Common Table Expression) as shown below. It gives an error on the “RECURSIVE” word.

If you run without the “RECURSIVE” key word you will only get one level down from the root as the output as shown below.


🔥 300+ Java Interview FAQs

Java & Big Data Tutorials