25: PySpark SQL With Common Table Expression (i.e. CTE)

This tutorial extends Getting started with Databricks. You can run these in the Databricks notebook.

Step 1: Login to Databricks notebook:
https://community.cloud.databricks.com/login.html.

Step 2: Create a CLUSTER and it will take a few minutes to come up. This cluster will go down after 2 hours.

Step 3: Given the user ratings for various marketing campaigns, how will you go about divide the ratings into 5 buckets & then select the top bucket.

Outputs:

Subquery approach

One way to get the desired output is by using subqueries as shown below. Subqueries can be clumsy to read as they require you to work inside out when trying to understand logic.

Step 4: NTILE is an analytic function that divides an ordered data set into a number of buckets indicated and assigns the appropriate bucket number to each row.

3 queries are nested inside out below:

1) Compute the average rating by campaign_id & campaign_name.
2) Divide the output into 5 buckets by average rating in descending order.
3) Filter the top bucket.

Outputs:

Common Table Expression (i.e. CTE) approach

The CTEs solve 2 key problems.

1) logic on top of logic problem where you want to do a data manipulation on top of the result of another data manipulation as demonstrated with subqueries above.

2) Make your code easier to read without the clumsy nested queries as shown above.

Step 5: Directly as SQL.

Outputs:

Step 6: Spark SQL with Python notebook.

Outputs:


🔥 300+ Java Interview FAQs

Java & Big Data Tutorials

Top