05: Databricks – Spark UDFs

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL.

What is a UDF?

User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.

Step 1: Create a new Notebook in Databricks, and choose Python as the language.

Step 2: The data is already uploaded and table has been created.

Step 3: Let’s create a UDF that calculates the bonus of 10% on the “emp_salary”.

Output:

Spark SQL

Step 4: Let’s use Spark the SQL style.

Output:

Databricks – Spark SQL

Spark SQL with UDF

gives you the same output.

UDF via lambda function

Step 5: Let’s convert our function to a Python lambda function.

groupBy, Aggregate & Alias

Step 6: Let’s group by “emp_city”, and calculate the average salary and then alias it with “emp_avg_salary_by_city”.

Output:

You can rename a single column with “withColumnRenamed”

Output:


800+ Java & Big Data Interview Q&As

Top