03: Databricks – Spark SCD Type 1

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL.

What is SCD Type 1

SCD stands for Slowly Changing Dimension, and it was explained in 10 Data warehouse interview Q&As.

Step 1: Remove all cells in the notebook with the “x” and then confirm or create a new Python notebook. If the cluster is not running as it auto terminates after 2 hours, create a new cluster and attach it to the notebook.

Step 2: We have already uploaded “employee.csv” to data,

and let’s upload the new delta file employee_delta.csv:

Now, for SCD Type 1::

1) UPDATE record where emp_id=2 with the new salary info in the employee_delta.csv”.

2) INSERT records that are new in the employee_delta.csv”.

NOTE: We don’t have to do DELETE as it is normally done as a logical delete with a new field “active=y” or “active=n”.

INNER JOIN

Inner join two dataframes to find the “emp_id” that is in both employee.csv & employee_delta.csv.

Step 3: The inner join to give you records in both:

Output:

Step 4: Let’s SELECT the column values from employee_delta.csv as it will update the values in employee.csv.

Note: Use “tab” for indentation, and “\” for continuation.

LEFT OUTER JOIN

Left outer join to identify the records that don’t need any change.

Step 5: Add a new cell and do a “left outer join”. We need to filter out records that are not in “employee_delta.csv”.

Output:

RIGHT OUTER JOIN

to get the new records in “employee_delta_df.csv”.

Step 6: Add a new cell and do a “right outer join”. We need to filter out records that are in “employee.csv”.

Output:

UNION ALL

Step 7: Union all three dataframes – emp_updated,emp_no_change_df, and emp_new_df to give us the final values.

Output:

Databricss SCD Type 1

Databricss SCD Type 1

functools reduce(…) function

Alternatively, we can also use the “reduce” function from the functools library, which has the higher order functions (i.e. Functional Programming).

Step 8: functools reduce(..) function in action to combine 3 dataframes.

functools.reduce(function, iterable[, initializer])

*df” is a varargs that takes variable number of arguments. In the above example 3 dataframes.

SQL – show tables

Output:

Databricks – show tables SQL


800+ Java & Big Data Interview Q&As

Top