Apache Spark SQL join types interview Q&As

Q1. What are the different Spark SQL join types?
A1. There are different SQL join types like inner join, left/right outer joins, full outer join, left semi-join, left anti-join and self-join.

Q2. Given the below tables, can give examples of the above join types?

Output:

Customers table:

Orders table:

A2. Here are the join examples. The tables are joined on code & custCode.

inner join

The inner is the default join in Spark and most commonly used. It drops columns that are not matched by the keys. As you can see below only codes 10 & 20 are displayed as they are in both tables.

Output:

full outer join

The full or fullouter join returns all rows from both the tables, and where join expression doesn’t match it returns null on respective record columns.

Output:

left outer join

The left or leftouter join returns all rows from the left table regardless of match found on the right table or not, and it assigns null for those records where no match in the right table. It drops records where no match is found in both tables. The right or rightouter will do the reverse.

Output:

leftsemi join

The leftsemi join is similar to inner join with the difference being leftsemi join returns all columns from the left table and ignores all columns from the right table. The rightsemi does the opposite.

Output:

leftanti join

The leftanti join does the exact opposite of the Spark leftsemi join, where the join returns only the columns from the left table for which no match found in the right table. The rightsemi does the opposite.

Output:

So, the union of “leftsemi” & “leftanti” will give the left table columns of the leftouter join.

Join types are not complete without discussing self join, cartesian (cross) join & theta join. These joins don’t have a specific join types. Let’s create different sample data to demonstrate these joins.

self-join

The self-join can use any of the above mentioned join types to join a table to itself. The self-joins are useful in querying hierarchical data. The example below joins to itself to find the employee names & their managers.

Output:

cartesian or cross join

The cartesian joins generate a “cartesian product”, which is defined as the product of two tables. If all employees above can perform all of the tasks, then you get product of both tables. Cross joins of very large tables can lead to performance issues, hence refer to performance tuning post for strategies to handle cross joins of large tables.

The example below self joins on the length of names.

Output:

theta join

A theta is a join that links tables based on a relationship other than the equality between two columns.

Output:


Java & Big Data Interview FAQs

Java Key Areas Interview Q&As

800+ Java Interview Q&As

Java & Big Data Tutorials