Spark interview Q&As with coding examples in Scala – part 01: Key basics

Some of these basic Apache Spark interview questions can make or break your chance to get an offer.

Q01. Why is “===” used in the below Dataframe join?

A01. Comparisons with == and != are universal. This has a problem as they compare any two values, no matter what their types are. The Scala compiler does give warnings when two different types are compared as shown below:

But, this warning coverage is NOT comprehensive as no warning is issued for below code:

Another example would be when you use a proxy for some data structure, the proxy and the underlying data would have different types. If you accidentally compare a proxy with the underlying type using == or a pattern match, the code is still valid, but it will just always result in false.

Scala prides itself as a strong static type system. “===” is a type safe equality operator.

Q02. When you join Dataframe, how do you know which join strategy is used by Spark?
A02. There are 4 join strategies:

1) Broadcast Join
2) Shuffle Hash Join
3) Sort Merge Join
4) BroadcastNestedLoopJoin

[Learn more: Spark SQL joins & performance tuning interview questions & answers].

You can use


Sample output: It is a “SortMergeJoin” in this example.

Q03. How do you remove duplicate rows in Spark?
A03. You can use distinct() to remove rows that have the same values on all columns.

On Databricks notebook – Spark Tutorials on Databricks Notebook.


Q04. What if you want to remove duplicates on selected columns?
A04. Use dropDuplicates() to remove based on all columns

or to deduplicate rows based on selected multiple columns:


Q05. How do you get the count by name?
Q05. Using the “groupBy


Q06. How do you get the distinct name counts?
A06. Function countDistinct(…) from org.apache.spark.sql.functions._


Q07. How do you aggregate salary by name?
A07. Use “groupBy(…)” and “agg()/sum()



Q08. How do you calculate average salary by name?
A08. Use agg()/avg().


Q09. How will you aggregate salary by name & age with different combinations?
A09. Use “cube(….)


Q10. How does rollup() differ from cube()?
A10. rollup(..) returns a subset of cube(..). It computes hierarchical subtotals from left to right.


Q11. How will you rank salary by name?
A11. Window aggregate functions to the rescue. These are functions that perform a calculation over a group of records called window that are in some relation to the current record.

Note: Use “withColumn” to add a new column.


Q12. How will you display the average salary by name?


