Spark interview Q&As with coding examples in Scala – part 01: Key basics

Some of these basic Apache Spark interview questions can make or break your chance to get an offer.

Q01. Why is “===” used in the below Dataframe join?

A01. Comparisons with == and != are universal. This has a problem as they compare any two values, no matter what their types are. The Scala compiler does give warnings when two different types are compared as shown below:

But, this warning coverage is NOT comprehensive as no warning is issued for below code:

Another example would be when you use a proxy for some data structure, the proxy and the underlying data would have different types. If you accidentally compare a proxy with the underlying type using == or a pattern match, the code is still valid, but it will just always result in false.

Scala prides itself as a strong static type system. “===” is a type safe equality operator.

Q02. When you join Dataframe, how do you know which join strategy is used by Spark?
A02. There are 4 join strategies:

1) Broadcast Join
2) Shuffle Hash Join
3) Sort Merge Join
4) BroadcastNestedLoopJoin

[Learn more: Spark SQL joins & performance tuning interview questions & answers].

You can use

OR

Sample output: It is a “SortMergeJoin” in this example.

Q03. How do you remove duplicate rows in Spark?
A03. You can use distinct() to remove rows that have the same values on all columns.

On Databricks notebook – Spark Tutorials on Databricks Notebook.

Output:

Q04. What if you want to remove duplicates on selected columns?
A04. Use dropDuplicates() to remove based on all columns

or to deduplicate rows based on selected multiple columns:

Output:

Q05. How do you get the count by name?
Q05. Using the “groupBy

Output:

Q06. How do you get the distinct name counts?
A06. Function countDistinct(…) from org.apache.spark.sql.functions._

Output:

Q07. How do you aggregate salary by name?
A07. Use “groupBy(…)” and “agg()/sum()

or

Output:

Q08. How do you calculate average salary by name?
A08. Use agg()/avg().

Output:

Q09. How will you aggregate salary by name & age with different combinations?
A09. Use “cube(….)

Output:

Q10. How does rollup() differ from cube()?
A10. rollup(..) returns a subset of cube(..). It computes hierarchical subtotals from left to right.

Output:

Q11. How will you rank salary by name?
A11. Window aggregate functions to the rescue. These are functions that perform a calculation over a group of records called window that are in some relation to the current record.

Note: Use “withColumn” to add a new column.

Output:

Q12. How will you display the average salary by name?
A12.

Output:

Spark Scala on Databricks notebook

You can easily get started on Databricks to practice more examples with Scala by following Getting started with Spark on Databricks.


Java & Big Data Interview FAQs

Java Key Areas Interview Q&As

800+ Java Interview Q&As

Java & Big Data Tutorials