Spark interview Q&As with coding examples in Scala – part 10: adding new columns

This covers one of the most popular Spark interview questions of adding new columns to Spark Dataframes. Most non-trivial Spark jobs will have additional columns added to an existing Dataframe. These columns can be derived from existing columns in a Dataframe.

Q01. What are the different ways in which you can add new columns to a Dataframe?
A01. Adding a new column or multiple columns to a Spark Dataframe can be done with withColumn(..), select(..), map(..) and foldLeft(..) methods of Dataframe.

Here are a few examples given the below code:

Note: This is a simple example for demonstration purpose only. In enterprise applications never use a double data type for monetary calculations. You need to use either BigDecimal, Money class or represent in cents.

Output:

Let’s add a new derived or calculated column named “bonus” by applying 10% to the existing column “salary”.

withColumn(..) on Dataframe

The “col()” is a function imported via org.apache.spark.sql.functions._. The withColumn() takes 2 args – the new column name & column value. Instead of col() you can also refer to the column as $“salary” in Scala. The $ sign converts a column name into a Column object with the help of the class sqlContext.implicits._.

Output:

select(..) on Dataframe

Same as below without the “spark.sqlContext.implicits._” import.

Can also do:

Same as:

Same as:

Output:

map(..) on RDD

At times you may need to add multiple columns after applying some transformations. In this scenario you can use either map() or foldLeft(). Let’s see an example with a map.

Note These examples were tested on Databricks where the handle to spark & sqlContext are readily available.

Output:

foldLeft(..) on List & withColumn() on Dataframe

Output:

select(..) with map(..) on List

The withColumn(..) method of DataFrame could have performance issues when called in a loop. Here is an alternative using select(..) function on DataFrame & map(..) function on Scala collection like Sequence, List, etc.

Q. Can df1.columns.map(x => col(x)) be written as df1.columns.map(col)?
A. Yes. col is a function on org.apache.spark.sql.functions package. “df1.columns” gives the column names as an Array[String]. The Array[String] is applied with the map() & col() functions to convert String to Column as in Seq[Column].

Q. What is : _* in Scala?
A.

1) ++ is a concatenation operator as in: child ++ newChild.

2) :” is a hint that helps compiler to understand, what type does that expression have.

3) _* is a placeholder accepting any value + vararg operator.

So, “Seq[Column] ++ Seq[Column] : _*” expands to Seq[Column] to Col* (hints compiler that you are working with a varargs as opposed to a sequence).

Here is a simpler example:

Q02. How will you add 5 new columns for the years from 2017 to 2021 with same value as salary?
A02. Here are the code written as a standalone Spark application in local mode using Scala.

Using withColumn(..)

Output:

Using RDD & map(…)

Output:

Using DataFrame & map(…)

The spark.implicits._ is used for conversion.

Output:

If you had used .toDF() without column names or did not use .toDF() , you woyld have got _1 to _8 as column header denoting the Tuple8:

Using foldLeft(…) & withColumn()

Output:

Using select(…)

This approach performs better than withColumn(…) when you add new columns in a loop.

Output:


Java & Big Data Interview FAQs

Java Key Areas Interview Q&As

800+ Java Interview Q&As

Java & Big Data Tutorials