05: Pyspark read from a file & add new columns interview Q&As with tutorial

This extends Pyspark with schema & a simple join of two data frames example.

Q08: How would you read csv data into a spark Dataframe?
A08: Here is the tutorial step by step that reads employee.csv file into the Dataframe.

Step 1: Create a .csv file named employee.csv under a folder say /tmp.

Step 2: Write a pyspark job named read_csv_spark.py as shown below:

Step 3: Run the job

Outputs:

Q09: How would you add a new column named “updated_ts” as a new column to the Dataframe?
A09: The withColumn function to the rescue as shown below that adds a new column “updated_ts”.

Run the pyspark job:

Outputs:

Q10: What if you want to read the file with your own schema instead of having the column names as _c0, _c1, _c2 and _c3?
A10: Here is the code that defines the schema & reads the file “employee.csv” with a schema.

Run the pyspark job:

Outputs:

Q11: What if you want to change emp_id data type from string to int?
A11: The cast function to the rescue as shown below.

Run the pyspark job:

Outputs:

Q12: How will you calculate a flat emp_bonus of 10%?
A12: The withColumn and lit function to the rescue as shown below.

Run the pyspark job:

Outputs:

Pyspark loading a file

Note that we can also use the below API to load a file, which is more readable:


Java Developer & Architect Q&As

Big Data Engineer & Architect Q&As

16+ Key Areas & 13+ Techs to fast-track