Prerequisite: Extends Databricks – Spark problem 1. Problem: Convert the below table
1 2 3 4 5 6 7 8 9 10 11 12 13 | +--------+------------+---+ |emp_name| product|qty| +--------+------------+---+ | John| Notebook| 7| | Peter|Mobile phone| 12| | Sam| Camera| 3| | David|Mobile phone| 9| | Elliot| Notebook| 3| | Elliot| Camera| 1| | John|Mobile phone| 13| | Peter|Mobile phone| 2| +--------+------------+---+ |
to
1 2 3 4 5 6 7 8 9 10 | +--------+------+------------+--------+ |emp_name|Camera|Mobile phone|Notebook| +--------+------+------------+--------+ | Elliot| 1| null| 3| | John| null| 13| 7| | Sam| 3| null| null| | Peter| null| 14| null| | David| null| 9| null| +--------+------+------------+--------+ |
Step 1: Create a new Python notebook, and attach it to a new cluster. Step 2: Let’s create some data using pyspark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pyspark.sql.types import * sales = ( ('John', 'Notebook', 7), ('Peter', 'Mobile phone', 12), ('Sam', 'Camera', 3), ('David', 'Mobile phone', 9), ('Elliot', 'Notebook', 3), ('Elliot', 'Camera', 1), ('John', 'Mobile phone', 13), ('Peter', 'Mobile phone', 2) ) data_schema = [StructField('emp_name', StringType(), False), StructField('product',StringType(),False), StructField('qty', IntegerType(), False)] final_struc = StructType(fields=data_schema) df = spark.sparkContext.parallelize(sales); df_sales = spark.createDataFrame(df, final_struc) df_sales.show() |
pivot( )…