Prerequisite: Extends Databricks – Spark Window functions. Step 1: Create a new Python notebook, and attach it to a cluster. Step 2: Let’s create some data using pyspark.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | emp_list = ( (1, 'John', 'Java'), (2, 'Peter', 'Scala'), (2, 'Peter', 'Python'), (2, 'Peter', 'Spark'), (1, 'John', 'JEE'), (6, 'Elliot', 'Unix',) ) data_schema = [StructField('emp_id', IntegerType(), False), StructField('emp_name', StringType(), False), StructField('emp_expertise', StringType(), False)] final_struc = StructType(fields=data_schema) df = spark.sparkContext.parallelize(emp_list); emp_df = spark.createDataFrame(df, final_struc) emp_df.show() |
Output:
1 2 3 4 5 6 7 8 9 10 11 | +------+--------+-------------+ |emp_id|emp_name|emp_expertise| +------+--------+-------------+ | 1| John| Java| | 2| Peter| Scala| | 2| Peter| Python| | 2| Peter| Spark| | 1| John| JEE| | 6| Elliot| Unix| +------+--------+-------------+ |
agg( ) & collect_list( ) Step 3: Let’s group…