Blog Archives
1 2 3

01: Databricks getting started – PySpark, Shell, and SQL


Step 1:
Signup to Databricks community edition – https://databricks.com/try-databricks. Fill in the details and you can leave your mobile number blank. Select “COMMUNITY EDITION” ==“GET STARTED“.

If you have a Cloud account then you can use it.

Step 2: Check your email and click the “link” in the email & reset your password.

Step 3: Login to Databricks notebook:
https://community.cloud.databricks.com/login.html.

Step 4: Create a CLUSTER and it will take a few minutes to come up. This cluster will go down after 2 hours.

Step 5: Select “DATA“, and upload a file named “employee.csv”.

Step 6:Create Table With UI” as shown below:

Note: Please check the “First row is header” check box on the LHS so that column names appear from the file.

Click on “Create Table“.

Step 7: Click on the “databricks” icon on the LHS menu, and then “Create a Blank Notebook“.

Spark in Python (i.e.PySpark)

Since we created the notebook as “python“, we don’t have to do “%python” as it is the default language.… Read more ...

Tags:

02: Databricks – Spark schemas, casting & PySpark API

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL. Q: What is a Dataframe? A: A DataFrame is a data abstraction or a domain-specific language (DSL) for working with structured and semi-structured data, i.e. datasets with a schema. Dataframes are immutable, stored in memory, resilient (i.e. fault-tolerant), distributed (i.e. spread…

Read more ...


03: Databricks – Spark SCD Type 1

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL.

What is SCD Type 1

SCD stands for Slowly Changing Dimension, and it was explained in 10 Data warehouse interview Q&As.

Step 1: Remove all cells in the notebook with the “x” and then confirm or create a new Python notebook. If the cluster is not running as it auto terminates after 2 hours, create a new cluster and attach it to the notebook.

Step 2: We have already uploaded “employee.csv” to data,

and let’s upload the new delta file employee_delta.csv:

Now, for SCD Type 1::

1) UPDATE record where emp_id=2 with the new salary info in the employee_delta.csv”.

2) INSERT records that are new in the employee_delta.csv”.

NOTE: We don’t have to do DELETE as it is normally done as a logical delete with a new field “active=y” or “active=n”.

INNER JOIN

Inner join two dataframes to find the “emp_id” that is in both employee.csv… Read more ...

Tags:

04: Databricks – Spark SCD Type 2

Prerequisite: Extends 03: Databricks – Spark SCD Type 1. What is SCD Type 2 SCD stands for Slowly Changing Dimension, and it was explained in 10 Data warehouse interview Q&As. Step 1: You may have to reattach the cluster to the notebook as clusters auto terminate after 2 hours. Create…

Read more ...


04a: Databricks – Spark SCD Type 1 with Merge

Prerequisite: Extends 03: Databricks – Spark SCD Type 1. What is SCD Type 1 SCD stands for Slowly Changing Dimension, and it was explained in 10 Data warehouse interview Q&As. Step 1: You may have to reattach the cluster to the notebook as clusters auto terminate after 2 hours. Create…

Read more ...


04b: Databricks – Spark SCD Type 2 with Merge

Prerequisite:…

Read more ...


05: Databricks – Spark UDFs

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL.

What is a UDF?

User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.

Step 1: Create a new Notebook in Databricks, and choose Python as the language.

Step 2: The data is already uploaded and table has been created.

Step 3: Let’s create a UDF that calculates the bonus of 10% on the “emp_salary”.

Read more ...


06: Databricks – Spark Window functions

Prerequisite: Extends Databricks getting started – Spark, Shell, SQL. What is a window function? Q. What are the different types of functions in Spark SQL? A. There are 4 types of functions: 1) Built-in functions: from org.apache.spark.sql.functions like to_date(Column e), to_utc_timestamp(Column e), etc. Take values from a single row as…

Read more ...


07: Databricks – groupBy, collect_list & explode

Prerequisite: Extends Databricks – Spark Window functions. Step 1: Create a new Python notebook, and attach it to a cluster. Step 2: Let’s create some data using pyspark.

Output:

agg( ) & collect_list( ) Step 3: Let’s group by “emp_id” and list all the courses as a list….… Read more ...



08: Databricks – Spark problem 1

Prerequisite: Extends Databricks – Spark Window functions.

Problem: Convert the below table

to

Where each column is counted for its occurrence.… Read more ...

Tags:

1 2 3

500+ Enterprise & Core Java programmer & architect Q&As

Java & Big Data Tutorials

Top