Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.
This tutorial extends Apache Zeppelin on Docker Tutorial – Docker pull from Docker hub and Spark stand-alone to read a file from local file system
1 2 3 4 5 6 7 | val lines = sc.textFile("file:///zeppelin/seed/employees.txt") case class Employee (id: Integer, name: String, location: String, salary: Double ) val dfEmployees = lines.map(s => s.split(",")).map(s => Employee(s(0).toInt, s(1), s(2), s(3).toDouble)).toDF() |
1. Print the schema of the Dataframe
1 2 | dfEmployees.printSchema() |
root
|– id: integer (nullable = true)
|– name: string (nullable = true)
|– location: string (nullable = true)
|– salary: double (nullable = true)
2. Show contents of a Dataframe
1 2 | dfEmployees.show() |
1 2 3 4 5 6 7 8 9 10 11 12 | +---+---------+----------+--------+ | id| name| location| salary| +---+---------+----------+--------+ | 1| John| USA|100000.0| | 2| Peter| Australia|200000.0| | 3| Sam| USA| 76000.0| | 4| Daniel| France| 86000.0| | 5| Simon| Australia| 96000.0| | 6| Roseanne| France|156000.0| +---+---------+----------+--------+ |
3. Count number of rows in a Dataframe
1 2 | dfEmployees.count() |
res12: Long = 6
4. Add a new column to a Dataframe
1 2 3 | dfEmployees.withColumn("bonus", dfEmployees.col("salary") * 0.02) .show() |
1 2 3 4 5 6 7 8 9 10 11 12 | +---+---------+----------+--------+------+ | id| name| location| salary| bonus| +---+---------+----------+--------+------+ | 1| John| USA|100000.0|2000.0| | 2| Peter| Australia|200000.0|4000.0| | 3| Sam| USA| 76000.0|1520.0| | 4| Daniel| France| 86000.0|1720.0| | 5| Simon| Australia| 96000.0|1920.0| | 6| Roseanne| France|156000.0|3120.0| +---+---------+----------+--------+------+ |
You can drop a column with “dfEmployees.drop(“location”).show()”
5. Select a few columns from a Dataframe
1 2 3 4 5 | dfEmployees.withColumn("bonus", dfEmployees.col("salary") * 0.02) .select("id", "bonus") .show() |
1 2 3 4 5 6 7 8 9 10 11 12 | +---+------+ | id| bonus| +---+------+ | 1|2000.0| | 2|4000.0| | 3|1520.0| | 4|1720.0| | 5|1920.0| | 6|3120.0| +---+------+ |
6. Distinct values
1 2 3 4 5 | dfEmployees.select("location") .distinct() .show() |
1 2 3 4 5 6 7 8 9 | +----------+ | location| +----------+ | Australia| | USA| | France| +----------+ |
7. Sorting
1 2 3 4 | dfEmployees.orderBy("location") .show() |
1 2 3 4 5 6 7 8 9 10 11 12 | +---+---------+----------+--------+ | id| name| location| salary| +---+---------+----------+--------+ | 2| Peter| Australia|200000.0| | 5| Simon| Australia| 96000.0| | 4| Daniel| France| 86000.0| | 6| Roseanne| France|156000.0| | 1| John| USA|100000.0| | 3| Sam| USA| 76000.0| +---+---------+----------+--------+ |
8. Applying SQL queries
1 2 3 | dfEmployees.registerTempTable("employees_tbl") |
1 2 3 4 5 | %sql Select * from employees_tbl |
9. Filtering by a predicate
1 2 3 | dfEmployees.filter(dfEmployees.col("salary") > 100000.00).show() |
1 2 3 4 5 6 7 8 | +---+---------+----------+--------+ | id| name| location| salary| +---+---------+----------+--------+ | 2| Peter| Australia|200000.0| | 6| Roseanne| France|156000.0| +---+---------+----------+--------+ |
10. Grouping & aggregation
1 2 3 | dfEmployees.groupBy("location").agg(sum("salary")).show() |
1 2 3 4 5 6 7 8 9 | +----------+-----------+ | location|sum(salary)| +----------+-----------+ | Australia| 296000.0| | USA| 176000.0| | France| 242000.0| +----------+-----------+ |
11. Map operations on Dataframe columns
We can apply a function on each row of DataFrame using map operation.
1 2 3 | dfEmployees.select("id", "salary").map(row => (row.getInt(0), row.getDouble(1) + 220.0 )).show() |
1 2 3 4 5 6 7 8 9 10 11 12 | +---+--------+ | _1| _2| +---+--------+ | 1|100220.0| | 2|200220.0| | 3| 76220.0| | 4| 86220.0| | 5| 96220.0| | 6|156220.0| +---+--------+ |
12. Get some stats on your data
1 2 3 | dfEmployees.select("salary").describe().show() |
1 2 3 4 5 6 7 8 9 10 11 | +-------+-----------------+ |summary| salary| +-------+-----------------+ | count| 6| | mean| 119000.0| | stddev|48493.29850608226| | min| 76000.0| | max| 200000.0| +-------+-----------------+ |