Blog Archives

00: Apache Spark eco system & anatomy interview Q&As

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java,

Read more ›



02: Cleansing & pre-processing data in BigData & machine learning with Spark interview Q&As

Q1. Why are data cleansing & pre-processing important in analytics & machine learning? A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is. …...



12 Apache Spark getting started interview Q&As

Q01. Where is Apache Spark used in the Hadoop eco system?
A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”.

Read more ›



14: Q105 – Q108 Spark “map” vs “flatMap” interview questions & answers

Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. … Read more ›...



15: Q109 – Q113 Spark RDD partitioning and “mapPartitions” interview questions & answers

Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result...



17: Spark interview Q&As with coding examples in pyspark (i.e. python)

Q01. How will you create a Spark context? A01. Q02. How will you create a Dataframe by reading a file from AWS S3 bucket? … Read more ›...



40+ Apache Spark best practices & optimisation interview FAQs – Part-1

There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues....



40+ Apache Spark best practices & optimisation interview FAQs – Part-2

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams. #11 Use Spark UI: Running Spark jobs...



40+ Apache Spark best practices & optimisation interview FAQs – part 03: Partitions & buckets

#31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by...



5 Spark streaming & Apache storm Interview Q&As

Q116. What is “Spark streaming” in the Spark ecosystem with Spark core, Spark SQL, Spark MLlib, Spark GraphX, etc? A116. Spark is a distributed and scalable batch processing framework that...



6 Delta Lake interview questions & answers

Q01. What is delta lake for Apache Spark? A01. Delta Lake is an open source storage layer that brings reliability to data lakes. … Read more ›...



8 Apache Spark repartition Vs. coalesce scenarios interview Q&As

Q01: Why is partitioning of data required in Apache Spark? A01: Partitioning is a key concept in distributed systems where the data is split into multiple partitions so that you...



8 Spark streaming interview Q&As

Q01 What is Spark streaming? A01 Spark streaming extends Spark core/sql to what is called a micro-batching to give Near-Real-Time processing where the arriving live stream of data from sources...



Apache Spark SQL join types interview Q&As

Q1. What are the different Spark SQL join types?
A1. There are different SQL join types like inner join, left/right outer joins, full outer join,

Read more ›



Debugging Spark applications written in Java locally by connecting to HDFS, Hive and HBase

This extends Remotely debugging Spark submit Jobs in Java. Running Spark in local mode When you run Spark in local mode, both the Driver and Executor will be running in...



Spark interview Q&As with coding examples in Scala – part 01: Key basics

Some of these basic Apache Spark interview questions can make or break your chance to get an offer.

Q01. Why is “===” used in the below Dataframe join?

Read more ›



Spark interview Q&As with coding examples in Scala – part 02: partition pruning & column projection

This extends Spark interview Q&As with coding examples in Scala – part 1 with the key optimisation concepts.

Partition Pruning

Q13. What do you understand by the concept Partition Pruning?

Read more ›



Spark interview Q&As with coding examples in Scala – part 03: Partitioning, pruning & predicate push down

This extends Spark interview Q&As with coding examples in Scala – part 2 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 04: Memory considerations, lazy computation & cacheing

This extends Spark interview Q&As with coding examples in Scala – part 3 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 05: Transformations, actions, pipelining & shuffling

This extends Spark interview Q&As with coding examples in Scala – part 4 with more coding examples on Databricks Note book.

Prerequisite: Create a free account as per Databricks getting started.

Read more ›



Spark interview Q&As with coding examples in Scala – part 06: groupBy, collect_list & explode

This extends Spark interview Q&As with coding examples in Scala – part 5 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 07: map, flatMap, mapPartitions & mapValues

This extends Spark interview Q&As with coding examples in Scala – part 6 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 08: pivot, unpivot, agg & selectExpr

This extends Spark interview Q&As with coding examples in Scala – part 7 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 09: sql.functions

This extends Spark interview Q&As with coding examples in Scala – part 8 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting...



Spark interview Q&As with coding examples in Scala – part 10: adding new columns

This covers one of the most popular Spark interview questions of adding new columns to Spark Dataframes. Most non-trivial Spark jobs will have additional columns added to an existing Dataframe.

Read more ›



Spark interview Q&As with coding examples in Scala – part 11: add column values conditionally, lit & typedlit

This extends Spark interview Q&As with coding examples in Scala – part 10: adding new columns. Q01. How will you add a new column with a fixed value? … Read...



Spark join strategies & performance tuning interview Q&As

Q1. What are the different types of Spark join strateges? A1. There are 3 types of joins. 1) Sort Merge Join – … Read more ›...



Spark understanding DAG for tuning performance interview Q&As

This extends 15 Apache Spark best practices & performance tuning interview FAQs to delve into DAGs, Stages, Tasks, Partitions and Shuffling in Spark. If you can’t read Spark Event Timelines...



800+ Java Interview Q&As

Top