Blog Archives
1 2

00: Apache Spark eco system & anatomy interview questions and answers

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and SparkR. All the functionalities being provided by Apache Spark are built on the top of Spark Core.… Read more ...

Tags:

02: Cleansing & pre-processing data in BigData & machine learning with Spark interview questions & answers

Q1. Why are data cleansing & pre-processing important in analytics & machine learning?
A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is.

Q2. What are the general steps of cleansing data
A2. General steps involve Deduplication, dropping/imputing missing values, fixing structural errors, removing the outliers, encoding the categorical values and scaling down the features.… Read more ...

Tags:

11a: 40+ Apache Spark best practices & optimisation interview FAQs – Part 1

There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues. Here are some best practices to keep in mind when writing Spark jobs.

#1 Favor DataFrames, Datasets, or SparkSQL over RDDs: Spark DataFrames, Datasets or SparkSQL are optimised, hence faster than RDDs, especially when working with structured data.… Read more ...



11b: 40+ Apache Spark best practices & optimisation interview FAQs – Part 2 Spark UI

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams.

#11 Use Spark UI: Running Spark jobs without inspecting the Spark UI is a definite NO. It is a very handy debugging & performance tuning tool.… Read more ...



11c: 40+ Apache Spark best practices & optimisation interview FAQs – Part 3 Partitions & buckets

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI. #31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by minimising or avoiding data “shuffles”….

Read more ...


11d: 40+ Apache Spark best practices & optimisation interview FAQs – Part 4 Small Files problem

Q38: What is the difference between repartition & Coalesce? A38: Repartition equally distributes the data across machines, hence requires reshuffling, hence slower than coalesce, but can improve the overall performance of the job as it can prevent data skewing by equally distributing the data across the executors. Repartition can either…

Read more ...


12 Apache Spark getting started interview questions and answers

Q01. Where is Apache Spark used in the Hadoop eco system? A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”. The Spark itself has grown into an eco system with Spark SQL, Spark streaming, Spark UI, GraphX, MLlib, and SparkR. Apache Spark…

Read more ...


14: Q105 – Q108 Spark “map” vs “flatMap” interview questions & answers

Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. map transformation is applied to each element of RDD and it returns the result as a new RDD. Map takes N elements as input and returns N elements…

Read more ...


15: Q109 – Q113 Spark RDD partitioning and “mapPartitions” interview questions & answers

Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result RDD by applying a function. The method mapPartitions converts each partition of the source RDD into multiple elements of the…

Read more ...


17: Spark interview Q&As with coding examples in pyspark (i.e. python)

Q01. How will you create a Spark context? A01.

Read more ...


5 Spark streaming & Apache storm Interview Q&As

Q116. What is “Spark streaming” in the Spark ecosystem with Spark core, Spark SQL, Spark MLlib, Spark GraphX, etc? A116. Spark is a distributed and scalable batch processing framework that supports in-memory processing in sub-seconds. The batch processes are scheduled with something like Oozie or Unix cron jobs to run…

Read more ...


6 Delta Lake interview questions & answers

Q01. What is delta lake for Apache Spark? A01. Delta Lake is an open source storage layer that brings reliability to data lakes. It as a dependency JAR that needs to be added to your Apache Spark project. All you need to make sure that you have the correct version…

Read more ...


8 Apache Spark repartition Vs. coalesce scenarios interview Q&As

Q01: Why is partitioning of data required in Apache Spark?
A01: Partitioning is a key concept in distributed systems where the data is split into multiple partitions so that you can execute transformations on multiple partitions in parallel, which helps a job to finish faster.

A text file read from a storage say a size of 512MB can be read into memory for computation and partitioned into 4 blocks with a block size of 128MB each.… Read more ...



8 Apache Spark streaming interview questions and answers

Q01 What is Spark streaming? A01 Spark streaming extends Spark core/sql to what is called a micro-batching to give Near-Real-Time processing where the arriving live stream of data from sources like Kafka, Kinesis, Flume, HDFS file system, etc divided into batches of predefined interval. For example, the files that arrive…

Read more ...


Apache Spark SQL join types interview questions and answers

Q1. What are the different Spark SQL join types?
A1. There are different SQL join types like inner join, left/right outer joins, full outer join, left semi-join, left anti-join and self-join.

Q2. Given the below tables, can give examples of the above join types?

Read more ...


Debugging Spark applications written in Java locally by connecting to HDFS, Hive and HBase

This extends Remotely debugging Spark submit Jobs in Java. Running Spark in local mode When you run Spark in local mode, both the Driver and Executor will be running in the same JVM and is very handy to debug the logic of your transformations. You can run within an IDE…

Read more ...


Spark interview Q&As with coding examples in Scala – part 01: Key basics

Some of these basic Apache Spark interview questions can make or break your chance to get an offer.

Q01. Why is “===” used in the below Dataframe join?

Read more ...
Tags:

Spark interview Q&As with coding examples in Scala – part 02: partition pruning & column projection

This extends Spark interview Q&As with coding examples in Scala – part 1 with the key optimisation concepts.

Partition Pruning

Q13. What do you understand by the concept Partition Pruning?
A13. Spark & Hive table partitioning by year, month, country, department, etc will optimise reads by storing files in a hierarchy of directories based on the partitioning keys, hence reducing the amount of I/O needed to process your query/data.… Read more ...

Tags:

Spark interview Q&As with coding examples in Scala – part 03: Partitioning, pruning & predicate push down

This extends Spark interview Q&As with coding examples in Scala – part 2 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting started. Login to community.cloud.databricks.com, and click on “Clusters” to create a Spark cluster. You can now create a new workspace…

Read more ...


Spark interview Q&As with coding examples in Scala – part 04: Memory considerations, lazy computation & cacheing

This extends Spark interview Q&As with coding examples in Scala – part 3 with more coding examples on Databricks Note book. Prerequisite: Create a free account as per Databricks getting started. Login to community.cloud.databricks.com, and click on “Clusters” to create a Spark cluster. In-memory lazy computation Q27. One of the…

Read more ...


1 2

500+ Java Interview FAQs

Java & Big Data Tutorials

Top