Blog Archives
1 2 3

00: Apache Spark eco system & anatomy interview Q&As

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming,

Read more ›



02: Cleansing & pre-processing data in BigData & machine learning with Spark interview Q&As

Q1. Why are data cleansing & pre-processing important in analytics & machine learning?
A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is.

Q2. What are the general steps of cleansing data
A2.

Read more ›



12 Apache Spark getting started interview Q&As

Q01. Where is Apache Spark used in the Hadoop eco system?
A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”. The Spark itself has grown into an eco system with Spark SQL, Spark streaming, Spark UI, GraphX,

Read more ›



14: Q105 – Q108 Spark “map” vs “flatMap” interview questions & answers

Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. map transformation is applied to each element of RDD and it returns the result as a new RDD. … Read more ›...



15: Q109 – Q113 Spark RDD partitioning and “mapPartitions” interview questions & answers

Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result RDD by applying a function. The method mapPartitions converts each partition of the source RDD into multiple elements of the...



17: Spark interview Q&As with coding examples in pyspark (i.e. python)

Q01. How will you create a Spark context? A01. Q02. How will you create a Dataframe by reading a file from AWS S3 bucket? A02. Q03. How will you create a Dataframe by reading a table in a database? … Read more ›...



40+ Apache Spark best practices & optimisation interview FAQs – Part-4 Small Files problem

Q38 What causes small files to be created? A38 Files that are in few KBs or MBs are considered to be small files. Ideally anything smaller than the block size (e.g. 128 MB) is a small file. When Spark writes data to object storage systems like HDFS, … Read more...



40+ Apache Spark best practices & optimisation interview FAQs – Part-1

There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues. Here are some best practices to keep in mind when writing Spark jobs.

#1 Favor DataFrames,

Read more ›



40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams. #11 Use Spark UI: Running Spark jobs without inspecting the Spark UI is a definite NO. … Read more ›...



40+ Apache Spark best practices & optimisation interview FAQs – part 03: Partitions & buckets

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI. #31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by minimising or avoiding data “shuffles”....



1 2 3

800+ Java Interview Q&As

Java & Big Data Tutorials

Top