Blog Archives
1 2 3

00: Apache Spark eco system & anatomy interview questions and answers

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and SparkR. All the functionalities being provided by Apache Spark are built on the top of Spark Core. Spark Core is the foundation of in-memory parallel and distributed processing of huge dataset with fault-tolerance & recovery.

The Spark SQL component is a distributed framework for structured data processing. Spark Streaming is an add on API, which allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark can access data from sources like Kafka, Flume, Amazon Kinesis or TCP socket. MLlib in Spark is a scalable Machine learning library. GraphX in Spark is API for graphs. The key component of SparkR is SparkR DataFrame.

Q02. What are the key execution components of Apache Spark?
A02. Apache Spark uses master-slave architecture, where there will be one master process & multiple slave (aka worker) processes. This master slave architecture is applied at the

1) Cluster Management Level: Application master is the master and the Node managers are the slaves. Application master is responsible for coordinating the node mangers to allocate resources like memory & CPU cores.

2) Application Level: Spark Driver is the master & Spark executors are the slaves/workers.… Read more ...

Tags:

02: Cleansing & pre-processing data in BigData & machine learning with Spark interview questions & answers

Q1. Why are data cleansing & pre-processing important in analytics & machine learning? A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is. Q2. What are the general steps of cleansing data A2. General steps involve Deduplication, dropping/imputing missing values, fixing structural errors, removing…

Read more ...


11a: 40+ Apache Spark best practices & optimisation interview FAQs – Part 1

There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues. Here are some best practices to keep in mind when writing Spark jobs. #1 Favor DataFrames, Datasets, or SparkSQL over…

Read more ...


11b: 40+ Apache Spark best practices & optimisation interview FAQs – Part 2 Spark UI

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams.

#11 Use Spark UI: Running Spark jobs without inspecting the Spark UI is a definite NO. It is a very handy debugging & performance tuning tool.

The UI allows to monitor and inspect the execution of jobs. Stages, tasks and shuffle writes and reads are concrete concepts that can be monitored from the Spark UI.

Jobs are divided into “stages” based on the shuffle boundary. A stage is a physical unit of execution. Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of execution for Spark. Each task maps to a single core and works on a single partition of data. An executor with 15 cores can have 15 tasks working on 15 partitions in parallel. If your job has lots of stages, you are performing as many shuffle operations.

You can click on a job id to view the DAG (Direct Acyclic Graph) of the RDD objects.

Here is a popular Apache Spark Job Interview question:

Q12: Given that you are processing a huge text file with the following steps, how many jobs & stages will be created in the DAG? What are some of the key considerations to watch-out for?… Read more ...

Tags:

11c: 40+ Apache Spark best practices & optimisation interview FAQs – Part 3 Partitions & buckets

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI. #31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by minimising or avoiding data “shuffles”….

Read more ...


11d: 40+ Apache Spark best practices & optimisation interview FAQs – Part 4 Small Files problem

Q38: What is the difference between repartition & Coalesce? A38: Repartition equally distributes the data across machines, hence requires reshuffling, hence slower than coalesce, but can improve the overall performance of the job as it can prevent data skewing by equally distributing the data across the executors. Repartition can either…

Read more ...


12 Apache Spark getting started interview questions and answers

Q01. Where is Apache Spark used in the Hadoop eco system? A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”. The Spark itself has grown into an eco system with Spark SQL, Spark streaming, Spark UI, GraphX, MLlib, and SparkR. Apache Spark…

Read more ...


14: Q105 – Q108 Spark “map” vs “flatMap” interview questions & answers

Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. map transformation is applied to each element of RDD and it returns the result as a new RDD. Map takes N elements as input and returns N elements…

Read more ...


15: Q109 – Q113 Spark RDD partitioning and “mapPartitions” interview questions & answers

Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result RDD by applying a function. The method mapPartitions converts each partition of the source RDD into multiple elements of the…

Read more ...


17: Spark interview Q&As with coding examples in pyspark (i.e. python)

Q01. How will you create a Spark context? A01.

Q02. How will you create a Dataframe by reading a file from AWS S3 bucket? A02.

Q03. How will you create a Dataframe by reading a table in a database? A03.

Read more ...


1 2 3

500+ Enterprise & Core Java programmer & architect Q&As

Java & Big Data Tutorials

Top