12 Apache Spark getting started interview Q&As

Q01. Where is Apache Spark used in the Hadoop eco system?
A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”. The Spark itself has grown into an eco system with Spark SQL, Spark streaming, Spark UI, GraphX, MLlib, and SparkR. Apache Spark can run on Hadoop clusters, as a standalone system or on the cloud. Spark can be used for fast processing (e.g. transforming sequence files into AVRO or Parquet file formats, reading from HBase, Hive, Cassandra, and any HDFS, etc), for sophisticated analytics (e.g. machine learning & graph algorithm), and for near real time (i.e. NRT) streaming of “Discretized Streams or DStreams”. DStreams are defined as sequences of RDD’s.

Spark Architecture

Q02. Why is Apache Spark favoured over MapReduce as an open source big data processing framework
A02. Spark gives you a comprehensive and unified framework to manage big data processing requirements with near real time (i.e. NRT) latency.

1) In MapReduce, the output data between each step has to be stored in the distributed file system before the next step can begin. This approach can be very slow for iterative tasks due to data replications across nodes & data storage I/O operations. Spark allows you for the steps to run

a) completely in memory for performance,
b) by writing everything to disk to handle large data sets and
c) by writing partially to disk & partially processing from the memory to get the best of both performance & ability to work with large data sets.

You have to look at your data and use cases to assess the memory requirements.

2) In MapReduce, if you want to perform complex processing, you need to string together a series of MapReduce jobs, and execute them in sequence.

1) read data from HDFS -> 2) apply map and reduce –> 3) write data back to HDFS –> 1) read data from HDFS –> 2) apply map and reduce –> 3) write data back to HDFS and so on……

MapReduce Jobs

MapReduce Jobs

Spark allows you to develop complex multi-step pipelines using DAG (i.e. Directed Acyclic Graph) pattern so that different jobs can work with the same data. This makes the development easier, but also makes Spark perform better even if you write everything to disk instead of processing from the memory.

RDDs in Apache Spark

Favor DataFrames over RDDs: Spark DataFrames are optimized, hence faster than RDDs, especially when working with structured data. You can use RDDs when you want to perform low-level transformations of your unstructured data. Whilst RDDs are very powerful and has many advantages, it’s easy to build inefficient transformations. DataFrames provide a higher level of abstraction to query and manipulate the structured data. Spark figures out the most efficient way to do what you want to do by converting your logical plan to a physical plan.

Even though DataFrames & Datasets will be favoured over RDDs, it is good to understand RDDs as they are the main abstraction that Spark provides.

Q03. What is a RDD in Spark?
A03. RDD stands for Resilient DistributedDatasets, which is a collection of fault-tolerant operational elements that run in parallel. The data is immutable & partitioned to run in a distributed manner. RDDs can be cached across computing nodes in a cluster. RDDs automatically recover from node failures.

RDDs for in-memory computations

RDDs for in-memory computations

JavaSparkContext’s parallelize method on a list of integers are copied across the Hadoop cluster (i.e. Data nodes) to form a distributed datasets,

and can be operated on in parallel to sum up the elements.

You can create a RDD from any storage source like text files, sequence files, avro files, etc.

Q04. What are the different types of RDD operations?
A04. RDD supports two types:

1) Transformations: take place in executors in a distributed manner across multiple worker nodes. A task can be creating a new dataset from an existing one. For example “.map” in the example below is a transformation that extracts values (i.e. _2) from key/value pairs.

2) Actions: Return a value to the driver program after running a computation on the dataset. For example, “.collect” in the example below is an action that returns a collection of values.

SparkContext with Executors executing tasks

JavaPairRDD, Key is of type IntWritable & Value is of type BytesWritable.

x is the instance of JavaPairRDD, which means x._1 is key & x._2 is a value.

Q05. What is a “RDD Lineage”?
A05. Spark does not support data replication in memory, hence in an event of any data loss, it is rebuilt using the “RDD Lineage“. It is a process of reconstructing lost data partitions.

Q06. How does Spark support development of complex multi-step pipelines?
A06. Spark allows you to develop complex multi-step pipelines using DAG (i.e. Directed Acyclic Graph) pattern so that different jobs can work with the same data. This makes the development easier, but also makes Spark perform better even if you write everything to disk instead of processing from the memory.

In Spark, a job is associated with a chain of RDD dependencies organised in a direct acyclic graph (DAG) as depicted further down this post.

Q08. What is a partition in a Spark job?
A08. Partitioning is the process that logically divides units of data to be processed in parallel to speed up data processing. RDDs created in 2 partitions

Q09. How are Spark variables shared across nodes?
A09. When a map or reduce operator is executed on a remote node, it works on separate copies of all the variables used within the operation at a particular node, and any updates to these variables are not propagated back to the driver program. Spark provides 2 approaches to share variables across nodes in a cluster.

1) Accumulators: Variables that can be used to aggregate values from worker nodes back to the driver program.

2) Broadcast variables: Shared variable to efficiently distribute large read-only values to all the worker nodes.

Accumulators

Counting the number of blank lines in a given text input.

Broadcast variables

Broadcast the list of words to ignore to all the nodes in a cluster.

Q10. What is a SparkContext? What is the diffrenec between a SparkContext & Spark Session?
A10. A “SparkContext” is the main entry point for a Spark job prior Spark version 2.0. Starting from Apache Spark 2.0, Spark Session is the new entry point for Spark applications.

SparkContext object used to require a “SparkConf” object, which stores configuration parameters like app name, number of cores & memory size of the executors, etc. In order to use the SparkContext API for SQL, Hive, and Streaming separate contexts need to be created.

This is where SparkSession becomes handy as it includes all APIs of SQL, Hive & Streaming. Once the SparkSession is instantiated, you can confugure Spark’s runtime properties.

Spark job execution model

Spark job execution model

A “SparkContext/SparkSession” represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

SparkSession/SparkContext to tasks

SparkSession/SparkContext to tasks

Create a SparkContext

Create RDDs

Create a SparkSession

Shared variables: Accumulators & Broadcast variables

Q11. What is a “Spark streaming”?
A11. Spark is a batch processing platform like Apache Hadoop, and Spark Streaming is a real-time processing tool that runs on top of the Spark engine. Spark streaming is related to Apache Storm, which is the most popular real-time processing platform for Big Data.

The primitive data type for Spark streaming is still RDD’s encapsulated by a continuous stream of data known as “Discretized Streams” or DStreams. DStreams are defined as sequences of RDD’s. A “DStream” is created from an input source, such as Apache Kafka, or from the transformation of another DStream.

Q12. What is a Spark Executor?
A12. The “Driver Application” creates tasks & schedule them to be run on the “Spark Executors”.

SparkContext with Executors executing tasks

SparkContext with Executors executing tasks

Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job. Spark Executors are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have finished running the tasks they send the results to the “Driver Application”. “Spark Executors” also provide in-memory storage for RDDs that are cached.

Bonus question – pagination using Spark

Q How do you implement pagination in Spark?
A Spark SQL Pagination can be done via:

1. OFFSET & LIMIT

Applying pagination using OFFSET and LIMIT ordered by some field is easiest way to paginate, but in distributed systems this can cause performance issues when you have say 500 or 1000 pages to paginate. If you have 100 line items per page, in order to get the 1000th page you need look for the record 10,001 to 10,100. If you have a 10 node cludter then each node has to produce its top 10,100 records and then the driver has to sort through 10,100 * 10 = 101,000 records and then from which it has to select 100 records and then discard 10K records. Hence, offset & limit based approach can be expensive.

2. Unique & Sequential ID generation

This approach requires unique & sequential id in the tables you are querying. This way you

3. Streaming

Using Streaming you can push the data as soon as the data becomes available, and the client can start to process the data. This solution reduces memory usage.


Java & Big Data Interview FAQs

Java Key Areas Interview Q&As

800+ Java Interview Q&As

Java & Big Data Tutorials