Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming,
…
Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming,
…
Q1. Why are data cleansing & pre-processing important in analytics & machine learning?
A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is.
Q2. What are the general steps of cleansing data
A2.
…
Q01. Where is Apache Spark used in the Hadoop eco system?
A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”. The Spark itself has grown into an eco system with Spark SQL, Spark streaming, Spark UI, GraphX,
…
Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. map transformation is applied to each element of RDD and it returns the result as a new RDD. … Read more ›...
Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result RDD by applying a function. The method mapPartitions converts each partition of the source RDD into multiple elements of the...
Q01. How will you create a Spark context? A01. Q02. How will you create a Dataframe by reading a file from AWS S3 bucket? A02. Q03. How will you create a Dataframe by reading a table in a database? … Read more ›...
Q38 What causes small files to be created? A38 Files that are in few KBs or MBs are considered to be small files. Ideally anything smaller than the block size (e.g. 128 MB) is a small file. When Spark writes data to object storage systems like HDFS, … Read more...
There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues. Here are some best practices to keep in mind when writing Spark jobs.
#1 Favor DataFrames,
…
This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams. #11 Use Spark UI: Running Spark jobs without inspecting the Spark UI is a definite NO. … Read more ›...
This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI. #31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by minimising or avoiding data “shuffles”....