Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java,
…
Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java,
…
Q1. Why are data cleansing & pre-processing important in analytics & machine learning? A1. Garbage in gets you garbage out. No matter how good your machine learning algorithm is. …...
Q01. Where is Apache Spark used in the Hadoop eco system?
A01. Spark is essentially a data processing framework that is faster & more flexible than “Map Reduce”.
…
Q105. What is the difference between “map” and “flatMap” operations in Spark? A105. The map and flatMap are transformation operations in Spark. … Read more ›...
Q109. What is the difference between “map” and “mapPartitions” transformations in Spark? A109. The method map converts each element of the source RDD into a single element of the result...
Q01. How will you create a Spark context? A01. Q02. How will you create a Dataframe by reading a file from AWS S3 bucket? … Read more ›...
There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues.
…
This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-6 were covered with examples & diagrams. #11 Use Spark UI: Running Spark jobs...
#31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by...
Q116. What is “Spark streaming” in the Spark ecosystem with Spark core, Spark SQL, Spark MLlib, Spark GraphX, etc? A116. Spark is a distributed and scalable batch processing framework that...