01: Q01 – Q07 General Big Data, Data Science & Data Analytics Interview Q&As

Q01. How is Big Data used in industries?
A01. The main goal for most organisations is to enhance customer experience, and consequently increase sales. The other goals include cost reduction, better targeted marketing, fraud detection, identifying data breaches to enhance security, making existing processes more efficient, medical records to drug discovery and genetic disease exploration, and the list goes on.

Q02. What do you understand by the terms personalization, next best offer, next best action, and recommendation engines?
A02. Big data processing and machine learning techniques can be used for customer personalization. By gathering historical data from all users and using this data within a big data framework to generate statistical models which predict the probability that a user will find a product, service, document, web page, destination, service, etc. to be useful.

The historical data include current geographical location, home location, age, gender, activities on the web site (pages viewed, items viewed, etc), purchase activities (e.g. items rated, items in the shopping cart, signing up for loyalty programs, use of discount coupons, etc), activities on social media, etc. This will be lots of data.

Once valid & statistical models for personalization are available, they can be used in real time to personalize the results for individual users.

Next-best offer refers to the use of predictive analytics solutions to identify the products or services your customers are most likely to be interested in for their next purchase.

Recommendation engine: are used to create associations among products A and products X, Y and Z for cross-sell and up-sell opportunities. For Example, Amazon.com says people who bought Book A, also bought Books Y & Z. Products are recommended based on actual customer browsing and buying behavior, rather than a gut feeling.

The algorithm used by Amazon.com is based on a user’s purchase history, the items they have in their shopping cart, items they have rated or liked in the past, and what other customers have viewed or purchased recently. Over 30% of all Amazon sales are generated by the recommendation engine.

Q03. What is web click data (aka Clickstream analytics)?
A03. Clickstream data is nothing new. It has been recorded and analyzed for years to track and understand an individual’s online behavior. Clickstream analysis is the process of collecting, analyzing and reporting aggregate data about which pages a website visitor visits and in what order. The path a visitor takes though a website is called the clickstream. There are two levels of clickstream analysis, traffic analytics and e-commerce analtyics.

Traffic analytics work on the server level as to how many pages were served to the user, how long it took for each page to load, how often the user hit the browser’s back button, etc.

e-commerce analtyics focus on what pages a shopper browses, what is added to or removed from a shopping cart, what items a shopper purchases, whether or not a shopper belongs to a loyalty program, uses of coupon code and preferred method of payment.

Clickstream analysis gathers extremely large volume of data, hence e-businesses rely on big data analytics technologies such as

1) Distributed data storage (e.g. HDFS (i.e. Hadoop Distributed Files System, Amazon S3 (Simple Storage Service), Azure Blob Storage, etc). Aka distributed file systems that can scale easily by adding more nodes.

2) Binary data formats: E.g. Avro, Parquet, ORC, etc. Columnar data formats like Parquet is used prevalently for read performance.

3) NoSQL (i.e. Not only SQL) databases such as HBase, Cassandra, Amazon DynamoDB, etc are used to store Big volume of Data.

4) Graph databases like Neo4J, Amazon Neptune, etc. Unlike other databases, relationships take first priority in graph databases. This means your application doesn’t have to infer data connections using foreign keys or MapReduce processing. This allows the engine to navigate your connections between nodes in constant time as opposed to exponential slowdown of many join SQL-queries in a relational databases. The data model for a graph database is significantly simpler and expressive than those of relational or other NoSQL databases.

5) Parallel processing systems like Apache Spark, Hive, Apache Tez, etc for batch processing.

6) Massively Parallel Processing (i.e. MPP) systems like Apache Impala, Apache Presto, and Amazon Redshift.

7) Streaming data real-time (e.g. Apache Storm, Apache Kafka, Amazon Kinesis, etc) or near real-time (i.e. Spark streaming).

8) Big Data analytics tools & dashboards both open-source (e.g. Apache Zeppelin) & commercial (e.g. SAS VA (i.e. Visual Analytics)).

Q04. How do IP Intelligence & Geolocation assist analytics?
A04. IP Intelligence and Geolocation technologies compliment analytics by providing the tools to further segment and gain deeper insight into customer behavior.

1) You can benchmark your marketing campaign performance by segments (E.g. local, regional, national, international, etc).

2) You can refine the campaign and website performance with better insight into customer behavior.

3) You can use real-time targeting based on user location and other IP data elements to offer discounts & loyalty programs to inspire actions.

Q05. What is data science?
A05. It is a field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in structured (E.g. relational data, key/value pairs, graph data, etc), semi-structured (e.g. logs) and unstructured (e.g. PDFs, Images, Word Documents) forms.

Q05. What algorithms are used by the recommendation engines (aka recommender models)?
A05. Content Based Filtering (i.e. CBF) and Collaborative filtering (i.e. CF) are the most commonly used recommendation algorithms.

CBF relies on similarities between items. For example, if you intend to buy a book on “Java Data structures & algorithms” then a similar book could be “Java coding exercises”. Netflix could be determining the similarities between two movies in terms of its genre and other features. You allocate match scores to these features to come up with similarities.

CF algorithm is not all about qualities of an item, but about making automatic predictions about a user’s interests by compiling preferences from several other users with similar profiles.

The kNN (i.e. k Nearest Neighbour) is an algorithm to find nearest items among the whole collection, which can be used in collaborative filtering (i.e. CF). The most important idea lies in a term “similar preferences“. To recommend something to the user in question, you find people from his/her neighborhood that have similar profiles.

When we want to recommend something to a user, the most logical thing to do is to find people with similar interests, analyze their behavior, and recommend our products or services. This has 2 key steps:

Step 1: Find out how many users/items in a database are similar to the given user/item. This is CBF

Step 2: Evaluate other users/items to predict what grade you would give the user of this item, given the total weight of the users/items that are more similar to this one. This is CF

Q06. What technologies/tools can you use for data analytics?
A06. There are myriad of tools to statistically analyze data that are stored & organized on Big Data storage systems like HDFS, Amazon S3, etc.

1) Programming languages like R, Python, Java, Scala, etc. Frameworks/Tools that use these languages like Spark MLlib, TensorFlow, etc. Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using “Alternating Least Squares“. TensorRec is a recommendation engine framework in TensorFlow.

2) Data mining software suites like SAS VA (i.e. Visual Analytics), IBM’s SPSS Modeler and SPSS Analytics for advanced data analytics, etc

3) SQL (e.g. Spark SQL, Hive, Amazon Redshift, etc), NoSQL (E.g. HBase, MongoDB, Amazon DynamoDB, etc), and Graph (E.g. Neo4j, Amazon Neptune, etc) databases for adhoc queries.

Q07. Why use a graph database?
A07. A graph database belong to NoSQL. Instead of structuring data in the traditional table and row model, NoSQL allows the database design to be built around the requirements at hand. NoSQL make use of key/value pairs, documents & graph databases.

Instead of just understanding what is the value of specific data, you understand the value of the relationship between data. Real-time recommendation engines are key to online success. In order to make real-time recommendations you must correlate product, customer, inventory, supplier, logistics and social sentiment data. Graph databases easily outperform relational and other NoSQL data stores for connecting large volume of buyer and product data to gain insights into customer needs and product trends.

Relational databases can model relationships between items, but to traversing those relationships can be expensive as you need to write SQL queries that join tables together. The joining process is computationally expensive, and becomes slower as the number of joins increases, which leads to performance & scalability issues.

JSON Documents based NoSQL databases like MongoDB struggle with lack of built-in functionalities to link records or entities. Any logic requiring the traversal of relationships must be provided by the application layer.

So, graph databases are more suited for real-time recommendation engines as the essence of a graph database is that both records and relationships are treated as first-class citizens.

In a graph database, data is not segregated into separate tables, and there is no need for joins. The database contains not only Customer and Product entities (aka “vertices” or “nodes”), but also relationship entities (aka “edges”) that specify how the nodes are linked to each other. If a customer purchases a product, a new edge element will be added to the database, which explicitly specifies which two objects are linked, and what their relationship is.

800+ Java & Big Data Interview Q&As