Blog Archives

00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1

Q1. What is dimensional modelling in a Data Warehouse (i.e. DWH)?
A1. A dimensional model is a data structure technique optimised for Data Warehousing tools (i.e. OLAP products). The concept of Dimensional Modelling is comprised of Fact and Dimension tables.

A “Fact” is a numeric value (i.e.… Read more ...

Tags:

00: Q1 – Q6 Hadoop based Big Data architecture & basics interview Q&As

There are a number of technologies to ingest & run analytical queries over Big Data (i.e. large volume of data). Big Data is used in Business Intelligence (i.e. BI) reporting, Data Science, Machine Learning, and Artificial Intelligence (i.e. AI). Processing a large volume of data will be intensive on disk I/O, CPU, and memory usage.… Read more ...

Tags:

01: Databricks interview questions & answers – overview

The best way to prepare for the Databricks interview is via the 28 tutorials on getting started with Databricks & PySpark. These tutorials will not only get you started on Databricks, but also help you prepare for the job interviews.… Read more ...



01: High level & low level system design considerations for read heavy systems

Q1. What are some of the design considerations for a read heavy system? A1. Before designing any systems, one should gather the functional & non-functional requirements. The SLAs (i.e. Service Level Agreements) have to be clearly defined. A rough-cut capacity planning has to be done in terms of how many…

Read more ...


01: High level & low level system design considerations for write heavy systems

This extends High level & low level system design considerations for read heavy systems Q1. What are some of the design considerations for a write heavy system? A1. Before designing any systems, one should gather the functional & non-functional requirements. The SLAs (i.e.… Read more ...



01: Lambda, Kappa & Delta Data Architectures Interview Q&As – Overview

Q1. What is the Lambda Architecture? A1. It is a data-processing architecture designed to handle Big Data by using both real-time streaming (e.g. Spark streaming, Apache Storm) and batch processing (E.g. Hive, Pig, Spark batch). This means you have to build 2 separate pipelines.… Read more ...



01: Snowflake interview questions & answers – overview

Q01. What is Snowflake?
A01. Snowflake is a fully managed SaaS (i.e. Software As A Service) that provides a single platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing of data.

It is built for storing and managing structured (i.e.… Read more ...



02: Q7 – Q15 Hadoop overview & architecture interview Q&As

This extends Q1 – Q6 Hadoop Overview & Architecture interview Q&As. Q7. What are the major machine roles in a Hadoop cluster? A7. The three major categories of machine roles in a Hadoop cluster are 1) Client machines. 2) Masters nodes.… Read more ...



10 Distributed storage & computing systems interview Q&As – Big Data

A distributed system consists of multiple software components that are on multiple computers (aka nodes), but run as a single system. These components can be stateful, stateless, or serverless, and these components can be created in different languages running on hybrid environments and developing open-source technologies, open standards, and interoperability.… Read more ...

Tags:

11a: 40+ Apache Spark best practices & optimisation interview FAQs – Part 1

There are so many different ways to solve the big data problems at hand in Spark, but some approaches can impact on performance, and lead to performance and memory issues. Here are some best practices to keep in mind when writing Spark jobs.… Read more ...



11b: 40+ Apache Spark best practices & optimisation interview FAQs – Part 2 Spark UI

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-1, where best practices 1-10 were covered with examples & diagrams. #11 Use Spark UI: Running Spark jobs without inspecting the Spark UI is a definite NO. It is a very handy debugging & performance tuning tool.… Read more ...



11c: 40+ Apache Spark best practices & optimisation interview FAQs – Part 3 Partitions & buckets

This extends 40+ Apache Spark best practices & optimisation interview FAQs – Part-2 Spark UI. #31 Bucketing is another data optimisation technique that groups data with the same bucket value across a fixed number of “buckets”. Bucketing improves performance in wide transformations and joins by minimising or avoiding data “shuffles”….… Read more ...



11d: 40+ Apache Spark best practices & optimisation interview FAQs – Part 4 Small Files problem

Q38: What is the difference between repartition & Coalesce? A38: Repartition equally distributes the data across machines, hence requires reshuffling, hence slower than coalesce, but can improve the overall performance of the job as it can prevent data skewing by equally distributing the data across the executors.… Read more ...



16: Q114 – Q115 CAP theorem interview Q&As

Q114. What does CAP stand for in CAP theorem? A114. In a distributed system having two or more nodes, and maintaining two or more copies of your data for fault tolerance, the CAP theorem can be depicted & explained as below: Consistency – Every read should give the most recent…

Read more ...
Tags:

Building idempotent data pipelines interview questions & answers

Q01. What is an idempotent data operation?
A01. Idempotent operations produce the same result even when the operation is repeated many times. A pipeline that reads data from a number of source systems and loads it into target RDBMs tables more than once for a given day can result in having duplicate values in the target tables, causing wrong metrics when aggregated for a dashboard.… Read more ...



Don't be overwhelmed by the number of Q&As & tech stacks as nobody knows everything, and often key Q&As at the right moment makes a difference.

500+ Java Interview FAQs

Java & Big Data Tutorials

Top