Blog Archives
1 2 3 4 5 16

0: 25 Big Data Engineering key concepts that Data Engineers, Analysts & Scientists must know

#01 Data Cardinality In data modelling, cardinality is the numerical relationship between rows of one table & rows in another. Common cardinalities are one-to-one, one-to-many and many-to-many. Data cardinality also refers to the uniqueness of the values contained in a database column. If most of the values are distinct, then…

Read more ...

00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1

Q1. What is dimensional modelling in a Data Warehouse (i.e. DWH)?
A1. A dimensional model is a data structure technique optimised for Data Warehousing tools (i.e. OLAP products). The concept of Dimensional Modelling is comprised of Fact and Dimension tables.

A “Fact” is a numeric value (i.e. aka a measure) that a business wishes to count or sum. A “Dimension” is essentially descriptive value in text for getting at the facts.

For example, “daily store sales”

FACT: price and quantity, which are quantitative values that can be aggregated (e.g sum) by dimensions such as store, product and time.

Dimensions: Store, Time, Product, etc

The fact table contains the main measures like price and quantity, which represent business events or transactions, used to add detail to dimension data, so that effective reports can be generated. Fact table links to many dimension tables via foreign keys. This resulting schema is called a star schema because it looks like a star. As there are multiple dimension tables, all connecting to single fact table, this design concept is named dimensional modelling.

Q2. How would you go about developing the star schema?

Step 1: Identify the business process the DWH should cover. This could be the sales, marketing, HR, etc. The selection of the Business process also depends on the quality of data available for that process. You can describe the business process with UML (e.g.… Read more ...


00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 2

This extends Q1 to Q5 at 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1. Q6. What is a Factless Fact table? A6. From the above store sales example we know that fact table is a collection of many facts having multiple keys joined…

Read more ...

00: 18+ SQL best practices interview Q&As

It is a must to know the order in which the SQL clauses are executed. This is demonstrated with an example below in #5. Have this order of execution visibly pinned and understood. SQL is very easy to learn, but lots of hands-on experience is required to master:

1) to translate business requirements into SQL.
2) to write efficient & maintainable queries.
3) to break-down & reverse engineer complex SQL queries into business requirements to enhance or modify.

Let’s start with the best practices.

#1. Use uppercase for the keywords like SELECT, FROM, JOIN, GROUP BY, WHERE, etc. It’s also a good practice to use uppercase for the SQL functions like UPPER(col_name), COUNT(, etc. Another key rule is that each clause such as SELECT, FROM, WHERE, GROUP BY, HAVING etc. should be in a new line. Proper structure improves readability of SQL queries.

Avoid: lower cases for the key words.

Read more ...

00: 50+ SQL scenarios based interview Q&As on identifying & deleting duplicate records

50+ SQL interview questions and answers to solve real business scenarios. SQL is widely used in building microservices & Big Data projects. Learning SQL syntax is easy, but being able to convert a given business requirement into a query takes lots of practice. These scenarios based interview questions can assesses your experience.

Considerations & Tips

It is important to first understand the problem statement, and then ask the right questions to solve the problem.

1) Break down a complex business requirement into sub sections.

2) Think about the grain of the data. For example, account level, user level, department level, etc. What to GROUP BY, and what columns to include in the GROUP BY for the resulting query.

3) What joins (e.g. inner, left outer, right outer, self, full outer, anti, etc) are required & at what granularity. Do these joins produce one-to-one, one-to-many or many-to-many results.

4) Is ordering of the rows required? Ranking functions may need the row ordering.

5) Do I need to aggregate any values. E.g sum(price), max(date_time), min(date_time), etc.

6) Should I display aggregated values at a group level or for each row using WINDOWing.

7) Should I transpose rows to columns (i.e pivot)? Should I convert columns to rows(i.e. unpivot)?

8) Any subqueries or CTE required? Will CTE make my code more readable & maintainable?

9) Any ranking required over a partition (i.e… Read more ...

00: A roadmap to become a Big Data Engineer – What skills are required?

What is all the hype about becoming a (Big) Data Engineer? There is a demand for Data Engineers as organisations have been ramping up their investments on big data related projects since 2019.

Why Big Data?

Confused about the various roles like Data Engineer, Technical Business Analyst, DevOps Engineer, Data Scientist, etc. Often Big Data projects will have all the above roles complimenting each other. There will be an overlap in the skills as the Data Scientist will have basic programming skills, Data Engineers will be required to have basic DevOps skills, and the technical business analysts will be required to have good SQL & scripting skills.

What do data analysts, engineers & scientists do?

Don’t be overwhelmed by the number of technology/tool stacks as every company will be using a different combination. Data engineering is vast as there are teams with technical business analysts building data pipelines. For example, metadata driven abstraction layers can be built on top of Apache Spark to build ETL pipelines with configs and basic SQL like joins. There are also drag & drop ETL tools.

Learn the fundamentals first like Data modelling & optimisation, Data warehouse & data lake concepts, SQL, Git, Regex, Cloud basics, distributed data storage & computing concepts, one mainstream language like Python, metadata mgmt, data lineage, and data governance. Tools and technologies will keep evolving.… Read more ...


00: Apache Spark eco system & anatomy interview questions and answers

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java, Scala, Python, and R. It has 6 components Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and SparkR. All the functionalities being provided by Apache Spark are built on the top of Spark Core. Spark Core is the foundation of in-memory parallel and distributed processing of huge dataset with fault-tolerance & recovery.

The Spark SQL component is a distributed framework for structured data processing. Spark Streaming is an add on API, which allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark can access data from sources like Kafka, Flume, Amazon Kinesis or TCP socket. MLlib in Spark is a scalable Machine learning library. GraphX in Spark is API for graphs. The key component of SparkR is SparkR DataFrame.

Q02. What are the key execution components of Apache Spark?
A02. Apache Spark uses master-slave architecture, where there will be one master process & multiple slave (aka worker) processes. This master slave architecture is applied at the

1) Cluster Management Level: Application master is the master and the Node managers are the slaves. Application master is responsible for coordinating the node mangers to allocate resources like memory & CPU cores.

2) Application Level: Spark Driver is the master & Spark executors are the slaves/workers.… Read more ...


00: Data Lake Vs. Data Warehouse Vs. Data Lakehouse Vs Data Fabric Vs Data Mesh

Modern data architectures will have both the Data Lakes and Data Warehouses. The Data Engineers build the data pipelines for the data analysts and scientists to build business reports & models to analyse the data.

What is Big Data?

Big Data is huge volumes of structured (e.g. entries in tables, rows & columns), semi-structured (e.g. JSON, XML, log files, etc) and unstructured (i.e. PDFs, audio & video files) data. Big Data is characterised by volume, velocity & variety. The volume relates to sheer size of datasets as in peta bytes of data coming from IoT (i.e smart) devices, business transactions, social media, and other streams. The velocity related to speed at which new data is being generated, and the speed at which this data needs to be ingested & processed by the distributed & scalable systems. The variety indicates the diversity of data sources like IoT Devices, mobile towers, social media platforms, billing systems, ERP systems, CRM systems, logging systems, Audio & Video sources, etc.

Why is Big Data important?

Businesses use Big Data to improve operations, detect frauds/anomalies, create reports to strategise, provide better customer service, predict customer behaviours, and create personalised marketing campaigns and take other actions that can lead to customer satisfaction & increased revenue and profits for the businesses. Big Data is used in business intelligence reporting, analytics, machine learning & artificial intelligence. The Big Data systems are predominantly append only where data is never or rarely updated or deleted.… Read more ...

00: Q1 – Q6 Hadoop based Big Data architecture & basics interview Q&As

There are a number of technologies to ingest & run analytical queries over Big Data (i.e. large volume of data). Big Data is used in Business Intelligence (i.e. BI) reporting, Data Science, Machine Learning, and Artificial Intelligence (i.e. AI). Processing a large volume of data will be intensive on disk I/O, CPU, and memory usage. Big Data processing is based on distributed computing where you will have multiple nodes across several machines to process data in parallel. Each node will have its own dedicated hard disk, CPU, memory, etc. This is know as the “Shared-nothing architecture“. A Hadoop cluster is a collection of nodes.

Here are a few technologies you can use as depicted below:

Hadoop is suitable for massively offline batch processing of structured, semi-structured & unstructured data by building a data lake whereas MPP (i.e. Massively Parallel Processing) databases like Amazon Redshift, Azure SQL Data warehouse, GCP’s BigQuery, etc are built for online analytics of large volume of highly structured data.

The data lakes can be built on HDFS (i.e. Hadoop Distributed File System) or on cloud storages like AWS S3 storage with Amazon EMR or Databricks for compute, Azure blob storage with Azure Data factory or Databricks for compute, Google Cloud storage with Dataproc for compute. Amazon EMR, Azure Data Factory, GCP Dataproc & Databricks make use of Apache Spark as the computation engine on the respective data storages.… Read more ...


01: Data Backfilling interview questions & answers

Q1. What is data backfilling? A1. Backfilling data is a process of reactively processing any missing data for a past time window. Q2. Why do you need to backfill data? A2. There are two types of data loads from source systems to target (aka sink) systems via ETL pipelines: 1)…

Read more ...

1 2 3 4 5 16

500+ Enterprise & Core Java programmer & architect Q&As

Java & Big Data Tutorials