00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1

Q1. What is dimensional modelling in a Data Warehouse (i.e. DWH)?
A1. A dimensional model is a data structure technique optimised for Data Warehousing tools (i.e. OLAP products). The concept of Dimensional Modelling is comprised of Fact and Dimension tables.

A “Fact” is a numeric value (i.e. aka a measure) that a business wishes to count or sum. A “Dimension” is essentially descriptive value in text for getting at the facts.

For example, “daily store sales”

FACT: price (i.e. sales revenue NOT unit_price) and quantity, which are quantitative values that can be aggregated (e.g sum) by dimensions such as store, product and time.

Dimensions: Store, Time, Product, etc

Star schema dimensional modelling with Facts & Dimensions

Star schema dimensional modelling with Facts & Dimensions (source: https://datawarehouseinfo.com/data-warehouse-star-schema-vs-snowflake-schema/)

The fact table contains the main measures like price and quantity, which represent business events or transactions, used to add detail to dimension data, so that effective reports can be generated. Fact table links to many dimension tables via foreign keys. This resulting schema is called a star schema because it looks like a star. As there are multiple dimension tables, all connecting to single fact table, this design concept is named dimensional modelling.

Q2. How would you go about developing the star schema?

Step 1: Identify the business process the DWH should cover. This could be the sales, marketing, HR, etc. The selection of the Business process also depends on the quality of data available for that process. You can describe the business process with UML (e.g. ERD – Entity Relationship Diagram, use case diagrams and business activity diagrams). You can also use mind-map tools to draw the business processes.

Step 2: Identify the grain of the data the design should cover. Are we interested in daily, weekly, or monthly sales? Do we capture at the order level or line item grain?

Step 3: Identify the dimensions, which are nouns like date, store, customer, product, inventory, etc.

For example, the sales manager wants to find the sales for specific products in different regions on a daily basis.

Dimensions: Product, Region and Time

Grain is “product sale information by region by the day.”

Step 4: Identify the fact, which are rows of numerical values like price or cost per unit, etc.

For example, the sales manager wants to find the sales for specific products in different regions on a daily basis.

Fact: Sum of sales by product by region by time.

Step 5: Build the dimensional model schema, where the fact table will have the foreign keys to the primary keys of the dimensional tables. There are two popular schemas: star schema & snowflake schema.

The snowflake schema is an extension of the star schema. In a snowflake schema, each dimension are normalized and connected to more dimension tables. On the plus side a snowflake schema allows us to reduce redundancy and minimize disk space at the expense of more complex joins required to answer business queries, hence slowing down query performance.

Now a days disk space is cheaper, but query performance is more important, hence star schema is more common in DWH.

Q. When to use a Snowflake schema?
A. In scenarios where an attribute in a dimension table has NULLs for the majority of dimension records, it would be advisable to create a separate dimension table for this attribute, thus transforming into the snowflake schema.

Second scenario is when we have certain attributes that are part of a hierarchy, but are generally queried independently as in “dim_time” dimension where you can have separate tables for dim_weekly, dim_monthly, dim_yearly, dim_weekday, etc as sales volume will be reported separately by day, week, month or year.

Q3. What is a grain in Data Warehouse?
A3. A grain is an atomic level at which the facts are measured. In other words it is the lowest level of information stored in a fact table.

In the above “daily store sales” example the each Fact table record is of the grain: Per product Per store Per day.

Other examples of grain definition are: 1 row per product, 1 row per store, 1 row per product per store, 1 row per product per store per day, etc.

In Data Warehouse design it is important to determine the grain of Data points before designing the schema as Dimension tables need to be consistent with the Fact tables. If we don’t decide on grains upfront the schema design may result in inconsistent reporting & possible rework.

For example, as per the above “daily store sales”, what if the fact table “fact_sales” capture the quantity & price every hour? The dimension table “dim_time” is at the grain of daily. This will cause inconsistent reporting. FIX: The dimension table must add the attribute “hour int”.

Q. What will happen if the fact table “fact_sales” was at daily grain & the dimension table “dim_time” is at hourly grain?
A. What if the requirement from the business is to report sales every hour? If the data warehouse were at a mature stage with 200+ tables then it will be time consuming to change the current schema & ETL (i.e. Extract Transform & Load) jobs to be able to support reporting at an hourly grain.

Q4. What are the different Fact types?
A4. Additive Fact type is where the measures can be aggregated for all dimensions. The “daily store sales” example shown above is of this type as the measures “price” & “quantity” can be aggregated for all dimensions “dim_store”, “dim_product”, “dim_time”, etc.

Semi-Additive Fact type is where the measures can only be aggregated by some, but not all dimensions. For example, if the fact table “fact_sales” captured an additional measure named “stock_at_hand” to keep track of the stock per product per store then this measure is additive across the dimensions “dim_product” & “dim_store”, BUT NOT additive for the dimension “dim_time”.

Non-Additive Fact type is where a measure attribute is not additive across all dimensions. For example, if the fact table “fact_sales” captured an additional measure named “gst_percentage” to keep track of the goods and services tax percentage as in 10% per record then it cannot be aggregated by product or store as it will be incorrect to say that “product A” is 30% GST by aggregating 3 records. The GST for “product A” will be 10% only. Same rationale applies to by store as well.

Q5. What are the different types of Fact Tables in Data Warehouse?
A5. There are 3 types of Fact Tables. Two key aspects that determine Fact table types are “grain” & “fact type (i.e. aggregate or additive nature)”.

1) Transaction Fact Table.
2) Periodic Snapshot Fact Table.
3) Accumulating Snapshot Fact Table.

Fact Table Types Comparison

Fact Table Types Comparison (Source: http://business-intelligence-yedida.blogspot.com/2014/02/comparisons-between-fact-table-types.html)

Transaction fact tables are those where business users can fully aggregate the measures by any related dimension in general without any limitations. In other words mostly of “Additive” fact type & keeps most detailed level. Example, “daily store sales”.

Periodic Snapshot fact tables are used to analyse outcomes like bank balance, stock counts, etc at a regular interval as in every few hours, daily, weekly, monthly, etc. A typical example will be our bank statements with opening & closing balance for a fixed period like a month. Unlike the transactional fact table where measures are fully additive, we wouldn’t want to sum the opening & closing balance across all periods as it would not make any analytic sense. The two distinct values of “opening balance” and “closing balance” were not meant to be summarised as they represent two “snapshots” based on a specific point in time.

Monthly inventory count is another example of periodic snapshot fact tables. The counts by SKU (i.e. Stock Keeping Unit) vary every month. We would not sum the numbers across time periods as that was not the intention of the original number. We might perform an average and the measures can only be partly rolled up at fixed intervals, hence the facts are “Semi-Additive”.

Accumulating Snapshot fact tables have multiple columns to update a normal sequence of events. For example, placing orders will have a sequence of events like order placement, order shipment, and order fulfilment. The fact table will have columns to capture “order_date”, “shipment_date” and “fulfilment_date”. Initially, a row will be inserted, and these date columns will be updated over a period of time as these life-cycle events happen. This is not a common type of fact table, and some database architects prefer to normalize the fact table and generate separate rows for each life cycle event. Useful for logistical applications.

Another example would be submitting a mortgage application with separate columns for approval dates and approval amounts.

Q6 – Q14 at 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 2.

300+ Java & Big Data Interview FAQs

16+ Java Key Areas Interview Q&As

800+ Java Interview Q&As

300+ Java & Big Data Tutorials