Q1. What is dimensional modelling in a Data Warehouse (i.e. DWH)?
A1. A dimensional model is a data structure technique optimised for Data Warehousing tools (i.e. OLAP products). The concept of Dimensional Modelling is comprised of Fact and Dimension tables.
A “Fact” is a numeric value (i.e. aka a measure) that a business wishes to count or sum. A “Dimension” is essentially descriptive value in text for getting at the facts.
For example, “daily store sales”
FACT: price (i.e. sales revenue NOT unit_price) and quantity, which are quantitative values that can be aggregated (e.g sum) by dimensions such as store, product and time.
Dimensions: Store, Time, Product, etc
The fact table contains the main measures like price and quantity, which represent business events or transactions, used to add detail to dimension data, so that effective reports can be generated. Fact table links to many dimension tables via foreign keys. This resulting schema is called a star schema because it looks like a star. As there are multiple dimension tables, all connecting to single fact table, this design concept is named dimensional modelling.
Q2. How would you go about developing the star schema?
Step 1: Identify the business process the DWH should cover. This could be the sales, marketing, HR, etc. The selection of the Business process also depends on the quality of data available for that process. You can describe the business process with UML (e.g. ERD – Entity Relationship Diagram, use case diagrams and business activity diagrams). You can also use mind-map tools to draw the business processes.
Step 2: Identify the grain of the data the design should cover. Are we interested in daily, weekly, or monthly sales? Do we capture at the order level or line item grain?
Step 3: Identify the dimensions, which are nouns like date, store, customer, product, inventory, etc.
For example, the sales manager wants to find the sales for specific products in different regions on a daily basis.
Dimensions: Product, Region and Time
Grain is “product sale information by region by the day.”
Step 4: Identify the fact, which are rows of numerical values like price or cost per unit, etc.
For example, the sales manager wants to find the sales for specific products in different regions on a daily basis.
Fact: Sum of sales by product by region by time.
Step 5: Build the dimensional model schema, where the fact table will have the foreign keys to the primary keys of the dimensional tables. There are two popular schemas: star schema & snowflake schema.
The snowflake schema is an extension of the star schema. In a snowflake schema, each dimension are normalized and connected to more dimension tables. On the plus side a snowflake schema allows us to reduce redundancy and minimize disk space at the expense of more complex joins required to answer business queries, hence slowing down query performance.
Now a days disk space is cheaper, but query performance is more important, hence star schema is more common in DWH.
Q. When to use a Snowflake schema?
A. In scenarios where an attribute in a dimension table has NULLs for the majority of dimension records, it would be advisable to create a separate dimension table for this attribute, thus transforming into the snowflake schema.
Second scenario is when we have certain attributes that are part of a hierarchy, but are generally queried independently as in “dim_time” dimension where you can have separate tables for dim_weekly, dim_monthly, dim_yearly, dim_weekday, etc as sales volume will be reported separately by day, week, month or year.
Q3. What is a grain in Data Warehouse?
A3. A grain is an atomic level at which the facts are measured. In other words it is the lowest level of information stored in a fact table.
In the above “daily store sales” example the each Fact table record is of the grain: Per product Per store Per day.
Other examples of grain definition are: 1 row per product, 1 row per store, 1 row per product per store, 1 row per product per store per day, etc.
In Data Warehouse design it is important to determine the grain of Data points before designing the schema as Dimension tables need to be consistent with the Fact tables. If we don’t decide on grains upfront the schema design may result in inconsistent reporting & possible rework.
For example, as per the above “daily store sales”, what if the fact table “fact_sales” capture the quantity & price every hour? The dimension table “dim_time” is at the grain of daily. This will cause inconsistent reporting. FIX: The dimension table must add the attribute “hour int”.
Q. What will happen if the fact table “fact_sales” was at daily grain & the dimension table “dim_time” is at hourly grain?
A. What if the requirement from the business is to report sales every hour? If the data warehouse were at a mature stage with 200+ tables then it will be time consuming to change the current schema & ETL (i.e. Extract Transform & Load) jobs to be able to support reporting at an hourly grain.
Q4. What are the different Fact types?
A4. Additive Fact type is where the measures can be aggregated for all dimensions. The “daily store sales” example shown above is of this type as the measures “price” & “quantity” can be aggregated for all dimensions “dim_store”, “dim_product”, “dim_time”, etc.
Semi-Additive Fact type is where the measures can only be aggregated by some, but not all dimensions. For example, if the fact table “fact_sales” captured an additional measure named “stock_at_hand” to keep track of the stock per product per store then this measure is additive across the dimensions “dim_product” & “dim_store”, BUT NOT additive for the dimension “dim_time”.
Non-Additive Fact type is where a measure attribute is not additive across all dimensions. For example, if the fact table “fact_sales” captured an additional measure named “gst_percentage” to keep track of the goods and services tax percentage as in 10% per record then it cannot be aggregated by product or store as it will be incorrect to say that “product A” is 30% GST by aggregating 3 records. The GST for “product A” will be 10% only. Same rationale applies to by store as well.
Q5. What are the different types of Fact Tables in Data Warehouse?
A5. There are 3 types of Fact Tables. Two key aspects that determine Fact table types are “grain” & “fact type (i.e. aggregate or additive nature)”.
1) Transaction Fact Table.
2) Periodic Snapshot Fact Table.
3) Accumulating Snapshot Fact Table.
Transaction fact tables are those where business users can fully aggregate the measures by any related dimension in general without any limitations. In other words mostly of “Additive” fact type & keeps most detailed level. Example, “daily store sales”.
Periodic Snapshot fact tables are used to analyse outcomes like bank balance, stock counts, etc at a regular interval as in every few hours, daily, weekly, monthly, etc. A typical example will be our bank statements with opening & closing balance for a fixed period like a month. Unlike the transactional fact table where measures are fully additive, we wouldn’t want to sum the opening & closing balance across all periods as it would not make any analytic sense. The two distinct values of “opening balance” and “closing balance” were not meant to be summarised as they represent two “snapshots” based on a specific point in time.
Monthly inventory count is another example of periodic snapshot fact tables. The counts by SKU (i.e. Stock Keeping Unit) vary every month. We would not sum the numbers across time periods as that was not the intention of the original number. We might perform an average and the measures can only be partly rolled up at fixed intervals, hence the facts are “Semi-Additive”.
Accumulating Snapshot fact tables have multiple columns to update a normal sequence of events. For example, placing orders will have a sequence of events like order placement, order shipment, and order fulfilment. The fact table will have columns to capture “order_date”, “shipment_date” and “fulfilment_date”. Initially, a row will be inserted, and these date columns will be updated over a period of time as these life-cycle events happen. This is not a common type of fact table, and some database architects prefer to normalize the fact table and generate separate rows for each life cycle event. Useful for logistical applications.
Another example would be submitting a mortgage application with separate columns for approval dates and approval amounts.
Q6. What is a Factless Fact table?
A6. From the above store sales example we know that fact table is a collection of many facts having multiple keys joined with one or more dimension tables. Facts contain both numeric and additive fields like price & quantity. But a factless fact table is fact table that does not contain any facts.
For example, the below fact table “FACT_LEAVE” is used to capture the leave taken by an employee. Whenever an employee takes leave, which is an event a record is created with the dimensions to answer questions like:
— Number of leaves taken by an employee
— The type of leave an employee takes
— Details of the employee who took leave
The above example was to capture the event of taking a leave by an employee. You can also create Factless fact tables to analyse conditions, coverage or eligibility like
— Products that were sold with a discount
— Products that were sold without a discount
— Products that did not get sold by a store
Q7. What is data vault modelling?
A7. Data vault modelling is a database modelling method, which is an evolution from the dimensional modelling that is designed to provide historical storage of data for OLAP coming in from multiple operational (i.e. OLTP) systems. The data vault modelling is agile, flexible & scalable.
3NF (i.e. Normal Form) is optimal for operational systems (i.e. OLTP – OnLine Transactional Processing).
Star schema (i.e. using dimensional modelling) is optimal for OLAP (i.e. OnLine Analytical Processing) Data Marts.
Data Vault is optimal for Data Warehouse. A Data Mart is a subset of a Data Warehouse, which is oriented around a business unit or team. For example, campaign data mart, HR data mart, retail data mart, etc.
Data Vault modelling uses a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema used in dimensional modelling. The Data Vault is a detail oriented, historical tracking and uniquely linked to a set of normalized tables that support one or more functional areas of a business.
Q. Why do you need Data Vault modelling when there are 3NF or dimension modelling?
A. Data Vault modelling handles changes like adding a new data source or decommissioning an existing datasource better than the 3NF & dimensional modelling without too much reengineering.
Data Vault modelling is all about efficiently collecting, integrating, and storing the data from all the source systems. Dimensional modelling is all about publishing the stored data to effectively support decision making.
Q8. In data vault modelling it is imperative to have conformed dimesions. What do you understand by this term conformed dimensions?
A8. Conformed dimensions are those dimensions which have been designed in such a way to allow facts and measures to be categorised and described in the same way across multiple facts and/or data marts, ensuring consistent reporting across the enterprise.
A good example of a conformed dimension is date. For example, fiscal year Vs. calendar year and the definition of a week. In the case of fiscal vs. calendar year, one may go with either fiscal or calendar, or an alternative is to have two separate conformed dimensions, one for fiscal year and one for calendar year. The definition of a week differs as finance may use Saturday to Friday, whilst marketing may use Sunday to Saturday. In this case, you should decide on a definition and move on.
Q9. What are the 3 types of tables in Data Vault modelling?
A9. The 3 types of tables are:
Hubs: hold unique list of Business Keys. Business keys uniquely identify an entity like the tax file number, social security number, bank account number, product sku number, etc. All the attributes of a hub like first name, surname, address, age, etc are stored in satellite tables.
Links: hub tables. These are the relationships among the hub tables. Links give you the flexibility to absorb structural and business rule changes without re-engineering.
Satellites: are descriptive data for Hubs. All the attributes (E.g. first name, surname, age, address) of a Hub are stored in satellite tables, whilst the hub will only have the business key(s). Satellites give you the adaptability to record history at any interval you want plus unquestionable auditability and traceability to your source systems. A hub can have any number of satellites depending the business requirements.
Q10. What do you understand by the term CDC (i.e. Change Data Capture)?
A10. Change data capture refers to the process or technology for identifying and capturing changes made to a database. CDC refers to several methods of detecting only the data that changed for loading to avoid row collisions, maintain history and make ELT/ETL more efficient. CDC is essential in DWH & BI to identify changed data with timestamps, versioning of rows, active indicators, or both timestamps and active indicators, etc. There are various tools available to perform CDC.
Q11. What is the difference between Slowly Changing Dimension (i.e. SCD) and Change Data Capture (i.e. CDC)?
A11. CDC is about the incremental extraction of changed or new records from the source systems (E.g. Databases), so that you don’t download the whole database each time, whilst slowly changing dimension is about loading the target systems (E.g. Databases) by applying the updates so that the original data is preserved.
Some SCD examples relating to Big Data is covered at Apache Hive for Slowly Changing Dimension (i.e. SCD) interview Q&As.
Q12. What are the different SCD types?
A12. There are 6 types of SCD. Type 1 and 2 are the mostly used types.
SCD 0: No changes are allowed, and the dimensions never change.
SCD 1: Update the changed records directly, and the historical values are ignored. Only the current state is captured. This approach is very common, easiest to implement and maintain, but updates in general are way slower than inserts as you can think of updates as deletes and inserts. Updates might also require locking to support ACID compliance of the DBMS, with inserts you don’t require any locking.
SCD 2: Tracks changes as new versions with new columns added as current_flag, start_date & end_date. There could be only one record with current_flag set to ‘Y’. CDC Type 2 is very powerful as you maintain the history for the entire records and can easily perform change-over-time analysis. However, it also comes with more maintenance overhead, increased storage requirement and potential performance impacts if used on very large volume of dimensions.
Even though you can do SCD type 2 with GROUP BY and MAX(surrogate_key_id), but this approach will be much slower than just selecting the values with current_flag=’Y’.
SCD 3: Tracks only the current & previous values of a change. A new column is added to capture the previous value.
SCD 4: Shows current value in a dimension table but tracks all change history in a separate table.
SCD 6: This is a hybrid type that utilises the techniques from SCD Types 1, 2 and 3 (i.e 1 + 2 + 3 = 6 ) to track changes.
Q13. How will go about modelling with data vault technique?
A13. The high level steps are:
Step 1: Start with a simplified 3NF OLTP data model.
Step 2: Start Identifying the unique “Business Keys” (e.g. account number, product SKU, social security number, etc), which are the Hubs. Prefix the Hub tables with “H_”
Step 3: Start establishing the relationships among the “Business Keys“, which are the Links. Prefix the Link tables with “L_”.
Step 4: Start identifying the attributes that describe the “Business Keys”, which are the Satellites.
Step 5: Add standalone components like Calendars and code/descriptions for decoding in Data Marts.
Step 6: Optimise for query performance, and add performance tables such as Bridge tables and Point-In-Time structures.
1) Hub must only have the attributes such as Surrogate Key (E.g. unique value – MD5 of business key), Business Key (E.g. Record source + Business Key), Created Timestamp, and Record Source.
2) Use a nomenclature such as “H_customer” or “Hub_customer” to identify Hub tables.
3) If the hub sizes are too small partitioning is not required.
1) Links are established only among Hubs, NOT satellites.
2) Links must not have NULL or default values for foreign keys.
3) Use a nomenclature such as “L_customer” or “Link_customer” to identify Link tables.
4) Evaluate the grain of the link as in per order or per line item, etc.
5) Links must only have Surrogate key, Created Timestamp, Surrogate Keys from HUB records (i.e. Foreign Keys), and Record Source.
6) Partitioning strategy for LINKS depends upon the size.
1) Associated with Hubs only. The satellites should not contain business keys.
2) Primary key is a “Hub surrogate key + created timestamp”. Record source field is mandatory.
3) Use a nomenclature such as “S_customer” or “Sat_customer” to identify Satellite tables.
4) Any additional derived, enhanced, or augmented business contexts should reside in separate satellite tables. Business rules can be applied and modelled in special satellites. Any additional identifiers related to a business concept should be put to a bag of key satellite.
5) Partition based on business attributes.
Q14. Can you explain “Full Data Load” Vs. “Incremental or Delta” data load?
A14. In full data load the whole data from a source system or source table is loaded into a target system or table by “truncate & load“. This means the data in the target table is fully truncated & then loaded with entire source system data.
The “Full Data Load” approach might not be efficient for larger volumes of data, hence came the Incremental or Delta” data load approach where only the records that have been changed from the last load are effected in the target table. There are different approaches to do incremental loads:
1) Loading the whole records from source table/file into a staging table in the target system to incrementally load only the new inserted or updated records based on “unmatched” records ETL logic or left outer joins, merge statements, etc from the staging table to the target table. Based on the requirements SCD 0 (i.e. reject changes, SCD 1 (i.e. overwrite changes) & SCD 2 (i.e. maintain change history) types can be applied.
2) Maintaining a separate job control table in the target system to keep track of data that were loaded in the last run by dates and IDs to incrementally load data. Based on the requirements SCD 0 (i.e. reject changes, SCD 1 (i.e. overwrite changes) & SCD 2 (i.e. maintain change history) types can be applied.
3) Using CDC (i.e. Change Data Capture) functionality offered by some data bases via redo or transaction logs. This is a risky option as these functionalities offered by the vendors may change.
4) CDC to real-time messaging engine like Kafka. Kafka is designed for event-driven processing and delivering streaming data to applications. CDC turns databases into a streaming data source where each new transaction is delivered to Kafka in real time, rather than grouping them in batches and introducing latency for the Kafka consumers. CDC to Kafka minimises the impact on source systems when done non-intrusively by reading the database redo or transaction logs.