Modelling process
1. identity the business objective.
e.g. for sales, marketing.
Can use Business Matrix table.
2. identity granularity
the lowest level of information, the grain for the business.
3. identity dimensions and attributes.
descriptive information for the business. The attributes applicable to the fact table at the required granularity.
4. identity facts
the measurable data, numbers and transactions
5. build the schema
star schema
conformed dimension
A conformed dimension is a dimension that has the same meaning to multiple facts.
It is referened to by multiple fact tables.
e.g. Fact Sales and Fact Inventory both reference to the Dim Product.
Fact Sales -> Dim Product <- fact Inventory
Dim Date is another common conformed dimension because the date attributes month, quarter, year, etc. mean the same thing for different facts (data marts).
So conformed dimensions bring together different facts. For different data marts, they are logically separately and the conformed dimensions can be with different table names.
role-playing dimension
sometimes a dimension is referenced multiple times by a fact table, with each reference representing a different meaning / role. e.g. sales fact table refers to the same Date Dim for ship date, order date, etc. the ship date, order date are different roles played by the date dimension. Another example is a dim employee may play as approver and developer for a project delivery fact.
Junk dimension
Some dimension are with low cardinality, e.g. sales channel (online, offline), and campaigns (mid year, xmax, no campaign).
Instead of creating a dimension for each of them, we can combine (cross join) them into one Junk Dimension as below.
online, mid year
online, xmax
online, no campaign
offline, mid year
offline, xmax
offline, no campaign
In this way it reduces the number of dimensions and joins required for queries. This only works if the cardinality is low.
Degenerate dimension
A degenerate dimension is a dimension key in the Fact table without its own dimension table.
It's usually some transactional number, e.g. transaction id, ticket id that doesn't have descriptive attributes, but the concept is still important in analysing data. Instead of creating its own dimension table with only the transaction number, we can put the transaction number into the fact table, so we can query fact table by transaction.
This helps to reduce the unnecessary dimension, but may bring in data quality issues, e.g. different source systems have same transaction id.
swappable dimension
A fact table can join to different versions of the same dimension. E.g. the base dimension is Dim Customer, but there can be business customer and direct customer, which may have different attributes. If the analysis is about business customers, the fact sales table joins to business customer dim. If analysing direct customer, then joins to the direct customer dim.
You can create a base Dim Customer, and create database views for different types of customers. Or create physical dim tables for different customer types.
shrunken dimension
A shrunken dimension is for changing the granularity to a higher summary.
e.g. a Date Dimension can be shrunken to a Month Dimension. or city dimension to country dimension.
the shrunken dimension is referenced by aggregated Fact
Outrigger Dimension
parent dimension table, it is used in snowflake style dimensions.
Dimension-to-Dimension table joins
Dimensions can reference to other dimensions. That can be modeled as outrigger dimensions (snowflake), but it can result in explosive growth of the base dimension due to type 2 changes in the outrigger dim forces corresponding type2 changes in the base dim.
Alternatively, the fact table keeps foreign keys to both dim tables. In this way, each dim keeps its only type2 changes, but the relationship betwen the two dims can be discovered only by traversing in the fact table. This can be useful especially when the fact table is periodic snapshot.
Some best practices for dimensional modelling
Denormalize dimensional tables, instead of using 3nf or snowflake type of dimension. The flatten structure helps with performance.
use conformed dimensions to combine different fact tables.
capture changes in dimensional tables. Type 1 simply overwrites. Type2 keeps changes by using e.g. start /end time.
Apply Primary key and foreign key constraints. easier to visualize the relationships, and enforce data integraty.
use Identity column (generated surrogate key).
enforced Check constraints for data quality or any checks.
loading dimensions and facts
usually every dim and fact comes with a surrogate key (unique id) which is usually an incremental big integer.
ETL is not able to load the facts before the dims are loaded, because it is waiting for the dim keys to be generated first before beining able to load the facts (with foreign keys pointing to dims).
Another way is using the hash of the Dim business key (the actual customer id for example) as the surrogate key, in that way, both fact and dim can be loaded in parallel. the downside is hash could have collision, it is rare though.
3 types of Fact table
1. Transaction fact
Just transactions, e.g. sales
2. Periodic snapshot fact
Aggregating the raw fact into period granularity, e.g. summary of sales every quarter.
Usually its a wide table. If no data for a period, still insert a row of nulls.
3. Accumulating snapshot fact
Multiple dates associated with a row. E.g. build start date, complete date, ship date, etc.
The row is filled / updated with accumulating information as it goes. E.g. complete date and total man hours are unknown when build starts, but those information will be updated when available.
In this way, the data warehouse can capture information to do with the timeline.
Slowly changing dimension
Dimension values change over time.
Type 0: retain original value
Type 1: overwrite value
Type 2: Add new row with start & end date
Type 3: prior/original value, and new value as different columns
Factless Fact Table
For tracking event happened between business entities, so as to record the relationship.
But no numeric measurement for the relationship.
e.g.
Consolidated Fact Table
Merge multiple fact tables at the same granularity into one fact table.