Data Warehouse/Data Lake -Data Modelling
What's the Difference Between a Data Warehouse, Data Lake, and Data Mart?
Data warehouses, data lakes, and data marts are different cloud storage solutions.
A data warehouse stores data in a structured format. It is a central repository of preprocessed data for analytics and business intelligence.
A data mart is a data warehouse that serves the needs of a specific business unit, like a company’s finance, marketing, or sales department.
A data lake is a central repository for raw data and unstructured data. You can store data first and process it later on.
What is a Data Warehouse?
A data warehouse is the core analytics system of an organization. The data warehouse will frequently work in conjunction with an operational data store (ODS) to ‘warehouse’ data captured by the various databases used by the business.
Main Characteristics of a Data Warehouse
· Stores large quantities of historical data so old data is not erased when new data is updated
· Captures data from multiple, disparate databases
· Works with ODS to house normalized, cleaned data
· Organized by subject
· OLAP (online analytical processing) application
· The primary data source for data analytics
· Reports and dashboards use data from data warehouses
What is a Data Mart?
A data mart is very similar to a data warehouse. Like a data warehouse, the data mart will maintain, and house cleaned data ready for analysis. However, unlike a data warehouse, the scope of visibility is limited.
Main Characteristics of a Data Mart
· Focuses on one subject matter or business unit
· Acts as a mini-data warehouse, holding aggregated data
· Data is limited in scope
· Often uses a star schema or similar structure
· Reports and dashboards use the data from the data mart
What is a Data Lake?
A data lake stores an organization’s raw and processed (unstructured and structured) data at both large and small scales. Unlike a data warehouse or database, a data lake captures anything the organization deems valuable for future use. This can be images, videos, PDFs, anything! The data lake will extract data from multiple disparate data sources and process the data like a data warehouse
Main Characteristics of a Data Lake
· Collects all data from many disparate data sources over an extended period
· Meets the needs of various users in the organization
· It is uploaded without an established methodology
· Processes and cleans data and stores it in the data lake
What type of tables in Snowflake
Snowflake offers three types of tables namely, Temporary, Transient & Permanent. Default is Permanent.
Temporary tables: Only exist within the session in which they were created and persist only for the remainder of the session.
Transient tables: Persist until explicitly dropped and are available to all users with the appropriate privileges.
Specifically designed for transitory data that needs to be maintained beyond each session (in contrast to temporary tables)
Permanent Tables (DEFAULT): Similar to transient tables with the key difference that they do have a Fail-safe period. Which provides an additional level of data protection and recovery.
Snowflake Virtual warehouse: the virtual warehouse is a cluster of compute resources. It provides resources — including memory, temporary storage and CPU — to perform tasks such as DML operation and SQL execution.
Normalization of Database in Data warehouse: Database Normalization is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy(repetition)
Normalization is used for mainly two purposes,
· Eliminating redundant(useless) data.
· Ensuring data dependencies make sense i.e data is logically stored.
Normalization rules are divided into the following normal forms:
Ø First Normal Form (1NF) --(All attributes are single valued.)
· It should only have single(atomic) valued attributes/columns.
· Values stored in a column should be of the same domain
· All the columns in a table should have unique names.
· And the order in which data is stored, does not matter.
Ø Second Normal Form –(An attribute must be dependent upon entity’s entire unique identifier.)
· It should be in the First Normal form.
· And, it should not have Partial Dependency.
Ø Third Normal Form—(No non-UID attribute can be dependent on another non-UID attribute)
· It is in the Second Normal form.
· And, it doesn't have Transitive Dependency.
“Third normal form is the generally accepted goal for a database design that eliminated redundancy”
Ø BCNF
· Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anomaly that is not handled by 3NF
Ø Fourth Normal Form
· It is in the Boyce-Codd Normal Form.
· And, it doesn't have Multi-Valued Dependency.
Denormalization strategies for data retrieval from data warehouses
Vertical and Horizontal Denormalization
Identifying and Non-Identifying Relationships: An identifying relationship means that the child table cannot be uniquely identified without the parent.
Account (AccountID, AccountNum, AccountTypeID)
PersonAccount (AccountID, PersonID, Balance)
Person (PersonID, Name)
Person to PersonAccount relationship are identifying because the child row
A non-identifying relationship is one where the child can be identified independently of the parent
Account( AccountID, AccountNum, AccountTypeID )
AccountType( AccountTypeID, Code, Name, Description )
Account and AccountType is non-identifying because each AccountType can be identified without having to exist in the parent table.
Dimension Table
Slowly Changes Dimensions : A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.
There are many approaches how to deal with SCD. The most popular are:
Type 0 - The passive method
Type 1 - Overwriting the old value
Type 2 - Creating a new additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
Types of Dimensions
Ø Type 1 Overwriting -->SCD(Slowly Changing Dimensions) the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you creat
Ø Type 2 Creating another dimension record-->This is the most commonly used type of SCD (slowly changing dimension). For this type of slowly changing dimension, add a new record encompassing the change and mark the old record as inactive
Ø Type 3 Creating a current value field--> SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value
Conformed Dimensions. This is used in multiple location. Its helps in creating consistency
Degenerate DimensionsàWhen the dimension attribute is store as part of the fact table and not in a separate dimension table.
Junk Dimensionsà A junk dimension is single table with a combination of different and unrelated attributes to avoid having a large no of foreign keys in the fact table.
Role play DimensionsàIt’s a dimension table that has multiple valid relationship with a fact table
Types of Fact Tables
Transaction Fact Tables
Snapshot Fact tables
Ø Additive Facts
o Measurements in a fact table that can be summed up across all dimensions
Ø Non-Additive Facts
o Facts that cannot be summed up across any dimension key % and ratio columns are non-addictive facts
Ø Semi Additive Facts
o Measurements in a fact table that can be summed up across only a few dimensions keys
Ø Factless fact
o In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.
What is Forward Engineering in Astera DW Builder?
Forward Engineering is a useful feature that allows you to transform a logical data model into a physical data model by generating the database schema. The generated schema is then used to create a new database or alter an existing one. In simple words, it allows you to “replicate” the database schema or changes made to it onto your desired destination.
The Forward Engineering option enables you to synchronize the physical changes you have made to the data model, such as adding or removing entities, indexes, attributes, and relationships with the database. However, this option doesn’t reflect any logical changes, like entity type, SCD field type, etc., as Forward Engineering is only concerned with the database-specific implementation of the data model.
How can AWS help with your data storage needs?
AWS provides the broadest selection of analytics services that fit all your data analytics needs. We enable industries and organizations of all sizes to reinvent their business with data. Here are examples of how you can use AWS:
Use Amazon Redshift for your data warehousing and data mart requirements. Get integrated insights by running real-time and predictive analytics on complex, scaled data across your operational databases, data lake, data warehouse, and thousands of third-party datasets. You can automatically create, train, and deploy machine learning models with ease.
Use AWS Lake Formation to build, manage, and secure a data lake within days. Quickly import data from all your data sources, then describe and manage them in a centralized data catalog.
Use Amazon S3 to build a custom data lake for big data analytics, artificial intelligence, machine learning, and high-performance computing applications.
For reference: https://aws.amazon.com/compare/the-difference-between-a-data-warehouse-data-lake-and-data-mart/
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Continuous Data Protection
Continuous Data Protection (CDP) encompasses a comprehensive set of features that help protect data stored in Snowflake against human error, malicious acts, and software failure. At every stage within the data lifecycle, Snowflake enables your data to be accessible and recoverable in the event of accidental or intentional modification, removal, or corruption.
Feature Additional Reading
Network policies for granting or restricting users access to the site based on their IP address Network Policies
Verification/authentication required for any users accessing your account Multi-Factor Authentication (MFA) — enabled per user
Security roles for controlling user access to all objects in the system Overview of Access Control
All ingested data stored in Snowflake tables is encrypted using AES-256 strong encryption Understanding End-to-End Encryption in Snowflake
Maintenance of historical data (i.e. data that has been changed or deleted) through Snowflake Time Travel Snowflake Time Travel & Fail-safe