Data Warehouse/Data Lake -Data Modelling

What's the Difference Between a Data Warehouse, Data Lake, and Data Mart?

Data warehouses, data lakes, and data marts are different cloud storage solutions.

 A data warehouse stores data in a structured format. It is a central repository of preprocessed data for analytics and business intelligence.

 A data mart is a data warehouse that serves the needs of a specific business unit, like a company’s finance, marketing, or sales department. 

A data lake is a central repository for raw data and unstructured data. You can store data first and process it later on.


What is a Data Warehouse?

A data warehouse is the core analytics system of an organization. The data warehouse will frequently work in conjunction with an operational data store (ODS) to ‘warehouse’ data captured by the various databases used by the business. 

Main Characteristics of a Data Warehouse

·   Stores large quantities of historical data so old data is not erased when new data is updated

·   Captures data from multiple, disparate databases

·   Works with ODS to house normalized, cleaned data

·   Organized by subject

·   OLAP (online analytical processing) application

·   The primary data source for data analytics

·   Reports and dashboards use data from data warehouses

What is a Data Mart?

A data mart is very similar to a data warehouse. Like a data warehouse, the data mart will maintain, and house cleaned data ready for analysis. However, unlike a data warehouse, the scope of visibility is limited.

Main Characteristics of a Data Mart

·   Focuses on one subject matter or business unit

·   Acts as a mini-data warehouse, holding aggregated data

·   Data is limited in scope

·   Often uses a star schema or similar structure

·   Reports and dashboards use the data from the data mart

What is a Data Lake?

A data lake stores an organization’s raw and processed (unstructured and structured) data at both large and small scales. Unlike a data warehouse or database, a data lake captures anything the organization deems valuable for future use. This can be images, videos, PDFs, anything! The data lake will extract data from multiple disparate data sources and process the data like a data warehouse 

Main Characteristics of a Data Lake

·   Collects all data from many disparate data sources over an extended period

·   Meets the needs of various users in the organization

·   It is uploaded without an established methodology

·   Processes and cleans data and stores it in the data lake


What type of tables in Snowflake

Snowflake offers three types of tables namely, Temporary, Transient & Permanent. Default is Permanent.

Temporary tables: Only exist within the session in which they were created and persist only for the remainder of the session.

Transient tables: Persist until explicitly dropped and are available to all users with the appropriate privileges.

Specifically designed for transitory data that needs to be maintained beyond each session (in contrast to temporary tables)

Permanent Tables (DEFAULT): Similar to transient tables with the key difference that they do have a Fail-safe period. Which provides an additional level of data protection and recovery.

Snowflake Virtual warehouse:  the virtual warehouse is a cluster of compute resources. It provides resources — including memory, temporary storage and CPU — to perform tasks such as DML operation and SQL execution.

Normalization of Database in Data warehouse: Database Normalization is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy(repetition)

Normalization is used for mainly two purposes,

·       Eliminating redundant(useless) data.

·       Ensuring data dependencies make sense i.e data is logically stored.

 

Normalization rules are divided into the following normal forms:

Ø  First Normal Form (1NF) --(All attributes are single valued.)

·       It should only have single(atomic) valued attributes/columns.

·       Values stored in a column should be of the same domain

·       All the columns in a table should have unique names.

·       And the order in which data is stored, does not matter.

  Ø  Second Normal Form –(An attribute must be dependent upon entity’s entire unique identifier.)

·       It should be in the First Normal form.

·       And, it should not have Partial Dependency.

 Ø  Third Normal Form—(No non-UID attribute can be dependent on another non-UID attribute)

·       It is in the Second Normal form.

·       And, it doesn't have Transitive Dependency.

“Third normal form is the generally accepted goal for a database design that eliminated redundancy”

Ø  BCNF

·       Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anomaly that is not handled by 3NF

Ø  Fourth Normal Form

·       It is in the Boyce-Codd Normal Form.

·       And, it doesn't have Multi-Valued Dependency.

Denormalization strategies for data retrieval from data warehouses

Vertical and Horizontal Denormalization

Identifying and Non-Identifying Relationships: An identifying relationship means that the child table cannot be uniquely identified without the parent.

Account (AccountID, AccountNum, AccountTypeID)

PersonAccount (AccountID, PersonID, Balance)

Person   (PersonID, Name)

Person to PersonAccount relationship are identifying because the child row

A non-identifying relationship is one where the child can be identified independently of the parent

Account( AccountID, AccountNum, AccountTypeID )

AccountType( AccountTypeID, Code, Name, Description )

Account and AccountType is non-identifying because each AccountType can be identified without having to exist in the parent table.


Dimension Table

Slowly Changes Dimensions : A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

There are many approaches how to deal with SCD. The most popular are:

Types of Dimensions

Ø  Type 1 Overwriting  -->SCD(Slowly Changing Dimensions) the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you creat

Ø  Type 2  Creating another dimension record-->This is the most commonly used type of SCD (slowly changing dimension). For this type of slowly changing dimension, add a new record encompassing the change and mark the old record as inactive

Ø  Type 3 Creating a current value field--> SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value

Conformed Dimensions. This is used in multiple location. Its helps in creating consistency

Degenerate DimensionsàWhen the dimension attribute is store as part of the fact table and not in a separate dimension table.

Junk Dimensionsà A junk dimension is single table with a combination of different and unrelated attributes to avoid having a large no of foreign keys in the fact table.

Role play DimensionsàIt’s a dimension table that has multiple valid relationship with a fact table

  

Types of Fact Tables

Transaction Fact Tables

Snapshot Fact tables

Ø  Additive Facts

o   Measurements in a fact table that can be summed up across all dimensions

Ø  Non-Additive Facts

o   Facts that cannot be summed up across any dimension key % and ratio columns are non-addictive facts

 Ø  Semi Additive Facts

o   Measurements in a fact table that can be summed up across only a few dimensions keys

Ø  Factless fact

o   In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.


What is Forward Engineering in Astera DW Builder?

Forward Engineering is a useful feature that allows you to transform a logical data model into a physical data model by generating the database schema.  The generated schema is then used to create a new database or alter an existing one. In simple words, it allows you to “replicate” the database schema or changes made to it onto your desired destination.

The Forward Engineering option enables you to synchronize the physical changes you have made to the data model, such as adding or removing entities, indexes, attributes, and relationships with the database. However, this option doesn’t reflect any logical changes, like entity type, SCD field type, etc., as Forward Engineering is only concerned with the database-specific implementation of the data model. 

How can AWS help with your data storage needs?

AWS provides the broadest selection of analytics services that fit all your data analytics needs. We enable industries and organizations of all sizes to reinvent their business with data. Here are examples of how you can use AWS:

For reference: https://aws.amazon.com/compare/the-difference-between-a-data-warehouse-data-lake-and-data-mart/

                       https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/


 

Continuous Data Protection

Continuous Data Protection (CDP) encompasses a comprehensive set of features that help protect data stored in Snowflake against human error, malicious acts, and software failure. At every stage within the data lifecycle, Snowflake enables your data to be accessible and recoverable in the event of accidental or intentional modification, removal, or corruption. 


Feature                                                                                                                                  Additional Reading

Network policies for granting or restricting users access to the site based on their IP address  Network Policies 

Verification/authentication required for any users accessing your account   Multi-Factor Authentication (MFA) — enabled per user 

Security roles for controlling user access to all objects in the system                           Overview of Access Control 

All ingested data stored in Snowflake tables is encrypted using AES-256 strong encryption Understanding End-to-End Encryption in Snowflake 

Maintenance of historical data (i.e. data that has been changed or deleted) through Snowflake Time Travel  Snowflake Time Travel & Fail-safe