Its Data-warehousing time, I have already written about Databases and their architecture in the previous sections, which you can review once as a refresher before going ahead with this Data-warehousing concepts.
You may find many definitions of Datawarehouse on Google but here is how I define it; Data Warehouse is a group of systems where Data is being gathered from various source systems so that a transformation engine can run on this data and provide an output of data which can be used for various Business needs, such as "Reporting". Another fancy term used for Reporting in this context is BI (Business Insights)
Some recall on the normal business functioning, we all know Data is one of the most important aspect when it comes to dealing with any corporate Business. Every Business has to collect data, store it and use it. Data collection can come from forms which customers or employees fill up, from transactions which happen in an application, file uploads, etc.
This data is normally loaded into disparate applications (into their databases) and used by them, but what if there is a requirement where you need to generate a combined report from this data for e.g., for purposes such as Business reporting, combined statementing, marketing, cross selling/up selling, etc.
Let me setup a scene for you to understand this concept better; a simple "Cross Selling" use-case. In a Bank there may be:
· Current account / savings account customers (CASA)
· Customers of credit cards
· Customers with Loans
Now there maybe cases where a single customer may have CASA account but not have a credit card, or a customer has a CASA account and a credit card but not a loan account, so you would want to sell credit card to a customer who doesn’t have it or sell a loan to the customer who has a credit card, to achieve this you definitely need data from all three systems for the following reasons:
· First to cross verify which customer has multiple products and which one is missing.
· Second is once you identify the customer, you need the customer's Name and contact number so that a sales executive can call the customer and try to sell the product which he currently does not own.
To achieve both the above we need a single report which shows what services a customer already has and what is missing and his contact details (for marketing team to connect with them), but! The main problem is that the data of all these customers is present on 3 disparate systems and needs to be combined and a report is to be generated for the marketing team.
The Report expected would be having the following minimum fields(columns):
· Customer Name
· Customer Contact Number
· CASA account available
· Credit card account available
· Loan account available
This is just one simple use case which I have taken to explain Data Warehouse and ETL, in a Financial institution there are numerous such requirements, not only for Reporting but for applications to use data in a specific format.
So, before I continue with decoding my scenario, let me jump back to the definition of Data Warehouse, we all know that the Data warehouse is a place where we load all our data from Disparate systems, so relating this to our scenario we are going to fetch data from all 3 systems (CASA, Credit Cards & Loans), now this process of fetching is actually called as "Extraction" with respect to Data Warehousing. Please follow the diagram while you read the details below:
Extraction
Now whenever we extract the data from different systems, we have to stage it somewhere that is the reason the Datawarehouse always has a Staging area or a database where the extracted data is staged, don’t get confused by the word staging, here staging can be another Database in the Warehouse, a table or a simple extracted file
Transform
Once the data is brought from all systems into the staging database now the Query/procedure will run to transform that data i.e. it will verify duplicate users present in the 3 systems and collate and show as a single user in the transformed DB, also it will dedupe the contact number and then put the final contact number in the transformed DB, third it would have columns of CASA, Credit Cards & Loans with a “Yes” or “No” status (for the marketing team to know for which missing service the customer needs to be contacted), this new report or table would be now available in the Transformed DB
Load
Once the transformed data is ready it would Load that data into the Marketing DB so that now the marketing users can access that Report and use it for Cross Selling
The above 3 processes which I just explained are called ETL (Extract, Transform & Load), this is how an ETL solution works, ETL is basically the Engine of Data warehousing
I have written a lot of text here; I would urge you to please refer the diagram simultaneously when you are reading so that you can grasp the concept with ease.
Some more Points to remember:
Datawarehouse is not a readily available product which is available in the market for use, it’s a combination of different tools the major one being the ETL tool, it’s basically nothing but an ETL application and standard databases.
The Major purpose of a Data Warehouse is for Business Reporting or Business Analytics, do remember that Reporting is one of the major requirements in a financial institution it can be used for anything from regulatory to transactional to application troubleshooting to Marketing and much more.
There is another concept of Datamart which is like a Sub Datawarehouse which can be department, vertical or any specific group based sub-warehouse within a Primary Warehouse.