Change Data Capture is software design that track changed data so that required action can be taken on them. CDC integrates data through identification, capture, and finally, delivers the changes to a source enterprise database.
The activity of Change Data Capture is mainly seen in data warehouse environments because its basic function is the same as a data warehouse that, is capturing and preserving the state of the data. CDC is also applicable in other forms of data repositories or database systems.
CDC records all activities of data change like insert, update, and delete that applies to any database like SQL Server or Oracle and sorts the details of all changes made into a user-friendly relational format. All the metadata and the column information necessary for recording the changes from the source to the target databases are captured for the modified rows. These are then stored in change tables that mirror the column structure of the tables in the source database. For more details visit here.
The structure of the change data capture format is simple. It involves a source system that holds the data that has changed over time and a target system that has to take some action based on those changes. In theory, both the source and the target systems can be physically in the same location but that would have no effect on the Change Data Capture design pattern as several solutions of CDC can coexist in the same system.
Change Data Capture and ETL
Extract, Transform, Load (ETL) is a process of data integration where data is extracted from multiple sources and loaded into a database, data lake, or a data warehouse. The data can be extracted either using batch queries (batch-based) or in near real-time with Change Data Capture. During the phase of transformation, the data is processed and converted to the required format and finally, loaded to the target destination.
In the traditional ETL processes, the transformation was a slow activity but in the modern ETL platforms, the older disk-based processing has been replaced with in-memory processing. This helps in real-time data processing, data enrichment, and data analysis. The ETL work ends with loading the data into a target destination.
Optimized Method of CDC
One of the most-used methods of CDC at the application level is the trigger-based process. This involves defining triggers and creating change log in shadow tables. The triggers are set off before or after the Insert, Update, or Delete commands indicating a change and are used to create a change log. Some databases even have native support for triggers.
The advantage of trigger-based Change Data Capture is that shadow tables provide immutable, and detailed logs of all transactions as well as provide direct support in the SQL API for some databases. On the other hand. CDC reduces the performance of a database by requiring multiple rewrites every time a row is inserted, updated, or deleted.