Change Data Capture (5)

On the previous page we saw how to load our facts from the Change Data Capture table.

Here we will see how to add some statistics which are useful to keep, but will also allow us to clean up and/or back up our Change Data Capture table.

First, I will show you a picture of our new Statistics table:

The screenshot below is a complete picture of our adapted Facts loading Workflow. Notice there are two additional blocks of nodes: the group to update the Statistics table (red group on the right) and the group to remove 'old' rows from our Change Data Capture table (blue group n the left). Below are expanded screenshots of each group with a brief explanation.

Above is the group of nodes to update the load_statistics table. At the successful end of our workflow, it inserts a row in the table with the Workflow name, 'Loaded' in the Status column, the highest CHANGE_ID processed so far, the timestamp of the CDC batch of records, and the timestamp of this Workflow.

Above the group of nodes to clean up the CDC table. In this example let's suppose we run our ETL process on a daily base. What we have decided is that, on day+2, we remove the CDC records of day 0, two days ago. So we will always keep one day of backup CDC records, in case we have to re-run our ETL job for some reason.

Depending on the criticality of your data, you could adjust the Row Filter node to keep more days of backup, or replace the SQL executor node to write the 'old' CDC records to some cold storage instead of deleting them.

Page updated

Google Sites

Report abuse