Data factory - notes


The dynamic content with source query / directory path etc. can point / query different things dynamically.


Getting the current date time:

convertFromUtc(utcNow(),'AUS Eastern Standard Time', 'yyyy-MM-dd')

@convertFromUtc(utcNow(), 'AUS Eastern Standard Time', 'yyyy-MM-dd HH:mm:ss.ffffff')


Azure Data factory does NOT support 'logical or' when combining flows from two activities.


Azure Data Factory does NOT allow running ForEach loop within an IF Condition, but can run IF condition within a ForEach Loop.


Can do upsert with copy task, maybe only sql targets?


wildcard paths

the datalake dataset points to the subfolder, e.g. container/parentfolder/subfolder/

the source of 'copy data' is set to use wildcard file path

In the path, the container is pre populated

             the folder path should be the subfolders, e.g. parentfolder/subfolder

the file name can be *, or abc*


the source of 'dataflow' 

Note, if the file name contains colon character : which can be part of a time string, then dataflow doesn't work

and raise "Relative Path in Absolute URI error" but replacing the colon : to e.g. - will work.


Dynamic dataset

dataset can not use a variable dynamically in the path, but a dataset can use a parameter

Then when using the dataset in a pipeline, it will ask for the parameter, then it can map a variable to the parameter

however, dataflow source doesn't natively support parameterized source yet. There seems to be a work around tho.




Handle pipeline parallel run

Azure Data factory doesn't skip pipeline when an existing instance is running. That causes duplicate load of data. One approach is to set a flag within the execution log database table to indicate an active instance is running.

This requires a 'Script' task to create & update the flag accordingly.

If using non SQL server, the script task is not available in Azure, then one way is to create a file in the data lake to indicate the pipeline is running. There is no task for creating a file either, so it needs to set up a Copy task to copy a static file to a flag file using e.g. the pipeline name as the file name. When finished, delete the flag file.

In the pipeline, the 'Get Metadata' task can be used for checking if a flag file exists or not. Then an 'If Condition' can direct the logic flow based on the flag file exists or not.

 

 

Load window

It can be done by SQL script on the database side (mysql or sql server) and exposed to Azure through a database view. 

It checks the execution log (for last load window end time as the start time) and use now() as the window end time.

When logging the load window, the end time should be rewritten with the max data timestamp if the timestamp is not the insert datetime.

The reason is: A data row with timestamp at 12:00 may not be inserted to the system until 12:01 for example. But if the timestamp is the actual insert datetime, then there won't be any row with insert timestamp earlier than now().


If the load window is small enough, just load from where it was up to in the last run until now. If too big, it may need to divide the load window into smaller windows.



Generating too many small files

If the pipeline generates too many small files due to many small queries in a loop, we may load all small result sets into a database and then export all into a single file at the end. If the 'append' option is available for writing to a file, then it may just append all small result sets into a single file.


Partition in database

When querying data in e.g. MySQL, as the size of the table can be quite big after a period of time, we can partition the table by e.g. Date. In this way we achieve two things:

(a) querying data is more efficient by limiting to only one partition

(2) deleting old data is more efficient as it can just truncate the old partitions, while deleting old rows from a table involves too much logging and is very slow.

 

Error handling

In Azure, there is no pipeline level 'try-catch' or failure capture, so it needs to set up the failure capture for every component.

However, we can create a parent pipeline to run the pipeline (Execute pipeline task) and capture the whole child pipeline's failure from there. To log the child pipeline's failure, it can retrieve the child pipeline's name from @activity('Execute Pipeline').output.pipelineName


Database view must have explicit column names

When querying e.g. MySQL view, make sure the view's select statement has included explicit column names instead of select * ...

The * will cause the ADF to be extremely slow.