Data factory - partition files in dataflow

data factory, dataflow, partition files


parameterize datalake source in dataflow

dataflow source doesn't support an input to a dataset parameter

but it provides Wildcard paths

So it can pass on a parameter (e.g date range) as dataflow parameter from a pipeline

the parameter date range is then used in the wildcard paths to select the subset of data from datalake source

e.g. 

   wildcard paths = concat('/site/csv/', $timestamp_date, '/*')

Note the path needs to include all subfolders even they have been specified in the dataset already.

The wildcard * is also needed to selecting files


Split dataflow by column value

Derive a partition path by column value

e.g. concat('/site/', column_value, '/', date_string, '.csv')

This is also the full path of the file, including subfolders and file name, not including the datalake container name


Split dataflow in sink by partition path

in the sink setting, can choose to Name Folder/File as column data

When using file, it can include folders as well

Set the Column data to the partition path derived previously.


dataflow sink, optimize - choose partitioning / single partition

single partition combines data into a single file, which leads to long write time.

with default partitioning, it writes multiple parts files, from parallel writing.

when a file name is specified in previous step, no need to specify 'single partition'