Data factory - partition files in dataflow
data factory, dataflow, partition files
parameterize datalake source in dataflow
dataflow source doesn't support an input to a dataset parameter
but it provides Wildcard paths
So it can pass on a parameter (e.g date range) as dataflow parameter from a pipeline
the parameter date range is then used in the wildcard paths to select the subset of data from datalake source
e.g.
wildcard paths = concat('/site/csv/', $timestamp_date, '/*')
Note the path needs to include all subfolders even they have been specified in the dataset already.
The wildcard * is also needed to selecting files
Split dataflow by column value
Derive a partition path by column value
e.g. concat('/site/', column_value, '/', date_string, '.csv')
This is also the full path of the file, including subfolders and file name, not including the datalake container name
Split dataflow in sink by partition path
in the sink setting, can choose to Name Folder/File as column data
When using file, it can include folders as well
Set the Column data to the partition path derived previously.
dataflow sink, optimize - choose partitioning / single partition
single partition combines data into a single file, which leads to long write time.
with default partitioning, it writes multiple parts files, from parallel writing.
when a file name is specified in previous step, no need to specify 'single partition'