Sorting and partitioning in DataStage jobs

Datastage is popular immensely due to its pipelining and parallel processing capability. Datastage executes its jobs in terms of partitions (separate processing blocks).This is where portioning of data plays an important role in how your data is processed. Partitioning refers to how your data is actually split into separate blocks so that they can be processed independently of each other. For e.g. If your data consists of people of varying age ranges and your processing only involves calculations based on people having same ages then it makes sense in portioning your data on the basis of age so that all the records having the same age would come in a single partition block.

As you will know by now Datastage can run in different partition modes which is mainly decided by the APT_CONFIG _FILE that is used during the run. So if your job is running on a four node configuration then it means that your data will be split into 4 parts. How its split depends on the partition mode you choose and the keys you provide. You can find the list of key/keyless partition methods in your Datastage guides.

It’s normally advised that you should provide suitable partitioning and sorts in your job designs rather than leaving it in the auto partitioning mode. The reason for this is because you will be knowing the data structure and can make judgement calls on the keys that have to be used during partitioning. If you leave the partitioning method as auto, Datastage would choose a partitioning method for you and normally in the case of keyed partitioning used in stages like sort/join the partitioning keys would be the same as provided in the stage operation. In most cases this might not even be required.

By default Datastage would automatically insert sort and partitions in the Datastage job to achieve optimal performance. If you are confident in your design then you can remove these settings in Datastage if its not required. You can change this setting either at the project level or at the job level. The environment variables that you will need to look into for this are

- APT_NO_PART_INSERTION – Datastage automatically inserts partitions in your jobs to optimize the performance of the stages in your job if this is not set. To avoid such insertions you can set the value for this variable.
- APT_NO_SORT_INSERTION – Datastage automatically inserts sorts in your jobs to optimize the performance of the stages in your job if this is not set. To avoid such sort insertions you can set the value for this variable.

Another useful environment variable is APT_SORT_INSERTION_CHECK_ONLY. If this is set, what happens is that when Datastage inserts its sort, sorting wont actually happen but instead the sort order is only verified by Datastage. This means that you wont have to specifically enable the above mentioned variables if the APT_SORT_INSERTION_CHECK_ONLY variable is set.

Sorting is done in Datastage using two operators. The tsort and psort operator. Tsort is the default sorting mechanism/operator used by Datastage. This operator does have any additional requirements unlike the psort operator which is used when the sort option specified is UNIX sort. In multiple node environments, the data in each partition is sorted separately and maintained as separate partition blocks. You must remember that in order to get appropriate sorted data, the input data will have to be suitably partitioned before the sort. Sorting is carried out in the disk space mentioned in the sort pool if such a pool is specified for the sort operation in the APT_CONFIG_FILE. Else the scratch disk is used which is always specified in the default disk pool. This is what you will normally find as the configurations

Page updated

Google Sites

Report abuse