DATASTAGE:

DataStage has the following features to aid the design and processing required to build a

data warehouse:

Uses graphical design tools. With simple point-and-click techniques you can draw a

scheme to represent your processing requirements.

Extracts data from any number or type of database.

Handles all the metadata definitions required to define your data warehouse. You can

view and modify the table definitions at any point during the design of your

application.

Aggregates data. You can modify SQL SELECT statements used to extract data.

Transforms data. DataStage has a set of predefined transforms and functions you can

use to convert your data. You can easily extend the functionality by defining your

own transforms to use.

Loads the data warehouse.

COMPONENTS OF DATASTAGE:

DataStage consists of a number of client and server components. DataStage has four

client components

1. DataStage Designer. A design interface used to create DataStage applications

(known as jobs). Each job specifies the data sources, the transforms required, and

the destination of the data. Jobs are compiled to create executables that are

scheduled by the Director and run by the Server (mainframe jobs are

transferred and run on the mainframe).

2. DataStage Director. A user interface used to validate, schedule, run, and monitor

DataStage server jobs and parallel jobs.

3. DataStage Manager. A user interface used to view and edit the contents of the

Repository.

4. DataStage Administrator. A user interface used to perform administration tasks

such as setting up DataStage users, creating and moving projects, and setting up

purging criteria.

SERVER COMPONENTS:

There are three server components:

1. Repository. A central store that contains all the information required to build a

data mart or data warehouse.

2. DataStage Server. Runs executable jobs that extract, transform, and load data

into a data warehouse.

3. DataStage Package Installer. A user interface used to install packaged

DataStage jobs and plug-ins.


DATASTAGE PROJECTS:

You always enter DataStage through a DataStage project. When you start a DataStage

client you are prompted to attach to a project. Each project contains:

• DataStage jobs.

• Built-in components. These are predefined components used in a job.

• User-defined components. These are customized components created using the

DataStage Manager. Each user-defined component performs a specific task in a

job.

DATASTAGE JOBS:

There are three basic types of DataStage job:

1. Server jobs. These are compiled and run on the DataStage server. A server job

will connect to databases on other machines as necessary, extract data, process it,

then write the data to the target datawarehouse.

2. Parallel jobs. These are compiled and run on the DataStage server in a similar

way to server jobs, but support parallel processing on SMP, MPP, and cluster

systems.

3. Mainframe jobs. These are available only if you have Enterprise MVS Edition

installed. A mainframe job is compiled and run on the mainframe. Data extracted

by such jobs is then loaded into the data warehouse.

SPECIAL ENTITIES:

• Shared containers. These are reusable job elements. They typically comprise a

number of stages and links. Copies of shared containers can be used in any

number of server jobs or parallel jobs and edited as required.

• Job Sequences. A job sequence allows you to specify a sequence of DataStage

jobs to be executed, and actions to take depending on results.

TYPES OF STAGES:

• Built-in stages. Supplied with DataStage and used for extracting aggregating,

transforming, or writing data. All types of job have these stages.

• Plug-in stages. Additional stages that can be installed in DataStage to perform

specialized tasks that the built-in stages do not support Server jobs and parallel

jobs can make use of these.

• Job Sequence Stages. Special built-in stages which allow you to define

sequences of activities to run. Only Job Sequences have these.


DATASTAGE NLS:

DataStage has built-in National Language Support (NLS). With NLS installed, DataStage

can do the following:

• Process data in a wide range of languages

• Accept data in any character set into most DataStage fields

• Use local formats for dates, times, and money (server jobs)

• Sort data according to local rules

To load a data mart or data warehouse, you must do the following:

• Set up your project

• Create a job

• Develop the job

• Edit the stages in the job

• Compile the job

• Run the job

SETTING UP YOUR PROJECT:

Before you create any DataStage jobs, you must set up your project by entering

information about your data. This includes the name and location of the tables or files

holding your data and a definition of the columns they contain. Information is stored in

table definitions in the Repository.

STARTING THE DATASTAGE DESIGNER:

To start the DataStage Designer, choose Start → Programs → Ascential DataStage →

DataStage Designer. The Attach to Project dialog box appears:

TO CONNECT TO A PROJECT:

1. Enter the name of your host in the Host system field. This is the name of the

system where the DataStage Server components are installed.

☻Page 51 of 210☻

2. Enter your user name in the User name field. This is your user name on the

server system.

3. Enter your password in the Password field.

4. Choose the project to connect to from the Project drop-down list box.

5. Click OK. The DataStage Designer window appears with the New dialog box

open, ready for you to create a new job:

CREATING A JOB:

Jobs are created using the DataStage Designer. For this example, you need to create a

server job, so double-click the New Server Job icon.


Choose File → Save to save the job.

The Create new job dialog box appears:

DEFINING TABLE DEFINITIONS:

For most data sources, the quickest and simplest way to specify a table definition is to

import it directly from your data source or data warehouse.

IMPORTING TABLE DEFINITIONS:

1. In the Repository window of the DataStage Designer, select the Table Definitions

branch, and choose Import _____ Table Definitions… from the shortcut

menu. The Import Metadata (ODBC Tables) dialog box appears:

2. Choose data Source Name from the DSN drop-down list box.

3. Click OK. The updated Import Metadata ( ODBC Tables) dialog box displays

all the files for the chosen data source name:

☻Page 53 of 210☻

4. Select project.EXAMPLE1 from the Tables list box, where project is the name

of your DataStage project.

5. Click OK. The column information from EXAMPLE1 is imported into DataStage.

6. A table definition is created and is stored under the Table Definitions

→ ODBC → DSNNAME branch in the Repository. The updated DataStage

Designer window displays the new table definition entry in the Repository

window.

DEVELOPING A JOB:

Jobs are designed and developed using the Designer. The job design is developed in the

Diagram window (the one with grid lines). Each data source, the data warehouse, and

each processing step is represented by a stage in the job design. The stages are linked

together to show the flow of data.

For Example we can develop a job with the following three stages:

A Universe stage to represent EXAMPLE1 (the data source).

• A Transformer stage to convert the data in the DATE column from an YYYY-MM-DD

date in internal date format to a string giving just year and month (YYYY-MM).

• A Sequential File stage to represent the file created at run time (the data warehouse in

this example).

Adding Stages:

Stages are added using the tool palette. This palette contains icons that represent the

components you can add to a job. The palette has different groups to organize the tools

available.


To add a stage:

1. Click the stage button on the tool palette that represents the stage type you want to add.

2. Click in the Diagram window where you want the stage to be positioned. The stage

appears in the Diagram window as a square. You can also drag items from the palette to

the Diagram window.

We recommend that you position your stages as follows:

Data sources on the left

Data warehouse on the right

Transformer stage in the center

When you add stages, they are automatically assigned default names. These names are

based on the type of stage and the number of the item in the Diagram window. You can

use the default names in the example.

Once all the stages are in place, you can link them together to show the flow of data.

Linking Stages

You need to add two links:

• One between the Universe and Transformer stages

• One between the Transformer and Sequential File stages

Links are always made in the direction the data will flow, that is, usually left to right.

When you add links, they are assigned default names. You can use the default names in

the example.

To add a link:

1. Right-click the first stage, hold the mouse button down and drag the link to the

transformer stage. Release the mouse button.

2. Right-click the Transformer stage and drag the link to the Sequential File stage.

The following screen shows how the Diagram window looks when you have added the

stages and links:


Editing the Stages

Your job design currently displays the stages and the links between them. You must edit

each stage in the job to specify the data to use and what to do with it. Stages are edited in

the job design by double-clicking each stage in turn. Each stage type has its own editor.

Editing the UniVerse Stage

The data source (EXAMPLE1) is represented by a UniVerse stage. You must specify the

data you want to extract from this file by editing the stage.

Double-click the stage to edit it. The UniVerse Stage dialog box appears:

This dialog box has two pages:

• Stage. Displayed by default. This page contains the name of the stage you are editing.

The General tab specifies where the file is found and the connection type.

• Outputs. Contains information describing the data flowing from the stage. You edit this

page to describe the data you want to extract from the file. In this example, the output

from this stage goes to the Transformer stage. To edit the Universe stage:

1. Check that you are displaying the General tab on the Stage page.

Choose localuv from the Data source name drop-down list. Localuv is where

EXAMPLE1 is copied to during installation.


The remaining parameters on the General and Details tabs are used to enter logon details

and describe where to find the file. Because EXAMPLE1 is installed in localuv, you do

not have to complete these fields, which are disabled.

2. Click the Outputs tab. The Outputs page appears:

The Outputs page contains the name of the link the data flows along and the following

four tabs:

• General. Contains the name of the table to use and an optional description of the link.

• Columns. Contains information about the columns in the table.

• Selection. Used to enter an optional SQL SELECT clause (an Advanced procedure).

• View SQL. Displays the SQL SELECT statement used to extract the data.

3. Choose dstage.EXAMPLE1 from the Available tables drop-down list.

4. Click Add to add dstage.EXAMPLE1 to the Table names field.

5. Click the Columns tab. The Columns tab appears at the front of the dialog box. You

must specify the columns contained in the file you want to use. Because the column

definitions are stored in a table definition in the Repository, you can load them directly.

6. Click Load…. The Table Definitions window appears with then UniVerse

_ localuv branch highlighted.

7. Select dstage.EXAMPLE1. The Select Columns dialog box appears, allowing you to

select which column definitions you want to load.

8. In this case you want to load all available columns definitions, so just click OK. The

column definitions specified in the table definition are copied to the stage. The Columns

tab contains definitions for the four columns in EXAMPLE1:

☻Page 57 of 210☻

9. You can use the Data Browser to view the actual data that is to be output from the

UniVerse stage. Click the View Data… button to open the Data Browser window.

11. Choose File → Save to save your job design so far.

Editing the Transformer Stage

The Transformer stage performs any data conversion required before the data is output to

another stage in the job design. In this example, the Transformer stage is used to convert

the data in the DATE column from an YYYYMM-DD date in internal date format to a

string giving just the year and month (YYYY-MM).

There are two links in the stage:

• The input from the data source (EXAMPLE1)

• The output to the Sequential File stage

To enable the use of one of the built-in DataStage transforms, you will assign data

elements to the DATE columns input and output from the Transformer stage. A


DataStage data element defines more precisely the kind of data that can appear in a given

column. In this example, you assign the Date data element to the input column, to specify

the date is input to the transform in internal format, and the MONTH.TAG data element

to the output column, to specify that the transform produces a string of the format YYYYMM.

Double-click the Transformer stage to edit it. The Transformer Editor appears:

1. Working in the upper-left pane of the Transformer Editor, select the input columns that

you want to derive output columns from. Click on the CODE, DATE, and QTY columns

while holding down the Ctrl key.

2. Click the left mouse button again and, keeping it held down, drag the selected

columns to the output link in the upper-right pane. Drop the columns over the Column

Name field by releasing the mouse button. The columns appear in the top pane and the

associated metadata appears in the lower-right pane:

3. In the Data element field for the DSLink3.DATE column, select Date from the dropdown

list.

4. In the SQL type field for the DSLink4 DATE column, select Char from the dropdown

list.

5. In the Length field or the DSLink4 DATE column, enter 7.

6. In the Data element field for the DSLink4 DATE column, select MONTH.TAG from

the drop-down list. Next you will specify the transform to apply to the input DATE

column to produce the output DATE column. You do this in the upper right pane of the

Transformer Editor.

7. Double-click the Derivation field for the DSLink4 DATE column. The Expression

Editor box appears. At the moment, the box contains the text DSLink3.DATE, which


indicates that the output is directly derived from the input DATE column. Select the text

DSLink3 and delete it by pressing the Delete key.

10. Select the MONTH.TAG transform. It appears in the Expression Editor box with the

argument field [%Arg1%] highlighted.

11. Right-click to open the Suggest Operand menu again. This time, select Input

Column. A list of available input columns appears:


12. Select DSLink3.DATE. This then becomes the argument for the transform.

13. Click OK to save the changes and exit the Transformer Editor. Once more the small

icon appears on the output link from the transformer stage to indicate that the link now

has column definitions associated with it.

Editing the Sequential File Stage

The data warehouse is represented by a Sequential File stage. The data to be written to

the data warehouse is already specified in the Transformer stage. However, you must

enter the name of a file to which the data is written when the job runs. If the file does not

exist, it is created. Double-click the stage to edit it. The Sequential File Stage dialog

box appears:

This dialog box has two pages:

• Stage. Displayed by default. This page contains the name of the stage you are editing

and two tabs. The General tab specifies the line termination type, and the NLS tab

specifies a character set map to use with the stage (this appears if you have NLS

installed).

• Inputs. Describes the data flowing into the stage. This page only appears when you

have an input to a Sequential File stage. You do not need to edit the column definitions

on this page, because they were all specified in the Transformer stage.


To edit the Sequential File stage:

1. Click the Inputs tab. The Inputs page appears. This page contains:

• The name of the link. This is automatically set to the link name used in the job design.

• General tab. Contains the pathname of the file, an optional description of the link, and

update action choices. You can use the default settings for this example, but you may

want to enter a file name (by default the file is named after the input link).

• Format tab. Determines how the data is written to the file. In this example, the data is

written using the default settings that is, as a comma-delimited file.

• Columns tab. Contains the column definitions for the data you want to extract. This tab

contains the column definitions specified in the Transformer stage’s output link.

2. Enter the pathname of the text file you want to create in the File name field, for

example, seqfile.txt. By default the file is placed in the server project directory (for

example, c:\Ascential\DataStage\Projects\datastage) and is named after the input link, but

you can enter, or browse for, a different directory.

3. Click OK to close the Sequential File Stage dialog box.

4. Choose File _ Save to save the job design.

The job design is now complete and ready to be compiled.

Compiling a Job

When you finish your design you must compile it to create an executable job. Jobs are

compiled using the Designer. To compile the job, do one of the following:

• Choose File → Compile.

• Click the Compile button on the toolbar.

The Compile Job window appears:

Running a Job

Executable jobs are scheduled by the DataStage Director and run by the DataStage

Server. You can start the Director from the Designer by choosing Tools → Run

Director.

☻Page 62 of 210☻

When the Director is started, the DataStage Director window appears with the status of

all the jobs in your project:

Highlight your job in the Job name column. To run the job, choose Job → Run Now or

click the Run button on the toolbar. The Job Run Options dialog box appears and

allows you to specify any parameter values and to specify any job run limits. In this case,

just click Run. The status changes to Running. When the job is complete, the status

changes to Finished.

Choose File → Exit to close the DataStage Director window.

Developing a Job

The DataStage Designer is used to create and develop DataStage jobs. A DataStage job

populates one or more tables in the target database. There is no limit to the number of

jobs you can create in a DataStage project.

A job design contains:

• Stages to represent the processing steps required

• Links between the stages to represent the flow of data

There are three different types of job in DataStage:

• Server jobs. These are available if you have installed Server. They run on the

DataStage Server, connecting to other data sources as necessary.

• Mainframe jobs. These are available only if you have installed Enterprise MVS

Edition. Mainframe jobs are uploaded to a mainframe, where they are compiled and run.

• Parallel jobs. These are available only if you have installed the Enterprise Edition.

These run on DataStage servers that are SMP, MPP, or cluster systems. There are two

other entities that are similar to jobs in the way they appear in the DataStage Designer,

and are handled by it. These are:

• Shared containers. These are reusable job elements. They typically comprise a number

of stages and links. Copies of shared containers can be used in any number of server jobs

and parallel jobs and edited as required.

• Job Sequences. A job sequence allows you to specify a sequence of DataStage server

or parallel jobs to be executed, and actions to take depending on results.

☻Page 63 of 210☻

STAGES:

A job consists of stages linked together which describe the flow of data from a data

source to a data target (for example, a final data warehouse).

A stage usually has at least one data input and/or one data output. However, some stages

can accept more than one data input, and output to more than one stage. The different

types of job have different stage types. The stages that are available in the DataStage

Designer depend on the type of job that is currently open in the Designer.

Server Job Stages

DataStage offers several built-in stage types for use in server jobs. These are used to

represent data sources, data targets, or conversion stages. These stages are either passive

or active stages. A passive stage handles access to databases for the extraction or writing

of data. Active stages model the flow of data and provide mechanisms for combining data

streams, aggregating data, and converting data from one data type to another.

As well as using the built-in stage types, you can also use plug-in stages for specific

operations that the built-in stages do not support. The Palette organizes stage types into

different groups, according to function:

• Database

• File

• PlugIn

• Processing

• Real Time

Stages and links can be grouped in a shared container. Instances of the shared container

can then be reused in different server jobs. You can also define a local container within a

job, this groups stages and links into a single unit, but can only be used within the job in

which it is defined. Each stage type has a set of predefined and editable properties. These

properties are viewed or edited using stage editors.

At this point in your job development you need to decide which stage types to use in your

job design. The following built-in stage types are available for server jobs:

☻Page 64 of 210☻

☻Page 65 of 210☻

Mainframe Job Stages

DataStage offers several built-in stage types for use in mainframe jobs. These are used to

represent data sources, data targets, or conversion stages.

The Palette organizes stage types into different groups, according to function:

• Database

• File

• Processing

Each stage type has a set of predefined and editable properties. Some stages can be used

as data sources and some as data targets. Some can be used as both. Processing stages

read data from a source, process it andwrite it to a data target target. These properties are

viewed or edited usingstage editors. A stage editor exists for each stage type and At this

point in your job development you need to decide which stage types to use in your job

design.

☻Page 66 of 210☻

`

☻Page 67 of 210☻

Parallel jobs Processing Stages

☻Page 68 of 210☻

SERVER JOBS:

When you design a job you see it in terms of stages and links. When it is compiled, the

DataStage engine sees it in terms of processes that are subsequently run on the server.

How does the DataStage engine define a process? It is here that the distinction between

active and passive stages becomes important. Actives stages, such as the Transformer and

Aggregator perform processing tasks, while passive stages, such as Sequential file stage

and ODBC stage, are reading or writing data sources and provide services to the active

stages. At its simplest, active stages become processes. But the situation becomes more

complicated where you connect active stages together and passive stages together.

☻Page 69 of 210☻

Single Processor and Multi-Processor Systems

The default behavior when compiling DataStage jobs is to run all adjacent active stages

in a single process. This makes good sense when you are running the job on a single

processor system. When you are running on a multi-processor system it is better to run

each active stage in a separate process so the processes can be distributed among

available processors and run in parallel. The enhancements to server jobs at Release 6 of

DataStage make it possible for you to stipulate at design time that jobs should be

compiled in this way. There are two ways of doing this:

• Explicitly – by inserting IPC stages between connected active stages.

• Implicitly – by turning on inter-process row buffering either project wide (using the

DataStage Administrator) or for individual jobs (in the Job Properties dialog box)

The IPC facility can also be used to produce multiple processes where passive stages are

directly connected. This means that an operation reading from one data source and

writing to another could be divided into a reading process and a writing process able to

take advantage of multiprocessor systems.

☻Page 70 of 210☻

Partitioning and Collecting

With the introduction of the enhanced multi-processor support at Release6, there are

opportunities to further enhance the performance of server jobs by partitioning data. The

Link Partitioner stage allows you to partition data you are reading so it can be processed

by individual processors running on multiple processors. The Link Collector stage

allows you to collect partitioned data together again for writing to a single data target.

The following diagram illustrates how you might use the Link Partitioner and Link

Collector stages within a job. Both stages are active, and you should turn on inter-process

row buffering at project or job level in order to implement process boundaries.

Aggregator Stages

Aggregator stages classify data rows from a single input link into groups and compute

totals or other aggregate functions for each group. The summed totals for each group are

output from the stage via an output link.

Using an Aggregator Stage

☻Page 71 of 210☻

If you want to aggregate the input data in a number of different ways, you can have

several output links, each specifying a different set of properties to define how the input

data is grouped and summarized.

When you edit an Aggregator stage, the Aggregator Stage dialog box appears:

This dialog box has three pages:

• Stage. Displays the name of the stage you are editing. This page has a General tab

which contains an optional description of the stage and names of before- and after-stage

routines

• Inputs. Specifies the column definitions for the data input link.

• Outputs. Specifies the column definitions for the data output link.

Defining Aggregator Input Data

Data to be aggregated is passed from a previous stage in the job design and into the

Aggregator stage via a single input link. The properties of this link and the column

definitions of the data are defined on the Inputs page in the Aggregator Stage dialog

box.

Note: The Aggregator stage does not preserve the order of input rows, even when the

incoming data is already sorted.

☻Page 72 of 210☻

The Inputs page has the following field and two tabs:

• Input name. The name of the input link to the Aggregator stage.

• General. Displayed by default. Contains an optional description of the link.

• Columns. Contains a grid displaying the column definitions for the data being written

to the stage, and an optional sort order.

Column name: The name of the column.

Sort Order: Specifies the sort order. This field is blank by default, that is, there is no

sort order. Choose Ascending for ascending order, Descending for descending

order, or Ignore if you do not want the order to be checked.

Key: Indicates whether the column is part of the primary key.

SQL type: The SQL data type.

Length: The data precision. This is the length for CHAR data and the maximum

length for VARCHAR data.

Scale: The data scale factor.

Nullable: Specifies whether the column can contain null values.

Display: The maximum number of characters required to display the column data.

Data element: The type of data in the column.

Description: A text description of the column.

Defining Aggregator Output Data

When you output data from an Aggregator stage, the properties of output links and the

column definitions of the data are defined on the Outputs page in the Aggregator Stage

dialog box.

The Outputs page has the following field and two tabs:

• Output name. The name of the output link. Choose the link to edit from the Output

name drop-down list box. This list box displays all the output links from the stage.

• General. Displayed by default. Contains an optional description of the link.

• Columns. Contains a grid displaying the column definitions for the data being output

from the stage. The grid has the following columns:

Column name. The name of the column.

Group. Specifies whether to group by the data in the column.

☻Page 73 of 210☻

Derivation. Contains an expression specifying how the data is aggregated. This is a

complex cell, requiring more than one piece of information. Double-clicking the

cell opens the Derivation

Transformer Stages

Transformer stages do not extract data or write data to a target database. They are used to

handle extracted data, perform any conversions required, and pass data to another

Transformer stage or a stage that writes data to a target data table.

Using a Transformer Stage

Transformer stages can have any number of inputs and outputs. The link from the main

data input source is designated the primary input link. There can only be one primary

input link, but there can be any number of reference inputs.

When you edit a Transformer stage, the Transformer Editor appears. An example

Transformer stage is shown below. In this example, metadata has been defined for the

input and the output links.

Link Area

The top area displays links to and from the Transformer stage, showing their columns and

the relationships between them. The link area is where all column definitions, key

expressions, and stage variables are defined. The link area is divided into two panes; you

can drag the splitter bar between them to resize the panes relative to one another. There is

also a horizontal scroll bar, allowing you to scroll the view left or right. The left pane

shows input links, the right pane shows output links. The input link shown at the top of

the left pane is always the primary link. Any subsequent links are reference links. For all

types of link, key fields are shown in bold. Reference link key fields that have no

expression defined are shown in red (or the color defined in Tools Options), as are

output columns that have no derivation defined.

☻Page 74 of 210☻

Within the Transformer Editor, a single link may be selected at any one time. When

selected, the link’s title bar is highlighted, and arrowheads indicate any selected

columns.

Metadata Area

The bottom area shows the column metadata for input and output links. Again this area is

divided into two panes: the left showing input link metadata and the right showing output

link metadata. The metadata for each link is shown in a grid contained within a tabbed

page. Click the tab to bring the required link to the front. That link is also selected in the

link area.

If you select a link in the link area, its metadata tab is brought to the front automatically.

You can edit the grids to change the column metadata on any of the links. You can also

add and delete metadata.

Input Links

The main data source is joined to the Transformer stage via the primary link, but the

stage can also have any number of reference input links.

A reference link represents a table lookup. These are used to provide information that

might affect the way the data is changed, but do not supply the actual data to be

changed. Reference input columns can be designated as key fields. You can specify key

expressions that are used to evaluate the key fields. The most common use for the key

expression is to specify an equi-join, which is a link between a primary link column and a

reference link column. For example, if your primary input data contains names and

addresses, and a reference input contains names and phone numbers, the reference link

name column is marked as a key field and the key expression refers to the primary link’s

name column. During processing, the name in the primary input is looked up in the

reference input. If the names match, the reference data is consolidated with the primary

data. If the names do not match, i.e., there is no record in the reference input whose key

matches the expression given, all the columns specified for the reference input are set to

the null value.

Output Links

You can have any number of output links from your Transformer stage. You may want to

pass some data straight through the Transformer stage unaltered, but it’s likely that you’ll

want to transform data from some input columns before outputting it from the

Transformer stage. You can specify such an operation by entering a BASIC expression or

by selecting a transform to apply to the data. DataStage has many built-in transforms, or

you can define your own custom transforms that are stored in the Repository and can be

reused as required. The source of an output link column is defined in that column’s

Derivation cell within the Transformer Editor. You can use the Expression Editor to

enter expressions or transforms in this cell. You can also simply drag an input column to

an output column’s Derivation cell, to pass the data straight through the Transformer

☻Page 75 of 210☻

stage. In addition to specifying derivation details for individual output columns, you can

also specify constraints that operate on entire output links. A constraint is a BASIC

expression that specifies criteria that data must meet before it can be passed to the output

link. You can also specify a reject link, which is an output link that carries all the data not

output on other links, that is, columns that have not met the criteria. Each output link is

processed in turn. If the constraint expression evaluates to TRUE for an input row, the

data row is output on that link. Conversely, if a constraint expression evaluates to FALSE

for an input row, the data row is not output on that link.

Constraint expressions on different links are independent. If you have more than one

output link, an input row may result in a data row being output from some, none, or all

of the output links. For example, if you consider the data that comes from a paint shop, it

could include information about any number of different colors. If you want to separate

the colors into different files, you would set up different constraints. You could output the

information about green and blue paint on Link A, red and yellow paint on Link B, and

black paint on Link C. When an input row contains information about yellow paint, the

Link A constraint expression evaluates to FALSE and the row is not output on Link A.

However, the input data does satisfy the constraint criterion for Link B and the rows

are output on Link B. If the input data contains information about white paint, this does

not satisfy any constraint and the data row is not output on Links A, B or C, but will be

output on the reject link. The reject link is used to route data to a table or file that is a

“catch-all” for rows that are not output on any other link. The table or file containing

these rejects is represented by another stage in the job design.

Inter-Process Stages

An inter-process (IPC) stage is a passive stage which provides a communication channel

between DataStage processes running simultaneously in the same job. It allows you to

design jobs that run on SMP systems with great performance benefits. To understand the

benefits of using IPC stages, you need to know a bit about how DataStage jobs actually

run as processes,

In this example the job will run as two processes, one handling the communication from

sequential file stage to IPC stage, and one handling communication from IPC stage to

ODBC stage. As soon as the Sequential File stage has opened its output link, the IPC

stage can start passing data to the ODBC stage. If the job is running on a multi-processor

system, the two processor can run simultaneously so the transfer will be much faster. You

can also use the IPC stage to explicitly specify that connected active stages should run as

☻Page 76 of 210☻

separate processes. This is advantageous for performance on multi-processor systems.

You can also specify this behavior implicitly by turning inter process row buffering on,

either for the whole project via DataStage Administrator, or individually for a job in its

Job Properties dialog box.

Using the IPC Stage

When you edit an IPC stage, the InterProcess Stage dialog box appears.

This dialog box has three pages:

• Stage. The Stage page has two tabs, General and Properties. The General page allows

you to specify an optional description of the page. The Properties tab allows you to

specify stage properties.

☻Page 77 of 210☻

• Inputs. The IPC stage can only have one input link. the Inputs page displays

information about that link.

• Outputs. The IPC stage can only have one output link. The Outputs page displays

information about that link.

Defining IPC Stage Properties

The Properties tab allows you to specify two properties for the IPC stage:

• Buffer Size. Defaults to 128 Kb. The IPC stage uses two blocks of memory; one block

can be written to while the other is read from. This property defines the size of each

block, so that by default 256 Kb is allocated in total.

• Timeout. Defaults to 10 seconds. This gives time limit for how long the stage will wait

for a process to connect to it before timing out. This normally will not need changing, but

may be important where you are prototyping multi-processor jobs on single processor

platforms and there are likely to be delays.

Defining IPC Stage Input Data

The IPC stage can have one input link. This is where the process that is writing

connects.

The Inputs page has two tabs: General and Columns.

• General. The General tab allows you to specify an optional description of the stage.

• Columns. The Columns tab contains the column definitions for the data on the input

link. This is normally populated by the metadata of the stage connecting on the input

side. You can also Load a column definition from the Repository, or type one in yourself

(and Save it to the Repository if required). Note that the metadata on the input link must

be identical to the metadata on the output link.

Defining IPC Stage Output Data

The IPC stage can have one output link. This is where the process that is reading

connects.

The Outputs page has two tabs: General and Columns.

• General. The General tab allows you to specify an optional description of the stage.

• Columns. The Columns tab contains the column definitions for the data on the input

link. This is normally populated by the metadata of the stage connecting on the input

side. You can also Load a column definition from the Repository, or type one in yourself

(and Save it to the Repository if required). Note that the metadata on the output link must

be identical to the metadata on the input link.

Link Partitioner Stage:

The Link Partitioner stage is an active stage which takes one input and allows you to

distribute partitioned rows to up to 64 output links. The stage expects the output links to

☻Page 78 of 210☻

use the same metadata as the input link. Partitioning your data enables you to take

advantage of a multi-processor system and have the data processed in parallel. It can be

used in conjunction with the Link Collector stage to partition data, process it in parallel,

and then collect it together again before writing it to a single target.

In order for this job to compile and run as intended on a multi-processor system you must

have inter-process buffering turned on, either at project level using the DataStage

Administrator, or at job level from the Job Properties dialog box.

Defining Link Partitioner Stage Properties

The Properties tab allows you to specify two properties for the Link Partitioner stage:

• Partitioning Algorithm. Use this property to specify the method the stage uses to

partition data. Choose from:

– Round-Robin. This is the default method. Using the round-robin method the stage will

write each incoming row to one of its output links in turn.

– Random. Using this method the stage will use a random number generator to distribute

incoming rows evenly across all output links.

– Hash. Using this method the stage applies a hash function to one or more input column

values to determine which output link the row is passed to.

– Modulus. Using this method the stage applies a modulus function to an integer input

column value to determine which output link the row is passed to.

• Partitioning Key. This property is only significant where you have chosen a

partitioning algorithm of Hash or Modulus. For the Hash algorithm, specify one or more

column names separated by commas. These keys are concatenated and a hash function

applied to determine the destination output link. For the Modulus algorithm, specify a

single column name which identifies an integer numeric column. The value of this

column value determines the destination output link.

Link Collector Stages

The Link Collector stage is an active stage which takes up to 64 inputs and allows you to

collect data from this links and route it along a single output link. The stage expects the

☻Page 79 of 210☻

output link to use the same metadata as the input links. The Link Collector stage can be

used in conjunction with a Link Partitioner stage to enable you to take advantage of a

multi-processor system and have data processed in parallel. The Link Partitioner stage

partitions data, it is processed in parallel, then the Link Collector stage collects it together

again before writing it to a single target.

The following diagram illustrates how the Link Collector stage can be used in a job in

this way.

In order for this job to compile and run as intended on a multi-processor system you must

have inter-process buffering turned on, either at project level using the Data Stage

Administrator, or at job level from the Job Properties dialog box.

Defining Link Collector Stage Properties

The Properties tab allows you to specify two properties for the Link Collector stage:

• Collection Algorithm. Use this property to specify the method the stage uses to collect

data. Choose from:

– Round-Robin. This is the default method. Using the round-robin method the stage will

read a row from each input link in turn.

– Sort/Merge. Using the sort/merge method the stage reads multiple sorted inputs and

writes one sorted output.

• Sort Key. This property is only significant where you have chosen a collecting

algorithm of Sort/Merge. It defines how each of the partitioned data sets are known to be

sorted and how the merged output will be sorted. The key has the following format:

Column name {sort order] [,Column name [sort order]]...

☻Page 80 of 210☻