DS Parallel Job Developers Guide
DS Parallel Job Developers Guide
Version 8
SC18-9891-00
WebSphere DataStage and QualityStage
®
Version 8
SC18-9891-00
Note
Before using this information and the product that it supports, be sure to read the general information under “Notices and
trademarks” on page 615.
Chapter 25. External Filter stage . . . 327 Chapter 30. Encode Stage . . . . . . 367
Must do’s . . . . . . . . . . . . . . 327 Must do’s . . . . . . . . . . . . . . 367
Stage page . . . . . . . . . . . . . . 327 Stage page . . . . . . . . . . . . . . 367
Properties tab . . . . . . . . . . . . 327 Properties tab . . . . . . . . . . . . 368
Advanced tab . . . . . . . . . . . . 328 Advanced tab . . . . . . . . . . . . 368
Input page . . . . . . . . . . . . . . 328 Input page . . . . . . . . . . . . . . 368
Partitioning on input links . . . . . . . . 329 Partitioning tab . . . . . . . . . . . . 369
Output page . . . . . . . . . . . . . . 330 Output page . . . . . . . . . . . . . . 370
Chapter 26. Change Capture stage 331 Chapter 31. Decode stage . . . . . . 371
Example Data . . . . . . . . . . . . . 332 Must do’s . . . . . . . . . . . . . . 371
Chapter 45. Head stage . . . . . . . 489 Chapter 50. Column Generator stage 525
Examples . . . . . . . . . . . . . . . 489 Example . . . . . . . . . . . . . . . 525
Head stage default behavior . . . . . . . 489 Must do’s . . . . . . . . . . . . . . 527
Skipping Data . . . . . . . . . . . . 490 Stage page . . . . . . . . . . . . . . 528
Must do’s . . . . . . . . . . . . . . 491 Properties tab . . . . . . . . . . . . 528
Stage page . . . . . . . . . . . . . . 492 Advanced tab . . . . . . . . . . . . 529
Properties tab . . . . . . . . . . . . 492 Input page . . . . . . . . . . . . . . 529
Advanced tab . . . . . . . . . . . . 493 Partitioning on input links . . . . . . . . 529
Input page . . . . . . . . . . . . . . 493 Output page . . . . . . . . . . . . . . 531
Partitioning tab . . . . . . . . . . . . 493 Mapping tab . . . . . . . . . . . . 531
Output page . . . . . . . . . . . . . . 495
Contents ix
Expanding fillers . . . . . . . . . . . . 611 Notices and trademarks . . . . . . . 615
Notices . . . . . . . . . . . . . . . 615
Accessing information about IBM . . 613 Trademarks . . . . . . . . . . . . . . 617
Contacting IBM . . . . . . . . . . . . 613
Accessible documentation . . . . . . . . . 613 Index . . . . . . . . . . . . . . . 619
Providing comments on the documentation . . . 614
WebSphere DataStage allows you to concentrate on designing your job sequentially, without worrying too
much about how parallel processing will be implemented. You specify the logic of the job, WebSphere
DataStage specifies the best implementation on the available hardware. If required, however, you can
exert more exact control on job implementation.
Once designed, parallel jobs can run on SMP, MPP, or cluster systems. Jobs are scaleable. The more
processors you have, the faster the job will run.
Parallel jobs can also be run on a USS system, special instructions for this are given in Chapter 52,
“Parallel jobs on USS,” on page 539.
Note: You must choose to either run parallel jobs on standard UNIX® systems or on USS systems in the
WebSphere DataStage Administrator. You cannot run both types of job at the same time. See
WebSphere DataStage Administrator Client Guide.
WebSphere DataStage also supports server jobs and mainframe jobs. Server jobs are compiled and run on
the server. These are for use on non-parallel systems and SMP systems with up to 64 processors. Server
jobs are described in WebSphere DataStage Server Job Developer Guide. Mainframe jobs are available if have
Enterprise MVS™ Edition installed. These are loaded onto a mainframe and compiled and run there.
Mainframe jobs are described in WebSphere DataStage Mainframe Job Developer Guide.
The following diagram represents one of the simplest jobs you could have: a data source, a Transformer
(conversion) stage, and the final database. The links between the stages represent the flow of data into or
out of a stage. In a parallel job each stage would correspond to a process. You can have multiple
instances of each process to run on the available processors in your system.
You lay down these stages and links on the canvas of the WebSphere DataStage Designer. You specify the
design as if it was sequential, WebSphere DataStage determines how the stages will become processes
and how many instances of these will actually be run.
WebSphere DataStage also allows you to store reuseable components in the Repository which can be
incorporated into different job designs. You can import these components, or entire jobs, from other
WebSphere DataStage Projects using the Designer. You can also import meta data directly from data
sources and data targets.
Guidance on how to construct your job and define the required meta data using the Designer is in the
WebSphere DataStage Designer Client Guide. Chapter 4 onwards of this manual describe the individual
stage editors that you may use when developing parallel jobs.
Parallel processing
There are two basic types of parallel processing; pipeline and partitioning. WebSphere DataStage allows
you to use both of these methods. The following sections illustrate these methods using a simple parallel
job which extracts data from a data source, transforms it in some way, then writes it to another data
source. In all cases this job would appear the same on your Designer canvas, but you can configure it to
behave in different ways (which are shown diagrammatically).
Pipeline parallelism
If you ran the example job on a system with at least three processors, the stage reading would start on
one processor and start filling a pipeline with the data it had read. The transformer stage would start
running on another processor as soon as there was data in the pipeline, process it and start filling another
pipeline. The stage writing the transformed data to the target database would similarly start writing as
soon as there was data available. Thus all three stages are operating simultaneously. If you were running
sequentially, there would only be one instance of each stage. If you were running in parallel, there would
be as many instances as you had partitions (see next section).
time taken
Partition parallelism
Imagine you have the same simple job as described above, but that it is handling very large quantities of
data. In this scenario you could use the power of parallel processing to your best advantage by
partitioning the data into a number of separate sets, with each partition being handled by a separate
instance of the job stages.
Using partition parallelism the same job would effectively be run simultaneously by several processors,
each handling a separate subset of the total data.
At the end of the job the data partitions can be collected back together again and written to a single data
source.
Repartitioning data
In some circumstances you may want to actually repartition your data between stages. This might
happen, for example, where you want to group data differently. Say you have initially processed data
based on customer last name, but now want to process on data grouped by zip code. You will need to
repartition to ensure that all customers sharing the same zip code are in the same group. WebSphere
Further details about how WebSphere DataStage actually partitions data, and collects it together again, is
given in ″Partitioning, Repartitioning, and Collecting Data″.
SMP systems allow you to scale up the number of processors, which may improve performance of your
jobs. The improvement gained depends on how your job is limited:
v CPU-limited jobs. In these jobs the memory, memory bus, and disk I/O spend a disproportionate
amount of time waiting for the processor to finish its work. Running a CPU-limited application on
more processors can shorten this waiting time so speed up overall performance.
In a cluster or MPP environment, you can use the multiple processors and their associated memory and
disk resources in concert to tackle a single job. In this environment, each processor has its own dedicated
memory, memory bus, disk, and disk access. In a shared-nothing environment, parallelization of your job
is likely to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications.
WebSphere DataStage learns about the shape and size of the system from the configuration file. It
organizes the resources needed for a job according to what is defined in the configuration file. When
your system changes, you change the file not the jobs.
The configuration file describes available processing power in terms of processing nodes. These may, or
may not, correspond to the actual number of processors in your system. You may, for example, want to
always leave a couple of processors free to deal with other activities on your system. The number of
nodes you define in the configuration file determines how many instances of a process will be produced
when you compile a parallel job.
Every MPP, cluster, or SMP environment has characteristics that define the system overall as well as the
individual processors. These characteristics include node names, disk storage locations, and other
distinguishing attributes. For example, certain processors might have a direct connection to a mainframe
for performing high-speed data transfers, while others have access to a tape drive, and still others are
dedicated to running an RDBMS application. You can use the configuration file to set up node pools and
resource pools. A pool defines a group of related nodes or resources, and when you design a parallel job
you can specify that execution be confined to a particular pool.
The configuration file describes every processing node that WebSphere DataStage will use to run your
application. When you run a parallel job, WebSphere DataStage first reads the configuration file to
determine the available system resources.
When you modify your system by adding or removing processing nodes or by reconfiguring nodes, you
do not need to alter or even recompile your parallel job. Just edit the configuration file.
The configuration file also gives you control over parallelization of your job during the development
cycle. For example, by editing the configuration file, you can first run your job on a single processing
node, then on two nodes, then four, then eight, and so on. The configuration file lets you measure system
performance and scalability without actually modifying your job.
You can define and edit the configuration file using the Designer client.
Partitioning
In the simplest scenario you probably won’t be bothered how your data is partitioned. It is enough that it
is partitioned and that the job runs faster. In these circumstances you can safely delegate responsibility
for partitioning to WebSphere DataStage. Once you have identified where you want to partition data,
WebSphere DataStage will work out the best method for doing it and implement it.
The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as
possible, ensuring an even load across your processors.
When performing some operations however, you will need to take control of partitioning to ensure that
you get consistent results. A good example of this would be where you are using an aggregator stage to
summarize your data. To get the answers you want (and need) you must ensure that related data is
grouped together in the same partition before the summary operation is performed on that partition.
WebSphere DataStage lets you do this.
There are a number of different partitioning methods available, note that all these descriptions assume
you are starting with sequential data. If you are repartitioning already partitioned data then there are
some specific considerations:
13
Node 1
1
2
3
4 2
5
6
6 10
7 14
8
9 Node 2
10
11
12 3
13
7
14 11
15
15
16
Node 3
Input data
4
8
12
16
Node 4
Random partitioner
Records are randomly distributed across all processing nodes. Like round robin, random partitioning can
rebalance the partitions of an input data set to guarantee that each processing node receives an
approximately equal-sized partition. The random partitioning has a slightly higher overhead than round
robin because of the extra processing required to calculate a random value for each record.
14
Node 1
1
2
3
4 9
5
6
6 10
7 13
8
9 Node 2
10
11
12 1
13
7
14 11
15
3
16
Node 3
Input data
8
4
4
16
Node 4
Same partitioner
The stage using the data set as input performs no repartitioning and takes as input the partitions output
by the preceding stage. With this partitioning method, records stay on the same processing node; that is,
they are not redistributed. Same is the fastest partitioning method. This is normally the method
WebSphere DataStage uses when passing data between stages in your job.
5 5
9 9
13 13
Node 1 Node 1
2 2
6 6
10 10
14 14
Node 2 Node 2
3 3
7 7
11 11
15 15
Node 3 Node 3
4 4
8 8
12 12
16 16
Entire partitioner
Every instance of a stage on every processing node receives the complete data set as input. It is useful
when you want the benefits of parallel execution, but every instance of the operator needs access to the
entire input data set. You are most likely to use this partitioning method with stages that create lookup
tables from their input.
4
5
Node 1
2
3
4
1
5
2
Node 2
3
4
5 1
2
3
4
5
Node 3
Input data
2
3
4
5
Node 4
Hash partitioner
Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record.
The hash partitioner examines one or more fields of each input record (the hash key fields). Records with
the same values for all hash key fields are assigned to the same processing node.
This method is useful for ensuring that related records are in the same partition, which may be a
prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash
partition records so that records with the same partitioning key values are on the same node. You can
then sort the records on each node using the hash key fields as sorting key fields, then remove
Hash partitioning does not necessarily result in an even distribution of data between partitions. For
example, if you hash partition a data set based on a zip code field, where a large percentage of your
records are from one or two zip codes, you can end up with a few partitions containing most of your
records. This behavior can lead to bottlenecks because some nodes are required to process more records
than other nodes.
For example, the diagram shows the possible results of hash partitioning a data set using the field age as
the partitioning key. Each record with a given age is assigned to the same partition, so for example
records with age 36, 40, or 22 are assigned to partition 0. The height of each bar represents the number of
records in the partition.
Age values
10 54 17
12 18 27
Partition size
(in records)
35 5 60
36 40 22
15 44 39
…
0 1 2 3 N
Partition number
As you can see, the key values are randomly distributed among the different partitions. The partition
sizes resulting from a hash partitioner are dependent on the distribution of records in the data set so even
though there are three keys per partition, the number of records per partition varies widely, because the
distribution of ages in the population is non-uniform.
When hash partitioning, you should select hashing keys that create a large number of partitions. For
example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not
a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to
create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a
maximum age of 190), to yield 1,500,000 possible partitions.
You must define a single primary partitioning key for the hash partitioner, and you may define as many
secondary keys as are required by your job. Note, however, that each column can be used only once as a
key. Therefore, the total number of primary and secondary keys must be less than or equal to the total
number of columns in the row.
You specify which columns are to act as hash keys on the Partitioning tab of the stage editor The data
type of a partitioning key may be any data type except raw, subrecord, tagged aggregate, or vector. By
default, the hash partitioner does case-sensitive comparison. This means that uppercase strings appear
before lowercase strings in a partitioned data set. You can override this default if you want to perform
case insensitive partitioning on string fields.
Modulus partitioner
Partitioning is based on a key column modulo the number of partitions. This method is similar to hash
by field, but involves simpler computation.
In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket
number. You can use the modulus partitioner to partition the records according to this number. The
modulus partitioner assigns each record of an input data set to a partition of its output data set as
determined by a specified key field in the input data set. This field can be the tag field.
where:
v fieldname is a numeric field of the input data set.
v number_of_partitions is the number of processing nodes on which the partitioner executes. If a
partitioner is executed on three processing nodes it has three partitions.
In this example, the modulus partitioner partitions a data set containing ten records. Four processing
nodes run the partitioner, and the modulus partitioner divides the data among four partitions. The input
data is as follows:
The bucket is specified as the key field, on which the modulus operation is calculated.
bucket date
64123 1960-03-30
61821 1960-06-27
44919 1961-06-18
22677 1960-09-24
90746 1961-09-15
21870 1960-01-01
87702 1960-12-22
The following table shows the output data set divided among four partitions by the modulus partitioner.
Here are three sample modulus operations corresponding to the values of three of the key fields:
v 22677 mod 4 = 1; the data is written to Partition 1.
v 47330 mod 4 = 2; the data is written to Partition 2.
v 64123 mod 4 = 3; the data is written to Partition 3.
None of the key fields can be divided evenly by 4, so no data is written to Partition 0.
Range partitioner
Divides a data set into approximately equal-sized partitions, each of which contains records with key
columns within a specified range. This method is also useful for ensuring that related records are in the
same partition.
A range partitioner divides a data set into approximately equal size partitions based on one or more
partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set.
In order to use a range partitioner, you have to make a range map. You can do this using the Write
Range Map stage, which is described in Chapter 55.
The range partitioner guarantees that all records with the same partitioning key values are assigned to
the same partition and that the partitions are approximately equal in size so all nodes perform an equal
amount of work when processing the data set.
An example of the results of a range partition is shown below. The partitioning is based on the age key,
and the age range for each partition is indicated by the numbers in each bar. The height of the bar shows
the size of the partition.
Partition
All partitions are of approximately the same size. In an ideal distribution, every partition would be
exactly the same size.
However, you typically observe small differences in partition size. In order to size the partitions, the
range partitioner uses a range map to calculate partition boundaries. As shown above, the distribution of
partitioning keys is often not even; that is, some partitions contain many partitioning keys, and others
contain relatively few. However, based on the calculated partition boundaries, the number of records in
each partition is approximately the same.
Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The
random and round robin partitioning methods also guarantee that the partitions of a data set are
equivalent in size. However, these partitioning methods are keyless; that is, they do not allow you to
control how records of a data set are grouped together within a partition.
In order to perform range partitioning your job requires a write range map stage to calculate the range
partition boundaries in addition to the stage that actually uses the range partitioner. The write range map
stage uses a probabilistic splitting technique to range partition a data set. This technique is described in
Parallel Sorting on a Shared- Nothing Architecture Using Probabilistic Splitting by DeWitt, Naughton, and
Schneider in Query Processing in Parallel Relational Database Systems by Lu, Ooi, and Tan, IEEE Computer
Society Press, 1994. In order for the stage to determine the partition boundaries, you pass it a sorted
sample of the data set to be range partitioned. From this sample, the stage can determine the appropriate
partition boundaries for the entire data set.
When you come to actually partition your data, you specify the range map to be used by clicking on the
property icon, next to the Partition type field, the Partitioning/Collection properties dialog box appears
and allows you to specify a range map.
DB2 partitioner
Partitions an input data set in the same way that DB2® would partition it. For example, if you use this
method to partition an input data set containing update information for an existing DB2 table, records are
assigned to the processing node containing the corresponding DB2 record. Then, during the execution of
the parallel operator, both the input record and the DB2 table record are local to the processing node.
Any reads and writes of the DB2 table would entail no network activity.
See theDB2 Parallel Edition for AIX®, Administration Guide and Reference for more information on DB2
partitioning.
Auto partitioner
The most common method you will see on the WebSphere DataStage stages is Auto. This just means that
you are leaving it to WebSphere DataStage to determine the best partitioning method to use depending
on the type of stage, and what the previous stage in the job has done. Typically WebSphere DataStage
would use round robin when initially partitioning data, and same for the intermediate stages of a job.
Collecting
Collecting is the process of joining the multiple partitions of a single data set back together again into a
single partition. There are various situations where you may want to do this. There may be a stage in
your job that you want to run sequentially rather than in parallel, in which case you will need to collect
all your partitioned data at this stage to make sure it is operating on the whole data set.
Similarly, at the end of a job, you might want to write all your data to a single database, in which case
you need to collect it before you write it.
There might be other cases where you do not want to collect the data at all. For example, you may want
to write each partition to a separate flat file.
Just as for partitioning, in many situations you can leave DataStage to work out the best collecting
method to use. There are situations, however, where you will want to explicitly specify the collection
method.
Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the
same data, you are unlikely to get data collected in the same order each time. If order matters, you need
to use the sorted merge collection method.
Node 1
1
5
9
5 13
6 2
7 6
8 10
14
Node 2 3
7
11
9 15
10 4
11 8
12
12
16
Node 3
Input data
13
14
15
16
Node 4
Ordered collector
Reads all records from the first partition, then all records from the second partition, and so on. This
collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both
the records in each partition and the partitions themselves are ordered. This may be useful as a
preprocessing action before exporting a sorted data set to a single data file.
Node 1
1
2
3
5 4
6 5
7 6
8 7
8
Node 2 9
10
11
9 12
10 13
11 14
12 15
16
Node 3
Input data
13
14
15
16
Node 4
Current record
In this example, the records consist of three fields. The first-name and last-name fields are strings, and
the age field is an integer. The following figure shows the order of the three records read by the sort
merge collector, based on different combinations of collecting keys.
order read:
1 “Jane” “Smith” 42
2 “Mary” “Davis” 42
3 “Paul” “Smith” 34
1 “Paul” “Smith” 34
2 “Mary” “Davis” 42
3 “Jane” “Smith” 42
1 “Jane” “Smith” 42
2 “Paul” “Smith” 34
3 “Mary” “Davis” 42
The data type of a collecting key can be any type except raw, subrec, tagged, or vector.
By default, the sort merge collector uses ascending sort order and case-sensitive comparisons. Ascending
order means that records with smaller values for a collecting field are processed before records with
larger values. You also can specify descending sorting order, so records with larger values are processed
first.
With a case-sensitive algorithm, records with uppercase strings are processed before records with
lowercase strings. You can override this default to perform case-insensitive comparisons of string fields.
Auto collector
The most common method you will see on the parallel stages is Auto. This normally means that
WebSphere DataStage will eagerly read any row from any input partition as it becomes available, but if it
detects that, for example, the data needs sorting as it is collected, it will do that. This is the fastest
collecting method.
Repartitioning data
If you decide you need to repartition data within your parallel job there are some particular
considerations as repartitioning can affect the balance of data partitions.
For example, if you start with four perfectly balanced partitions and then subsequently repartition into
three partitions, you will lose the perfect balance and be left with, at best, near perfect balance. This is
true even for the round robin method; this only produces perfectly balanced partitions from a sequential
data source. The reason for this is illustrated below. Each node partitions as if it were a single processor
with a single data set, and will always start writing to the first target partition. In the case of four
partitions repartitioning to three, more rows are written to the first target partition. With a very small
data set the effect is pronounced; with a large data set the partitions tend to be more balanced.
Partitioning icons
Each parallel stage in a job can partition or repartition incoming data before it operates on it. Equally it
can just accept the partitions that the data comes in. There is an icon on the input link to a stage which
shows how the stage handles partitioning.
In some cases, stages have a specific partitioning method associated with them that cannot be overridden.
It always uses this method to organize incoming data before it processes it. In this case an icon on the
input link tells you that the stage is repartitioning data:
If you have a data link from a stage running sequentially to one running in parallel the following icon is
shown to indicate that the data is being partitioned:
You can specify that you want to accept the existing data partitions by choosing a partitioning method of
same. This is shown by the following icon on the input link:
Partitioning methods are set on the Partitioning tab of the Inputs pages on a stage editor.
In most cases you are best leaving the preserve partitioning flag in its default state. The exception to this
is where preserving existing partitioning is important. The flag will not prevent repartitioning, but it will
warn you that it has happened when you run the job. If the Preserve Partitioning flag is cleared, this
means that the current stage doesn’t care what the next stage in the job does about partitioning. On some
Collector icon
A stage in the job which is set to run sequentially will need to collect partitioned data before it operates
on it. There is an icon on the input link to a stage which shows that it is collecting data:
Sorting data
You will probably have requirements in your parallel jobs to sort data. WebSphere DataStage has a sort
stage, which allows you to perform complex sorting operations. There are situations, however, where you
require a fairly simple sort as a precursor to a processing operation. For these purposes, WebSphere
DataStage allows you to insert a sort operation in most stage types for incoming data. You do this by
selecting the Sorting option on the Input page Partitioning tab . When you do this you can specify:
v Sorting keys. The field(s) on which data is sorted. You must specify a primary key, but you can also
specify any number of secondary keys. The first key you define is taken as the primary.
v Stable sort (this is the default and specifies that previously sorted data sets are preserved).
v Unique sort (discards records if multiple records have identical sorting key values).
v Case sensitivity.
v Sort direction. Sorted as EBCDIC (ASCII is the default).
If you have NLS enabled, you can also specify the collate convention used.
Some WebSphere DataStage operations require that the data they process is sorted (for example, the
Merge operation). If WebSphere DataStage detects that the input data set is not sorted in such a case, it
will automatically insert a sort operation in order to enable the processing to take place unless you have
explicitly specified otherwise.
Data sets
Inside a WebSphere DataStage parallel job, data is moved around in data sets. These carry meta data with
them, both column definitions and information about the configuration that was in effect when the data
set was created. If for example, you have a stage which limits execution to a subset of available nodes,
and the data set was created by a stage using all nodes, WebSphere DataStage can detect that the data
will need repartitioning.
If required, data sets can be landed as persistent data sets, represented by a Data Set stage (see Chapter 4,
“Data set stage,” on page 67, ″Data Set Stage.″) This is the most efficient way of moving data between
linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should
not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided
with WebSphere DataStage).
Metadata
Metadata is information about data. It describes the data flowing through your job in terms of column
definitions, which describe each of the fields making up a data record.
WebSphere DataStage has two alternative ways of handling metadata, through table definitions, or
through Schema files. By default, parallel stages derive their meta data from the columns defined on the
Outputs or Input page Column tab of your stage editor. Additional formatting information is supplied,
where needed, by a Formats tab on the Outputs or Input page. In some cases you can specify that the
stage uses a schema file instead by explicitly setting a property on the stage editor and specify the name
and location of the schema file. Note that, if you use a schema file, you should ensure that runtime
column propagation is turned on. Otherwise the column definitions specified in the stage editor will
always override any schema file.
Where is additional formatting information needed? Typically this is where you are reading from, or
writing to, a file of some sort and WebSphere DataStage needs to know more about how data in the file
is formatted.
You can specify formatting information on a row basis, where the information is applied to every column
in every row in the dataset. This is done from the Formats tab (the Formats tab is described with the
stage editors that support it; for example, for Sequential files, see page Input Link Format Tab). You can
also specify formatting for particular columns (which overrides the row formatting) from the Edit
Column Metadata dialog box for each column (see page Field Level).
Table definitions
A table definition is a set of related columns definitions that are stored in the Repository. These can be
loaded into stages as and when required.
You can import a table definition from a data source via the Designer. You can also edit and define new
table definitions in the Designer (see WebSphere DataStage Designer Client Guide). If you want, you can edit
individual column definitions once you have loaded them into your stage.
You can also simply type in your own column definition from scratch on the Outputs or Input page
Column tab of your stage editor. When you have entered a set of column definitions you can save them
as a new table definition in the Repository for subsequent reuse in another job.
Note: If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is,
however, easy to convert text files between two different maps with a WebSphere DataStage job.
Such a job would read data from a text file using a Sequential File stage and specifying the
appropriate character set on the NLS Map page. It would write the data to another file using a
Sequential File stage, specifying the UTF-8 map on the NLS Map page.
Some parallel job stages allow you to use a partial schema. This means that you only need define column
definitions for those columns that you are actually going to operate on. Partial schemas are also described
in Appendix A.
Remember that you should turn runtime column propagation on if you intend to use schema files to
define column meta data.
Data types
When you work with parallel job column definitions, you will see that they have an SQL type associated
with them. This maps onto an underlying data type which you use when specifying a schema via a file,
and which you can view in the Parallel tab of the Edit Column Meta Data dialog box. The underlying
data type is what a parallel job data set understands. The following table summarizes the underlying
data types that columns definitions can have:
’%[padding_character][integer]lld’
%[padding_character][integer]ld’
The integer component specifies a minimum field width. The output column is printed at least this wide, and wider
if necessary. If the column has fewer digits than the field width, it is padded on the left with padding_character to
make up the field width. The default padding character is a space.
For this example c_format specification: ’%09lld’ the padding character is zero (0), and the integers 123456 and
12345678 are printed out as 000123456 and 123456789.
When you work with mainframe data using the CFF stage, the data types are as follows:
Underlying Data
COBOL Data Type Type
binary, native binary 2 bytes S9(1-4) int16
COMP/COMP-5
binary, native binary 4 bytes S9(5-9) int32
COMP/COMP-5
binary, native binary 8 bytes S9(10-18) int64
COMP/COMP-5
binary, native binary 2 bytes 9(1-4) uint16
COMP/COMP-5
binary, native binary 4 bytes 9(5-9) uint32
COMP/COMP-5
binary, native binary 8 bytes 9(10-18) uint64
COMP/COMP-5
character n bytes X(n) string[n]
character for filler n bytes X(n) raw(n)
varchar n bytes X(n) string[max=n]
decimal (x+y)/2+1 bytes 9(x)V9(y)COMP-3 decimal[x+y,y] packed
decimal (x+y)/2+1 bytes S9(x)V9(y)COMP-3 decimal[x+y,y] packed
display_numeric x+y bytes 9(x)V9(y) decimal[x+y,y] or zoned
string[x+y]
display_numeric x+y bytes S9(x)V9(y) decimal[x+y,y] or zoned, trailing
string[x+y]
display_numeric x+y bytes S9(x)V9(y) sign is decimal[x+y,y] zoned, trailing
trailing
display_numeric x+y bytes S9(x)V9(y) sign is decimal[x+y,y] zoned, leading
leading
display_numeric x+y+1 bytes S9(x)V9(y) sign is decimal[x+y,y] separate, trailing
trailing separate
display_numeric x+y+1 bytes S9(x)V9(y) sign is decimal[x+y,y] separate, leading
leading separate
float 4 bytes 8 bytes COMP-1 COMP-2 sfloat dfloat floating point
The Char, VarChar, and LongVarChar SQL types relate to underlying string types where each character is
8-bits and does not require mapping because it represents an ASCII character. You can, however, specify
that these data types are extended, in which case they are taken as ustrings and do require mapping.
(They are specified as such by selecting the Extended check box for the column in the Edit Meta Data
dialog box.) An Extended field appears in the columns grid, and extended Char, VarChar, or
LongVarChar columns have `Unicode’ in this field. The NChar, NVarChar, and LongNVarChar types
relate to underlying ustring types so do not need to be explicitly extended.
When referring to complex data in WebSphere DataStage column definitions, you can specify fully
qualified column names, for example:
Parent.Child5.Grandchild2
Subrecords
A subrecord is a nested data structure. The column with type subrecord does not itself define any
storage, but the columns it contains do. These columns can have any data type, and you can nest
subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The
following diagram gives an example of a subrecord structure.
Parent (subrecord)
Child1 (string)
Child2 (string)
Child3 (string) LEVEL01
Child4 (string)
Child5(subrecord)
Grandchild1 (string)
Grandchild2 (time) LEVEL02
Grandchild3 (sfloat)
Parent (tagged)
Child1 (string)
Child2 (int8)
Child3 (raw)
Vector
A vector is a one dimensional array of any type except tagged. All the elements of a vector are of the
same type, and are numbered from 0. The vector can be of fixed or variable length. For fixed length
vectors the length is explicitly stated, for variable length ones a property defines a link field which gives
the length at run time. The following diagram illustrates a vector of fixed length and one of variable
length.
Fixed length
0 1 2 3 4 5 6 7 8
Variable length
0 1 2 3 4 5 6 N
link field = N
You use formatting strings at various places in parallel jobs to specify the format of dates, times, and
timestamps.
Date formats
Format strings are used to control the format of dates.
A date format string can contain one or a combination of the following elements:
Table 1. Date format tags
Variable width
Tag availability Description Value range Options
%d import Day of month, 1...31 s
variable width
%dd Day of month, fixed 01...31 s
width
When you specify a date format string, prefix each component with the percent symbol (%). Separate the
string’s components with a literal character.
Where indicated the tags can represent variable-width date elements. Variable-width date elements can
omit leading zeroes without causing errors.
The following options can be used in the format string where indicated in the table:
s Specify this option to allow leading spaces in date formats. The s option is specified in the form:
%(tag,s)
Where tag is the format string. For example:
%(m,s)
indicates a numeric month of year field in which values can contain leading spaces or zeroes and
be one or two characters wide. If you specified the following date format property:
%(d,s)/%(m,s)/%yyyy
Then the following dates would all be valid:
8/ 8/1958
The u, w, and t options are mutually exclusive. They affect how text is formatted for output. Input dates
will still be correctly interpreted regardless of case.
-N Specify this option to left justify long day or month names so that the other elements in the date
will be aligned.
+N Specify this option to right justify long day or month names so that the other elements in the
date will be aligned.
Names are left justified or right justified within a fixed width field of N characters (where N is between 1
and 99). Names will be truncated if necessary. The following are examples of justification in use:
%dd-%(mmmm,-5)-%yyyy
21-Augus-2006
%dd-%(mmmm,-10)-%yyyy
21-August -2005
%dd-%(mmmm,+10)-%yyyy
21- August-2005
The locale for determining the setting of the day and month names can be controlled through the locale
tag. This has the format:
%(L,’locale’)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention
supported by ICU. See NLS Guide for a list of locales. The default locale for month names and weekday
names markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment
variable (the tag takes precedence over the environment variable if both are set).
Use the locale tag in conjunction with your time format, for example the format string:
Specifies the Spanish locale and would result in a date with the following format:
The format string is subject to the restrictions laid out in the following table. A format string can contain
at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the
’incompatible with’ column. When some tags are used the format string requires that other tags are
When a numeric variable-width input tag such as %d or %m is used, the field to the immediate right of
the tag (if any) in the format string cannot be either a numeric tag, or a literal substring that starts with a
digit. For example, all of the following format strings are invalid because of this restriction:
%d%m-%yyyy
%d%mm-%yyyy
%(d)%(mm)-%yyyy
%h00 hours
The year_cutoff is the year defining the beginning of the century in which all two-digit years fall. By
default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997.
You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible
year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set
the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds
to 2029.
You can include literal text in your date format, for example as separators. Any Unicode character other
than null, backslash, or the percent sign can be used (although it is better to avoid control codes and
other non-graphic characters). The following table lists special tags and escape sequences:
Time formats
Format strings are used to control the format of times.
When you specify a time string, prefix each component of the format string with the percent symbol.
Separate the string’s components with a literal character.
Where indicated the tags can represent variable-width date elements. Variable-width time elements can
omit leading zeroes without causing errors.
The following options can be used in the format string where indicated:
s Specify this option to allow leading spaces in time formats. The s option is specified in the form:
%(tag,s)
Where tag is the format string. For example:
%(n,s)
indicates a minute field in which values can contain leading spaces or zeroes and be one or two
characters wide. If you specified the following date format property:
%(h,s):$(n,s):$(s,s)
Then the following times would all be valid:
20: 6:58
20:06:58
The locale for determining the setting of the am/pm string and the default decimal separator can be
controlled through the locale tag. This has the format:
%(L,’locale’)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention
supported by ICU. See NLS Guide for a list of locales. The default locale for am/pm string and separators
markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable
(the tag takes precedence over the environment variable if both are set).
Use the locale tag in conjunction with your time format, for example:
%L(’es’)%HH:%nn %aa
The format string is subject to the restrictions laid out in the following table. A format string can contain
at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the
’incompatible with’ column. When some tags are used the format string requires that other tags are
present too, as indicated in the ’requires’ column.
Table 4. Format tag restrictions
Element Numeric format tags Text format tags Requires Incompatible with
hour %hh, %h, %HH, %H - - -
am/pm - %aa hour (%HH) hour (%hh)
marker
minute %nn, %n - - -
second %ss, %s - - -
fraction of a %ss.N, %s.N, %SSS, - - -
second %SSSSSS
You can include literal text in your date format. Any Unicode character other than null, backslash, or the
percent sign can be used (although it is better to avoid control codes and other non-graphic characters).
The following table lists special tags and escape sequences:
Timestamp formats
Format strings are used to control the format of timestamps.
The timestamp format is the date format and the time format combined. The two formats can be in any
order and their elements can be mixed. The formats are described in “Date formats” on page 30 and
“Time formats” on page 33.
You must prefix each component of the format string with the percent symbol (%).
You create a new shared container in the Designer, add Server job stages as required, and then add the
Server Shared Container to your Parallel job and connect it to the Parallel stages. Server Shared Container
stages used in Parallel jobs have extra pages in their Properties dialog box, which enable you to specify
details about parallel processing and partitioning and collecting data.
You can only use Server Shared Containers in this way on SMP systems (not MPP or cluster systems).
The following limitations apply to the contents of such Server Shared Containers:
v There must be zero or one container inputs, zero or more container outputs, and at least one of either.
v There can be no disconnected flows - all stages must be linked to the input or an output of the
container directly or via an active stage. When the container has an input and one or more outputs,
each stage must connect to the input and at least one of the outputs.
v There can be no synchronization by having a passive stage with both input and output links.
For details on how to use Server Shared Containers, see in WebSphere DataStage Designer Client Guide. This
also tells you how to use Parallel Shared Containers, which enable you to package parallel job
functionality in a reuseable form.
Parallel jobs have a large number of stages available. They are organized into groups in the tool palette
or you can drag all the stages you use frequently to the Favorites category.
The stage editors are divided into the following basic types:
v Database. These are stages that read or write data contained in a database. Examples of database
stages are the Oracle Enterprise and DB2/UDB Enterprise stages.
v Development/Debug. These are stages that help you when you are developing and troubleshooting
parallel jobs. Examples are the Peek and Row Generator stages.
v File. These are stages that read or write data contained in a file or set of files. Examples of file stages
are the Sequential File and Data Set stages.
v Processing. These are stages that perform some processing on the data that is passing through them.
Examples of processing stages are the Aggregator and Transformer stages.
v Real Time. These are the stages that allow Parallel jobs to be made available as web services. They are
part of the optional Web Services package.
v Restructure. These are stages that deal with and manipulate data containing columns of complex data
type. Examples are Make Subrecord and Make Vector stages.
Parallel jobs also support local containers and shared containers. Local containers allow you to tidy your
designs by putting portions of functionality in a container, the contents of which are viewed on a
separate canvas. Shared containers are similar, but are stored separately in the repository and can be
reused by other parallel jobs. Parallel jobs can use both Parallel Shared Containers and Server Shared
Containers. Using shared containers is described in WebSphere DataStage Designer Client Guide.
The following table lists the available stage types and gives a quick guide to their function:
All of the stage types use the same basic stage editor, but the pages that actually appear when you edit
the stage depend on the exact type of stage you are editing. The following sections describe all the page
types and sub tabs that are available. The individual descriptions of stage editors in the following
chapters tell you exactly which features of the generic editor each stage type uses.
The top Oracle stage has a warning triangle, showing that there is a compilation error. If you hover the
mouse pointer over the stage a tooltip appears, showing the particular errors for that stage.
Any local containers on your canvas will behave like a stage, i.e., all the compile errors for stages within
the container are displayed. You have to open a parallel shared container in order to see any compile
problems on the individual stages.
Note: Parallel transformer stages will only show certain errors; to detect C++ errors in the stage, you
have to actually compile the job containing it.
General tab
All stage editors have a General tab, this allows you to enter an optional description of the stage.
Specifying a description here enhances job maintainability.
The properties for most general stages are set under the Stage page. The following figure shows an
example Properties tab.
The available properties are displayed in a tree structure. They are divided into categories to help you
find your way around them. All the mandatory properties are included in the tree by default and cannot
be removed. Properties that you must set a value for (i.e. which have not got a default value) are shown
in the warning color (red by default), but change to black when you have set a value. You can change the
warning color by opening the Options dialog box (select Tools → Options ... from the Designer main
menu) and choosing the Transformer item from the tree. Reset the Invalid column color by clicking on
the color bar and choosing a new color from the palette.
To set a property, select it in the list and specify the required property value in the property value field.
The title of this field and the method for entering a value changes according to the property you have
selected. In the example above, the Key property is selected so the Property Value field is called Key and
you set its value by choosing one of the available input columns from a drop down list. Key is shown in
red because you must select a key for the stage to work properly. The Information field contains details
about the property you currently have selected in the tree. Where you can browse for a property value,
or insert a job parameter whose value is provided at run time, a right arrow appears next to the field.
Click on this and a menu gives access to the Browse Files dialog box and/or a list of available job
parameters (job parameters are defined in the Job Properties dialog box - see WebSphere DataStage
Designer Client Guide).
Some properties have default values, and you can always return to the default by selecting it in the tree
and choosing Set to default from the shortcut menu.
Some properties can be repeated. In the example above you can add multiple key properties. The Key
property appears in the Available properties to add list when you select the tree top level Properties
node. Click on the Key item to add multiple key properties to the tree. Where a repeatable property
expects a column as an argument, a dialog is available that lets you specify multiple columns at once. To
open this, click the column button next to the properties tree.
The Column Selection dialog box opens. The left pane lists all the available columns, use the arrow right
keys to select some or all of them (use the left arrow keys to move them back if you change your mind).
A separate property will appear for each column you have selected.
Some properties have dependents. These are properties which somehow relate to or modify the parent
property. They appear under the parent in a tree structure.
For some properties you can supply a job parameter as their value. At runtime the value of this
parameter will be used for the property. Such properties will have an arrow next to their Property Value
box. Click the arrow to get a drop-down menu, then choose Insert job parameter get a list of currently
defined job parameters to chose from (see WebSphere DataStage Designer Client Guide for information about
job parameters).
You can switch to a multiline editor for entering property values for some properties. Do this by clicking
on the arrow next to their Property Value box and choosing Switch to multiline editor from the menu.
The property capabilities are indicated by different icons in the tree as follows:
The properties for individual stage types are described in the chapter about the stage.
Advanced tab
All stage editors have an Advanced tab. This allows you to:
v Specify the execution mode of the stage. This allows you to choose between Parallel and Sequential
operation. If the execution mode for a particular type of stage cannot be changed, then this drop down
list is disabled. Selecting Sequential operation forces the stage to be executed on a single node. If you
have intermixed sequential and parallel stages this has implications for partitioning and collecting data
between the stages. You can also let WebSphere DataStage decide by choosing the default setting for
the stage (the drop down list tells you whether this is parallel or sequential).
v Set or clear the preserve partitioning flag (this field is not available for all stage types). It indicates
whether the stage wants to preserve partitioning at the next stage of the job. You choose between Set,
Clear and Propagate. For some stage types, Propagate is not available. The operation of each option is
as follows:
– Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should
preserve existing partitioning if possible.
The tab allows you to order input links and/or output links as needed. Where link ordering is not
important or is not possible the tab does not appear.
The link label gives further information about the links being ordered. In the example we are looking at
the Link Ordering tab for a Join stage. The join operates in terms of having a left link and a right link,
and this tab tells you which actual link the stage regards as being left and which right. If you use the
arrow keys to change the link order, the link name changes but not the link label. In our example, if you
pressed the down arrow button, DSLink27 would become the left link, and DSLink26 the right.
The following example shows the Link Ordering tab from a Merge stage. In this case you can order both
input links and output links. The Merge stage handles reject links as well as a stream link and the tab
allows you to order these, although you cannot move them to the stream link position. Again the link
labels give the sense of how the links are being used.
Select a map from the list, or click the arrow button next to the list to specify a job parameter.
Inputs page
The Input page gives information about links going into a stage. In the case of a file or database stage an
input link carries data being written to the file or database. In the case of a processing or restructure
stage it carries data that the stage will process before outputting to another stage. Where there are no
input links, the stage editor has no Input page.
Where it is present, the Input page contains various tabs depending on stage type. The only field the
Input page itself contains is Input name, which gives the name of the link being edited. Where a stage
has more than one input link, you can select the link you are editing from the Input name drop-down
list.
The Input page also has a Columns... button. Click this to open a window showing column names from
the meta data defined for this link. You can drag these columns to various fields in the Input page tabs as
required.
Certain stage types will also have a View Data... button. Press this to view the actual data associated
with the specified data source or data target. The button is available if you have defined meta data for
the link. Note the interface allowing you to view the file will be slightly different depending on stage and
link type.
General tab
The Input page always has a General tab. this allows you to enter an optional description of the link.
Specifying a description for each link enhances job maintainability.
Properties tab
Some types of file and database stages can have properties that are particular to specific input links. In
this case the Input page has a Properties tab. This has the same format as the Stage page Properties tab
(see ″Properties Tab″ ).
Partitioning tab
Most parallel stages have a default partitioning or collecting method associated with them. This is used
depending on the execution mode of the stage (i.e., parallel or sequential) and the execution mode of the
immediately preceding stage in the job. For example, if the preceding stage is processing data
sequentially and the current stage is processing in parallel, the data will be partitioned before it enters the
current stage. Conversely if the preceding stage is processing data in parallel and the current stage is
sequential, the data will be collected as it enters the current stage.
You can, if required, override the default partitioning or collecting method on the Partitioning tab. The
selected method is applied to the incoming data as it enters the stage on a particular link, and so the
Partitioning tab appears on the Input page. You can also use the tab to repartition data between two
parallel stages. If both stages are executing sequentially, you cannot select a partition or collection method
and the fields are disabled. The fields are also disabled if the particular stage does not permit selection of
The Partitioning tab also allows you to specify that the data should be sorted as it enters.
Format tab
Stages that write to certain types of file (e.g., the Sequential File stage) also have a Format tab which
allows you to specify the format of the file or files being written to.
The Format tab is similar in structure to the Properties tab. A flat file has a number of properties that you
can set different attributes for. Select the property in the tree and select the attributes you want to set
from the Available properties to add box, it will then appear as a dependent property in the property
tree and you can set its value as required. This tab sets the format information for the file at row level.
You can override the settings for individual columns using the Edit Column Metadata dialog box (see
page Field Level).
If you click the Load button you can load the format information from a table definition in the
Repository.
Details of the properties you can set are given in the chapter describing the individual stage editors:
v Sequential File stage - Input Link Format Tab
v File Set stage - Input Link Format Tab
v External Target stage - Format Tab
v Column Export stage - Properties Tab
Columns Tab
The Input page always has a Columns tab. This displays the column meta data for the selected input link
in a grid.
If you select the options in the Grid Properties dialog box (see WebSphere DataStage Designer Client Guide),
the Columns tab will also display two extra fields: Table Definition Reference and Column Definition
Reference. These show the table definition and individual columns that the columns on the tab were
derived from.
If you click in a row and select Edit Row... from the shortcut menu, the Edit Column Meta Data dialog
box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab
which allows you to specify properties that are peculiar to parallel job column definitions. The dialog box
only shows those properties that are relevant for the current link.
The Parallel tab enables you to specify properties that give more detail about each column, and
properties that are specific to the data type. Where you are specifying complex data types, you can
specify a level number, which causes the Level Number field to appear in the grid on the Columns page.
Some table definitions need format information. This occurs where data is being written to a file where
WebSphere DataStage needs additional information in order to be able to locate columns and rows.
Properties for the table definition at row level are set on the Format tab of the relevant stage editor, but
you can override the settings for individual columns using the Parallel tab. The settings are made in a
properties tree under the following categories:
Field level
This has the following properties:
v Bytes to Skip. Skip the specified number of bytes from the end of the previous column to the
beginning of this column.
v Delimiter. Specifies the trailing delimiter of the column. Type an ASCII character or select one of
whitespace, end, none, null, comma, or tab.
– whitespace. The last column of each record will not include any trailing white spaces found at the
end of the record.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character used.
– tab. ASCII tab character used.
v Delimiter string. Specify a string to be written at the end of the column. Enter one or more characters.
This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma
space - you do not need to enter the inverted commas) would have the column delimited by `, `.
v Drop on input. Select this property when you must fully define the meta data for a data set, but do
not want the column actually read into the data set.
v Prefix bytes. Specifies that this column is prefixed by 1, 2, or 4 bytes containing, as a binary value,
either the column’s length or the tag value for a tagged column. You can use this option with
variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-,
2-, or 4-byte prefix containing the field length. WebSphere DataStage inserts the prefix before each field.
This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which
are used by default.
v Print field. This property is intended for use when debugging jobs. Set it to have WebSphere
DataStage produce a message for each of the columns it reads. The message has the format:
Importing N: D
where:
– N is the column name.
– D is the imported data of the column. Non-printable characters contained in D are prefixed with an
escape character and written as C string literals; if the column contains binary data, it is output in
octal format.
v Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another
ASCII character or pair of ASCII characters. Choose Single or Double, or enter a character.
v Start position. Specifies the starting position of a column in the record. The starting position can be
either an absolute byte offset from the first record position (0) or the starting position of another
column.
String type
This has the following properties:
v Character Set. Choose from ASCII or EBCDIC (not available for ustring type (Unicode)).
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
v Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters
(not available for ustring type (Unicode)).
v Is link field. Selected to indicate that a column holds the length of another, variable-length column of
the record or of the tag value of a tagged record field.
v Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters
(not available for ustring type (Unicode)).
v Field max width. The maximum number of bytes in a column represented as a string. Enter a number.
This is useful where you are storing numbers as text. If you are using a fixed-width character set, you
can calculate the length exactly. If you are using variable-length character set, calculate an adequate
maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and
raw; and record, subrec, or tagged if they contain at least one field of this type.
v Field width. The number of bytes in a column represented as a string. Enter a number. This is useful
where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the
number of bytes exactly. If it’s a variable length encoding, base your calculation on the width and
frequency of your variable-width characters. Applies to fields of all data types except date, time,
timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
v Pad char. Specifies the pad character used when strings or numeric values are written to an external
string representation. Enter a character (single-byte for strings, can be multi-byte for ustrings) or choose
null or space. The pad character is used when the external string representation is larger than required
to hold the written field. In this case, the external string is filled with the pad character to its full
length. Space is the default. Applies to string, ustring, and numeric data types and record, subrec, or
tagged types if they contain at least one field of this type.
Date type
v Byte order. Specifies how multiple byte data types are ordered. Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine.
v Character Set. Choose from ASCII or EBCDIC.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system.
v Data Format. Specifies the data representation format of a column. Choose from:
– binary
– text
For dates, binary is equivalent to specifying the julian property for the date field, text specifies that
the data to be written contains a text-based date in the form %yyyy-%mm-%dd or in the default date
format if you have defined a new one on an NLS system.
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
Time type
v Byte order. Specifies how multiple byte data types are ordered. Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine.
v Character Set. Choose from ASCII or EBCDIC.
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
v Data Format. Specifies the data representation format of a column. Choose from:
– binary
– text
For time, binary is equivalent to midnight_seconds, text specifies that the field represents time in the
text-based form %hh:%nn:%ss or in the default date format if you have defined a new one on an
NLS system.
v Format string. Specifies the format of columns representing time as a string. By default this is
%hh-%mm-%ss. For details about the format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp type
v Byte order. Specifies how multiple byte data types are ordered. Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine.
v Character Set. Choose from ASCII or EBCDIC.
v Data Format. Specifies the data representation format of a column. Choose from:
– binary
– text
For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written. Text
specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or in the default date
format if you have defined a new one on an NLS system.
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
v Format string. Specifies the format of a column representing a timestamp as a string. Defaults to
%yyyy-%mm-%dd %hh:%nn:%ss. The format combines the format for date strings and time strings. See
“Date formats” on page 30 and “Time formats” on page 33.
Integer type
v Byte order. Specifies how multiple byte data types are ordered. Choose from:
– little-endian. The high byte is on the right.
Decimal type
v Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is
normally illegal) as a valid representation of zero. Select Yes or No.
v Character Set. Choose from ASCII or EBCDIC.
v Decimal separator. Specify the character that acts as the decimal separator (period by default).
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
v Data Format. Specifies the data representation format of a column. Choose from:
– binary
– text
For decimals, binary means packed. Text represents a decimal in a string format with a leading
space or ’-’ followed by decimal digits with an embedded decimal point if the scale is not zero. The
destination string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
Float type
v C_format. Perform non-default conversion of data from a string to floating-point data. This property
specifies a C-language format string used for reading floating point strings. This is passed to sscanf().
v Character Set. Choose from ASCII or EBCDIC.
v Default. The default value for a column. This is used for data written by a Generate stage. It also
supplies the value to substitute for a column that causes an error (whether written or read).
v Data Format. Specifies the data representation format of a column. Choose from:
– binary
– text
v Field max width. The maximum number of bytes in a column represented as a string. Enter a number.
Enter a number. This is useful where you are storing numbers as text. If you are using a fixed-width
character set, you can calculate the length exactly. If you are using variable-length character set,
calculate an adequate maximum width for your fields. Applies to fields of all data types except date,
time, timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
v Field width. The number of bytes in a column represented as a string. Enter a number. This is useful
where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the
number of bytes exactly. If it’s a variable length encoding, base your calculation on the width and
frequency of your variable-width characters. Applies to fields of all data types except date, time,
timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
v In_format. Format string used for conversion of data from string to floating point. This is passed to
sscanf(). By default, WebSphere DataStage invokes the C sscanf() function to convert a numeric field
formatted as a string to floating point data. If this function does not output data in a satisfactory
format, you can specify the in_format property to pass formatting arguments to sscanf().
v Is link field. Selected to indicate that a column holds the length of a another, variable-length column
of the record or of the tag value of a tagged record field.
v Out_format. Format string used for conversion of data from floating point to a string. This is passed to
sprintf(). By default, WebSphere DataStage invokes the C sprintf() function to convert a numeric field
formatted as floating point data to a string. If this function does not output data in a satisfactory
format, you can specify the out_format property to pass formatting arguments to sprintf().
v Pad char. Specifies the pad character used when the floating point number is written to an external
string representation. Enter a character (single-bye for strings, can be multi-byte for ustrings) or choose
null or space. The pad character is used when the external string representation is larger than required
to hold the written field. In this case, the external string is filled with the pad character to its full
length. Space is the default.
Nullable
This appears for nullable fields.
v Actual field length. Specifies the number of bytes to fill with the Fill character when a field is
identified as null. When WebSphere DataStage identifies a null field, it will write a field of this length
full of Fill characters. This is mutually exclusive with Null field value.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is read, a length of null field length in the source field indicates that it contains a
null. When a variable-length field is written, WebSphere DataStage writes a length value of null field
length if the field contains a null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value given to a null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
Generator
If the column is being used in a Row Generator or Column Generator stage, this allows you to specify
extra details about the mock data being generated. The exact fields that appear depend on the data type
of the column being generated. They allow you to specify features of the data being generated, for
example, for integers they allow you to specify if values are random or whether they cycle. If they cycle
you can specify an initial value, an increment, and a limit. If they are random, you can specify a seed
value for the random number generator, whether to include negative numbers, and a limit
The diagram below shows the Generate options available for the various data types:
Increment
Initial value
Cycle Limit
Date
Random Limit
Epoch Seed
Percent invalid Signed
Use current date
Increment
Initial value
Cycle Limit
Time
Random Limit
Scale factor Seed
Percent invalid Signed
Increment
Initial value
Cycle Limit
Timestamp
Random Limit
Epoch Seed
Percent invalid Signed
Use current date
Increment
Initial value
Cycle Limit
Integer
Random Limit
Seed
Signed
Increment
Initial value
Cycle Limit
Decimal
Random Limit
Percent zero Seed
Percent invalid Signed
Increment
Initial value
Cycle Limit
Float
Random Limit
Seed
Signed
All data types other than string have two Types of operation, cycle and random:
v Cycle. The cycle option generates a repeating pattern of values for a column. It has the following
optional dependent properties:
– Increment. The increment value added to produce the field value in the next output record. The
default value is 1 (integer) or 1.0 (float).
– Initial value. is the initial field value (value of the first output record). The default value is 0.
– Limit. The maximum field value. When the generated field value is greater than Limit, it wraps back
to Initial value. The default value of Limit is the maximum allowable value for the field’s data type.
Strings
By default the generator stages initialize all bytes of a string field to the same alphanumeric character.
The stages use the following characters, in the following order:
abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
For example, the following a string with a length of 5 would produce successive string fields with the
values:
aaaaa
bbbbb
ccccc
ddddd
...
After the last character, capital Z, values wrap back to lowercase a and the cycle repeats.
You can also use the algorithm property to determine how string values are generated, this has two
possible values: cycle and alphabet:
v Cycle. Values are assigned to a generated string field as a set of discrete string values to cycle through.
This has the following dependent property:
– Values. Repeat this property to specify the string values that the generated data cycles through.
v Alphabet. Values are assigned to a generated string field as a character string each of whose characters
is taken in turn. This is like the default mode of operation except that you can specify the string cycled
through using the dependent property String.
Decimal
As well as the Type property, decimal columns have the following properties:
v Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by
default.
v Percent zero. The percentage of generated decimal columns where all bytes of the decimal are set to
binary zero (0x00). Set to 10% by default.
Date
As well as the Type property, date columns have the following properties:
v Epoch. Use this to specify the earliest generated date value, in the format yyyy-mm-dd (leading zeros
must be supplied for all parts). The default is 1960-01-01.
Time
As well as the Type property, time columns have the following properties:
v Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by
default.
v Scale factor. Specifies a multiplier to the increment value for time. For example, a scale factor of 60
and an increment of 1 means the field increments by 60 seconds.
Timestamp
As well as the Type property, time columns have the following properties:
v Epoch. Use this to specify the earliest generated date value, in the format yyyy-mm-dd (leading zeros
must be supplied for all parts). The default is 1960-01-01.
v Use current date. Set this to generate today’s date in this column for every row generated. If you set
this all other properties are ignored.
v Percent invalid. The percentage of generated columns that will contain invalid values. Set to 10% by
default.
v Scale factor. Specifies a multiplier to the increment value for time. For example, a scale factor of 60
and an increment of 1 means the field increments by 60 seconds.
Vectors
If the row you are editing represents a column which is a variable length vector, tick the Variable check
box. The Vector properties appear, these give the size of the vector in one of two ways:
v Link Field Reference. The name of a column containing the number of elements in the variable length
vector. This should have an integer or float type, and have its Is Link field property set.
v Vector prefix. Specifies 1-, 2-, or 4-byte prefix containing the number of elements in the vector.
If the row you are editing represents a column which is a vector of known length, enter the number of
elements in the Vector Occurs box.
Subrecords
If the row you are editing represents a column which is part of a subrecord the Level Number column
indicates the level of the column within the subrecord structure.
If you specify Level numbers for columns, the column immediately preceding will be identified as a
subrecord. Subrecords can be nested, so can contain further subrecords with higher level numbers (i.e.,
level 06 is nested within level 05). Subrecord fields have a Tagged check box to indicate that this is a
tagged subrecord.
Extended
For certain data types the Extended check box appears to allow you to modify the data type as follows:
v Char, VarChar, LongVarChar. Select to specify that the underlying data type is a ustring.
v Time. Select to indicate that the time field includes microseconds.
v Timestamp. Select to indicate that the timestamp field includes microseconds.
Advanced tab
The Advanced tab allows you to specify how WebSphere DataStage buffers data being input this stage.
By default WebSphere DataStage buffers data in such a way that no deadlocks can arise; a deadlock being
the situation where a number of stages are mutually dependent, and are waiting for input from another
stage and cannot output until they have received it.
The size and operation of the buffer are usually the same for all links on all stages (the default values
that the settings take can be set using environment variables).
The Advanced tab allows you to specify buffer settings on a per-link basis. You should only change the
settings if you fully understand the consequences of your actions (otherwise you might cause deadlock
situations to arise).
Any changes you make on this tab will automatically be reflected in the Output Page Advanced tab of
the stage at the other end of this link.
If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering
parameters:
v Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes,
used per buffer. The default size is 3145728 (3 MB).
v Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the
buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of
data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it,
the buffer first tries to write some of the data it contains before accepting more.
The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in
which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer
size before writing to disk.
v Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using
both memory and disk. The default value is zero, meaning that the buffer size is limited only by the
available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper
bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this
value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB).
If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer
size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk.
However, the size of the buffer is limited by the virtual memory of your system and you can create
deadlock if the buffer becomes full.
v Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by
the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access
against throughput for small amounts of data. Increasing the block size reduces disk access, but may
Output page
The Output page gives information about links going out of a stage. In the case of a file or database stage
an input link carries data being read from the file or database. In the case of a processing or restructure
stage it carries data that the stage has processed. Where there are no output links the stage editor has no
Output page.
Where it is present, the Output page contains various tabs depending on stage type. The only field the
Output page itself contains is Output name, which gives the name of the link being edited. Where a
stage has more than one output link, you can select the link you are editing from the Output name
drop-down list.
The Output page also has a Columns... button. Click Columns... to open a window showing column
names from the meta data defined for this link. You can drag these columns to various fields in the
Output page tabs as required.
Certain stage types will also have a View Data... button. Press this to view the actual data associated
with the specified data source or data target. The button is available if you have defined meta data for
the link.
The Sequential File stage has a Show File... button, rather than View Data... . This shows the flat file as it
has been created on disk.
General tab
The Output page always has a General tab. this allows you to enter an optional description of the link.
Specifying a description for each link enhances job maintainability.
Properties tab
Some types of file and database stages can have properties that are particular to specific output links. In
this case the Output page has a Properties tab. This has the same format as the Stage page Properties tab
(see ″Properties Tab″ on p).
Format tab
Stages that read from certain types of file (e.g., the Sequential File stage) also have a Format tab which
allows you to specify the format of the file or files being read from.
The Format page is similar in structure to the Properties page. A flat file has a number of properties that
you can set different attributes for. Select the property in the tree and select the attributes you want to set
from the Available properties to add window, it will then appear as a dependent property in the
property tree and you can set its value as required. This tab sets the format information for the file at row
level. You can override the settings for individual columns using the Edit Column Metadata dialog box
(see Field Level).
Format details are also stored with table definitions, and you can use the Load... button to load a format
from a table definition stored in the Repository.
The short-cut menu from the property tree gives access to the following functions:
v Format as. This applies a predefined template of properties. Choose from the following:
Details of the properties you can set are given in the chapter describing the individual stage editors:
v Sequential File stage - Output Link Format Tab
v File Set stage - Output Link Format Tab
v External Source stage - Format Tab
v Column Import stage - Properties Tab
Columns tab
The Output page always has a Columns tab. This displays the column meta data for the selected output
link in a grid.
If runtime column propagation is enabled in the WebSphere DataStage Administrator, you can select the
Runtime column propagation option to specify that columns encountered by the stage can be used even
if they are not explicitly defined in the meta data. There are some special considerations when using
runtime column propagation with certain stage types:
v Sequential File
v File Set
v External Source
v External Target
If the selected output link is a reject link, the column meta data grid is read only and cannot be modified.
If you click in a row and select Edit Row... from the shortcut menu, the Edit Column meta data dialog
box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab
which allows you to specify properties that are peculiar to parallel job column definitions. The properties
you can specify here are the same as those specified for input links).
Mapping tab
For processing and restructure stages the Mapping tab allows you to specify how the output columns are
derived, i.e., what input columns map onto them or how they are generated.
The left pane shows the input columns and/or the generated columns. These are read only and cannot be
modified on this tab. These columns represent the data that the stage has produced after it has processed
the input data.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility. If you have not yet defined any output column definitions, dragging columns over
will define them for you. If you have already defined output column definitions, WebSphere DataStage
performs the mapping for you as far as possible: you can do this explicitly using the auto-match facility,
or implicitly by just visiting the Mapping tab and clicking OK (which is the equivalent of auto-matching
on name).
There is also a shortcut menu which gives access to a range of column selection and editing functions,
including the facilities for selecting multiple columns and editing multiple derivations (this functionality
is described in the Transformer chapter).
You may choose not to map all the left hand columns, for example if your output data is a subset of your
input data, but be aware that, if you have Runtime Column Propagation turned on for that link, the data
you have not mapped will appear on the output link anyway.
You can also perform mapping without actually opening the stage editor. Select the stage in the Designer
canvas and choose Auto-map from the shortcut menu.
In the above example the left pane represents the data after it has been joined. The Expression field
shows how the column has been derived, the Column Name shows the column after it has been joined.
The right pane represents the data being output by the stage after the join. In this example the data has
been mapped straight across.
More details about mapping operations for the different stages are given in the individual stage
descriptions:
A shortcut menu can be invoked from the right pane that allows you to:
v Find and replace column names.
v Validate a derivation you have entered.
v Clear an existing derivation.
v Append a new column.
v Select all columns.
v Insert a new column at the current position.
v Delete the selected column or columns.
v Cut and copy columns.
v Paste a whole column.
v Paste just the derivation from a column.
The Find button opens a dialog box which allows you to search for particular output columns.
The Auto-Match button opens a dialog box which will automatically map left pane columns onto right
pane columns according to the specified criteria.
Select Location match to map input columns onto the output ones occupying the equivalent position.
Select Name match to match by names. You can specify that all columns are to be mapped by name, or
only the ones you have selected. You can also specify that prefixes and suffixes are ignored for input and
output columns, and that case can be ignored.
Advanced tab
The Advanced tab allows you to specify how WebSphere DataStage buffers data being output from this
stage. By default WebSphere DataStage buffers data in such a way that no deadlocks can arise; a
deadlock being the situation where a number of stages are mutually dependent, and are waiting for input
from another stage and cannot output until they have received it.
The size and operation of the buffer are usually the same for all links on all stages (the default values
that the settings take can be set using environment variables - see WebSphere DataStage Installatiuon and
Configuration Guide).
Any changes you make on this tab will automatically be reflected in the Input page Advanced tab of the
stage at the other end of this link
If you choose the Auto buffer or Buffer options, you can also set the values of the various buffering
parameters:
v Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes,
used per buffer. The default size is 3145728 (3 MB).
v Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the
buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of
data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it,
the buffer first tries to write some of the data it contains before accepting more.
The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in
which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer
size before writing to disk.
v Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using
both memory and disk. The default value is zero, meaning that the buffer size is limited only by the
available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper
bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this
value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB).
If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer
size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk.
However, the size of the buffer is limited by the virtual memory of your system and you can create
deadlock if the buffer becomes full.
v Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by
the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access
against throughput for small amounts of data. Increasing the block size reduces disk access, but may
decrease performance when data is being read/written in smaller units. Decreasing the block size
increases throughput, but may increase the amount of disk access.
What is a data set? parallel jobs use data sets to manage data within a job. You can think of each link in a
job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files,
each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be
key to good performance in a set of linked jobs. You can also manage data sets independently of a job
using the Data Set Management utility, available from the WebSphere DataStage Designer or Director, see
Chapter 53, “Managing data sets,” on page 547.
The stage editor has up to three pages, depending on whether you are reading or writing a data set:
v Stage Page. This is always present and is used to specify general information about the stage.
v Input Page. This is present when you are writing to a data set. This is where you specify details about
the data set being written to.
v Output Page. This is present when you are reading from a data set. This is where you specify details
about the data set being read from.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Data Set stages
in a job. This section specifies the minimum steps to take to get a Data Set stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic methods, you will learn where the shortcuts are when you get familiar
with the product.
The steps required depend on whether you are using the Data Set stage to read or write a data set.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
contents of the data set are processed by the available nodes as specified in the Configuration file, and
by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the
data set are processed by the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. You can select Propagate, Set or Clear. If you select Set file read operations will
request that the next stage preserves the partitioning as is. Propagate takes the setting of the flag from
the previous stage.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about how the Data Set stage writes data to a data set. The
Data Set stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Columns tab specifies the column definitions of
the data. The Advanced tab allows you to change the default buffering settings for the input link.
Details about Data Set stage properties are given in the following sections. See ″Stage Editors,″ for a
general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows:
Category/
Property Values Default Mandatory? Repeats? Dependent of
Target/File pathname N/A Y N N/A
Target/Update Append/Create Overwrite Y N N/A
Policy (Error if
exists)/
Overwrite/Use
existing (Discard
records)/Use
existing (Discard
records and
schema)
Target Category
File
The name of the control file for the data set. You can browse for the file or enter a job parameter. By
convention, the file has the suffix .ds.
Update Policy
Specifies what action will be taken if the data set you are writing to already exists. Choose from:
v Append. Append any new data to the existing data.
v Create (Error if exists). WebSphere DataStage reports an error if the data set already exists.
v Overwrite. Overwrites any existing data with new data.
v Use existing (Discard records). Keeps the existing data and discards any new data.
v Use existing (Discard records and schema). Keeps the existing data and discards any new data and its
associated schema.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is written to the data set. It also allows you to specify that the data should be sorted before
being written.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Data Set stage is operating in sequential mode, it will first collect the data before writing it to the
file using the default Auto collection method.
If the Data Set stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Data Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the data set. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available with Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about how the Data Set stage reads data from a data set.
The Data Set stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Columns tab specifies the column
definitions of incoming data. The Advanced tab allows you to change the default buffering settings for
the output link.
Details about Data Set stage properties and formatting are given in the following sections. See ″Stage
Editors,″ for a general description of the other tabs.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Source/File pathname N/A Y N N/A
Source category
File
The name of the control file for the data set. You can browse for the file or enter a job parameter. By
convention the file has the suffix .ds.
When you edit a Sequential File stage, the Sequential File stage editor appears. This is based on the
generic stage editor described in ″Stage Editors.″
The stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single
file. Each node writes to a single file, but a node can write more than one file.
When reading or writing a flat file, WebSphere DataStage needs to know something about the format of
the file. The information required is how the file is divided into rows and how rows are divided into
columns. You specify this on the Format tab. Settings for individual columns can be overridden on the
Columns tab using the Edit Column Metadata dialog box.
The stage editor has up to three pages, depending on whether you are reading or writing a file:
v Stage Page. This is always present and is used to specify general information about the stage.
v Input Page. This is present when you are writing to a flat file. This is where you specify details about
the file or files being written to.
v Output Page. This is present when you are reading from a flat file and/or have a reject link. This is
where you specify details about the file or files being read from.
There are one or two special points to note about using runtime column propagation (RCP) with
Sequential stages. See 98 for details.
The Format tab is set as follows to define that the stage will write a file where each column is delimited
by a comma, there is no final delimiter, and any dates in the data are expected to have the format
mm/dd/yyyy with or without leading zeroes for month and day, rather than yyyy-mm-dd, which is the
default format:
The meta data for the file is defined in the Columns tab as follows:
Table 6. Example metadata for reading a sequential file
Column name Key SQL Type Length Nullable
OrderID yes char 2 no
Price char 5 no
Quantity char 2 no
Order_Date char 10 no
The Format tab is set as follows to define that the stage is reading a fixed width file where each row is
delimited by a UNIX newline, and the columns have no delimiter:
The steps required depend on whether you are using the Sequential File stage to read or write a file.
Writing to a file
v In the Input Link Properties Tab specify the pathname of the file being written to (repeat this for
writing to multiple files). The other properties all have default values, which you can change or not as
required.
v In the Input Link Format Tab specify format details for the file(s) you are writing to, or accept the
defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited
with UNIX newlines).
v Ensure column meta data has been specified for the file(s) (this can be achieved via a schema file if
required).
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it
allows you to specify a character set map for the stage.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. When a stage is reading
or writing a single file the Execution Mode is sequential and you cannot change it. When a stage is
reading or writing multiple files, the Execution Mode is parallel and you cannot change it. In parallel
mode, the files are processed by the available nodes as specified in the Configuration file, and by any
node constraints specified on the Advanced tab. In Sequential mode the entire contents of the file are
processed by the conductor node.
Input page
The Input page allows you to specify details about how the Sequential File stage writes data to one or
more flat files. The Sequential File stage can have only one input link, but this can write to multiple files.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the file or files. The Formats tab gives information
about the format of the files being written. The Columns tab specifies the column definitions of data
being written. The Advanced tab allows you to change the default buffering settings for the input link.
Details about Sequential File stage properties, partitioning, and formatting are given in the following
sections. See ″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Target/File Pathname N/A Y Y N/A
Target/File Append/ Create/ Overwrite Y N N/A
Update Mode Overwrite
Target category
File
This property defines the flat file that the incoming data will be written to. You can type in a pathname,
or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the
Properties item at the top of the tree, and clicking on File in the Available properties to add box. Do this
for each extra file you want to specify.
You must specify at least one file to be written to, which must exist unless you specify a File Update
Mode of Create or Overwrite.
This property defines how the specified file or files are updated. The same method applies to all files
being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files,
or Create to create a new file. If you specify the Create property for a file that already exists you will get
an error at runtime.
Options category
Cleanup on failure
This is set to True by default and specifies that the stage will delete any partially written files if the stage
fails for any reason. Set this to False to specify that partially written files should be left.
Reject mode
This specifies what happens to any data records that are not written to a file for some reason. Choose
from Continue to continue operation and discard any rejected rows, Fail to cease writing if any rows are
rejected, or Save to send rejected rows down a reject link.
Filter
This is an optional property. You can use this to specify that the data is passed through a filter program
before being written to the file or files. Specify the filter command, and any required arguments, in the
Property Value box.
Schema file
This is an optional property. By default the Sequential File stage will use the column definitions defined
on the Columns and Format tabs as a schema for writing to the file. You can, however, specify a file
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is written to the file or files. It also allows you to specify that the data should be sorted before
being written.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Sequential File stage is operating in sequential mode, it will first collect the data before writing it to
the file using the default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Sequential File stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Sequential File stage is set to execute in parallel (i.e., is writing to multiple files), then you can set a
partitioning method by selecting from the Partition type drop-down list. This will override any current
partitioning.
If the Sequential File stage is set to execute in sequential mode (i.e., is writing to a single file), but the
preceding stage is executing in parallel, then you can set a collection method from the Collector type
drop-down list. This will override the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available with the Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
If you do not alter any of the Format settings, the Sequential File stage will produce a file of the
following format:
v File comprises variable length columns contained within double quotes.
v All columns are delimited by a comma, except for the final column in a row.
v Rows are delimited by a UNIX newline.
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
You can use the Defaults button to change your default settings. Use the Format tab to specify your
required settings, then click Defaults → Save current as default. All your sequential files will use your
settings by default from now on. If your requirements change, you can choose Defaults → Reset defaults
from factory settings to go back to the original defaults as described above. Once you have done this,
you then have to click Defaults → Set current from default for the new defaults to take effect.
To change individual properties, select a property type from the main tree then add the properties you
want to set to the tree structure by clicking on them in the Available properties to set window. You can
then set a value for that property in the Property Value box. Pop-up help for each of the available
properties appears if you hover the mouse pointer over it.
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
Fill char
Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a
drop-down list. This character is used to fill any gaps in a written record caused by column positioning
properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could
also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot
specify a multi-byte Unicode character.
Specify a string to be written after the last column of a record in place of the column delimiter. Enter one
or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final
delimiter, which is the default. For example, if you set Delimiter to comma and Final delimiter string to `,
` (comma space - you do not need to enter the inverted commas) all fields are delimited by a comma,
except the final field, which is delimited by a comma followed by an ASCII space character.
Final delimiter
Specify a single character to be written after the last column of a record in place of the field delimiter.
Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram
for an illustration.
v whitespace. The last column of each record will not include any trailing white spaces found at the end
of the record.
v end. The last column of each record does not include the field delimiter. This is the default setting.
v none. The last column of each record does not have a delimiter; used for fixed-width fields.
v null. The last column of each record is delimited by the ASCII null character.
v comma. The last column of each record is delimited by the ASCII comma character.
v tab. The last column of each record is delimited by the ASCII tab character.
record delimiter
field delimiter
Intact
The intact property specifies an identifier of a partial schema. A partial schema specifies that only the
column(s) named in the schema can be modified by the stage. All other columns in the row are passed
through unmodified. The file containing the partial schema is specified in the Schema File property on
the Properties tab. This property has a dependent property, Check intact, but this is not relevant to input
links.
Specify a string to be written at the end of each record. Enter one or more characters. This is mutually
exclusive with Record delimiter, which is the default, record type and record prefix.
Record delimiter
Specify a single character to be written at the end of each record. Type a character or select one of the
following:
v UNIX Newline (the default)
v null
(To implement a DOS newline, use the Record delimiter string property set to ″\R\N″ or chooseFormat
as → DOS line terminator from the shortcut menu.)
Note: Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record
type.
Record length
Select Fixed where fixed length fields are being written. WebSphere DataStage calculates the appropriate
length for the record. Alternatively specify the length of fixed records as number of bytes. This is not
used by default (default files are comma-delimited). The record is padded to the specified length with
either zeros or the fill character if one has been specified.
Record Prefix
Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by
default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string
and record type.
Record type
Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If
you choose the implicit property, data is written as a stream with no explicit record boundaries. The end
of the record is inferred when all of the columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or
VR.
This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and
Record prefix and by default is not used.
Defines default properties for columns written to the file or files. These are applied to all columns
written, but can be overridden for individual columns from the Columns tab using the Edit Column
Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a
multi-byte Unicode character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the number of bytes to fill with the Fill character when a field is
identified as null. When WebSphere DataStage identifies a null field, it will write a field of this length
full of Fill characters. This is mutually exclusive with Null field value.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters.
This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma
space - you do not need to enter the inverted commas) would have each field delimited by `, ` unless
overridden for individual fields.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is written, WebSphere DataStage writes a length value of null field length if the
field contains a null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value written to null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F.
You must use this form to encode non-printable byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data
representation, you can use Pad char (from the general section of Type defaults) to specify a repeated
trailing character if the value you specify is shorter than the fixed width of the field.
v Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a
binary value, either the column’s length or the tag value for a tagged field.
You can use this option with variable-length fields. Variable-length fields can be either delimited by a
character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. WebSphere DataStage
inserts the prefix before each field.
This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which
are used by default.
v Print field. This property is not relevant for input links.
v Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another
character or pair of characters. Choose Single or Double, or enter a character. This is set to double
quotes by default.
When writing, WebSphere DataStage inserts the leading quote character, the data, and a trailing quote
character. Quote characters are not counted as part of a field’s length.
v Vector prefix. For fields that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing
the number of elements in the vector. You can override this default prefix for individual vectors.
Variable-length vectors must use either a prefix on the vector or a link to another field in order to
specify the number of elements in the vector. If the variable length vector has a prefix, you use this
Type defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data to be written contains a text-based date in the
form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS
system (see WebSphere DataStage NLS Guide).
– For the decimal data type: a field represents a decimal in a string format with a leading space or ’-’
followed by decimal digits with an embedded decimal point if the scale is not zero. The destination
string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
– For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): WebSphere DataStage
assumes that numeric fields are represented as text.
– For the time data type: text specifies that the field represents time in the text-based form
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
– For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
v Field max width. The maximum number of bytes in a column represented as a string. Enter a number.
This is useful where you are storing numbers as text. If you are using a fixed-width character set, you
can calculate the length exactly. If you are using variable-length character set, calculate an adequate
maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and
raw; and record, subrec, or tagged if they contain at least one field of this type.
v Field width. The number of bytes in a field represented as a string. Enter a number. This is useful
where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII
characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain
at least one field of this type.
v Import ASCII as EBCDIC. Not relevant for input links.
For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see WebSphere DataStage Developer’s Help.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
v Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is
normally illegal) as a valid representation of zero. Select Yes or No. The default is No.
v Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default).
v Packed. Select an option to specify what the decimal columns contain, choose from:
– Yes to specify that the decimal columns contain data in packed decimal format (the default). This
has the following sub-properties:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a
positive sign (0xf) regardless of the columns’ actual sign value.
– No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the
following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This
has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
Numeric
These properties apply to integer and float fields unless overridden at column level.
v C_format. Perform non-default conversion of data from integer or floating-point data to a string. This
property specifies a C-language format string used for writing integer or floating point strings. This is
passed to sprintf(). For example, specifying a C-format of %x and a field width of 8 ensures that
integers are written as 8-byte hexadecimal strings.
v In_format. This property is not relevant for input links..
v Out_format. Format string used for conversion of data from integer or floating-point data to a string.
This is passed to sprintf(). By default, WebSphere DataStage invokes the C sprintf() function to convert a
numeric field formatted as either integer or floating point data to a string. If this function does not
output data in a satisfactory format, you can specify the out_format property to pass formatting
arguments to sprintf().
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. For details about the
format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. Defaults to
%yyyy-%mm-%dd %hh:%nn:%ss. The format combines the format for date strings and time strings. See
“Date formats” on page 30 and “Time formats” on page 33.
Output page
The Output page allows you to specify details about how the Sequential File stage reads data from one
or more flat files. The Sequential File stage can have only one output link, but this can read from multiple
files.
It can also have a single reject link. This is typically used when you are writing to a file and provides a
location where records that have failed to be written to a file for some reason can be sent. When you are
reading files, you can use a reject link as a destination for rows that do not match the expected column
definitions.
The Output name drop-down list allows you to choose whether you are looking at details of the main
output link (the stream link) or the reject link.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Formats tab gives information about the
format of the files being read. The Columns tab specifies the column definitions of the data. The
Advanced tab allows you to change the default buffering settings for the output link.
Details about Sequential File stage properties and formatting are given in the following sections. See
Chapter 3, “Stage editors,” on page 37, ″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Source category
File
This property defines the flat file that data will be read from. You can type in a pathname, or browse for
a file. You can specify multiple files by repeating the File property. Do this by selecting the Properties
item at the top of the tree, and clicking on File in the Available properties to add window. Do this for
each extra file you want to specify.
File pattern
Specifies a group of files to import. Specify file containing a list of files or a job parameter representing
the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a
list of file names.
Read method
This property specifies whether you are reading from a specific file or files or using a file pattern to select
files (e.g., *.txt).
Options category
Missing file mode
Specifies the action to take if one of your File properties has specified a file that does not exist. Choose
from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the
file has a node name prefix of *: in which case it is OK. The default is Depends.
Set this to True to partition the imported data set according to the organization of the input file(s). So, for
example, if you are reading three files you will have three partitions. Defaults to False.
Reject mode
Allows you to specify behavior if a read record does not match the expected schema. Choose from
Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are
rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Report progress
Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each
10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB,
records are fixed length, and there is no filter on the file.
Filter
This is an optional property. You can use this to specify that the data is passed through a filter program
after being read from the files. Specify the filter command, and any required arguments, in the Property
Value box.
This is an optional property. It adds an extra column of type VarChar to the output of the stage,
containing the pathname of the file the record is read from. You should also add this column manually to
the Columns definitions to ensure that the column is not dropped if you are not using runtime column
propagation, or it is turned off at some point.
This is an optional property and only applies to files containing fixed-length records, it is mutually
exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read
operator on a processing node. The default is one operator per node per input data file. If numReaders is
greater than one, each instance of the file read operator reads a contiguous range of records from the
input file. The starting record location in the file for each operator, or seek location, is determined by the
data file size, the record length, and the number of instances of the operator, as specified by numReaders.
The resulting data set contains one partition per instance of the file read operator, as determined by
numReaders.
This provides a way of partitioning the data contained in a single file. Each node reads a single file, but
the file can be divided according to the number of readers per node, and written to separate partitions.
This method can result in better I/O performance on an SMP system.
reader
reader
reader
reader
node
This is an optional property and only applies to files containing fixed-length records, it is mutually
exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be
read by several nodes. This can improve performance on a cluster system.
WebSphere DataStage knows the number of nodes available, and using the fixed length record size, and
the actual size of the file to be read, allocates the reader on each node a separate region within the file to
process. The regions will be of roughly equal size.
reader
file partitioned data set
node
reader
node
reader
node
reader
node
Schema file
This is an optional property. By default the Sequential File stage will use the column definitions defined
on the Columns and Format tabs as a schema for reading the file. You can, however, specify a file
containing a schema instead (note, however, that if you have defined columns on the Columns tab, you
should ensure these match the schema file). Type in a pathname or browse for a schema file.
Reject Links
You cannot change the properties of a Reject link. The Properties tab for a reject link is blank.
If you do not alter any of the Format settings, the Sequential File stage will produce a file of the
following format:
v File comprises variable length columns contained within double quotes.
v All columns are delimited by a comma, except for the final column in a row.
v Rows are delimited by a UNIX newline.
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
You can use the Defaults button to change your default settings. Use the Format tab to specify your
required settings, then click Defaults → Save current as default. All your sequential files will use your
settings by default from now on. If your requirements change, you can choose Defaults → Reset defaults
from factory settings to go back to the original defaults as described above. Once you have done this,
you then have to click Defaults → Set current from default for the new defaults to take effect.
To change individual properties, select a property type from the main tree then add the properties you
want to set to the tree structure by clicking on them in the Available properties to set window. You can
then set a value for that property in the Property Value box. Pop-up help for each of the available
properties appears if you hover the mouse pointer over it.
Any property that you set on this tab can be overridden at the column level by setting properties for
individual columns on the Edit Column Metadata dialog box (see page Columns Tab).
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
v Fill char. Does not apply to output links.
v Final delimiter string. Specify the string written after the last column of a record in place of the
column delimiter. Enter one or more characters, this precedes the record delimiter if one is used.
Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to
comma and Final delimiter string to `, ` (comma space - you do not need to enter the inverted
commas) all fields are delimited by a comma, except the final field, which is delimited by a comma
followed by an ASCII space character. WebSphere DataStage skips the specified delimiter string when
reading the file.
v Final delimiter. Specify the single character written after the last column of a record in place of the
field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. WebSphere
DataStage skips the specified delimiter string when reading the file. See the following diagram for an
illustration.
field delimiter
Defines default properties for columns read from the file or files. These are applied to all columns, but
can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog
box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode
character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the actual number of bytes to skip if the field’s length equals the setting
of the null field length property.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab. WebSphere DataStage skips the delimiter when
reading.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is
mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma space -
you do not need to enter the inverted commas) specifies each field is delimited by `, ` unless
overridden for individual fields. WebSphere DataStage skips the delimiter string when reading.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is read, a length of null field length in the source field indicates that it contains a
null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value given to a null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F.
You must use this form to encode non-printable byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data
representation, you can use Pad char (from the general section of Type defaults) to specify a repeated
trailing character if the value you specify is shorter than the fixed width of the field.
v Prefix bytes. You can use this option with variable-length fields. Variable-length fields can be either
delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. WebSphere
DataStage reads the length prefix but does not include the prefix as a separate field in the data set it
reads from the file.
This property is mutually exclusive with the Delimiter, Quote, and Final Delimiter properties, which
are used by default.
v Print field. This property is intended for use when debugging jobs. Set it to have WebSphere
DataStage produce a message for every field it reads. The message has the format:
Importing N: D
where:
– N is the field name.
– D is the imported data of the field. Non-printable characters contained in D are prefixed with an
escape character and written as C string literals; if the field contains binary data, it is output in octal
format.
v Quote. Specifies that variable length fields are enclosed in single quotes, double quotes, or another
character or pair of characters. Choose Single or Double, or enter a character. This is set to double
quotes by default.
Type Defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data read, contains a text-based date in the form
%yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system
(see WebSphere DataStage NLS Guide).
– For the decimal data type: a field represents a decimal in a string format with a leading space or ’-’
followed by decimal digits with an embedded decimal point if the scale is not zero. The destination
string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
– For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): WebSphere DataStage
assumes that numeric fields are represented as text.
– For the time data type: text specifies that the field represents time in the text-based form
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
– For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Not relevant for output links.
v Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
v Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is
normally illegal) as a valid representation of zero. Select Yes or No. The default is No.
v Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default).
v Packed. Select an option to specify what the decimal columns contain, choose from:
– Yes to specify that the decimal fields contain data in packed decimal format (the default). This has
the following sub-properties:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive
sign (0xf) regardless of the fields’ actual sign value.
– No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the
following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This
has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (overpunch) to specify that the field has a leading or end byte that contains a character which
specifies both the numeric value of that byte and whether the number as a whole is negatively or
positively signed. This has the following sub-property:
Numeric
These properties apply to integer and float fields unless overridden at column level.
v C_format. Perform non-default conversion of data from string data to a integer or floating-point. This
property specifies a C-language format string used for reading integer or floating point strings. This is
passed to sscanf(). For example, specifying a C-format of %x and a field width of 8 ensures that a 32-bit
integer is formatted as an 8-byte hexadecimal string.
v In_format. Format string used for conversion of data from string to integer or floating-point data This
is passed to sscanf(). By default, WebSphere DataStage invokes the C sscanf() function to convert a
numeric field formatted as a string to either integer or floating point data. If this function does not
output data in a satisfactory format, you can specify the in_format property to pass formatting
arguments to sscanf().
v Out_format. This property is not relevant for output links.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. By default this is
%hh-%mm-%ss. For details about the format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. The format
combines the format for date strings and time strings. See “Date formats” on page 30 and “Time
formats” on page 33.
Sequential files, unlike most other data sources, do not have inherent column definitions, and so
WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can
only use RCP on sequential files if you have used the Schema File property (see ″Schema File″ on page
Schema File and on page Schema File) to specify a schema which describes all the columns in the
sequential file. You need to specify the same schema file for any similar stages in the job where you want
to propagate columns. Stages that will require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
v Column Import
v Column Export
What is a file set? WebSphere DataStage can generate and name exported files, write them to their
destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data
files and the file that lists them are called a file set. This capability is useful because some operating
systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent
overruns.
The amount of data that can be stored in each destination data file is limited by the characteristics of the
file system and the amount of free disk space available. The number of files created by a file set depends
on:
v The number of processing nodes in the default node pool
v The number of disks in the export or default disk pool connected to each processing node in the
default node pool
v The size of the partitions of the data set
The File Set stage enables you to create and write to file sets, and to read data back from file sets.
Unlike data sets, file sets carry formatting information that describe the format of the files to be read or
written.
When you edit a File Set stage, the File Set stage editor appears. This is based on the generic stage editor
described in″Stage Editors.″
The stage editor has up to three pages, depending on whether you are reading or writing a file set:
v Stage Page. This is always present and is used to specify general information about the stage.
There are one or two special points to note about using runtime column propagation (RCP) with File Set
stages. See ″Using RCP With File Set Stages″ for details.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include File Set stages
in a job. This section specifies the minimum steps to take to get a File Set stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic methods, you will learn where the shortcuts are when you get familiar
with the product.
The steps required depend on whether you are using the File Set stage to read or write a file.
Writing to a file
v In the Input Link Properties Tab specify the pathname of the file set being written to. The other
properties all have default values, which you can change or not as required.
v In the Input Link Format Tab specify format details for the file set you are writing to, or accept the
defaults (variable length columns enclosed in double quotes and delimited by commas, rows delimited
with UNIX newlines).
v Ensure column meta data has been specified for the file set.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it
allows you to specify a character set map for the stage.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. This is set to parallel and cannot be changed.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. You can select Set or Clear. If you select Set, file set read operations will request
that the next stage preserves the partitioning as is (it is ignored for file set write operations).
Input page
The Input page allows you to specify details about how the File Set stage writes data to a file set. The
File Set stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the file set. The Formats tab gives information about
the format of the files being written. The Columns tab specifies the column definitions of the data. The
Advanced tab allows you to change the default buffering settings for the input link.
Details about File Set stage properties, partitioning, and formatting are given in the following sections.
See ″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Target/File Set pathname N/A Y N N/A
Target/File Set Create (Error if Overwrite Y N N/A
Update Policy exists)
/Overwrite/Use
Existing (Discard
records)/ Use
Existing (Discard
schema &
records)
Target/File Set Write/Omit Write Y N N/A
Schema policy
Target category
File set
This property defines the file set that the incoming data will be written to. You can type in a pathname
of, or browse for a file set descriptor file (by convention ending in .fs).
Specifies what action will be taken if the file set you are writing to already exists. Choose from:
v Create (Error if exists)
v Overwrite
v Use Existing (Discard records)
v Use Existing (Discard schema & records)
Specifies whether the schema should be written to the file set. Choose from Write or Omit. The default is
Write.
Options category
Cleanup on failure
This is set to True by default and specifies that the stage will delete any partially written files if the stage
fails for any reason. Set this to False to specify that partially written files should be left.
Set this to True to specify that one file is written for each partition. The default is False.
Allows you to specify behavior if a record fails to be written for some reason. Choose from Continue to
continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save
to send rejected rows down a reject link. Defaults to Continue.
Diskpool
This is an optional property. Specify the name of the disk pool into which to write the file set. You can
also specify a job parameter.
File prefix
This is an optional property. Specify a prefix for the name of the file set components. If you do not
specify a prefix, the system writes the following: export.username, where username is your login. You can
also specify a job parameter.
File suffix
This is an optional property. Specify a suffix for the name of the file set components. The suffix is omitted
by default.
This is an optional property. Specify the maximum file size in MB. The value must be equal to or greater
than 1.
Schema file
This is an optional property. By default the File Set stage will use the column definitions defined on the
Columns tab and formatting information from the Format tab as a schema for writing the file. You can,
however, specify a file containing a schema instead (note, however, that if you have defined columns on
the Columns tab, you should ensure these match the schema file). Type in a pathname or browse for a
schema file.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is written to the file set. It also allows you to specify that the data should be sorted before being
written.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the File Set stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the File Set stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the File Set stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
If you do not alter any of the Format settings, the File Set stage will produce files of the following format:
v Files comprise variable length columns contained within double quotes.
v All columns are delimited by a comma, except for the final column in a row.
v Rows are delimited by a UNIX newline.
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
To change individual properties, select a property type from the main tree then add the properties you
want to set to the tree structure by clicking on them in the Available properties to set window. You can
then set a value for that property in the Property Value box. Pop-up help for each of the available
properties appears if you hover the mouse pointer over it.
Any property that you set on this tab can be overridden at the column level by setting properties for
individual columns on the Edit Column Metadata dialog box (see page Columns Tab).
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the Property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
Fill char
Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a
drop-down list. This character is used to fill any gaps in a written record caused by column positioning
properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could
also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot
specify a multi-byte Unicode character.
Specify a string to be written after the last column of a record in place of the column delimiter. Enter one
or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final
delimiter, which is the default. For example, if you set Delimiter to comma and Final delimiter string to `,
` (comma space - you do not need to enter the inverted commas) all fields are delimited by a comma,
except the final field, which is delimited by a comma followed by an ASCII space character.
Final delimiter
Specify a single character to be written after the last column of a record in place of the field delimiter.
Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram
for an illustration.
v whitespace. The last column of each record will not include any trailing white spaces found at the end
of the record.
field delimiter
When writing, a space is now inserted after every field except the last in the record. Previously, a space
was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of
inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable.
Intact
The intact property specifies an identifier of a partial schema. A partial schema specifies that only the
column(s) named in the schema can be modified by the stage. All other columns in the row are passed
through unmodified. The file containing the partial schema is specified in the Schema File property on
the Properties tab. This property has a dependent property, Check intact, but this is not relevant to input
links.
Specify a string to be written at the end of each record. Enter one or more characters. This is mutually
exclusive with Record delimiter, which is the default, record type and record prefix.
Record delimiter
Specify a single character to be written at the end of each record. Type a character or select one of the
following:
v UNIX Newline (the default)
v null
(To implement a DOS newline, use the Record delimiter string property set to ″\R\N″ or chooseFormat
as → DOS line terminator from the shortcut menu.)
Note: Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record
type.
Record length
Select Fixed where fixed length fields are being written. WebSphere DataStage calculates the appropriate
length for the record. Alternatively specify the length of fixed records as number of bytes. This is not
Record Prefix
Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. It is set to 1 by
default. This is mutually exclusive with Record delimiter, which is the default, and record delimiter string
and record type.
Record type
Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If
you choose the implicit property, data is written as a stream with no explicit record boundaries. The end
of the record is inferred when all of the columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or
VR.
This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and
Record prefix and by default is not used.
Field defaults
Defines default properties for columns written to the file or files. These are applied to all columns
written, but can be overridden for individual columns from the Columns tab using the Edit Column
Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a
multi-byte Unicode character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the number of bytes to fill with the Fill character when a field is
identified as null. When WebSphere DataStage identifies a null field, it will write a field of this length
full of Fill characters. This is mutually exclusive with Null field value.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters.
This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma
space - you do not need to enter the inverted commas) would have each field delimited by `, ` unless
overridden for individual fields.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is written, WebSphere DataStage writes a length value of null field length if the
field contains a null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value written to null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F.
You must use this form to encode non-printable byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data
representation, you can use Pad char (from the general section of Type defaults) to specify a repeated
trailing character if the value you specify is shorter than the fixed width of the field.
Type defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data to be written contains a text-based date in the
form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS
system (see WebSphere DataStage NLS Guide).
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII
characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain
at least one field of this type.
v Import ASCII as EBCDIC. Not relevant for input links.
For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see WebSphere DataStage Developer’s Help.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
Numeric
These properties apply to integer and float fields unless overridden at column level.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. For details about the
format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. Defaults to
%yyyy-%mm-%dd %hh:%nn:%ss. The format combines the format for date strings and time strings. See
“Date formats” on page 30 and “Time formats” on page 33.
Output page
The Output page allows you to specify details about how the File Set stage reads data from a file set. The
File Set stage can have only one output link. It can also have a single reject link, where rows that have
failed to be written or read for some reason can be sent. The Output name drop-down list allows you to
choose whether you are looking at details of the main output link (the stream link) or the reject link.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Formats tab gives information about the
format of the files being read. The Columns tab specifies the column definitions of the data. The
Advanced tab allows you to change the default buffering settings for the output link.
Details about File Set stage properties and formatting are given in the following sections. See Chapter 3,
“Stage editors,” on page 37, ″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Source/File Set pathname N/A Y N N/A
Options/Keep file True/False False Y N N/A
Partitions
Options/Reject Continue/Fail/ Continue Y N N/A
Mode Save
Options/Report Yes/No Yes Y N N/A
Progress
Options/Filter command N/A N N N/A
Options/Schema pathname N/A N N N/A
File
Options/Use True/False False Y N N/A
Schema Defined
in File Set
Options/File column name fileNameColumn N N N/A
Name Column
Source category
File set
This property defines the file set that the data will be read from. You can type in a pathname of, or
browse for, a file set descriptor file (by convention ending in .fs).
Options category
Keep file partitions
Set this to True to partition the read data set according to the organization of the input file(s). So, for
example, if you are reading three files you will have three partitions. Defaults to False.
Reject mode
Allows you to specify behavior for read rows that do not match the expected schema. Choose from
Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are
rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Report progress
Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each
10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB,
records are fixed length, and there is no filter on the file.
This is an optional property. You can use this to specify that the data is passed through a filter program
after being read from the files. Specify the filter command, and any required arguments, in the Property
Value box.
Schema file
This is an optional property. By default the File Set stage will use the column definitions defined on the
Columns and Format tabs as a schema for reading the file. You can, however, specify a file containing a
schema instead (note, however, that if you have defined columns on the Columns tab, you should ensure
these match the schema file). Type in a pathname or browse for a schema file. This property is mutually
exclusive with Use Schema Defined in File Set.
When you create a file set you have an option to save the schema along with it. When you read the file
set you can use this schema in preference to the column definitions by setting this property to True. This
property is mutually exclusive with Schema File.
This is an optional property. It adds an extra column of type VarChar to the output of the stage,
containing the pathname of the file the record is read from. You should also add this column manually to
the Columns definitions to ensure that the column is not dropped if you are not using runtime column
propagation, or it is turned off at some point.
Similarly, you cannot edit the column definitions for a reject link. For writing file sets, the link uses the
column definitions for the input link. For reading file sets, the link uses a single column called rejected
containing raw data for columns rejected after reading because they do not match the schema.
If you do not alter any of the Format settings, the stage will produce a file of the following format:
v File comprises variable length columns contained within double quotes.
v All columns are delimited by a comma, except for the final column in a row.
v Rows are delimited by a UNIX newline.
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
You can use the Defaults button to change your default settings. Use the Format tab to specify your
required settings, then click Defaults → Save current as default. All your sequential files will use your
settings by default from now on. If your requirements change, you can choose Defaults → Reset defaults
from factory settings to go back to the original defaults as described above. Once you have done this,
you then have to click Defaults → Set current from default for the new defaults to take effect.
Any property that you set on this tab can be overridden at the column level by setting properties for
individual columns on the Edit Column Metadata dialog box (see page Columns Tab).
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
v Fill char. Does not apply to output links.
v Final delimiter string. Specify the string written after the last column of a record in place of the
column delimiter. Enter one or more characters, this precedes the record delimiter if one is used.
Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to
comma and Final delimiter string to `, ` (comma space - you do not need to enter the inverted
commas) all fields are delimited by a comma, except the final field, which is delimited by a comma
followed by an ASCII space character. WebSphere DataStage skips the specified delimiter string when
reading the file.
v Final delimiter. Specify the single character written after the last column of a record in place of the
field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. WebSphere
DataStage skips the specified delimiter string when reading the file. See the following diagram for an
illustration.
– whitespace. The last column of each record will not include any trailing white spaces found at the
end of the record.
– end. The last column of each record does not include the field delimiter. This is the default setting.
– none. The last column of each record does not have a delimiter, used for fixed-width fields.
– null. The last column of each record is delimited by the ASCII null character.
– comma. The last column of each record is delimited by the ASCII comma character.
– tab. The last column of each record is delimited by the ASCII tab character.
record delimiter
field delimiter
Field Defaults
Defines default properties for columns read from the file or files. These are applied to all columns, but
can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog
box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode
character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the actual number of bytes to skip if the field’s length equals the setting
of the null field length property.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab. WebSphere DataStage skips the delimiter when
reading.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify the string at the end of each field. Enter one or more characters. This is
mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma space -
you do not need to enter the inverted commas) specifies each field is delimited by `, ` unless
overridden for individual fields. WebSphere DataStage skips the delimiter string when reading.
Type Defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
116 Parallel Job Developer Guide
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data read, contains a text-based date in the form
%yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system
(see WebSphere DataStage NLS Guide).
– For the decimal data type: a field represents a decimal in a string format with a leading space or ’-’
followed by decimal digits with an embedded decimal point if the scale is not zero. The destination
string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
– For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): WebSphere DataStage
assumes that numeric fields are represented as text.
– For the time data type: text specifies that the field represents time in the text-based form
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
– For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
v Field max width. The maximum number of bytes in a column represented as a string. Enter a number.
This is useful where you are storing numbers as text. If you are using a fixed-width character set, you
can calculate the length exactly. If you are using variable-length character set, calculate an adequate
maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and
raw; and record, subrec, or tagged if they contain at least one field of this type.
v Field width. The number of bytes in a field represented as a string. Enter a number. This is useful
where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the
number of bytes exactly. If it’s a variable length encoding, base your calculation on the width and
frequency of your variable-width characters. Applies to fields of all data types except date, time,
timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
If you specify neither field width nor field max width, numeric fields written as text have the
following number of bytes as their maximum width:
– 8-bit signed or unsigned integers: 4 bytes
– 16-bit signed or unsigned integers: 6 bytes
– 32-bit signed or unsigned integers: 11 bytes
– 64-bit signed or unsigned integers: 21 bytes
– single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, ″E″, sign, 2 exponent)
– double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, ″E″, sign, 3 exponent)
v Pad char. This property is ignored for output links.
v Character set. Specifies the character set. Choose from ASCII or EBCDIC. The default is ASCII. Applies
to all data types except raw and ustring and record, subrec, or tagged containing no fields other than
raw or ustring.
String
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
v Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is
normally illegal) as a valid representation of zero. Select Yes or No. The default is No.
v Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default).
v Packed. Select an option to specify what the decimal columns contain, choose from:
– Yes to specify that the decimal fields contain data in packed decimal format (the default). This has
the following sub-properties:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive
sign (0xf) regardless of the fields’ actual sign value.
– No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the
following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This
has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (overpunch) to specify that the field has a leading or end byte that contains a character which
specifies both the numeric value of that byte and whether the number as a whole is negatively or
positively signed. This has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
v Precision. Specifies the precision of a packed decimal. Enter a number.
v Rounding. Specifies how to round the source field to fit into the destination decimal when reading a
source field to a decimal. Choose from:
– up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE
754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1.
– down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE
754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2.
– nearest value. Round the source column towards the nearest representable value. This mode
corresponds to the COBOL ROUNDED mode. For example, 1.4 becomes 1, 1.5 becomes 2, -1.4
becomes -1, -1.5 becomes -2.
– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most
fractional digit supported by the destination, regardless of sign. For example, if the destination is an
integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale,
truncate to the scale size of the destination decimal. This mode corresponds to the COBOL
INTEGER-PART function. Using this method 1.6 becomes 1, -1.6 becomes -1.
v Scale. Specifies the scale of a source packed decimal.
Numeric
These properties apply to integer and float fields unless overridden at column level.
v C_format. Perform non-default conversion of data from string data to a integer or floating-point. This
property specifies a C-language format string used for reading integer or floating point strings. This is
passed to sscanf(). For example, specifying a C-format of %x and a field width of 8 ensures that a 32-bit
integer is formatted as an 8-byte hexadecimal string.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. By default this is
%hh-%mm-%ss. For details about the format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. The format
combines the format for date strings and time strings. See “Date formats” on page 30 and “Time
formats” on page 33.
The File Set stage handles a set of sequential files. Sequential files, unlike most other data sources, do not
have inherent column definitions, and so WebSphere DataStage cannot always tell where there are extra
columns that need propagating. You can only use RCP on File Set stages if you have used the Schema
File property (see ″Schema File″) to specify a schema which describes all the columns in the sequential
files referenced by the stage. You need to specify the same schema file for any similar stages in the job
where you want to propagate columns. Stages that will require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
When creating Lookup file sets, one file will be created for each partition. The individual files are
referenced by a single descriptor file, which by convention has the suffix .fs.
When performing lookups, Lookup File Set stages are used with Lookup stages. For more information
about lookup operations, see ″Lookup Stage.″
The stage editor has up to three pages, depending on whether you are creating or referencing a file set:
v Stage Page. This is always present and is used to specify general information about the stage.
v Input Page. This is present when you are creating a lookup table. This is where you specify details
about the file set being created and written to.
v Output Page. This is present when you are reading from a lookup file set, i.e., where the stage is
providing a reference link to a Lookup stage. This is where you specify details about the file set being
read from.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Lookup File Set
stages in a job. This section specifies the minimum steps to take to get a Lookup File Set stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
The steps required depend on whether you are using the Lookup File Set stage to create a lookup file set,
or using it in conjunction with a Lookup stage.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Map tab, which appears if you have NLS enabled on your
system, allows you to specify a character set map for the stage.
Advanced tab
This tab only appears when you are using the stage to create a reference file set (i.e., where the stage has
an input link). Use this tab to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
contents of the table are processed by the available nodes as specified in the Configuration file, and by
any node constraints specified on the Advanced tab. In sequential mode the entire contents of the table
are processed by the conductor node.
Input page
The Input page allows you to specify details about how the Lookup File Set stage writes data to a file set.
The Lookup File Set stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the file set. The Columns tab specifies the column
definitions of the data. The Advanced tab allows you to change the default buffering settings for the
input link.
Details about Lookup File Set stage properties and partitioning are given in the following sections. See
″Stage Editors″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Lookup Input column N/A Y Y N/A
Keys/Key
Lookup True/False True Y N Key
Keys/Case
Sensitive
Lookup True/False False Y N Key
Keys/Keep
Specifies the name of a lookup key column. The Key property can be repeated if there are multiple key
columns. The property has two dependent properties:
v Case Sensitive
This is a dependent property of Key and specifies whether the parent key is case sensitive or not. Set
to True by default.
v Keep
This specifies whether the key column needs to be propagated to the output stream after the lookup, or
used in a subsequent condition expression. Set to False by default.
This property defines the file set that the incoming data will be written to. You can type in a pathname
of, or browse for, a file set descriptor file (by convention ending in .fs).
Specifies whether a range lookup is required. A range lookup compares the value of a source column to a
range of values between two lookup columns. Alternatively, you can compare the value of a lookup
column to a range of values between two source columns. This property has the following dependent
properties:
v Bounded Column
Specifies the input column to use for the lookup file set. The lookup table is built with the bounded
column, which will then be compared to a range of columns from the source data set. This property is
mutually exclusive with Lower Bound and Upper Bound. It has two dependent properties:
– Case Sensitive
This is an optional property. Specifies whether the column is case sensitive or not. Set to True by
default.
– Keep
Specifies whether the column needs to be propagated to the output stream after the lookup, or used
in a subsequent condition expression. Set to False by default.
v Lower Bound
Specifies the input column to use for the lower bound of the range in the lookup table. This cannot be
the same as the upper bound column and is mutually exclusive with Bounded Column. It has two
dependent properties:
– Case Sensitive
This is an optional property. Specifies whether the column is case sensitive or not. Set to True by
default.
– Keep
Specifies whether the column needs to be propagated to the output stream after the lookup, or used
in a subsequent condition expression. Set to False by default.
v Upper Bound
Specifies the input column to use for the upper bound of the range in the lookup table. This cannot be
the same as the lower bound column and is mutually exclusive with Bounded Column. It has two
dependent properties:
– Case Sensitive
This is an optional property. Specifies whether the column value is case sensitive or not. Set to True
by default.
– Keep
Specifies whether the column needs to be propagated to the output stream after the lookup, or used
in a subsequent condition expression. Set to False by default.
Options category
Allow duplicates
Set this to cause multiple copies of duplicate records to be saved in the lookup table without a warning
being issued. Two lookup records are duplicates when all lookup key columns have the same value in the
two records. If you do not specify this option, WebSphere DataStage issues a warning message when it
encounters duplicate records and discards all but the first of the matching records.
This is an optional property. Specify the name of the disk pool into which to write the file set. You can
also specify a job parameter.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is written to the lookup file set. It also allows you to specify that the data should be sorted
before being written.
By default the stage will write to the file set in entire mode. The complete data set is written to each
partition.
If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it
to the file using the default (auto) collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Lookup File Set stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Lookup File Set stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type list. This will override any current partitioning.
If the Lookup File Set stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type list. This will override the default
auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions. If the stage is
partitioning incoming data, the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available with the Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Right-click
the column in the Selected list to invoke the shortcut menu.
Output page
The Output page allows you to specify details about how the Lookup File Set stage references a file set.
The Lookup File Set stage can have only one output link, which is a reference link.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Columns tab specifies the column
definitions of the data. The Advanced tab allows you to change the default buffering settings for the
output link.
Details about Lookup File Set stage properties are given in the following sections. See ″Stage Editors,″ for
a general description of the other tabs.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Lookup path name N/A Y N N/A
Source/Lookup
File Set
This property defines the file set that the data will be referenced from. You can type in a pathname of, or
browse for, a file set descriptor file (by convention ending in .fs).
The External Source stage allows you to perform actions such as interface with databases not currently
supported by the WebSphere DataStage Enterprise Edition.
When reading output from a program, WebSphere DataStage needs to know something about its format.
The information required is how the data is divided into rows and how rows are divided into columns.
You specify this on the Format tab. Settings for individual columns can be overridden on the Columns
tab using the Edit Column Metadata dialog box.
When you edit an External Source stage, the External Source stage editor appears. This is based on the
generic stage editor described in ″Stage Editors.″
There are one or two special points to note about using runtime column propagation (RCP) with External
Source stages. See ″Using RCP With External Source Stages″ for details.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it
allows you to specify a character set map for the stage.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data
input from external programs is processed by the available nodes as specified in the Configuration file,
and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the
source program is processed by the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage
preserves the partitioning as is. Clear is the default.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Output page
The Output page allows you to specify details about how the External Source stage reads data from an
external program. The External Source stage can have only one output link. It can also have a single
reject link, where rows that do not match the expected schema can be sent. The Output name drop-down
list allows you to choose whether you are looking at details of the main output link (the stream link) or
the reject link.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Format tab gives information about the
format of the files being read. The Columns tab specifies the column definitions of the data. The
Advanced tab allows you to change the default buffering settings for the output link.
Details about External Source stage properties and formatting are given in the following sections. See
″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/ Dependent
Property Values Default Mandatory? Repeats? of
Source/Source string N/A Y if Source Y N/A
Program Method =
Specific
Program(s)
Source/Source pathname N/A Y if Source Y N/A
Programs File Method =
Program
File(s)
Source/Source Specific Specific Program(s) Y N N/A
Method Program(s)/
Program File(s)
Options/Keep True/False False Y N N/A
File Partitions
Options/Reject Continue/Fail/ Continue Y N N/A
Mode Save
Options/Schema pathname N/A N N N/A
File
Source category
Source program
Specifies the name of a program providing the source data. WebSphere DataStage calls the specified
program and passes to it any arguments specified. You can repeat this property to specify multiple
program instances with different arguments. You can use a job parameter to supply program name and
arguments.
Specifies a file containing a list of program names and arguments. You can browse for the file or specify a
job parameter. You can repeat this property to specify multiple files.
Source method
This property specifies whether you directly specifying a program (using the Source Program property)
or using a file to specify a program (using the Source Programs File property).
Options category
Keep file partitions
Set this to True to maintain the partitioning of the read data. Defaults to False.
Reject mode
Allows you to specify behavior if a record fails to be read for some reason. Choose from Continue to
continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save
to send rejected rows down a reject link. Defaults to Continue.
Schema file
This is an optional property. By default the External Source stage will use the column definitions defined
on the Columns tab and Schema tab as a schema for reading the file. You can, however, specify a file
containing a schema instead (note, however, that if you have defined columns on the Columns tab, you
should ensure these match the schema file). Type in a pathname or browse for a schema file.
This is an optional property. It adds an extra column of type VarChar to the output of the stage,
containing the pathname of the source the record is read from. You should also add this column
manually to the Columns definitions to ensure that the column is not dropped if you are not using
runtime column propagation, or it is turned off at some point.
If you do not alter any of the Format settings, the stage will produce a file of the following format:
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
You can use the Defaults button to change your default settings. Use the Format tab to specify your
required settings, then click Defaults → Save current as default. All your sequential files will use your
settings by default from now on. If your requirements change, you can choose Defaults → Reset defaults
from factory settings to go back to the original defaults as described above. Once you have done this,
you then have to click Defaults → Set current from default for the new defaults to take effect.
To change individual properties, select a property type from the main tree then add the properties you
want to set to the tree structure by clicking on them in the Available properties to set window. You can
then set a value for that property in the Property Value box. Pop-up help for each of the available
properties appears if you hover the mouse pointer over it.
Any property that you set on this tab can be overridden at the column level by setting properties for
individual columns on the Edit Column Metadata dialog box (see Columns Tab).
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
v Fill char. Does not apply to output links.
v Final delimiter string. Specify the string written after the last column of a record in place of the
column delimiter. Enter one or more characters, this precedes the record delimiter if one is used.
Mutually exclusive with Final delimiter, which is the default. For example, if you set Delimiter to
comma and Final delimiter string to `, ` (comma space - you do not need to enter the inverted
commas) all fields are delimited by a comma, except the final field, which is delimited by a comma
followed by an ASCII space character. WebSphere DataStage skips the specified delimiter string when
reading the file.
v Final delimiter. Specify the single character written after the last column of a record in place of the
field delimiter. Type a character or select one of whitespace, end, none, null, tab, or comma. WebSphere
DataStage skips the specified delimiter string when reading the file. See the following diagram for an
illustration.
– whitespace. The last column of each record will not include any trailing white spaces found at the
end of the record.
– end. The last column of each record does not include the field delimiter. This is the default setting.
– none. The last column of each record does not have a delimiter, used for fixed-width fields.
– null. The last column of each record is delimited by the ASCII null character.
– comma. The last column of each record is delimited by the ASCII comma character.
field delimiter
Field Defaults
Defines default properties for columns read from the file or files. These are applied to all columns, but
can be overridden for individual columns from the Columns tab using the Edit Column Metadata dialog
box. Where you can enter a character, this can usually be an ASCII character or a multi-byte Unicode
character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the actual number of bytes to skip if the field’s length equals the setting
of the null field length property.
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data read, contains a text-based date in the form
%yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS system
(see WebSphere DataStage NLS Guide).
– For the decimal data type: a field represents a decimal in a string format with a leading space or ’-’
followed by decimal digits with an embedded decimal point if the scale is not zero. The destination
string format is: [+ | -]ddd.[ddd] and any precision and scale arguments are ignored.
– For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): WebSphere DataStage
assumes that numeric fields are represented as text.
– For the time data type: text specifies that the field represents time in the text-based form
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
– For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd
%hh:%nn:%ss or in the default date format if you have defined a new one on an NLS system (see
WebSphere DataStage NLS Guide).
v Field max width. The maximum number of bytes in a column represented as a string. Enter a number.
This is useful where you are storing numbers as text. If you are using a fixed-width character set, you
can calculate the length exactly. If you are using variable-length character set, calculate an adequate
maximum width for your fields. Applies to fields of all data types except date, time, timestamp, and
raw; and record, subrec, or tagged if they contain at least one field of this type.
v Field width. The number of bytes in a field represented as a string. Enter a number. This is useful
where you are storing numbers as text. If you are using a fixed-width charset, you can calculate the
number of bytes exactly. If it’s a variable length encoding, base your calculation on the width and
frequency of your variable-width characters. Applies to fields of all data types except date, time,
timestamp, and raw; and record, subrec, or tagged if they contain at least one field of this type.
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Not relevant for output links.
v Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
v Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is
normally illegal) as a valid representation of zero. Select Yes or No. The default is No.
v Decimal separator. Specify the ASCII character that acts as the decimal separator (period by default).
v Packed. Select an option to specify what the decimal columns contain, choose from:
– Yes to specify that the decimal fields contain data in packed decimal format (the default). This has
the following sub-properties:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when reading decimal fields. Select No to write a positive
sign (0xf) regardless of the fields’ actual sign value.
– No (separate) to specify that they contain unpacked decimal with a separate sign byte. This has the
following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This
has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
– No (overpunch) to specify that the field has a leading or end byte that contains a character which
specifies both the numeric value of that byte and whether the number as a whole is negatively or
positively signed. This has the following sub-property:
Sign Position. Choose leading or trailing as appropriate.
v Precision. Specifies the precision of a packed decimal. Enter a number.
v Rounding. Specifies how to round the source field to fit into the destination decimal when reading a
source field to a decimal. Choose from:
– up (ceiling). Truncate source column towards positive infinity. This mode corresponds to the IEEE
754 Round Up mode. For example, 1.4 becomes 2, -1.6 becomes -1.
– down (floor). Truncate source column towards negative infinity. This mode corresponds to the IEEE
754 Round Down mode. For example, 1.6 becomes 1, -1.4 becomes -2.
Numeric
These properties apply to integer and float fields unless overridden at column level.
v C_format. Perform non-default conversion of data from string data to a integer or floating-point. This
property specifies a C-language format string used for reading integer or floating point strings. This is
passed to sscanf(). For example, specifying a C-format of %x and a field width of 8 ensures that a 32-bit
integer is formatted as an 8-byte hexadecimal string.
v In_format. Format string used for conversion of data from string to integer or floating-point data This
is passed to sscanf(). By default, WebSphere DataStage invokes the C sscanf() function to convert a
numeric field formatted as a string to either integer or floating point data. If this function does not
output data in a satisfactory format, you can specify the in_format property to pass formatting
arguments to sscanf().
v Out_format. This property is not relevant for output links.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. By default this is
%hh-%mm-%ss. For details about the format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. The format
combines the format for date strings and time strings. See “Date formats” on page 30 and “Time
formats” on page 33.
External Source stages, unlike most other data sources, do not have inherent column definitions, and so
WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can
only use RCP on External Source stages if you have used the Schema File property (see ″Schema File″ on
page Schema File) to specify a schema which describes all the columns in the sequential files referenced
by the stage. You need to specify the same schema file for any similar stages in the job where you want
to propagate columns. Stages that will require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
v Column Import
v Column Export
The External Target stage allows you to perform actions such as interface with databases not currently
supported by the WebSphere DataStage Parallel Extender.
When writing to a program, WebSphere DataStage needs to know something about how to format the
data. The information required is how the data is divided into rows and how rows are divided into
columns. You specify this on the Format tab. Settings for individual columns can be overridden on the
Columns tab using the Edit Column Metadata dialog box.
When you edit an External Target stage, the External Target stage editor appears. This is based on the
generic stage editor described in ″Stage Editors.″
There are one or two special points to note about using runtime column propagation (RCP) with External
Target stages. See ″Using RCP With External Target Stages″ for details.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it
allows you to specify a character set map for the stage.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data
output to external programs is processed by the available nodes as specified in the Configuration file,
and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the
source program is processed by the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage
preserves the partitioning as is. Clear is the default.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about how the External Target stage writes data to an
external program. The External Target stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the external program. The Format tab gives
information about the format of the data being written. The Columns tab specifies the column definitions
of the data. The Advanced tab allows you to change the default buffering settings for the input link.
Details about External Target stage properties, partitioning, and formatting are given in the following
sections. See″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Target string N/A Y if Source Y N/A
/Destination Method = Specific
Program Program(s)
Target pathname N/A Y if Source Y N/A
/Destination Method =
Programs File Program File(s)
Target /Target Specific Specific Y N N/A
Method Program(s)/ Program(s)
Program File(s)
Options/Reject Continue/Fail/ Continue N N N/A
Mode Save
Options/Schema pathname N/A N N N/A
File
Target category
Destination program
This is an optional property. Specifies the name of a program receiving data. WebSphere DataStage calls
the specified program and passes to it any arguments specified.You can repeat this property to specify
This is an optional property. Specifies a file containing a list of program names and arguments. You can
browse for the file or specify a job parameter. You can repeat this property to specify multiple files.
Target method
This property specifies whether you directly specifying a program (using the Destination Program
property) or using a file to specify a program (using the Destination Programs File property).
Options category
Reject mode
This is an optional property. Allows you to specify behavior if a record fails to be written for some
reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading
if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Schema file
This is an optional property. By default the External Target stage will use the column definitions defined
on the Columns tab as a schema for writing the file. You can, however, specify a file containing a schema
instead (note, however, that if you have defined columns on the Columns tab, you should ensure these
match the schema file). Type in a pathname or browse for a schema file.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is written to the target program. It also allows you to specify that the data should be sorted
before being written.
If the External Target stage is operating in sequential mode, it will first collect the data before writing it
to the file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the External Target stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the External Target stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partitioning type drop-down list. This will override any current partitioning.
If the External Target stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default Auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the target program. The sort is always carried out within data partitions. If the
stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the
sort occurs before the collection. The availability of sorting depends on the partitioning or collecting
method chosen (it is not available with the Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Format tab
The Format tab allows you to supply information about the format of the data you are writing. The tab
has a similar format to the Properties tab
If you do not alter any of the Format settings, the External Target stage will produce a file of the
following format:
You can use the Format As item from the shortcut menu in the Format tab to quickly change to a
fixed-width column format, using DOS newlines as row delimiters, or producing a COBOL format file.
To change individual properties, select a property type from the main tree then add the properties you
want to set to the tree structure by clicking on them in the Available properties to set window. You can
then set a value for that property in the Property Value box. Pop up help for each of the available
properties appears if you hover the mouse pointer over it.
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the Property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
Fill char
Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a
drop-down list. This character is used to fill any gaps in a written record caused by column positioning
properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could
also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot
specify a multi-byte Unicode character.
Specify a string to be written after the last column of a record in place of the column delimiter. Enter one
or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final
delimiter, which is the default. For example, if you set Delimiter to comma and Final delimiter string to `,
` (comma space - you do not need to enter the inverted commas) all fields are delimited by a comma,
except the final field, which is delimited by a comma followed by an ASCII space character.
Final delimiter
Specify a single character to be written after the last column of a record in place of the field delimiter.
Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram
for an illustration.
v whitespace. The last column of each record will not include any trailing white spaces found at the end
of the record.
v end. The last column of each record does not include the field delimiter. This is the default setting.
v none. The last column of each record does not have a delimiter; used for fixed-width fields.
v null. The last column of each record is delimited by the ASCII null character.
v comma. The last column of each record is delimited by the ASCII comma character.
field delimiter
When writing, a space is now inserted after every field except the last in the record. Previously, a space
was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of
inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable.
Intact
The intact property specifies an identifier of a partial schema. A partial schema specifies that only the
column(s) named in the schema can be modified by the stage. All other columns in the row are passed
through unmodified. The file containing the partial schema is specified in the Schema File property on
the Properties tab. This property has a dependent property, Check intact, but this is not relevant to input
links.
Specify a string to be written at the end of each record. Enter one or more characters. This is mutually
exclusive with Record delimiter, which is the default, record type and record prefix.
Record delimiter
Specify a single character to be written at the end of each record. Type a character or select one of the
following:
v UNIX Newline (the default)
v null
(To implement a DOS newline, use the Record delimiter string property set to ″\R\N″ or chooseFormat
as → DOS line terminator from the shortcut menu.)
Note: Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record
type.
Record length
Select Fixed where fixed length fields are being written. WebSphere DataStage calculates the appropriate
length for the record. Alternatively specify the length of fixed records as number of bytes. This is not
used by default (default files are comma-delimited). The record is padded to the specified length with
either zeros or the fill character if one has been specified.
Record Prefix
Record type
Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If
you choose the implicit property, data is written as a stream with no explicit record boundaries. The end
of the record is inferred when all of the columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or
VR.
This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and
Record prefix and by default is not used.
Field defaults
Defines default properties for columns written to the file or files. These are applied to all columns
written, but can be overridden for individual columns from the Columns tab using the Edit Column
Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a
multi-byte Unicode character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the number of bytes to fill with the Fill character when a field is
identified as null. When WebSphere DataStage identifies a null field, it will write a field of this length
full of Fill characters. This is mutually exclusive with Null field value.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters.
This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma
space - you do not need to enter the inverted commas) would have each field delimited by `, ` unless
overridden for individual fields.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is written, WebSphere DataStage writes a length value of null field length if the
field contains a null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value written to null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F.
You must use this form to encode non-printable byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data
representation, you can use Pad char (from the general section of Type defaults) to specify a repeated
trailing character if the value you specify is shorter than the fixed width of the field.
v Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a
binary value, either the column’s length or the tag value for a tagged field.
Type defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data to be written contains a text-based date in the
form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS
system (see WebSphere DataStage NLS Guide).
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII
characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain
at least one field of this type.
v Import ASCII as EBCDIC. Not relevant for input links.
For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see WebSphere DataStage Developer’s Help.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
Numeric
These properties apply to integer and float fields unless overridden at column level.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. For details about the
format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. Defaults to
%yyyy-%mm-%dd %hh:%nn:%ss. The format combines the format for date strings and time strings. See
“Date formats” on page 30 and “Time formats” on page 33.
Output page
The Output page appears if the stage has a Reject link.
The General tab allows you to specify an optional description of the output link.
You cannot change the properties of a Reject link. The Properties tab for a reject link is blank.
Similarly, you cannot edit the column definitions for a reject link. The link uses the column definitions for
the link rejecting the data records.
External Target stages, unlike most other data targets, do not have inherent column definitions, and so
WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can
only use RCP on External Target stages if you have used the Schema File property (see ″Schema File″) to
specify a schema which describes all the columns in the sequential files referenced by the stage. You need
to specify the same schema file for any similar stages in the job where you want to propagate columns.
Stages that will require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
v Column Import
v Column Export
As a source, the CFF stage can have multiple output links and a single reject link. You can read data from
one or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read
data from files that contain multiple record types. The source data can contain one or more of the
following clauses:
v GROUP
v REDEFINES
v OCCURS
v OCCURS DEPENDING ON
CFF source stages run in parallel mode when they are used to read multiple files, but you can configure
the stage to run sequentially if it is reading only one file with a single reader.
As a target, the CFF stage can have a single input link and a single reject link. You can write data to one
or more complex flat files. You cannot write to MVS data sets or to files that contain multiple record
types.
Figure 1. This job has a Complex Flat File source stage with a single reject link, and a Complex Flat File target stage
with a single reject link.
Column definitions
You must define columns to specify what data the CFF stage will read or write.
If the stage will read data from a file that contains multiple record types, you must first create record
definitions on the Records tab. If the source file contains only one record type, or if the stage will write
data to a target file, then the columns belong to the default record called RECORD_1.
You can load column definitions from a table in the repository, or you can type column definitions into
the columns grid. You can also define columns by dragging a table definition from the Repository
window to the CFF stage icon on the Designer canvas. If you drag a table to a CFF source stage that has
multiple record types, the columns are added to the first record in the stage. You can then propagate the
columns to one or more output links by right-clicking the stage icon and selecting Propagate Columns
from the pop-up menu.
The columns that you define must reflect the actual layout of the file format. If you do not want to
display all of the columns, you can create fillers to replace the unwanted columns by selecting the Create
fillers check box in the Select Columns From Table window. For more information about fillers, see
“Filler creation and expansion” on page 159.
After you define columns, the CFF stage projects the columns to the Selection tab on the Output page if
you are using the stage as a source, or to the Columns tab on the Input page if you are using the stage as
a target.
If you load more than one table definition, the list of columns from the subsequent tables is added to the
end of the current list. If the first column of the subsequent list has a level number higher than the last
column of the current list, the CFF stage inserts a 02 FILLER group item before the subsequent list is
loaded. However, if the first column that is being loaded already has a level number of 02, then no 02
FILLER group item is inserted.
To load columns:
1. Click the Records tab on the Stage page.
2. Click Load to open the Table Definitions window. This window displays all of the repository objects
that are in the current project.
3. Select a table definition in the repository tree and click OK.
4. Select the columns to load in the Select Columns From Table window and click OK.
5. If flattening is an option for any arrays in the column structure, specify how to handle array data in
the Complex File Load Option window. See “Complex file load options” on page 160 for details.
Typing columns
You can also define column metadata by typing column definitions in the columns grid.
To type columns:
1. Click the Records tab on the Stage page.
2. In the Level number field of the grid, specify the COBOL level number where the data is defined. If
you do not specify a level number, a default value of 05 is used.
3. In the Column name field, type the name of the column.
4. In the Native type field, select the native data type.
5. In the Length field, specify the data precision.
6. In the Scale field, specify the data scale factor.
7. Optional: In the Description field, type a description of the column.
You can edit column properties by selecting a property in the properties tree and making changes in the
Value field. Use the Available properties to add field to add optional attributes to the properties tree.
If you need to specify COBOL attributes, open the Edit Column Meta Data window by right-clicking
anywhere in the columns grid and selecting Edit row from the pop-up menu.
Mainframe table definitions frequently contain hundreds of columns. If you do not want to display all of
these columns in the CFF stage, you can create fillers. When you load columns into the stage, select the
Create fillers check box in the Select Columns From Table window. This check box is selected by default
and is available only when you load columns from a simple or complex flat file.
The sequences of unselected columns are collapsed into FILLER items with a storage length that is equal
to the sum of the storage length of each individual column that is being replaced. The native data type is
set to CHARACTER, and the name is set to FILLER_XX_YY, where XX is the start offset and YY is the
end offset. Fillers for elements of a group array or an OCCURS DEPENDING ON (ODO) column have
the name of FILLER_NN, where NN is the element number. The NN begins at 1 for the first unselected
group element and continues sequentially. Any fillers that follow an ODO column are also numbered
sequentially.
See Appendix C, “Fillers,” on page 605 for examples of how fillers are created for different COBOL
structures.
The Complex File Load Option window opens when you load or type column definitions that contain
arrays on the Records tab. This window displays the names of the column definitions and their structure.
Array sizes are shown in parentheses. Select one of the following options to specify how the stage should
process arrays:
v Flatten selective arrays. Specifies that arrays can be selected for flattening on an individual basis. This
option is the default. Right-click an array in the columns list and select Flatten from the pop-up menu.
Columns that cannot be flattened are not available for selection. The array icon changes for the arrays
that will be flattened.
v Flatten all arrays. Flattens all arrays and creates a column for each element of the arrays.
v As is. Passes arrays with no changes.
If you choose to pass an array as is, the columns with arrays are loaded as is.
If you choose to flatten an array, all elements of the array will appear as separate columns in the table
definition. The data is presented as one row at run time. Each array element is given a numeric suffix to
make its name unique.
Consider the following complex flat file structure (in COBOL file definition format):
05 ID PIC X(10)
05 NAME PIC X(30)
05 CHILD PIC X(30) OCCURS 5 TIMES
Array columns that have redefined fields or OCCURS DEPENDING ON clauses cannot be flattened.
Even if you choose to flatten all arrays in the Complex File Load Option window, these columns are
passed as is.
Columns that are identified in the record ID clause must be in the same physical storage location across
records. The constraint must be a simple equality expression, where a column equals a value.
You can select columns from multiple record types to output from the stage. If you do not select columns
to output on each link, the CFF stage automatically propagates all of the stage columns except group
columns to each empty output link when you click OK to exit the stage.
Option Description
To select individual columns Press Ctrl and click the columns in the Available
columns tree, then click > to move them to the Selected
columns list.
To select all columns from a single record definition Click the record name or any of its columns in the
Available columns tree and click >> to move them to the
Selected columns list. By default, group columns are not
included, unless you first select the Enable all group
column selection check box.
If you select columns out of order, they are reordered in the Selected columns list to match the
structure of the input columns.
Array columns
If you select an array column for output, the CFF stage passes data to the output link data in different
ways depending on the type of array you select.
When you load array columns into a CFF stage, you must specify how to process the array data. You can
pass the data as is, flatten all arrays on input to the stage, or flatten selected arrays on input. You choose
one of these options from the Complex File Load Option window, which opens when you load column
definitions on the Records tab.
If you choose to flatten arrays, the flattening is done at the time that the column metadata is loaded into
the stage. All of the array elements appear as separate columns in the table. Each array column has a
numeric suffix to make its name unique. You can select any or all of these columns for output.
If you choose to pass arrays as is, the array structure is preserved. The data is presented as a single row
at run time for each incoming row. If the array is normalized, the incoming single row is resolved into
multiple output rows.
The following examples show how the CFF stage passes data to the output link from different types of
array columns, including:
A simple array is a single, one-dimensional array. This example shows the result when you select all
columns as output columns. For each record that is read from the input file, five rows are written to the
output link. The sixth row out the link causes the second record to be read from the file, starting the
process over again.
Input record:
05 ID PIC X(10)
05 NAME PIC X(30)
05 CHILD PIC X(30) OCCURS 5 TIMES.
Output rows:
Row 1: ID NAME CHILD(1)
Row 2: ID NAME CHILD(2)
Row 3: ID NAME CHILD(3)
Row 4: ID NAME CHILD(4)
Row 5: ID NAME CHILD(5)
This example shows the result when you select a nested array column as output. If you select FIELD-A,
FIELD-C and FIELD-D as output columns, the CFF stage multiplies the OCCURS values at each level. In
this case, 6 rows are written to the output link.
Input record:
05 FIELD-A PIC X(4)
05 FIELD-B OCCURS 2 TIMES.
10 FIELD-C PIC X(4)
10 FIELD-D PIC X(4) OCCURS 3 TIMES.
Output rows:
Row 1: FIELD-A FIELD-C(1) FIELD-D(1,1)
Row 2: FIELD-A FIELD-C(1) FIELD-D(1,2)
Row 3: FIELD-A FIELD-C(1) FIELD-D(1,3)
Row 4: FIELD-A FIELD-C(2) FIELD-D(2,1)
Row 5: FIELD-A FIELD-C(2) FIELD-D(2,2)
Row 6: FIELD-A FIELD-C(2) FIELD-D(2,3)
Parallel arrays are array columns that have the same level number. The first example shows the result
when you select all parallel array columns as output columns. The CFF stage determines the number of
output rows by using the largest subscript. As a result, the smallest array is padded with default values
and the element columns are repeated. In this case, if you select all of the input fields as output columns,
four rows are written to the output link.
Input record:
Output rows:
Row 1: FIELD-A FIELD-B(1) FIELD-C FIELD-D(1) FIELD-E(1)
Row 2: FIELD-A FIELD-B(2) FIELD-C FIELD-D(2) FIELD-E(2)
Row 3: FIELD-A FIELD-C FIELD-D(3) FIELD-E(3)
Row 4: FIELD-A FIELD-C FIELD-E(4)
In the next example, only a subset of the parallel array columns are selected (FIELD-B and FIELD-E).
FIELD-D is passed as is. The number of output rows is determined by the maximum size of the
denormalized columns. In this case, four rows are written to the output link.
Output rows:
Row 1: FIELD-A FIELD-B(1) FIELD-C FIELD-D(1) FIELD-D(2) FIELD-D(3) FIELD-E(1)
Row 2: FIELD-A FIELD-B(2) FIELD-C FIELD-D(1) FIELD-D(2) FIELD-D(3) FIELD-E(2)
Row 3: FIELD-A FIELD-C FIELD-D(1) FIELD-D(2) FIELD-D(3) FIELD-E(3)
Row 4: FIELD-A FIELD-C FIELD-D(1) FIELD-D(2) FIELD-D(3) FIELD-E(4)
This complex example shows the result when you select both parallel array fields and nested array fields
as output. If you select FIELD-A, FIELD-C, and FIELD-E as output columns in this example, the CFF
stage determines the number of output rows by using the largest OCCURS value at each level and
multiplying them. In this case, 3 is the largest OCCURS value at the outer (05) level, and 5 is the largest
OCCURS value at the inner (10) level. Therefore, 15 rows are written to the output link. Some of the
subscripts repeat. In particular, subscripts that are smaller than the largest OCCURS value at each level
start over, including the second subscript of FIELD-C and the first subscript of FIELD-E.
05 FIELD-A PIC X(10)
05 FIELD-B OCCURS 3 TIMES.
10 FIELD-C PIC X(2) OCCURS 4 TIMES.
05 FIELD-D OCCURS 2 TIMES.
10 FIELD-E PIC 9(3) OCCURS 5 TIMES.
Output rows:
Row 1: FIELD-A FIELD-C(1,1) FIELD-E(1,1)
Row 2: FIELD-A FIELD-C(1,2) FIELD-E(1,2)
Row 3: FIELD-A FIELD-C(1,3) FIELD-E(1,3)
Row 4: FIELD-A FIELD-C(1,4) FIELD-E(1,4)
Row 5: FIELD-A FIELD-E(1,5)
Row 6: FIELD-A FIELD-C(2,1) FIELD-E(2,1)
Row 7: FIELD-A FIELD-C(2,2) FIELD-E(2,2)
Row 8: FIELD-A FIELD-C(2,3) FIELD-E(2,3)
Row 9: FIELD-A FIELD-C(2,4) FIELD-E(2,4)
Row 10: FIELD-A FIELD-E(2,5)
Row 11: FIELD-A FIELD-C(3,1)
Row 12: FIELD-A FIELD-C(3,2)
Row 13: FIELD-A FIELD-C(3,3)
Row 14: FIELD-A FIELD-C(3,4)
Row 15: FIELD-A
Group columns
If you select a group column for output, the CFF stage passes data to the output link data in different
ways depending on your selection.
Group columns contain elements or subgroups. When you select groups or their elements for output, the
CFF stage handles them in the following manner:
You can set the output link constraint to match the record ID constraint for each selected output record
by clicking Default on the Constraint tab on the Output page. The Default button is available only when
the constraint grid is empty. For more information about record ID constraints, see “Defining record ID
constraints” on page 160.
The order of operators in expressions is determined by SQL standards. After you verify a constraint, any
redundant parentheses might be removed.
For CFF source stages, reject links are supported only if the source file contains a single record type
without any OCCURS DEPENDING ON (ODO) columns. For CFF target stages, reject links are
supported only if the target file does not contain ODO columns.
You cannot change the selection properties of a reject link. The Selection tab for a reject link is blank.
You cannot edit the column definitions for a reject link. For writing files, the reject link uses the input
link column definitions. For reading files, the reject link uses a single column named ″rejected″ that
contains raw data for the columns that were rejected after reading because they did not match the
schema.
Transformer stages can have a single input and any number of outputs. It can also have a reject link that
takes any rows which have not been written to any of the outputs links by reason of a write failure or
expression evaluation failure.
Unlike most of the other stages in a Parallel job, the Transformer stage has its own user interface. It does
not use the generic interface as described in Chapter 3, “Stage editors,” on page 37.
When you edit a Transformer stage, the Transformer Editor appears. An example Transformer stage is
shown below, this is the same Transformer as illustrated in the job above. The left pane represents input
data and the right pane, output data. The two links carrying data and the link carrying the data that has
failed to meet a criteria are all visible in the editor, the reject link is not.
Must do’s
This section specifies the minimum steps to take to get a Transformer stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
v In the left pane:
Toolbar
The Transformer toolbar contains the following buttons (from left to right):
v Stage properties
v Constraints
v Show all
v Show/hide stage variables
v Cut
v Copy
v Paste
v Find/replace
v Load column definition
v Save column definition
v Column auto-match
v Input link execution order
v Output link execution order
Link area
The top area displays links to and from the Transformer stage, showing their columns and the
relationships between them.
The link area is where all column definitions and stage variables are defined.
The link area is divided into two panes; you can drag the splitter bar between them to resize the panes
relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right.
The left pane shows the input link, the right pane shows output links. Output columns that have no
derivation defined are shown in red.
Within the Transformer Editor, a single link may be selected at any one time. When selected, the link’s
title bar is highlighted, and arrowheads indicate any selected columns within that link.
The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the
required link to the front. That link is also selected in the link area.
If you select a link in the link area, its meta data tab is brought to the front automatically.
You can edit the grids to change the column meta data on any of the links. You can also add and delete
metadata.
As with column meta data grids on other stage editors, edit row in the context menu allows editing of
the full metadata definitions (see ″Columns Tab″ ).
Shortcut menus
The Transformer Editor shortcut menus are displayed by right-clicking the links in the links area.
There are slightly different menus, depending on whether you right-click an input link, an output link, or
a stage variable. The input link menu offers you operations on input columns, the output link menu
offers you operations on output columns and their derivations, and the stage variable menu offers you
operations on stage variables.
If you display the menu from the links area background, you can:
v Open the Stage Properties dialog box in order to specify stage or link properties.
v Open the Constraints dialog box in order to specify a constraint for the selected output link.
v Open the Link Execution Order dialog box in order to specify the order in which links should be
processed.
v Toggle between viewing link relations for all links, or for the selected link only.
v Toggle between displaying stage variables and hiding them.
Right-clicking in the meta data area of the Transformer Editor opens the standard grid editing shortcut
menus.
This section explains some of the basic concepts of using a Transformer stage.
Input link
The input data source is joined to the Transformer stage via the input link.
Output links
You can have any number of output links from your Transformer stage.
You may want to pass some data straight through the Transformer stage unaltered, but it’s likely that
you’ll want to transform data from some input columns before outputting it from the Transformer stage.
You can specify such an operation by entering a transform expression. The source of an output link
column is defined in that column’s Derivation cell within the Transformer Editor. You can use the
Expression Editor to enter expressions in this cell. You can also simply drag an input column to an
output column’s Derivation cell, to pass the data straight through the Transformer stage.
In addition to specifying derivation details for individual output columns, you can also specify
constraints that operate on entire output links. A constraint is an expression that specifies criteria that
data must meet before it can be passed to the output link. You can also specify a constraint otherwise
link, which is an output link that carries all the data not output on other links, that is, columns that have
not met the criteria.
Each output link is processed in turn. If the constraint expression evaluates to TRUE for an input row, the
data row is output on that link. Conversely, if a constraint expression evaluates to FALSE for an input
row, the data row is not output on that link.
Constraint expressions on different links are independent. If you have more than one output link, an
input row may result in a data row being output from some, none, or all of the output links.
For example, if you consider the data that comes from a paint shop, it could include information about
any number of different colors. If you want to separate the colors into different files, you would set up
different constraints. You could output the information about green and blue paint on LinkA, red and
yellow paint on LinkB, and black paint on LinkC.
When an input row contains information about yellow paint, the LinkA constraint expression evaluates to
FALSE and the row is not output on LinkA. However, the input data does satisfy the constraint criterion
for LinkB and the rows are output on LinkB.
If the input data contains information about white paint, this does not satisfy any constraint and the data
row is not output on Links A, B or C, but will be output on the otherwise link. The otherwise link is used
to route data to a table or file that is a ″catch-all″ for rows that are not output on any other link. The
table or file containing these rows is represented by another stage in the job design.
You can also specify another output link which takes rows that have not be written to any other links
because of write failure or expression evaluation failure. This is specified outside the stage by adding a
link and converting it to a reject link using the shortcut menu. This link is not shown in the Transformer
meta data grid, and derives its meta data from the input link. Its column values are those in the input
row that failed to be written.
You can drag and drop multiple columns, key expressions, or derivations. Use the standard Explorer keys
when selecting the source column cells, then proceed as for a single cell.
You can drag and drop the full column set by dragging the link title.
You can add a column to the end of an existing derivation by holding down the Ctrl key as you drag the
column.
The Find and Replace dialog box appears. It has three tabs:
v Expression Text. Allows you to locate the occurrence of a particular string within an expression, and
replace it if required. You can search up or down, and choose to match case, match whole words, or
neither. You can also choose to replace all occurrences of the string within an expression.
v Columns Names. Allows you to find a particular column and rename it if required. You can search up
or down, and choose to match case, match the whole word, or neither.
v Expression Types. Allows you to find the next empty expression or the next expression that contains
an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next
erroneous expression.
Note: The find and replace results are shown in the color specified in Tools lou8 Options.
Press F3 to repeat the last search you made without opening the Find and Replace dialog box.
Select facilities
If you are working on a complex job where several links, each containing several columns, go in and out
of the Transformer stage, you can use the select column facility to select multiple columns. This facility is
also available in the Mapping tabs of certain Parallel job stages.
To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It
has three tabs:
v Expression Text. This Expression Text tab allows you to select all columns/stage variables whose
expressions contain text that matches the text specified. The text specified is a simple text match, taking
into account the Match case setting.
v Column Names. The Column Names tab allows you to select all column/stage variables whose Name
contains the text specified. There is an additional Data Type drop down list, that will limit the columns
selected to those with that data type. You can use the Data Type drop down list on its own to select all
columns of a certain data type. For example, all string columns can be selected by leaving the text field
blank, and selecting String as the data type. The data types in the list are generic data types, where
each of the column SQL data types belong to one of these generic types.
v Expression Types. The Expression Types tab allows you to select all columns with either empty
expressions or invalid expressions.
When copying columns, a new column is created with the same meta data as the column it was copied
from.
To delete a column from within the Transformer Editor, select the column you want to delete and click
the cut button or choose Delete Column from the shortcut menu.
The meta data shown does not include column derivations since these are edited in the links area.
Note: To access a vector element in a column derivation, you need to use an expression containing the
vector function - see ″Vector Function″ .
If a derivation is displayed in red (or the color defined in Tools → Options), it means that the Transformer
Editor considers it incorrect.
Once an output link column has a derivation defined that contains any input link columns, then a
relationship line is drawn between the input column and the output column, as shown in the following
Note: Auto-matching does not take into account any data type incompatibility between matched
columns; the derivations are set regardless.
The Expression Substitution dialog box allows you to make the same change to the expressions of all the
currently selected columns within a link. For example, if you wanted to add a call to the trim() function
around all the string output column expressions in a link, you could do this in two steps. First, use the
Select dialog to select all the string output columns. Then use the Expression Substitution dialog to apply
a trim() call around each of the existing expression values in those selected columns.
You are offered a choice between Whole expression substitution and Part of expression substitution.
Whole expression
With this option the whole existing expression for each column is replaced by the replacement value
specified. This replacement value can be a completely new value, but will usually be a value based on
the original expression value. When specifying the replacement value, the existing value of the column’s
expression can be included in this new value by including ″$1″. This can be included any number of
times.
For example, when adding a trim() call around each expression of the currently selected column set,
having selected the required columns, you would:
1. Select the Whole expression option.
2. Enter a replacement value of:
trim($1)
3. Click OK
If you need to include the actual text $1 in your expression, enter it as ″$$1″.
Part of expression
With this option, only part of each selected expression is replaced rather than the whole expression. The
part of the expression to be replaced is specified by a Regular Expression match.
It is possible that more that one part of an expression string could match the Regular Expression
specified. If Replace all occurrences is checked, then each occurrence of a match will be updated with the
replacement value specified. If it is not checked, then just the first occurrence is replaced.
When replacing part of an expression, the replacement value specified can include that part of the
original expression being replaced. In order to do this, the Regular Expression specified must have round
brackets around its value. ″$1″ in the replacement value will then represent that matched text. If the
Regular Expression is not surrounded by round brackets, then ″$1″ will simply be the text ″$1″.
For complex Regular Expression usage, subsets of the Regular Expression text can be included in round
brackets rather than the whole text. In this case, the entire matched part of the original expression is still
replaced, but ″$1″, ″$2″ etc can be used to refer to each matched bracketed part of the Regular Expression
specified.
Suppose a selected set of columns have derivations that use input columns from `DSLink3’. For example,
two of these derivations could be:
DSLink3.OrderCount + 1
If (DSLink3.Total > 0) Then DSLink3.Total Else -1
You may want to protect the usage of these input columns from null values, and use a zero value instead
of the null. To do this:
1. Select the columns you want to substitute expressions for.
2. Select the Part of expression option.
3. Specify a Regular Expression value of:
(DSLink3\.[a-z,A-Z,0-9]*)
4. Specify a replacement value of
NullToZero($1)
5. Click OK, to apply this to all the selected column derivations.
would become
NullToZero(DSLink3.OrderCount) + 1
and
If (DSLink3.Total > 0) Then DSLink3.Total Else -1
If the Replace all occurrences option is selected, the second expression will become:
If (NullToZero(DSLink3.Total) > 0)
Then NullToZero(DSLink3.Total)
Else -1
The replacement value can be any form of expression string. For example in the case above, the
replacement value could have been:
(If (StageVar1 > 50000) Then $1 Else ($1 + 100))
would become:
(If (StageVar1 > 50000) Then DSLink3.OrderCount
Else (DSLink3.OrderCount + 100)) + 1
It does not apply where an output column is mapped directly from an input column, with a straight
assignment expression.
If you need to be able to handle nulls in these situations, you should use the null handling functions
described in Appendix B. For example, you could enter an output column derivation expression
including the expression:
1 + NullToZero(InLink.Col1)
This would check the input column to see if it contains a null, and if it did, replace it with 0 (which is
added to 1). Otherwise the value the column contains is added to 1.
Define a constraint by entering an expression in the Constraint field for that link. Once you have done
this, any constraints will appear below the link’s title bar in the Transformer Editor. This constraint
expression will then be checked against the row data at runtime. If the data does not satisfy the
constraint, the row will not be written to that link. It is also possible to define a link which can be used
to catch these rows which have been not satisfied the constraints on any previous links.
Note: You can also specify a reject link which will catch rows that have not been written on any
output links due to a write error or null expression error. Define this outside Transformer stage
by adding a link and using the shortcut menu to convert it to a reject link.
The initial order of the links is the order in which they are added to the stage.
Any stage variables you declare are shown in a table in the right pane of the links area. The table looks
similar to an output link. You can display or hide the table by clicking the Stage Variable button in the
Transformer toolbar or choosing Stage Variable from the background shortcut menu.
The table lists the stage variables together with the expressions used to derive their values. Link lines join
the stage variables with input columns used in the expressions. Links from the right side of the table link
the variables to the output columns that use them.
You perform most of the same operations on a stage variable as you can on an output column (see
Defining Output Column Derivations). A shortcut menu offers the same commands. You cannot, however,
paste a stage variable as a new column, or a column as a new stage variable.
Expression format
The format of an expression is as follows:
KEY:
something_like_this is a token
something_in_italics is a terminal, i.e., doesn’t break down any further
| is a choice between tokens
[ is an optional part of the construction
"XXX" is a literal token (i.e., use XXX not
including the quotes)
=================================================
expression ::= function_call |
variable_name |
other_name |
Entering expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the
next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to
the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears
depends on context, i.e., whether you should be entering an operand or an operator as the next
expression element. The Functions available from this menu are described in Appendix B. The DS macros
are described in WebSphere DataStage Parallel Job Advanced Developer Guide You can also specify custom
routines for use in the expression editor (see in WebSphere DataStage Designer Client Guide).
If there is an error, a message appears and the element causing the error is highlighted in the expression
box. You can either correct the expression or close the Transformer Editor or Transform dialog box.
For any expression, selecting Validate from its shortcut menu will also validate it and show any errors in
a message box.
The Expression Editor is configured by editing the Designer options. This allows you to specify how
`helpful’ the expression editor is. For more information, see WebSphere DataStage Designer Client Guide.
System variables
WebSphere DataStage provides a set of variables containing useful system information that you can
access from an output derivation or constraint.
Name Description
@FALSE
The value is replaced with 0.
@TRUE
The value is replaced with 1.
@INROWNUM
Input row counter.
@OUTROWNUM
Output row counter (per link).
@NUMPARTITIONS
The total number of partitions for the stage.
@PARTITIONNUM
The partition number for the particular instance.
The stage variables and the columns within a link are evaluated in the order in which they are displayed
on the parallel job canvas. Similarly, the output links are also evaluated in the order in which they are
displayed.
From this sequence, it can be seen that there are certain constructs that would be inefficient to include in
output column derivations, as they would be evaluated once for every output column that uses them.
Such constructs are:
v Where the same part of an expression is used in multiple column derivations.
For example, suppose multiple columns in output links want to use the same substring of an input
column, then the following test may appear in a number of output columns derivations:
IF (DSLINK1.col1[1,3] = "001") THEN ...
In this case, the evaluation of the substring of DSLINK1.col[1,3] is repeated for each column that uses
it.
This can be made more efficient by moving the substring calculation into a stage variable. By doing
this, the substring is evaluated just once for every input row. In this case, the stage variable definition
for StageVar1 would be:
DSLINK1.col1[1,3]
and each column derivation would start with:
IF (StageVar1 = "001") THEN ...
In fact, this example could be improved further by also moving the string comparison into the stage
variable. The stage variable would be:
IF (DSLink1.col1[1,3] = "001") THEN 1 ELSE 0
and each column derivation would start with:
IF (StageVar1) THEN
This reduces both the number of substring functions evaluated and string comparisons made in the
Transformer.
v Where an expression includes calculated constant values.
For example, a column definition may include a function call that returns a constant value, such as:
Str(" ",20)
This returns a string of 20 spaces. In this case, the function would be evaluated every time the column
derivation is evaluated. It would be more efficient to calculate the constant value just once for the
whole Transformer.
This can be achieved using stage variables. This function could be moved into a stage variable
derivation; but in this case, the function would still be evaluated once for every input row. The
solution here is to move the function evaluation into the initial value of a stage variable.
A stage variable can be assigned an initial value from the Stage Properties dialog box Variables tab. In
this case, the variable would have its initial value set to:
Str(" ", 20)
Stage page
The Stage page has up to eight tabs:
v General. Allows you to enter an optional description of the stage.
v Variables. Allows you to set up stage variables for use in the stage.
v Surrogate Key. Allows you to generate surrogate keys that can be used in complex lookup and update
operations.
v Advanced. Allows you to specify how the stage executes.
v Link Ordering. Allows you to specify the order in which the output links will be processed.
v Triggers. Allows you to run certain routines at certain points in the stage’s execution.
v NLS Locale. Allows you to select a locale other than the project default to determine collating rules.
182 Parallel Job Developer Guide
v Build. Allows you to override the default compiler and linker flags for this stag.
The Variables tab is described in ″Defining Local Stage Variables″. The Link Ordering tab is described in
″Specifying Link Order″.
General tab
In addition to the Description field, the General page also has an option which lets you control how
many rejected row warnings will appear in the job log when you run the job. Whenever a row is rejected
because it contains a null value, a warning is written to the job log. Potentially there could be a lot of
messages, so this option allows you to set limits. By default, up to 50 messages per partition are allowed,
but you can increase or decrease this, or set it to -1 to allow unlimited messages.
You must create a derivation for the surrogate key column that uses the NextSurrogateKey function.
Advanced tab
The Advanced tab is the same as the Advanced tab of the generic stage editor as described in ″Advanced
Tab″. This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data
is processed by the available nodes as specified in the Configuration file, and by any node constraints
specified on the Advanced tab. In sequential mode the data is processed by the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is set to Propagate by default, this sets or clears the partitioning in
accordance with what the previous stage has set. You can also select Set or Clear. If you select Set, the
stage will request that the next stage preserves the partitioning as is.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Triggers tab
The Triggers tab allows you to choose routines to be executed at specific execution points as the
transformer stage runs in a job. The execution point is per-instance, i.e., if a job has two transformer stage
instances running in parallel, the routine will be called twice, once for each instance.
The available execution points are Before-stage and After-stage. At this release, the only available built-in
routine is SetCustomSummaryInfo. You can also define custom routines to be executed; to do this you
define a C function, make it available in UNIX shared library, and then define a Parallel routine which
calls it (see WebSphere DataStage Designer Client Guide for details on defining a Parallel Routine). Note that
the function should not return a value.
Build tab
This tab allows you to override the compiler and linker flags that have been set for the job or project. The
flags you specify here will take effect for this stage and this stage alone. The flags available are platform
and compiler-dependent.
Input page
The Input page allows you to specify details about data coming into the Transformer stage. The
Transformer stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned. This is the same as the Partitioning tab in the
generic stage editor described in ″Partitioning Tab″. The Advanced tab allows you to change the default
buffering settings for the input link.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
when input to the Transformer stage. It also allows you to specify that the data should be sorted on
input.
By default the Transformer stage will attempt to preserve partitioning of incoming data, or use its own
partitioning method according to what the previous stage in the job dictates.
If the Transformer stage is operating in sequential mode, it will first collect the data before writing it to
the file using the default collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Transformer stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partitioning type drop-down list. This will override any current partitioning.
If the Transformer stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted. The
sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs
after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability
of sorting depends on the partitioning method chosen.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output Page has a General tab which allows you to enter an optional description for each of the
output links on the Transformer stage. It also allows you to switch Runtime column propagation on for
this link, in which case data will automatically be propagated from the input link without you having to
specify meta data for this output link (see ″Runtime Column Propagation″ ). The Advanced tab allows
you to change the default buffering settings for the output links.
The BASIC Transformer stage is similar in appearance and function to the Transformer stage described in
Chapter 11, “Transformer stage,” on page 167. It gives access to BASIC transforms and functions (BASIC
is the language supported by the WebSphere DataStage server engine and available in server jobs). For a
description of the BASIC functions available see WebSphere DataStage Server Job Developer Guide.
You can only use BASIC transformer stages on SMP systems (not on MPP or cluster systems).
Note: If you encounter a problem when running a job containing a BASIC transformer, you could try
increasing the value of the DSIPC_OPEN_TIMEOUT environment variable in the Parallel
Operator specific category of the environment variable dialog box in the DataStage Administrator
(see WebSphere DataStage Administrator Client Guide).
BASIC Transformer stages can have a single input and any number of outputs.
Must do’s
This section specifies the minimum steps to take to get a BASIC Transformer stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
v In the left pane:
– Ensure that you have column metadata defined.
v In the right pane:
– Ensure that you have column metadata defined for each of the output links. The easiest way to do
this is to drag columns across from the input link.
– Define the derivation for each of your output columns. You can leave this as a straight mapping
from an input column, or explicitly define an expression to transform the data before it is output.
– Optionally specify a constraint for each output link. This is an expression which input rows must
satisfy before they are output on a link. Rows that are not output on any of the links can be output
on the otherwise link.
– Optionally specify one or more stage variables. This provides a method of defining expressions
which can be reused in your output columns derivations (stage variables are only visible within the
stage).
Toolbar
The Transformer toolbar contains the following buttons (from left to right):
v Stage properties
v Constraints
v Show all
Link area
The top area displays links to and from the BASIC Transformer stage, showing their columns and the
relationships between them.
The link area is where all column definitions and stage variables are defined.
The link area is divided into two panes; you can drag the splitter bar between them to resize the panes
relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right.
The left pane shows the input link, the right pane shows output links. Output columns that have no
derivation defined are shown in red.
Within the Transformer Editor, a single link may be selected at any one time. When selected, the link’s
title bar is highlighted, and arrowheads indicate any selected columns.
Metadata area
The bottom area shows the column metadata for input and output links. Again this area is divided into
two panes: the left showing input link meta data and the right showing output link meta data.
The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the
required link to the front. That link is also selected in the link area.
If you select a link in the link area, its metadata tab is brought to the front automatically.
You can edit the grids to change the column meta data on any of the links. You can also add and delete
metadata.
Shortcut menus
The BASIC Transformer Editor shortcut menus are displayed by right-clicking the links in the links area.
There are slightly different menus, depending on whether you right-click an input link, an output link, or
a stage variable. The input link menu offers you operations on input columns, the output link menu
offers you operations on output columns and their derivations, and the stage variable menu offers you
operations on stage variables.
Ifyou display the menu from the links area background, you can:
v Open the Stage Properties dialog box in order to specify stage or link properties.
v Open the Constraints dialog box in order to specify a constraint for the selected output link.
v Open the Link Execution Order dialog box in order to specify the order in which links should be
processed.
v Toggle between viewing link relations for all links, or for the selected link only.
v Toggle between displaying stage variables and hiding them.
Right-clicking in the meta data area of the Transformer Editor opens the standard grid editing shortcut
menus.
This section explains some of the basic concepts of using a Transformer stage.
Input link
The input data source is joined to the BASIC Transformer stage via the input link.
Output links
You can have any number of output links from your Transformer stage.
You may want to pass some data straight through the BASIC Transformer stage unaltered, but it’s likely
that you’ll want to transform data from some input columns before outputting it from the BASIC
Transformer stage.
You can specify such an operation by entering an expression or by selecting a transform to apply to the
data. WebSphere DataStage has many built-in transforms, or you can define your own custom transforms
that are stored in the Repository and can be reused as required.
The source of an output link column is defined in that column’s Derivation cell within the Transformer
Editor. You can use the Expression Editor to enter expressions or transforms in this cell. You can also
simply drag an input column to an output column’s Derivation cell, to pass the data straight through the
BASIC Transformer stage.
In addition to specifying derivation details for individual output columns, you can also specify
constraints that operate on entire output links. A constraint is a BASIC expression that specifies criteria
Each output link is processed in turn. If the constraint expression evaluates to TRUE for an input row, the
data row is output on that link. Conversely, if a constraint expression evaluates to FALSE for an input
row, the data row is not output on that link.
Constraint expressions on different links are independent. If you have more than one output link, an
input row may result in a data row being output from some, none, or all of the output links.
For example, if you consider the data that comes from a paint shop, it could include information about
any number of different colors. If you want to separate the colors into different files, you would set up
different constraints. You could output the information about green and blue paint on LinkA, red and
yellow paint on LinkB, and black paint on LinkC.
When an input row contains information about yellow paint, the LinkA constraint expression evaluates to
FALSE and the row is not output on LinkA. However, the input data does satisfy the constraint criterion
for LinkB and the rows are output on LinkB.
If the input data contains information about white paint, this does not satisfy any constraint and the data
row is not output on Links A, B or C, but will be output on the reject link. The reject link is used to route
data to a table or file that is a ″catch-all″ for rows that are not output on any other link. The table or file
containing these rejects is represented by another stage in the job design.
You can drag and drop multiple columns or derivations. Use the standard Explorer keys when selecting
the source column cells, then proceed as for a single cell.
You can drag and drop the full column set by dragging the link title.
You can add a column to the end of an existing derivation by holding down the Ctrl key as you drag the
column.
The Find and Replace dialog box appears. It has three tabs:
v Expression Text. Allows you to locate the occurrence of a particular string within an expression, and
replace it if required. You can search up or down, and choose to match case, match whole words, or
neither. You can also choose to replace all occurrences of the string within an expression.
v Columns Names. Allows you to find a particular column and rename it if required. You can search up
or down, and choose to match case, match the whole word, or neither.
v Expression Types. Allows you to find the next empty expression or the next expression that contains
an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next
erroneous expression.
Note: The find and replace results are shown in the color specified in Tools → Options.
Press F3 to repeat the last search you made without opening the Find and Replace dialog box.
Select facilities
If you are working on a complex job where several links, each containing several columns, go in and out
of the Transformer stage, you can use the select column facility to select multiple columns. This facility is
also available in the Mapping tabs of certain Parallel job stages.
To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It
has three tabs:
v Expression Text. This Expression Text tab allows you to select all columns/stage variables whose
expressions contain text that matches the text specified. The text specified is a simple text match, taking
into account the Match case setting.
v Column Names. The Column Names tab allows you to select all column/stage variables whose Name
contains the text specified. There is an additional Data Type drop down list, that will limit the columns
selected to those with that data type. You can use the Data Type drop down list on its own to select all
columns of a certain data type. For example, all string columns can be selected by leaving the text field
blank, and selecting String as the data type. The data types in the list are generic data types, where
each of the column SQL data types belong to one of these generic types.
v Expression Types. The Expression Types tab allows you to select all columns with either empty
expressions or invalid expressions.
When copying columns, a new column is created with the same meta data as the column it was copied
from.
To delete a column from within the Transformer Editor, select the column you want to delete and click
the cut button or choose Delete Column from the shortcut menu.
The meta data shown does not include column derivations since these are edited in the links area.
If a derivation is displayed in red (or the color defined in Tools → Options), it means that the Transformer
Editor considers it incorrect. (In some cases this may simply mean that the derivation does not meet the
strict usage pattern rules of the WebSphere DataStage engine, but will actually function correctly.)
Once an output link column has a derivation defined that contains any input link columns, then a
relationship line is drawn between the input column and the output column, as shown in the following
example. This is a simple example; there can be multiple relationship lines either in or out of columns.
You can choose whether to view the relationships for all links, or just the relationships for the selected
links, using the button in the toolbar.
Note: Auto-matching does not take into account any data type incompatibility between matched
columns; the derivations are set regardless.
The Expression Substitution dialog box allows you to make the same change to the expressions of all the
currently selected columns within a link. For example, if you wanted to add a call to the trim() function
around all the string output column expressions in a link, you could do this in two steps. First, use the
Select dialog to select all the string output columns. Then use the Expression Substitution dialog to apply
a trim() call around each of the existing expression values in those selected columns.
You are offered a choice between Whole expression substitution and Part of expression substitution.
Whole expression
With this option the whole existing expression for each column is replaced by the replacement value
specified. This replacement value can be a completely new value, but will usually be a value based on
the original expression value. When specifying the replacement value, the existing value of the column’s
expression can be included in this new value by including ″$1″. This can be included any number of
times.
For example, when adding a trim() call around each expression of the currently selected column set,
having selected the required columns, you would:
1. Select the Whole expression option.
2. Enter a replacement value of:
trim($1)
3. Click OK
If you need to include the actual text $1 in your expression, enter it as ″$$1″.
Part of expression
With this option, only part of each selected expression is replaced rather than the whole expression. The
part of the expression to be replaced is specified by a Regular Expression match.
It is possible that more that one part of an expression string could match the Regular Expression
specified. If Replace all occurrences is checked, then each occurrence of a match will be updated with the
replacement value specified. If it is not checked, then just the first occurrence is replaced.
When replacing part of an expression, the replacement value specified can include that part of the
original expression being replaced. In order to do this, the Regular Expression specified must have round
brackets around its value. ″$1″ in the replacement value will then represent that matched text. If the
Regular Expression is not surrounded by round brackets, then ″$1″ will simply be the text ″$1″.
Suppose a selected set of columns have derivations that use input columns from `DSLink3’. For example,
two of these derivations could be:
DSLink3.OrderCount + 1
If (DSLink3.Total > 0) Then DSLink3.Total Else -1
You may want to protect the usage of these input columns from null values, and use a zero value instead
of the null. To do this:
1. Select the columns you want to substitute expressions for.
2. Select the Part of expression option.
3. Specify a Regular Expression value of:
(DSLink3\.[a-z,A-Z,0-9]*)
4. Specify a replacement value of
NullToZero($1)
5. Click OK, to apply this to all the selected column derivations.
would become
NullToZero(DSLink3.OrderCount) + 1
and
If (DSLink3.Total > 0) Then DSLink3.Total Else -1
would become:
If (NullToZero(DSLink3.Total) > 0) Then DSLink3.Total Else -1
If the Replace all occurrences option is selected, the second expression will become:
If (NullToZero(DSLink3.Total) > 0)
Then NullToZero(DSLink3.Total)
Else -1
The replacement value can be any form of expression string. For example in the case above, the
replacement value could have been:
(If (StageVar1 > 50000) Then $1 Else ($1 + 100))
would become:
(If (StageVar1 > 50000) Then DSLink3.OrderCount
Else (DSLink3.OrderCount + 100)) + 1
Choose a routine from the drop-down list box. This list box contains all the built routines defined as a
Before/After Subroutine under the Routines branch in the Repository. Enter an appropriate value for the
routine’s input argument in the Input Value field.
If you choose a routine that is defined in the Repository, but which was edited but not compiled, a
warning message reminds you to compile the routine when you close the Transformer stage dialog box.
If you installed or imported a job, the Before-stage subroutine or After-stage subroutine field may
reference a routine that does not exist on your system. In this case, a warning message appears when you
close the dialog box. You must install or import the ″missing″ routine or choose an alternative one to use.
A return code of 0 from the routine indicates success, any other code indicates failure and causes a fatal
error when the job is run.
A dialog box appears which allows you either to define constraints for any of the Transformer output
links or to define a link as an reject link.
Define a constraint by entering a expression in the Constraint field for that link. Once you have done
this, any constraints will appear below the link’s title bar in the Transformer Editor. This constraint
expression will then be checked against the row data at runtime. If the data does not satisfy the
constraint, the row will not be written to that link. It is also possible to define a link which can be used
to catch these rows which have been rejected from a previous link.
A reject link can be defined by choosing Yes in the Reject Row field and setting the Constraint field as
follows:
v To catch rows which are rejected from a specific output link, set the Constraint field to
linkname.REJECTED. This will be set whenever a row is rejected on the linkname link, whether because
the row fails to match a constraint on that output link, or because a write operation on the target fails
for that row. Note that such an otherwise link should occur after the output link from which it is
defined to catch rejects.
v To catch rows which caused a write failures on an output link, set the Constraint field to
linkname.REJECTEDCODE. The value of linkname.REJECTEDCODE will be non-zero if the row was
Note: Due to the nature of the ″catch all″ case above, you should only use one reject link whose
Constraint field is blank. To use multiple reject links, you should define them to use the
linkname.REJECTED flag detailed in the first case above.
Note: Although the link ordering facilities mean that you can use a previous output column to derive
a subsequent output column, we do not encourage this practice, and you will receive a warning
if you do so.
Note: Stage variables are not shown in the output link meta data area at the bottom of the right pane.
The table lists the stage variables together with the expressions used to derive their values. Link lines join
the stage variables with input columns used in the expressions. Links from the right side of the table link
the variables to the output columns that use them.
Variables entered in the Stage Properties dialog box appear in the Stage Variable table in the links pane.
You perform most of the same operations on a stage variable as you can on an output column (see
Defining Output Column Derivations). A shortcut menu offers the same commands. You cannot, however,
paste a stage variable as a new column, or a column as a new stage variable.
Expression format
The format of an expression is as follows:
KEY:
something_like_this is a token
something_in_italics is a terminal, i.e., doesn’t break down any
further
| is a choice between tokens
[ is an optional part of the construction
"XXX" is a literal token (i.e., use XXX not
including the quotes)
=================================================
Entering expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the
next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to
the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears
depends on context, i.e., whether you should be entering an operand or an operator as the next
expression element.
You will be offered a different selection on the Suggest Operand menu depending on whether you are
defining key expressions, derivations and constraints, or a custom transform. The Suggest Operator
menu is always the same.
If you enter the name of an input link followed by a period, for example, DailySales., the Expression
Editor displays a list of the column names of that link. If you continue typing, the list selection changes
to match what you type. You can also select a column name using the mouse. Enter a selected column
name into the expression by pressing Tab or Enter. Press Esc to dismiss the list without selecting a
column name.
If there is an error, a message appears and the element causing the error is highlighted in the expression
box. You can either correct the expression or close the Transformer Editor or Transform dialog box.
Within the Transformer Editor, the invalid expressions are shown in red. (In some cases this may simply
mean that the expression does not meet the strict usage pattern rules of the WebSphere DataStage engine,
but will actually function correctly.)
The Expression Editor is configured by editing the Designer options. This allows you to specify how
`helpful’ the expression editor is. For more information, see WebSphere DataStage Designer Client Guide.
Stage page
The Stage page has four tabs:
200 Parallel Job Developer Guide
v General. Allows you to enter an optional description of the stage and specify a before-stage and/or
after-stage subroutine.
v Variables. Allows you to set up stage variables for use in the stage.
v Link Ordering. Allows you to specify the order in which the output links will be processed.
v Advanced. Allows you to specify how the stage executes.
The General tab is described in ″Before-Stage and After-Stage Routines″ . The Variables tab is described in
″Defining Local Stage Variables″. The Link Ordering tab is described in ″Specifying Link Order″.
Advanced tab
The Advanced tab is the same as the Advanced tab of the generic stage editor as described in ″Advanced
Tab″. This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data
is processed by the available nodes as specified in the Configuration file, and by any node constraints
specified on the Advanced tab. In sequential mode the data is processed by the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is set to Propagate by default, this sets or clears the partitioning in
accordance with what the previous stage has set. You can also select Set or Clear. If you select Set, the
stage will request that the next stage preserves the partitioning as is.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about data coming into the Transformer stage. The
Transformer stage can have only one input link.
The General tab allows you to specify an optional description of the input link.
The Partitioning tab allows you to specify how incoming data is partitioned. This is the same as the
Partitioning tab in the generic stage editor described in ″Partitioning Tab″.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
when input to the BASIC Transformer stage. It also allows you to specify that the data should be sorted
on input.
By default the BASIC Transformer stage will attempt to preserve partitioning of incoming data, or use its
own partitioning method according to what the previous stage in the job dictates.
If the BASIC Transformer stage is operating in sequential mode, it will first collect the data before writing
it to the file using the default collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
If the BASIC Transformer stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partitioning type drop-down list. This will override any current partitioning.
If the BASIC Transformer stage is set to execute in sequential mode, but the preceding stage is executing
in parallel, then you can set a collection method from the Collector type drop-down list. This will
override the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted. The
sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs
after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability
of sorting depends on the partitioning method chosen.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output Page has a General tab which allows you to enter an optional description for each of the
output links on the BASIC Transformer stage. The Advanced tab allows you to change the default
buffering settings for the output links.
When you edit an Aggregator stage, the Aggregator stage editor appears. This is based on the generic
stage editor described in Chapter 3, “Stage editors,” on page 37, ″Stage Editors.″
The aggregator stage gives you access to grouping and summary operations. One of the easiest ways to
expose patterns in a collection of records is to group records with similar characteristics, then compute
statistics on all records in the group. You can then use these statistics to compare properties of the
different groups. For example, records containing cash register transactions might be grouped by the day
of the week to see which day had the largest number of transactions, the largest amount of revenue, etc.
Records can be grouped by one or more characteristics, where record characteristics correspond to
column values. In other words, a group is a set of records with the same value for one or more columns.
For example, transaction records might be grouped by both day of the week and by month. These
groupings might show that the busiest day of the week varies by season.
In addition to revealing patterns in your data, grouping can also reduce the volume of data by
summarizing the records in each group, making it easier to manage. If you group a large volume of data
on the basis of one or more characteristics of the data, the resulting data set is generally much smaller
than the original and is therefore easier to analyze using standard workstation or PC-based tools.
At a practical level, you should be aware that, in a parallel environment, the way that you partition data
before grouping and summarizing it can affect the results. For example, if you partitioned using the
round robin method, records with identical values in the column you are grouping on would end up in
different partitions. If you then performed a sum operation within these partitions you would not be
operating on all the relevant columns. In such circumstances you may want to key partition the data on
one or more of the grouping keys to ensure that your groups are entire.
Example
The example data is from a freight carrier who charges customers based on distance, equipment, packing,
and license requirements. They need a report of distance traveled and charges grouped by date and
license type.
The stage first hash partitions the incoming data on the license column, then sorts it on license and date:
Ship Date License Distance Sum Distance Mean Charge Sum Charge Mean
...
2000-06-02 BUN 1126053.00 1563.93 20427400.00 28371.39
If you wanted to go on and work out the sum of the distance and charge sums by license, you could
insert another Aggregator stage with the following properties:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Aggregator
stages in a job. This section specifies the minimum steps to take to get an Aggregator stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Grouping Input column N/A Y Y N/A
Keys/Group
Grouping True/ False True N N Group
Keys/Case
Sensitive
Aggregations/ Calculation/ Calculation Y N N/A
Aggregation Type Recalculation/
Count rows
Aggregations/ Input column N/A Y (if Aggregation Y N/A
Column for Type =
Calculation Calculation)
Aggregations/ Output column N/A Y (if Aggregation Y N/A
Count Output Type = Count
Column Rows)
Aggregations/ Input column N/A Y (if Aggregation Y N/A
Summary Type =
Column for Recalculation)
Recalculation
Aggregations/ precision, scale 8,2 N N N/A
Default To
Decimal Output
Aggregations/ Output column N/A N N Column for
Corrected Sum of Calculation &
Squares Summary
Column for
Recalculation
Specifies the input columns you are using as group keys. Repeat the property to select multiple columns
as group keys. You can use the Column Selection dialog box to select several group keys at once if
required). This property has a dependent property:
v Case Sensitive
Use this to specify whether each group key is case sensitive or not, this is set to True by default, that
is, the values ″CASE″ and ″case″ in would end up in different groups.
Aggregations category
Aggregation type
This property allows you to specify the type of aggregation operation your stage is performing. Choose
from Calculate (the default), Recalculate, and Count Rows.
The Calculate aggregate type allows you to summarize the contents of a particular column or columns in
your input data set by applying one or more aggregate functions to it. Select the column to be
aggregated, then select dependent properties to specify the operation to perform on it, and the output
column to carry the result. You can use the Column Selection dialog box to select several columns for
calculation at once if required).
The Count Rows aggregate type performs a count of the number of records within each group. Specify
the column on which the count is output.
This aggregate type allows you to apply aggregate functions to a column that has already been
summarized. This is like calculate but performs the specified aggregate operation on a set of data that has
already been summarized. In practice this means you should have performed a calculate (or recalculate)
operation in a previous Aggregator stage with the Summary property set to produce a subrecord
containing the summary data that is then included with the data set. Select the column to be aggregated,
then select dependent properties to specify the operation to perform on it, and the output column to
carry the result. You can use the Column Selection dialog box to select several columns for recalculation
at once if required).
Weighting column
Configures the stage to increment the count for the group by the contents of the weight column for each
record in the group, instead of by 1. Not available for Summary Column for Recalculation. Setting this
option affects only the following options:
v Percent Coefficient of Variation
v Mean Value
v Sum
v Sum of Weights
v Uncorrected Sum of Squares
The output type of a calculation or recalculation column is double. Setting this property causes it to
default to decimal. You can also set a default precision and scale. (You can also specify that individual
columns have decimal output while others retain the default type of double.)
Options category
Method
The aggregate stage has two modes of operation: hash and sort. Your choice of mode depends primarily
on the number of groupings in the input data set, taking into account the amount of memory available.
You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000
groups per megabyte of memory to be used.
When using hash mode, you should hash partition the input data set by one or more of the grouping key
columns so that all the records in the same group are in the same partition (this happens automatically if
auto is set in the Partitioning tab). However, hash partitioning is not mandatory, you can use any
partitioning method you choose if keeping groups together in a single partition is not important. For
If the number of groups is large, which can happen if you specify many grouping keys, or if some
grouping keys can take on many values, you would normally use sort mode. However, sort mode
requires the input data set to have been partition sorted with all of the grouping keys specified as
hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting
requires a pregrouping operation: after sorting, all records in a given group in the same partition are
consecutive.
You may want to try both modes with your particular data and application to determine which gives the
better performance. You may find that when calculating statistics on large numbers of groups, sort mode
performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed
to group.
Set this to True to indicate that null is a valid output value when calculating minimum value, maximum
value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the
null value will have 0 substituted when all input values for the calculation column are null. It is False by
default.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data set is processed by the available nodes as specified in the Configuration file, and by any
node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by
the conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage
will request that the next stage in the job attempt to maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being grouped and/or summarized. The
Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change
the default buffering settings for the input link.
Details about Aggregator stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is grouped and/or summarized. It also allows you to specify that the data should be sorted
before being operated on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Aggregator stage is operating in sequential mode, it will first collect the data before writing it to
the file using the default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Aggregator stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Aggregator stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the Aggregator stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available for the default auto modes).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Aggregator stage. The
Aggregator stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship
between the processed data being produced by the Aggregator stage and the Output columns. The
Advanced tab allows you to change the default buffering settings for the output link.
Mapping tab
For the Aggregator stage the Mapping tab allows you to specify how the output columns are derived, i.e.,
what input columns map onto them or how they are generated.
The left pane shows the input columns and/or the generated columns. These are read only and cannot be
modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging columns over from the left pane, or by
using the Auto-match facility.
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input (for example, whether it is sorted). See ″Join Versus Lookup″ for
help in deciding which stage to use.
In the Join stage, the input data sets are notionally identified as the ″right″ set and the ″left″ set, and
″intermediate″ sets. You can specify which is which. It has any number of input links and a single output
link.
The data sets input to the Join stage must be key partitioned and sorted. This ensures that rows with the
same key column values are located in the same partition and will be processed by the same node. It also
minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing
the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning
are carried out on separate stages before the Join stage, WebSphere DataStage in auto mode will detect
this and not repartition (alternatively you could explicitly specify the Same partitioning method).
There are two data sets being combined. One is the primary or driving dataset, sometimes called the left
of the join. The other data set(s) are the reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If these take up a large amount of
memory relative to the physical RAM memory size of the computer you are running on, then a lookup
stage may thrash because the reference datasets may not fit in RAM along with everything else that has
to be in RAM. This results in very slow performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never involves
paging or other I/O.
Example joins
The following examples show what happens to two data sets when each type of join operation is applied
to them. Here are the two data sets:
Inner join
Here is the data set that is output if you perform an inner join on the Price key column:
Must do’s
WebSphere DataStage has many defaults which means that Joins can be simple to set up. This section
specifies the minimum steps to take to get a Join stage functioning. WebSphere DataStage provides a
versatile user interface, and there are many shortcuts to achieving a particular end, this section describes
the basic method, you will learn where the shortcuts are when you get familiar with the product.
v In the Stage page Properties Tab specify the key column or columns that the join will be performed
on.
v In the Stage page Properties Tab specify the join type or accept the default of Inner.
v In the Stage page Link Ordering Tab, check that your links are correctly identified as ″left″, ″right″,
and ″intermediate″ and reorder if required.
v Ensure required column meta data has been specified (this may be done in another stage).
v In the Output Page Mapping Tab, specify how the columns from the input links map onto output
columns.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Join Keys/“Key” Input Column N/A Y Y N/A
Join Keys/Case True/False True N N Key
Sensitive
Options/“Options Full Outer/ Inner Y N N/A
category” Inner/Left
Outer/ Right
Outer
Choose the input column you want to join on. You are offered a choice of input columns common to all
links. For a join to work you must join on a column that appears in all input data sets, i.e. have the same
name and compatible data types. If, for example, you select a column called ″name″ from the left link, the
stage will expect there to be an equivalent column called ″name″ on the right link.
You can join on multiple key columns. To do so, repeat the Key property. You can use the Column
Selection dialog box to select several key columns at once if required).
Case Sensitive
v Case Sensitive
Use this to specify whether each group key is case sensitive or not, this is set to True by default, that
is, the values ″CASE″ and ″case″ in would not be judged equivalent.
Options category
Join type
Specify the type of join operation you want to perform. Choose one of:
v Full Outer
v Inner
v Left Outer
v Right Outer
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the
settings of the input stages, i.e., if either of the input stages uses Set then this stage will use Set. You
can explicitly select Set or Clear. Select Set to request that the next stage in the job attempts to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
In the example DSLink4 is the left link, click on it to select it then click on the down arrow to convert it
into the right link.
Input page
The Input page allows you to specify details about the incoming data sets. Choose an input link from the
Input name drop down list to specify which link you want to work on.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being joined. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file. Auto mode ensures that data being input to the Join stage is key partitioned and
sorted.
If the Join stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Join stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Join stage is set to execute in parallel, then you can set a partitioning method by selecting from the
Partition type drop-down list. This will override any current partitioning.
If the Join stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then
you can set a collection method from the Collector type drop-down list. This will override the default
collection method.
The Partitioning tab also allows you to explicitly specify that data arriving on the input link should be
sorted before being joined (you might use this if you have selected a partitioning method other than auto
or same). The sort is always carried out within data partitions. If the stage is partitioning incoming data
the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection.
The availability of sorting depends on the partitioning or collecting method chosen (it is not available
with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Join stage. The Join stage can
have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Join stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output link.
Details about Join stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For Join stages the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them.
The left pane shows the input columns from the links whose tables have been joined. These are read only
and cannot be modified on this tab.
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Merge stage is one of three stages that join tables based on the values of key columns. The other two
are:
v Chapter 14, “Join stage,” on page 219
v Chapter 16, “Lookup Stage,” on page 235
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input (for example, whether it is sorted).
The Merge stage combines a master data set with one or more update data sets. The columns from the
records in the master and update data sets are merged so that the output record contains all the columns
from the master record plus any additional columns from each update record that are required. A master
record and an update record are merged only if both of them have the same values for the merge key
column(s) that you specify. Merge key columns are one or more columns that exist in both the master
and update records.
The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with
the same key column values are located in the same partition and will be processed by the same node. It
also minimizes memory requirements because fewer rows need to be in memory at any one time.
Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and
As part of preprocessing your data for the Merge stage, you should also remove duplicate records from
the master data set. If you have more than one update data set, you must remove duplicate records from
the update data sets as well. See Chapter 19, “Remove Duplicates Stage,” on page 277 for information
about the Remove Duplicates stage.
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can
route update link rows that fail to match a master row down a reject link that is specific for that link. You
must have the same number of reject links as you have update links. The Link Ordering tab on the Stage
page lets you specify which update links send rejected rows to which reject links. You can also specify
whether to drop unmatched master rows, or output them on the output data link.
Example merge
This example shows what happens to a master data set and two update data sets when they are merged.
The key field is Horse, and all the data sets are sorted in descending order. Here is the master data set:
Freeze
Horse mark Mchip Reg. Soc Level vacc. last worm last trim shoes
William DAM7 N/A FPS Adv 07.07.02 12.10.02 11.05.02 N/A
Robin DG36 N?A FPS Nov 07.07.02 12.10.02 12.03.02 refit
Kayser N/A N/A AHS Nov 11.12.02 12.10.02 11.05.02 N/A
Heathcliff A1B1 N/A N/A Adv 07.07.02 12.10.02 12.03.02 new
Fairfax N/A N/A FPS N/A 11.12.02 12.10.02 12.03.02 N/A
Chaz N/A a2961da AHS Inter 10.02.02 12.10.02 12.03.02 new
Must do’s
WebSphere DataStage has many defaults which means that Merges can be simple to set up. This section
specifies the minimum steps to take to get a Merge stage functioning. WebSphere DataStage provides a
versatile user interface, and there are many shortcuts to achieving a particular end, this section describes
the basic method, you will learn where the shortcuts are when you get familiar with the product.
v In the Stage page Properties Tab specify the key column or columns that the Merge will be performed
on.
v In the Stage page Properties Tab set the Unmatched Masters Mode, Warn on Reject Updates, and Warn
on Unmatched Masters options or accept the defaults.
v In the Stage page Link Ordering Tab, check that your input links are correctly identified as ″master″
and ″update(s)″, and your output links are correctly identified as ″master″ and ″update reject″. Reorder
if required.
v Ensure required column meta data has been specified (this may be done in another stage, or may be
omitted altogether if you are relying on Runtime Column Propagation).
v In the Output Page Mapping Tab, specify how the columns from the input links map onto output
columns.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties that determine what the stage actually does. Some of
the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Merge Keys/Key Input Column N/A Y Y N/A
Merge Keys/Sort Ascending/ Ascending Y N Key
Order Descending
Merge First/Last First N N Key
Keys/Nulls
position
Merge Keys/Sort True/False False N N Key
as EBCDIC
Merge Keys/Case True/False True N N Key
Sensitive
Options/ Keep/Drop Keep Y N N/A
Unmatched
Masters Mode
Options/Warn True/False True Y N N/A
On Reject Masters
Options/Warn True/False True Y N N/A
On Reject
Updates
This specifies the key column you are merging on. Repeat the property to specify multiple keys. You can
use the Column Selection dialog box to select several keys at once if required. Key has the following
dependent properties:
v Sort Order
Choose Ascending or Descending. The default is Ascending.
v Nulls position
By default columns containing null values appear first in the merged data set. To override this default
so that columns containing null values appear last in the merged data set, select Last.
v Sort as EBCDIC
To sort as in the EBCDIC character set, choose True.
v Case Sensitive
Use this to specify whether each merge key is case sensitive or not, this is set to True by default, i.e.,
the values ″CASE″ and ″case″ would not be judged equivalent.
Options category
Unmatched Masters Mode
Set to Keep by default. It specifies that unmatched rows from the master link are output to the merged
data set. Set to Drop to specify that rejected records are dropped instead.
Set to True by default. This will warn you when bad records from the master link are rejected. Set it to
False to receive no warnings.
Set to True by default. This will warn you when bad records from any update links are rejected. Set it to
False to receive no warnings.
Advanced Tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the
settings of the input stages, i.e., if any of the input stages uses Set then this stage will use Set. You can
explicitly select Set or Clear. Select Set to request the next stage in the job attempts to maintain the
partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
By default the links will be processed in the order they were added. To rearrange them, choose an input
link and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the data coming in to be merged. Choose an input
link from the Input name drop down list to specify which link you want to work on.
Details about Merge stage partitioning are given in the following section. See , ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before the merge is performed.
By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on
the previous stage in the job, this stage will warn you if it cannot preserve the partitioning of the
incoming data. Auto mode ensures that data being input to the Merge stage is key partitioned and
sorted.
If the Merge stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Merge stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Merge stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Merge stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the merge is performed (you might use this if you have selected a partitioning method other than
auto or same). The sort is always carried out within data partitions. If the stage is partitioning incoming
data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the
collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not
available with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Merge stage. The Merge stage
can have only one master output link carrying the merged data and a number of reject links, each
carrying rejected records from one of the update links. Choose an output link from the Output name
drop down list to specify which link you want to work on.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship
between the columns being input to the Merge stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output links.
Details about Merge stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Reject links
You cannot change the properties of a Reject link. They have the meta data of the corresponding
incoming update link and this cannot be altered.
The left pane shows the columns of the merged data. These are read only and cannot be modified on this
tab. This shows the meta data from the master input link and any additional columns carried on the
update links.
The right pane shows the output columns for the master output link. This has a Derivations field where
you can specify how the column is derived. You can fill it in by dragging input columns over, or by
using the Auto-match facility.
In the above example the left pane represents the incoming data after the merge has been performed. The
right pane represents the data being output by the stage after the merge operation. In this example the
data has been mapped straight across.
The most common use for a lookup is to map short codes in the input data set onto expanded
information from a lookup table which is then joined to the incoming data and output. For example, you
could have an input data set carrying names and addresses of your U.S. customers. The data as presented
identifies state as a two letter U. S. state postal code, but you want the data to carry the full name of the
state. You could define a lookup table that carries a list of codes matched to states, defining the code as
the key column. As the Lookup stage reads each line, it uses the key to look up the state in the lookup
table. It adds the state to a new column defined for the output link, and so the full state name is added
to each address. If any state codes have been incorrectly entered in the data set, the code will not be
found in the lookup table, and so that record will be rejected.
Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to
the key’s values, the row is rejected.
The Lookup stage is one of three stages that join tables based on the values of key columns. The other
two are:
v Join stage - Chapter 14, “Join stage,” on page 219
v Merge stage - Chapter 15, “Merge Stage,” on page 227
The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and
their requirements for data being input (for example, whether it is sorted). See ″Lookup Versus Join″ on
page Lookup Versus Join for help in deciding which stage to use.
The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects
link. Depending upon the type and setting of the stage(s) providing the look up information, it can have
multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a
single reference link). A lot of the setting up of a lookup operation takes place on the stage providing the
lookup table.
The input link carries the data from the source data set and is known as the primary link. The following
pictures show some example jobs performing lookups.
For each record of the source data set from the primary link, the Lookup stage performs a table lookup
on each of the lookup tables attached by reference links. The table lookup is based on the values of a set
of lookup key columns, one set for each table. The keys are defined on the Lookup stage. For lookups of
data accessed through the Lookup File Set stage, the keys are specified when you create the look up file
set.
You can specify a condition on each of the reference links, such that the stage will only perform a lookup
on that reference link if the condition is satisfied.
Lookup stages do not require data on the input link or reference links to be sorted. Be aware, though,
that large in-memory lookup tables will degrade performance because of their paging requirements.
The optional reject link carries source records that do not have a corresponding entry in the input lookup
tables.
You can also perform a range lookup, which compares the value of a source column to a range of values
between two lookup table columns. If the source column value falls within the required range, a row is
passed to the output link. Alternatively, you can compare the value of a lookup column to a range of
values between two source columns. Range lookups must be based on column values, not constant
values. Multiple ranges are supported.
There are some special partitioning considerations for Lookup stages. You need to ensure that the data
being looked up in the lookup table is in the same partition as the input data referencing it. One way of
doing this is to partition the lookup tables using the Entire method. Another way is to partition it in the
same way as the input data (although this implies sorting of the data).
Unlike most of the other stages in a Parallel job, the Lookup stage has its own user interface. It does not
use the generic interface as described in Chapter 3, “Stage editors,” on page 37.
When you edit a Lookup stage, the Lookup Editor appears. An example Lookup stage is shown below.
The left pane represents input data and lookup data, and the right pane represents output data. In this
example, the Lookup stage has a primary link and single reference link, and a single output link. Meta
data has been defined for all links.
There are two data sets being combined. One is the primary or driving data set, sometimes called the left
of the join. The other data set(s) are the reference data sets, or the right of the join.
In all cases the size of the reference data sets is a concern. If these take up a large amount of memory
relative to the physical RAM memory size of the computer you are running on, then a Lookup stage
might thrash because the reference data sets might not fit in RAM along with everything else that has to
be in RAM. This results in very slow performance since each lookup operation can, and typically does,
cause a page fault and an I/O operation.
So, if the reference data sets are big enough to cause trouble, use a join. A join does a high-speed sort on
the driving and reference data sets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. After the sort is over, the join processing is very fast and never involves
paging or other I/O.
Example Look Up
This example shows what happens when data is looked up in a lookup table. The stage in this case will
look up the interest rate for each customer based on the account type. Here is the data that arrives on the
primary link:
The accounts data set holds the details of customers and their account types, the interest rates are held in
an Oracle table. The lookup stage is set as follows:
All the columns in the accounts data set are mapped over to the output link. The AccountType column in
the accounts data set has been joined to the AccountType column of the interest_rates table. For each row,
the AccountType is looked up in the interest_rates table and the corresponding interest rate is returned.
The reference link has a condition on it. This detects if the balance is null in any of the rows of the
accounts data set. If the balance is null the row is sent to the rejects link (the rejects link does not appear
in the lookup editor because there is nothing you can change).
Must Do’s
WebSphere DataStage has many defaults which means that lookups can be simple to set up. This section
specifies the minimum steps to take to get a Lookup stage functioning. WebSphere DataStage provides a
versatile user interface, and there are many shortcuts to achieving a particular end, this section describes
the basic method, you will learn where the shortcuts are when you get familiar with the product.
If you want to impose conditions on your lookup, or want to use a reject link, you need to double-click
on the Condition header of a reference link, choose Conditions from the link shortcut menu, or click the
Condition toolbar button. The Lookup Stage Conditions dialog box appears. This allows you to:
v Specify that one of the reference links is allowed to return multiple rows when performing a lookup
without causing an error (choose the relevant reference link from the Multiple rows returned from
link list).
v Specify a condition for the required references. Double-click the Condition box (or press CTRL-E) to
open the expression editor. This expression can access all the columns of the primary link, plus
columns in reference links that are processed before this link.
v Specify what happens if the condition is not met on each link.
v Specify what happens if a lookup fails on each link.
If you want to specify a range lookup, select the Range check box on the stream link or select Range
from the Key Type list on the reference link. Then double-click the Key Expression field to open the
Range dialog box, where you can build the range expression.
Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the
Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties
from the stage editor shortcut menu (choosing Link Properties will open the dialog box with the link you
are looking at selected, otherwise you might need to choose the correct link from the Input name or
Output name list).
See the Connectivity Guides for these two databases for more details.
If you want to impose conditions on your lookup, or want to use a reject link, you need to double-click
on the Condition header, choose Conditions from the link shortcut menu, or click the Condition toolbar
icon. The Lookup Stage Conditions dialog box appears. This allows you to:
v Specify what happens if a lookup fails on this link.
If you want to specify a range lookup, select the Range check box on the stream link or select Range
from the Key Type list on the reference link. Then double-click the Key Expression field to open the
Range dialog box, where you can build the range expression.
Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the
Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties
from the stage editor shortcut menu (choosing Link Properties will open the dialog box with the link you
are looking at selected, otherwise you might need to choose the correct link from the Input name or
Output name list).
v In the Stage page Link Ordering Tab, check that your links are correctly identified as ″primary″ and
″lookup(s)″, and reorder if required.
v Unless you have particular partitioning requirements, leave the default auto setting on the Input Page
Partitioning Tab.
See Chapter 7, “Lookup file set stage,” on page 121 for details about the Lookup File Set stage.
As you are using a lookup file set this is all the mapping you need to do, the key column or columns for
the lookup is defined when you create the lookup file set.
If you are performing a range lookup, select the Range check box on the stream link or select Range from
the Key Type list on the reference link. Then double-click the Key Expression field to open the Range
dialog box, where you can build the range expression.
Next you need to open the Stage Properties dialog box for the Lookup stage. Do this by choosing the
Stage Properties icon from the stage editor toolbar, or by choosing Stage Properties or Link Properties
from the stage editor shortcut menu (choosing Link Properties will open the dialog with the link you are
looking at selected, otherwise you might need to choose the correct link from the Input name or Output
name list).
v In the Stage page Link Ordering Tab, check that your links are correctly identified as ″primary″ and
″lookup(s)″, and reorder if required.
v Unless you have particular partitioning requirements, leave the default auto setting on the Input Page
Partitioning Tab.
Toolbar
The Lookup toolbar contains the following buttons (from left to right):
v Stage properties
Link Area
The top area displays links to and from the Lookup stage, showing their columns and the relationships
between them.
The link area is divided into two panes; you can drag the splitter bar between them to resize the panes
relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right.
The left pane shows the input link, the right pane shows output links. Output columns that have an
invalid derivation defined are shown in red. Reference link input key columns with invalid key
expressions are also shown in red.
Within the Lookup Editor, a single link can be selected at any one time. When selected, the link’s title bar
is highlighted, and arrowheads indicate any selected columns within that link.
The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the
required link to the front. That link is also selected in the link area.
If you select a link in the link area, its meta data tab is brought to the front automatically.
You can edit the grids to change the column meta data on any of the links. You can also add and delete
meta data.
As with column meta data grids on other stage editors, edit row in the context menu allows editing of
the full meta data definitions (see ″Columns Tab″ on page Columns Tab).
Shortcut Menus
The Lookup Editor shortcut menus are displayed by right-clicking the links in the links area.
There are slightly different menus, depending on whether you right-click an input link, or an output link.
The input link menu offers you operations on input columns, the output link menu offers you operations
on output columns and their derivations.
Ifyou display the menu from the links area background, you can:
v Open the Stage Properties dialog box in order to specify stage or link properties.
v Open the Lookup Stage Conditions dialog box to specify a conditional lookup.
v Open the Link Execution Order dialog box in order to specify the order in which links should be
processed.
v Toggle between viewing link relations for all links, or for the selected link only.
Right-clicking in the meta data area of the Lookup Editor opens the standard grid editing shortcut
menus.
You can drag and drop the full column set by dragging the link title.
The Find and Replace dialog box appears. It has three tabs:
v Expression Text. Allows you to locate the occurrence of a particular string within an expression, and
replace it if required. You can search up or down, and choose to match case, match whole words, or
neither. You can also choose to replace all occurrences of the string within an expression.
v Column Names. Allows you to find a particular column and rename it if required. You can search up
or down, and choose to match case, match the whole word, or neither.
v Expression Types. Allows you to find the next empty expression or the next expression that contains
an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next
erroneous expression.
Note: The find and replace results are shown in the color specified in Tools → Options.
Press F3 to repeat the last search you made without opening the Find and Replace dialog box.
Select Facilities
If you are working on a complex job where several links, each containing several columns, go in and out
of the Lookup stage, you can use the select column facility to select multiple columns.
To use the select facilities, choose Select from the link shortcut menu. The Select dialog box appears. It
has three tabs:
v Expression Text. The Expression Text tab allows you to select all columns/stage variables whose
expressions contain text that matches the text specified. The text specified is a simple text match, taking
into account the Match case setting.
Chapter 16. Lookup Stage 243
v Column Names. The Column Names tab allows you to select all column/stage variables whose Name
contains the text specified. There is an additional Data Type list, that will limit the columns selected to
those with that data type. You can use the Data Type list on its own to select all columns of a certain
data type. For example, all string columns can be selected by leaving the text field blank, and selecting
String as the data type. The data types in the list are generic data types, where each of the column SQL
data types belong to one of these generic types.
v Expression Types. The Expression Types tab allows you to select all columns with either empty
expressions or invalid expressions.
When copying columns, a new column is created with the same meta data as the column it was copied
from.
To delete a column from within the Lookup Editor, select the column you want to delete and click the cut
button or choose Delete Column from the shortcut menu.
The meta data shown does not include column derivations since these are edited in the links area.
If a derivation is displayed in red (or the color defined in Tools → Options), it means that the Lookup
Editor considers it incorrect. To see why it is invalid, choose Validate Derivation from the shortcut menu.
Note: Auto-matching does not take into account any data type incompatibility between matched
columns; the derivations are set regardless.
The key expression is an equijoin from a primary input link column. You can specify it in two ways:
v Use drag and drop to drag a primary input link column to the appropriate key expression cell.
v Use copy and paste to copy a primary input link column and paste it on the appropriate key
expression cell.
A relationship link is drawn between the primary input link column and the key expression.
You can also use drag and drop or copy and paste to copy an existing key expression to another input
column, and you can drag or copy multiple selections.
If a key expression is displayed in red (or the color defined in Tools → Options), it means that the
Transformer Editor considers it incorrect. To see why it is invalid, choose Validate Derivation from the
shortcut menu.
Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The Link Ordering tab allows you to specify which order the input
links are processed in. The NLS Locale tab appears if your have NLS enabled on your system. It allows
you to select a locale other than the project default to determine collating rules. The Build tab allows you
to override the default compiler and linker flags for this particular stage.
Advanced Tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage on the
stream link. You can explicitly select Set or Clear. Select Set to request the next stage in the job should
attempt to maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
By default the input links will be processed in the order they were added. To rearrange them, choose an
input link and click the up arrow button or the down arrow button.
You can also access this tab by clicking the input link order button in the toolbar, or by choosing Reorder
input links from the shortcut menu.
Input Page
The Input page allows you to specify details about the incoming data set and the reference links. Choose
a link from the Input name list to specify which link you want to work on.
The General tab allows you to specify an optional description of the link. When you are performing an
in-memory lookup, the General tab has two additional fields:
v Save to lookup fileset. Allows you to specify a lookup file set to save the look up data.
v Diskpool. Specify the name of the disk pool into which to write the file set. You can also specify a job
parameter.
The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned.
The Advanced tab allows you to change the default buffering settings for the input link.
Details about Lookup stage partitioning are given in the following section. See Chapter 3, “Stage editors,”
on page 37 for a general description of the other tabs.
Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before the lookup is performed. It also allows you to specify that the data should be sorted before the
lookup. Note that you cannot specify partitioning or sorting on the reference links, this is specified in
their source stage.
By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on
the previous stage in the job the stage will warn you when the job runs if it cannot preserve the
partitioning of the incoming data.
If the Lookup stage is operating in sequential mode, it will first collect the data before writing it to the
file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Lookup stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Lookup stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type list. This will override any current partitioning.
You might need to ensure that your lookup tables have been partitioned using the Entire method, so that
the lookup tables will always contain the full set of data that might need to be looked up. For lookup
files and lookup tables being looked up in databases, the partitioning is performed on those stages.
If the Lookup stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type list. This will override the default auto
collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the lookup is performed. The sort is always carried out within data partitions. If the stage is
partitioning incoming data, the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
The General tab allows you to specify an optional description of the output link. The Advanced tab
allows you to change the default buffering settings for the output links.
Reject Links
You cannot set the mapping or edit the column definitions for a reject link. The link uses the column
definitions for the primary input link.
You can open the Lookup Stage Conditions dialog box by:
v Double-clicking the Condition: bar on a reference link.
v Selecting Conditions from the background shortcut menu.
v Clicking the Conditions toolbar button.
v Selecting Conditions from the link shortcut menu.
Range lookups
You can define a range lookup on the stream link or a reference link of a Lookup stage. On the stream
link, the lookup compares the value of a source column to a range of values between two lookup
columns. On the reference link, the lookup compares the value of a lookup column to a range of values
between two source columns. Multiple ranges are supported.
Expression Format
The format of an expression is as follows:
KEY:
something_like_this is a token
something_in_italics is a terminal, i.e., doesn’t break down any
further
| is a choice between tokens
[ is an optional part of the construction
"XXX" is a literal token (i.e., use XXX not
including the quotes)
Entering Expressions
Whenever the insertion point is in an expression box, you can use the Expression Editor to suggest the
next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to
the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears
depends on context, i.e., whether you should be entering an operand or an operator as the next
expression element. The Functions available from this menu are described in Appendix B. The DS macros
are described in WebSphere DataStage Parallel Job Advanced Developer Guide. You can also specify custom
routines for use in the expression editor (see WebSphere DataStage Designer Client Guide).
If there is an error, a message appears and the element causing the error is highlighted in the expression
box. You can either correct the expression or close the Lookup Editor or Lookup dialog box.
For any expression, selecting Validate from its shortcut menu will also validate it and show any errors in
a message box.
The Expression Editor is configured by editing the Designer options. This allows you to specify how
`helpful’ the expression editor is. For more information, see WebSphere DataStage Designer Client Guide.
You specify sorting keys as the criteria on which to perform the sort. A key is a column on which to sort
the data, for example, if you had a name column you might specify that as the sort key to produce an
alphabetical list of names. The first column you specify as a key to the stage is the primary key, but you
can specify additional secondary keys. If multiple rows have the same value for the primary key column,
then WebSphere DataStage uses the secondary columns to sort these rows.
You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within
partitions, as shown below:
Tom Dick
Dick Harry
Harry Jack
Jack Tom
Sort
Bob
Ted Stage Jane
Mary
Mary
Bob
Tod
jane
Running
in parallel
Monica Bill
Bill Dave
Dave Mike
Mike Monica
The stage uses temporary disk space when performing a sort. It looks in the following locations, in the
following order, for this temporary space.
1. Scratch disks in the disk pool sort (you can create these pools in the configuration file).
2. Scratch disks in the default disk pool (scratch disks are included here by default).
3. The directory specified by the TMPDIR environment variable.
4. The directory /tmp.
You may perform a sort for several reasons. For example, you may want to sort a data set by a zip code
column, then by last name within the zip code. Once you have sorted the data set, you can filter the data
set by comparing adjacent records and removing any duplicates.
However, you must be careful when processing a sorted data set: many types of processing, such as
repartitioning, can destroy the sort order of the data. For example, assume you sort a data set on a
system with four processing nodes and store the results to a data set stage. The data set will therefore
have four partitions. You then use that data set as input to a stage executing on a different number of
nodes, possibly due to node constraints. WebSphere DataStage automatically repartitions a data set to
spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort
You must also be careful when using a stage operating sequentially to process a sorted data set. A
sequential stage executes on a single processing node to perform its action. Sequential stages will collect
the data where the data set has more than one partition, which may also destroy the sorting order of its
input data set. You can overcome this if you specify the collection method as follows:
v If the data was range partitioned before being sorted, you should use the ordered collection method to
preserve the sort order of the data set. Using this collection method causes all the records from the first
partition of a data set to be read first, then all records from the second partition, and so on.
v If the data was hash partitioned before being sorted, you should use the sort merge collection method
specifying the same collection keys as the data was partitioned on.
Note: If you write a sorted data set to an RDBMS there is no guarantee that it will be read back in the
same order unless you specifically structure the SQL query to ensure this.
By default the stage will sort with the native WebSphere DataStage sorter, but you can also specify that it
uses the UNIX sort command.
Examples
Sequential Sort
This job sorts the contents of a sequential file, and writes it to a data set. The data is a list of GlobalCo
customers sorted by customer number. We are going to sort it by customer name instead.
The Sequential File stage runs sequentially because it only has one source file to read. The Sort stage is
set to run sequentially on the Stage page Advanced tab. The sort stage properties are used to specify the
column CUST_NAME as the primary sort key:
When the job is run the data is sorted into a single partition. The Data Set stage, GlobalCo_sorted, is set
to run sequentially to write the data to a single partition. Here is a sample of the sorted data:
"ATLANTIC SCIENTIFIC CORPORATIO","PO BOX 106023","GC27169","GlobalCoUS"
"ATLANTIC UNDERGROUND CONST","11192 60TH ST NORTH","GC23719","GlobalCoUS"
"ATLANTIC UNDERGROUND UTILITIES","203 EDSEL DR","GC23785","GlobalCoUS"
"ATLANTIC WATER CO","PO BOX 30806","GC23993","GlobalCoUS"
"ATLANTIS POOL & SPA","2323 NEPTUNE RD","GC23413","GlobalCoUS"
"ATLAS BOLT & SCREW COMPANY","PO BOX 96113","GC27196","GlobalCoUS"
"ATLAS ENGINEERING","PO BOX 140","GC23093","GlobalCoUS"
"ATLAS FOUNDATION REPAIR","2835 FAY ST","GC23110","GlobalCoUS"
"ATLAS REFRIGERATION & AC","1808 S ERVAY ST","GC23128","GlobalCoUS"
"ATLAS SCREW & SPECIALTY","PO BOX 41389","GC27225","GlobalCoUS"
"ATLAS UTILITY SUPPLY","2301 CARSON STREET","GC23182","GlobalCoUS"
"ATLAS UTILITY SUPPLY COMPANY","1206 PROCTOR ST","GC23200","GlobalCoUS"
"ATMEL Wireless and Microcontro","RÖSSLERSTRASSE 94 ","GC21356","GlobalCoDE"
"ATMOS ENERGY","PO BOX 9001949","GC23912","GlobalCoUS"
"ATNIP DESIGN & SUPPLY CENTER I","200 INDUSTRIAL DR","GC23215","GlobalCoUS"
"ATS","4901 E GRIMES","GC28576","GlobalCoUS"
"ATTALA WATER WORKS BD","509 4TH ST NW","GC23247","GlobalCoUS"
"ATTENTION HOME","","GC15998","GlobalCoUS"
"ATTN: C.L. MAR","","GC28189","GlobalCoUS"
In the Sort stage we specify parallel execution in the Stage page Advanced tab. In the Input page
Partitioning tab we specify a partitioning type of Hash, and specify the column SOURCE as the hash key.
Because the partitioning takes place on the input link, the data is partitioned before the sort stage actually
tries to sort it. We hash partition to ensure that customers from the same country end up in the same
partition. The data is then sorted within those partitions.
The job is run on a four-node system, so we end up with a data set comprising four partitions.
The following is a sample of the data in partition 2 after partitioning, but before sorting:
"GTM construction ","Avenue De Lotz Et Cosse Ile Be","GC20115","GlobalCoFR"
"Laboratoires BOIRON ","14 Rue De La Ferme","GC20124","GlobalCoFR"
"Sofinco (Banque du Groupe Suez","113 Boulevard De Paris","GC20132","GlobalCoFR"
"Dingremont ","1 Avenue Charles Guillain","GC20140","GlobalCoFR"
"George Agenal - Tailleur de Pi","105 Rue La Fayette","GC20148","GlobalCoFR"
"Le Groupe Shell en France ","33-37 Rue Du Moulin Des Bruyèr","GC20156","GlobalCoFR"
"Bottin ","24 Rue Henri Barbusse","GC20165","GlobalCoFR"
"Crédit du Nord ","9 Rue Du Besly","GC20175","GlobalCoFR"
"LNET Multimedia ","4 Quai Du Point Du Jour","GC20184","GlobalCoFR"
"Merisel ","7 Allée Des Platanes","GC20192","GlobalCoFR"
"Groupe SEB ","100 Rue de Veeweyde","GC20200","GlobalCoFR"
"CSTB : Centre Scientifique et ","40 Rue Gabriel Crie","GC20208","GlobalCoFR"
"GTI-Informatique ","4 Avenue Montaigne","GC20219","GlobalCoFR"
"BIOTEC Biologie appliquée SA -","Chaussée Louise 143","GC20227","GlobalCoFR"
"Rank Xerox France ","4 Rue Jean-Pierre Timbaud","GC20236","GlobalCoFR"
"BHV : Bazar de l’Hotel de Vill","168 Bis Rue Cardinet","GC20246","GlobalCoFR"
"CerChimie ","Domaine De Voluceau","GC20254","GlobalCoFR"
"Renault S.A. [unofficial] ","11 Rue De Cambrai","GC20264","GlobalCoFR"
"Alcatel Communication","21-23 Rue D’Astorg","GC20272","GlobalCoFR"
And here is a sample of the data in partition 2 after it has been processed by the sort stage:
"Groupe BEC : Des entrepreneurs","91-95 Avenue François Arago","GC20900","GlobalCoFR"
"Groupe Courtaud ","2 Avenue Pierre Marzin","GC20692","GlobalCoFR"
"Groupe Courtaud ","99 Avenue Pierre Semard","GC20929","GlobalCoFR"
"Groupe FAYAT:Groupe Industriel","Allee Des Acacias","GC20020","GlobalCoFR"
"Groupe FAYAT:Groupe Industriel","Allee Des Acacias","GC20091","GlobalCoFR"
If we want to bring the data back together into a single partition, for example to write to another
sequential file, we need to be mindful of how it is collected, or we will lose the benefit of the sort. If we
use the sort/merge collection method, specifying the CUST_NAME column as the collection key, we will
end up with a totally sorted data set.
Total sort
You can also perform a total sort on a parallel data set, such that the data is ordered within each partition
and the partitions themselves are ordered. A total sort requires that all similar and duplicate records are
located in the same partition of the data set. Similarity is based on the key fields in a record. The
partitions also need to be approximately the same size so that no one node becomes a processing
bottleneck.
In order to meet these two requirements, the input data is partitioned using the range partitioner. This
guarantees that all records with the same key fields are in the same partition, and calculates the partition
boundaries based on the key field to ensure fairly even distribution. In order to use the range partitioner
you must first take a sample of your input data, sort it, then use it to build a range partition map as
described in Chapter 51, “Write Range Map stage,” on page 533, ″Write Range Map Stage.″ You then
specify this map when setting up the range partitioner in the Input page Partitioning tab of your Sort
stage.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Sort stages in a
job. This section specifies the minimum steps to take to get a Sort stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Sorting Keys/Key Input Column N/A Y Y N/A
Specifies the key column for sorting. This property can be repeated to specify multiple key columns. You
can use the Column Selection dialog box to select several keys at once if required. Key has dependent
properties depending on the Sort Utility chosen:
v Sort Order
Options category
Sort utility
The type of sort the stage will carry out. Choose from:
v DataStage. The default. This uses the built-in WebSphere DataStage sorter, you do not require any
additional software to use this option.
v UNIX. This specifies that the UNIX sort command is used to perform the sort.
Stable sort
Applies to a Sort Utility type of DataStage, the default is True. It is set to True to guarantee that this sort
operation will not rearrange records that are already in a properly sorted data set. If set to False no prior
ordering of records is guaranteed to be preserved by the sorting operation.
Allow duplicates
Set to True by default. If False, specifies that, if multiple records have identical sorting key values, only
one record is retained. If Stable Sort is True, then the first record is retained. This property is not available
for the UNIX sort type.
Output statistics
Set False by default. If True it causes the sort operation to output statistics. This property is not available
for the UNIX sort type.
This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells
the Sort stage to create the column clusterKeyChange in each output record. The clusterKeyChange
column is set to 1 for the first record in each group where groups are defined by using a Sort Key Mode
This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells
the Sort stage to create the column KeyChange in each output record. The KeyChange column is set to 1
for the first record in each group where the value of the sort key changes. Subsequent records in the
group have the KeyChange column set to 0.
This is set to 20 by default. It causes the Sort stage to restrict itself to the specified number of megabytes
of virtual memory on a processing node.
We recommend that the number of megabytes specified is smaller than the amount of physical memory
on a processing node.
Workspace
This property appears for sort type UNIX only. Optionally specifies the workspace used by the stage.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request
the next stage in the job should attempt to maintain the partitioning.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Sort stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before the sort is performed.
By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on
the previous stage in the job the stage will warn you when the job runs if it cannot preserve the
partitioning of the incoming data.
If the Sort Set stage is operating in sequential mode, it will first collect the data before writing it to the
file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Sort stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Sort stage is set to execute in parallel, then you can set a partitioning method by selecting from the
Partition type drop-down list. This will override any current partitioning.
If the Sort stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then
you can set a collection method from the Collector type drop-down list. This will override the default
auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the Sort is performed. This is a standard feature of the stage editors, if you make use of it you will
be running a simple sort before the main Sort operation that the stage provides. The sort is always
carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the
partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting
depends on the partitioning or collecting method chosen (it is not available with the default auto
methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Sort stage. The Sort stage can
have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Sort stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output link.
Details about Sort stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
The left pane shows the columns of the sorted data. These are read only and cannot be modified on this
tab. This shows the meta data from the input link.
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
In the above example the left pane represents the incoming data after the sort has been performed. The
right pane represents the data being output by the stage after the sort operation. In this example the data
has been mapped straight across.
For all methods the meta data of all input data sets must be identical.
The sort funnel method has some particular requirements about its input data. All input data sets must
be sorted by the same key columns as to be used by the Funnel operation.
Typically all input data sets for a sort funnel operation are hash-partitioned before they’re sorted
(choosing the auto partitioning method will ensure that this is done). Hash partitioning guarantees that
all records with the same key column values are located in the same partition and so are processed on
the same node. If sorting and partitioning are carried out on separate stages before the Funnel stage, this
partitioning must be preserved.
Examples
The Funnel stage, when set to continuous funnel, will combine these into a single data set. The job to
perform the funnel is as follows:
The continuous funnel method does not attempt to impose any order on the data it is processing. It
simply writes rows as they become available on the input links. In our example the stage has written a
row from each input link in turn. A sample of the final, funneled, data is as follows:
"VALUE TOOL REPAIR","4345 OKEECHOBEE BLVD STE F2","GC17697","GlobalCoUS"
"CHAPPEL HILL WSC MAIN STREET","PO BOX 215","GC17712","GlobalCoUS"
"ROBERT WILLIAMS","3615","GC17728","GlobalCoUS"
"USC TICKET OFFICE","REX ENRIGHT CENTER","GC17742","GlobalCoUS"
"STASCO INC","PO BOX 3087","GC17760","GlobalCoUS"
"VASCO BRANDS INC","PO BOX 4040","GC17786","GlobalCoUS"
"ROYAL ST WW IMPROVEMENTS*DO NO","4103 RIVER OAKS DR","GC17798","GlobalCoUS"
"KAY ROSE INC **CASH SALES**","PO BOX 617","GC17808","GlobalCoUS"
"Novatec sa","Bp 28","GC20002","GlobalCoFR"
"Casio - Esselte ","Route De Nozay","GC20012","GlobalCoFR"
"Citélis ","137 Rue Voltaire","GC20021","GlobalCoFR"
"DECATHLON ","3 Rue De La Nouvelle France","GC20032","GlobalCoFR"
"BNP : Banque Nationale de Pari","Sdap","GC20040","GlobalCoFR"
",Acton - Visserie Boulonnerie","Domaine De Voluceau","GC20049","GlobalCoFR"
"SKF France ","5 Rue Noël Pons","GC20059","GlobalCoFR"
"Admiral ","2 Rue Auguste Comte","GC20067","GlobalCoFR"
"Borland ","1 Avenue Newton","GC20076","GlobalCoFR"
Note: If you are running your sort funnel stage in parallel, you should be aware of the various
considerations about sorting data and partitions. These are described in ″Sort Stage.″
When using the sequence method, you need to specify the order in which the Funnel stage processes its
input links, as this affects the order of the sequencing. This is done on the Stage page Link Ordering tab.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Funnel stages
in a job. This section specifies the minimum steps to take to get a Funnel stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link
Ordering tab allows you to specify which order the input links are processed in. The NLS Locale tab
appears if your have NLS enabled on your system. It allows you to select a locale other than the project
default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Funnel Continuous Continuous Y N N/A
Type Funnel/ Funnel
Sequence/ Sort
funnel
Sorting Keys/Key Input Column N/A Y (if Funnel Type Y N/A
= Sort Funnel)
Sorting Keys/Sort Ascending/ Ascending Y (if Funnel Type N Key
Order Descending = Sort Funnel)
Sorting First/Last First Y (if Funnel Type N Key
Keys/Nulls = Sort Funnel)
position
Options category
Funnel type
This property is only required for Sort Funnel operations. Specify the key column that the sort will be
carried out on. The first column you specify is the primary key, you can add multiple secondary keys by
repeating the key property. You can use the Column Selection dialog box to select several keys at once if
required.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
By default the input links will be processed in the order they were added. To rearrange them, choose an
input link and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the incoming data sets. Choose an input link from the
Input name drop down list to specify which link you want to work on.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being funneled. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Funnel stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Funnel stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Funnel stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If you are using the Sort Funnel method, and haven’t partitioned the data in a previous stage, you should
key partition it by choosing the Hash or modulus partition method on this tab.
If the Funnel stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being funneled. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Funnel stage. The Funnel stage
can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Funnel stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output link.
Details about Funnel stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For Funnel stages the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them or how they are generated.
The left pane shows the input columns. These are read only and cannot be modified on this tab. It is a
requirement of the Funnel stage that all input links have identical meta data, so only one set of column
definitions is shown.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and
writes the results to an output data set.
Removing duplicate records is a common way of cleansing a data set before you perform further
processing. Two rows are considered duplicates if they are adjacent in the input data set and have
identical values for the key column(s). A key column is any column you designate to be used in
determining whether two rows are identical.
The data set input to the Remove Duplicates stage must be sorted so that all records with identical key
values are adjacent. You can either achieve this using the in-stage sort facilities available on the Input
page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage.
Example
In the example the data is a list of GlobalCo’s customers. The data contains some duplicate entries, and
we want to remove these.
The first step is to sort the data so that the duplicates are actually next to each other. As with all sorting
operations, there are implications around data partitions if you run the job in parallel (see ″Copy Stage,″
for a discussion of these). You should hash partition the data using the sort keys as hash keys in order to
guarantee that duplicate rows are in the same partition. In our example we sort on the
CUSTOMER_NUMBER columns and our sample of the sorted data shows up some duplicates:
"GC29834","AQUILUS CONDOS","917 N FIRST ST","8/29/1996 "
"GC29835","MINERAL COUNTRY","2991 TELLER CT","8/29/1996 "
"GC29836","ABC WATERWORKS","PO BOX 14093","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29838","ENCNG WASHINGTON","1305 JOHN SMALL AV","8/30/1996 "
"GC29839","DYNEGY FACILITIES","1000 LOUISIANA STE 5800","8/30/1996 "
"GC29840","LITTLE HAITI GATEWAY","NE 2ND AVENUE AND NE 62ND STRE","8/30/1996 "
Here is a sample of the data after the job has been run and the duplicates removed:
"GC29834","AQUILUS CONDOS","917 N FIRST ST","8/29/1996 "
"GC29835","MINERAL COUNTRY","2991 TELLER CT","8/29/1996 "
"GC29836","ABC WATERWORKS","PO BOX 14093","8/30/1996 "
"GC29837","ANGEL BROTHERS INC","128 SOUTH MAIN ST","8/30/1996 "
"GC29838","ENCNG WASHINGTON","1305 JOHN SMALL AV","8/30/1996 "
"GC29839","DYNEGY FACILITIES","1000 LOUISIANA STE 5800","8/30/1996 "
"GC29840","LITTLE HAITI GATEWAY","NE 2ND AVENUE AND NE 62ND STRE","8/30/1996 "
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Remove
Duplicates stages in a job. This section specifies the minimum steps to take to get a Remove Duplicates
stage functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts
to achieving a particular end, this section describes the basic method, you will learn where the shortcuts
are when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Keys that Define Input Column N/A Y Y N/A
Duplicates/Key
Keys that Define True/False False N N Key
Duplicates/Sort
as EBCDIC
Keys that Define True/False True N N Key
Duplicates/Case
Sensitive
Options/ First/Last First Y N N/A
Duplicate to
retain
Specifies the key column for the operation. This property can be repeated to specify multiple key
columns. You can use the Column Selection dialog box to select several keys at once if required. Key has
dependent properties as follows:
v Sort as EBCDIC
To sort as in the EBCDIC character set, choose True.
v Case Sensitive
Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the
values ″CASE″ and ″case″ would not be judged equivalent.
Options category
Duplicate to retain
Specifies which of the duplicate columns encountered to retain. Choose between First and Last. It is set to
First by default.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
Input page
The Input page allows you to specify details about the data coming in to be sorted. Choose an input link
from the Input name drop down list to specify which link you want to work on.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Remove Duplicates stage partitioning are given in the following section. See ″Stage
Editors,″ for a general description of the other tabs.
If the Remove Duplicates stage is operating in sequential mode, it will first collect the data before writing
it to the file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Remove Duplicates stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Remove Duplicates stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Remove Duplicates stage is set to execute in sequential mode, but the preceding stage is executing
in parallel, then you can set a collection method from the Collector type drop-down list. This will
override the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the remove duplicates operation is performed. The sort is always carried out within data
partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is
collecting data, the sort occurs before the collection. The availability of sorting depends on the
partitioning or collecting method chosen (it is not available with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Remove Duplicates stage and the output columns. The Advanced
tab allows you to change the default buffering settings for the output link.
Details about Remove Duplicates stage mapping is given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Mapping tab
For Remove Duplicates stages the Mapping tab allows you to specify how the output columns are
derived, i.e., what input columns map onto them.
The left pane shows the columns of the input data. These are read only and cannot be modified on this
tab. This shows the meta data from the incoming link.
The right pane shows the output columns for the master output link. This has a Derivations field where
you can specify how the column is derived. You can fill it in by dragging input columns over, or by
using the Auto-match facility.
The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set
from a sequence of records into a stream of raw binary data. The complement to the Compress stage is
the Expand stage, which is described in Chapter 21, “Expand Stage,” on page 287.
A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data
Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that
is, until its rows are returned to their normal format. Stages that do not perform column-based processing
or reorder the rows can operate on compressed data sets. For example, you can use the Copy stage to
create a copy of the compressed data set.
Because compressing a data set removes its normal record boundaries, the compressed data set must not
be repartitioned before it is expanded.
DataStage puts the existing data set schema as a subrecord to a generic compressed schema. For example,
given a data set with a schema of:
a:int32;
b:string[50];
Therefore, when you are looking to reuse a file that has been compressed, ensure that you use the
’compressed schema’ to read the file rather than the schema that had gone into the compression.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. The
stage only has a single property which determines whether the stage uses compress or GZIP.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ compress/gzip compress Y N N/A
Command
Options category
Command
Specifies whether the stage will use compress (the default) or GZIP.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request
the next stage should attempt to maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
Input page
The Input page allows you to specify details about the data set being compressed. There is only one
input link.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Compress stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
If the Compress stage is operating in sequential mode, it will first collect the data before writing it to the
file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Compress stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Compress stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the Compress stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the compression is performed. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Compress stage. The stage
only has one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously
compressed data set back into a sequence of records from a stream of raw binary data. The complement
to the Expand stage is the Compress stage which is described in Chapter 20, “Compress stage,” on page
283.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Expand stages
in a job. This section specifies the minimum steps to take to get an Expand stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. The
stage only has a single property which determines whether the stage uses uncompress or GZIP.
Options category
Command
Specifies whether the stage will use uncompress (the default) or GZIP.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. The stage has a mandatory partitioning method of
Same, this overrides the preserve partitioning flag and so the partitioning of the incoming data is
always preserved.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the data set being expanded. There is only one input
link.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Expand stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
By default the stage uses the Same partitioning method and this cannot be altered. This preserves the
partitioning already in place.
The Partitioning tab normally also allows you to specify that data arriving on the input link should be
sorted before the expansion is performed. This facility is not available on the expand stage.
Output page
The Output page allows you to specify details about data output from the Expand stage. The stage only
has one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Copy stage copies a single input data set to a number of output data sets. Each record of the input
data set is copied to every output data set. Records can be copied without modification or you can drop
or change the order of columns (to copy with more modification - for example changing column data
types - use the Modify stage as described in Chapter 23, “Modify stage,” on page 301). Copy lets you
make a backup copy of a data set on disk while performing an operation on another copy, for example.
Where you are using a Copy stage with a single input and a single output, you should ensure that you
set the Force property in the stage editor TRUE. This prevents WebSphere DataStage from deciding that
the Copy operation is superfluous and optimizing it out of the job.
Example
In this example we are going to copy data from a table containing billing information for GlobalCo’s
customers.. We are going to copy it to three separate data sets, and in each case we are only copying a
subset of the columns. The Copy stage will drop the unwanted columns as it copies the data set.
The column names for the input data set are as follows:
v BILL_TO_NUM
v CUST_NAME
v ,ADDR_1
v ,ADDR_2
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Copy stages in
a job. This section specifies the minimum steps to take to get a Copy stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. The
Copy stage only has one property.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Force True/False False N N N/A
Options category
Force
Set True to specify that WebSphere DataStage should not try to optimize the job by removing a Copy
operation where there is one input and one output. Set False by default.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can
explicitly select Set or Clear. Select Set to request the stage should attempt to maintain the
partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the data set being copied. There is only one input
link.
Details about Copy stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
If the Copy stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Copy stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Copy stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Copy stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the remove duplicates operation is performed. The sort is always carried out within data
partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is
collecting data, the sort occurs before the collection. The availability of sorting depends on the
partitioning or collecting method chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Copy stage. The stage can
have any number of output links, choose the one you want to work on from the Output name drop
down list.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Copy stage and the output columns. The Advanced tab allows
you to change the default buffering settings for the output links.
Details about Copy stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For Copy stages the Mapping tab allows you to specify how the output columns are derived, i.e., what
copied columns map onto them.
The left pane shows the copied columns. These are read only and cannot be modified on this tab.
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging copied columns over, or by using the
Auto-match facility.
The modify stage alters the record schema of its input data set. The modified data set is then output. You
can drop or keep columns from the schema, or change the type of a column.
Examples
The modify stage is used to drop the REPID, CREDITLIMIT, and COMMENTS columns. To do this, the
stage properties are set as follows:
Specification = DROP REPID, CREDITLIMIT, COMMENTS
The easiest way to specify the outgoing meta data in this example would be to use runtime column
propagation. You could, however, choose to specify the meta data manually, in which case it would look
You could achieve the same effect by specifying which columns to keep, rather than which ones to drop.
In the case of this example the required specification to use in the stage properties would be:
KEEP CUSTID, NAME, ADDRESS, CITY, STATE, ZIP, AREA, PHONE
Some data type conversions require you to use a transform command, a list of these, and the available
transforms, is given in ″Specification″ . The decimal to string conversion is one that can be performed
using an explicit transform. In this case, the specification on the Properties page is as follows:
conv_CUSTID:string = string_from_decimal(CUSTID)
Null handling
You can also use the Modify stage to handle columns that might contain null values. Any of the columns
in the example, other than CUSTID, could legally contain a null value. You could use the modify stage to
detect when the PHONE column contains a null value, and handle it so no errors occur. In this case, the
specification on the Properties page would be:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Modify stages
in a job. This section specifies the minimum steps to take to get a Modify stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. The
modify stage only has one property, although you can repeat this as required.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ string N/A Y Y N/A
Specification
Options category
Specification
If you choose to drop a column or columns, all columns are retained except those you explicitly drop. If
you chose to keep a column or columns, all columns are excluded except those you explicitly keep.
Some type conversions WebSphere DataStage can carry out automatically, others need you to specify an
explicit conversion function. Some conversions are not available. The following table summarizes the
availability, ″d″ indicates automatic (default) conversion, ″m″ indicates that manual conversion is
required, a blank square indicates that conversion is not possible:
Source Field
decimal string ustring raw date time timestamp
int8 d dm dm m m m
uint8 d d d
int16 d dm dm
uint16 d dm dm
int32 d dm dm m m
uint32 d m m m
int64 d d d
uint64 d d d
sfloat d d d
dfloat dm dm dm m m
decimal dm dm
string dm d m m m
ustring dm d m m
raw
date m m m
time m m dm
timestamp m m m m
For example:
int8col = uint64col
For example:
day_column:int8 = month_day_from_date (date_column)
The new_type can be any of the destination types that are supported for conversions from the source (that
is, any of the columns marked ″m″ in the above table). For example, you can use the conversion
hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on. WebSphere
DataStage warns you when it is performing an implicit data type conversion, for example
hours_from_time expects to convert a time to an int8, and will warn you if converting to a int16, int32,
or dfloat.
The following table lists the available conversion functions. The source and destinations are always
specified in terms of column names. Preliminary arguments are enclosed in square brackets, the source
column name is enclosed in round brackets.
Output
Conversion Arguments type Description Example
date_from_days_since [base_date (date)] date Converts an date_col:date =
(number_col integer field date_from_days_since
(int32)) into a date by [″1958-08-18″] (int_col)
adding the
integer to the
specified base
date. The
base_date must
be in the
format
yyyy-mm-dd and
must be either
double
quoted or a
variable.
date_from_julian_day (juliandate_col date Date from date_col:date =
(uint32)) Julian day. date_from_julian_day
(julian_col)
date_from_string [date_format] date Converts the date_col:date = date_from_string
(string_col string to a date [″%yyyy-%mm-%dd″]
(string)) representation (string_col)
using the
specified
date_format. By
default the
string format is
yyyy-mm-dd.
tableDefinition defines the rows of a string lookup table and has the following form:
{propertyList} (’string’ = value; ’string’ = value; ... )
where:
v propertyList is one or more of the following options; the entire list is enclosed in braces and properties
are separated by commas if there are more than one:
– case_sensitive. Perform a case-sensitive search for matching strings; the default is case-insensitive.
– default_value = defVal. The default numeric value returned for a string that does not match any of
the strings in the table.
– default_string = defString. The default string returned for numeric values that do not match any
numeric value in the table.
v string specifies a comma-separated list of strings associated with value; enclose each string in quotes.
v value specifies a comma-separated list of 16-bit integer values associated with string.
date_format is the standard date formatting string described in “Date and time formats” on page 30
R_type is a string representing the rounding type and should contain one of the following:
v ceil. Round the source field toward positive infinity. E.g, 1.4 -> 2, -1.6 -> -1.
v floor. Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2.
v round_inf. Round or truncate the source field toward the nearest representable value, breaking ties by
rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4
-> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.
v trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the
destination, regardless of sign. For example, if the destination is an integer, all fractional digits are
truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size
of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1.
You can specify fix_zero for decimal source columns so that columns containing all zeros (by default
illegal) are treated as a valid decimal with a value of zero.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or
Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage
should attempt to maintain the partitioning.
Input page
The Input page allows you to specify details about the incoming data set you are modifying. There is
only one input link.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Modify stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
If the Modify stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Modify stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Modify stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Modify stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the modify operation is performed. The sort is always carried out within data partitions. If the
stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the
sort occurs before the collection. The availability of sorting depends on the partitioning or collecting
method chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
See ″Stage Editors,″ for a general description of the output tabs.
The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified
requirements and filters out all other records. You can specify different requirements to route rows down
different output links. The filtered out records can be routed to a reject link, if required.
When a record meets the requirements, it is written unchanged to the specified output link. The Where
property supports standard SQL expressions, except when comparing strings.
When quoting in the filter, you should use single, not double, inverted commas.
The input field can contain nulls. If it does, null values are less than all non-null values, unless you
specify the operators’s nulls last option.
Note: The conversion of numeric data types may result in a loss of range and cause incorrect results.
WebSphere DataStage displays a warning message to that effect when range is lost.
Order of association
As in SQL, expressions are associated left to right. AND and OR have the same precedence. You may
group fields and expressions in parentheses to affect the order of evaluation.
String comparison
WebSphere DataStage sorts string values according to these general rules:
v Characters are sorted in lexicographic order.
v Strings are evaluated by their ASCII value.
Examples
The following give some example Where properties.
You then select output link 2 in the dependent Output Link property. (You use the Link Ordering tab to
specify the number order of the output links).
In this example the stage only has one output link. You do not need to specify the Output Link property
because the stage will write to the output link by default.
If these conditions are met, the stage writes the row to the output link.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Filter stages in
a job. This section specifies the minimum steps to take to get a Filter stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link
Ordering tab allows you to specify what order the output links are processed in. The NLS Locale tab
appears if your have NLS enabled on your system. It allows you to select a locale other than the project
default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Predicates/Where string N/A Y Y N/A
clause
Predicates/ Output link N/A Y N Where clause
Output link
Options/Output True/False False Y N N/A
rejects
Options/Output True/False False Y N N/A
rows only once
Options/Nulls Less Less Than N N N/A
value Than/Greater
Than
Predicates category
Where clause
Specify a Where statement that a row must satisfy in order to be routed down this link. This is like an
SQL Where clause, see ″Specifying the Filter″ for details.
Options category
Output rejects
Set this to true to output rows that satisfy no Where clauses down the reject link (remember to specify
which link is the reject link on the parallel job canvas).
Set this to true to specify that rows are only output down the link of the first Where clause they satisfy.
Set to false to have rows output down the links of all Where clauses that they satisfy.
Nulls value
Specify whether null values are treated as greater than or less than other values.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can
explicitly select Set or Clear. Select Set to request the stage should attempt to maintain the
partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Filter stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
If the Filter stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Filter stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Filter stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Filter stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the remove duplicates operation is performed. The sort is always carried out within data
partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is
collecting data, the sort occurs before the collection. The availability of sorting depends on the
partitioning or collecting method chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Filter stage. The stage can
have any number of output links, plus one reject link, choose the one you want to work on from the
Output name drop down list.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Filter stage and the output columns. The Advanced tab allows
you to change the default buffering settings for the output links.
Details about Filter stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For Filter stages the Mapping tab allows you to specify how the output columns are derived, i.e., what
filtered columns map onto them.
The left pane shows the filtered columns. These are read only and cannot be modified on this tab.
The External Filter stage allows you to specify a UNIX command that acts as a filter on the data you are
processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and
discard records which did not contain a match. This can be a quick and efficient way of filtering data.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include External Filter
stages in a job. This section specifies the minimum steps to take to get an External Filter stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Options category
Filter command
Specifies the filter command line to be executed and any command line options it requires. For example:
grep
Arguments
Allows you to specify any arguments that the command line requires. For example:
\(cancel\).*\1
Together with the grep command would extract all records that contained the string ″cancel″ twice and
discard other records.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can
explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the
partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the data set being filtered. There is only one input
link.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
If the External Filter stage is operating in sequential mode, it will first collect the data before writing it to
the file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the External Filter stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the External Filter stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the External Filter stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the remove duplicates operation is performed. The sort is always carried out within data
partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is
collecting data, the sort occurs before the collection. The availability of sorting depends on the
partitioning or collecting method chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the External Filter stage. The stage
can only have one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data
set whose records represent the changes made to the before data set to obtain the after data set. The stage
produces a change data set, whose table definition is transferred from the after data set’s table definition
with the addition of one column: a change code with values encoding the four actions: insert, delete,
copy, and edit. The preserve-partitioning flag is set on the change data set.
The compare is based on a set a set of key columns, rows from the two data sets are assumed to be
copies of one another if they have the same values in these key columns. You can also optionally specify
change values. If two rows have identical key columns, you can compare the value columns in the rows
to see if one is an edited copy of the other.
The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns
the data is hashed on should be the key columns used for the data compare. You can achieve the sorting
and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the
Change Capture stage.
You can use the companion Change Apply stage to combine the changes from the Change Capture stage
with the original before data set to reproduce the after data set (see Chapter 32, “Switch stage,” on page
375).
The Change Capture stage is very similar to the Difference stage described in Chapter 28, “Difference
stage,” on page 351.
This is the data set output by the Change Capture stage (bcol4 is the key column, bcol1 the value
column):
The change_code indicates that, in these three rows, the bcol1 column in the after data set has been
edited. The bcol1 column carries the edited value.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link
Ordering tab allows you to specify which input link carries the before data set and which the after data
set. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a
locale other than the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Change Input Column N/A Y Y N/A
Keys/Key
Change True/False True N N Key
Keys/Case
Sensitive
Change Ascending/ Ascending N N Key
Keys/Sort Order Descending
Specifies the name of a difference key input column. This property can be repeated to specify multiple
difference key input columns. You can use the Column Selection dialog box to select several keys at once
if required. Key has the following dependent properties:
v Case Sensitive
Use this property to specify whether each key is case sensitive or not. It is set to True by default; for
example, the values ″CASE″ and ″case″ would not be judged equivalent.
v Sort Order
Specify ascending or descending sort order.
v Nulls Position
Specify whether null values should be placed first or last.
Specifies the name of a value input column. You can use the Column Selection dialog box to select values
at once if required . Value has the following dependent properties:
v Case Sensitive
Use this to property to specify whether each value is case sensitive or not. It is set to True by default;
for example, the values ″CASE″ and ″case″ would not be judged equivalent.
Options category
Change mode
This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the
keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined,
but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that
key columns must be defined but all other columns are value columns unless they are excluded.
Log statistics
This property configures the stage to display result information containing the number of input rows and
the number of copy, delete, edit, and insert rows.
Specifies to drop (not generate) an output row for an insert result. By default, an output row is always
created by the stage.
Specifies to drop (not generate) the output row for a delete result. By default, an output row is always
created by the stage.
Specifies to drop (not generate) the output row for an edit result. By default, an output row is always
created by the stage.
Specifies to drop (not generate) the output row for a copy result. By default, an output row is not created
by the stage.
Allows you to specify a different name for the output column carrying the change code generated for
each record by the stage. By default the column is called change_code.
Copy code
Allows you to specify an alternative value for the code that indicates the after record is a copy of the
before record. By default this code is 0.
Allows you to specify an alternative value for the code that indicates that a record in the before set has
been deleted from the after set. By default this code is 2.
Edit code
Allows you to specify an alternative value for the code that indicates the after record is an edited version
of the before record. By default this code is 3.
Insert cCode
Allows you to specify an alternative value for the code that indicates a new record has been inserted in
the after set that did not exist in the before set. By default this code is 1.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices
from drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
By default the first link added will represent the before set. To rearrange the links, choose an input link
and click the up arrow button or the down arrow button.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being compared. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Change Capture stage partitioning are given in the following section. See ″Stage Editors,″
for a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is compared. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file. In the case of the Change Capture stage, WebSphere DataStage will determine if the
incoming data is key partitioned. If it is, the Same method is used, if not, WebSphere DataStage will hash
partition the data and sort it. You could also explicitly choose hash and take advantage of the on-stage
sorting.
If the Change Capture stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Change Capture stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Change Capture stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Change Capture stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being compared. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Change Capture stage. The
Change Capture stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Change Capture stage and the Output columns. The Advanced
tab allows you to change the default buffering settings for the output link.
Details about Change Capture stage mapping is given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Mapping tab
For the Change Capture stage the Mapping tab allows you to specify how the output columns are
derived, i.e., what input columns map onto them and which column carries the change code data.
338 Parallel Job Developer Guide
The left pane shows the columns from the before/after data sets plus the change code column. These are
read only and cannot be modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging input columns over, or by using the
Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that
there is an output column to carry the change code and that this is mapped to the Change_code column.
The before input to Change Apply must have the same columns as the before input that was input to
Change Capture, and an automatic conversion must exist between the types of corresponding columns. In
addition, results are only guaranteed if the contents of the before input to Change Apply are identical (in
value and record order in each partition) to the before input that was fed to Change Capture, and if the
keys are unique.
Note: The change input to Change Apply must have been output from Change Capture without
modification. Because preserve-partitioning is set on the change output of Change Capture, you
will be warned at run time if the Change Apply stage does not have the same number of partitions
as the Change Capture stage. Additionally, both inputs of Change Apply are designated as
partitioned using the Same partitioning method
The Change Apply stage reads a record from the change data set and from the before data set, compares
their key column values, and acts accordingly:
v If the before keys come before the change keys in the specified sort order, the before record is copied to
the output. The change record is retained for the next comparison.
v If the before keys are equal to the change keys, the behavior depends on the code in the change_code
column of the change record:
– Insert: The change record is copied to the output; the stage retains the same before record for the
next comparison. If key columns are not unique, and there is more than one consecutive insert with
the same key, then Change Apply applies all the consecutive inserts before existing records. This
record order may be different from the after data set given to Change Capture.
– Delete: The value columns of the before and change records are compared. If the value columns are
the same or if the Check Value Columns on Delete is specified as False, the change and before
records are both discarded; no record is transferred to the output. If the value columns are not the
same, the before record is copied to the output and the stage retains the same change record for the
Note: If the before input of Change Apply is identical to the before input of Change Capture and
either the keys are unique or copy records are used, then the output of Change Apply is
identical to the after input of Change Capture. However, if the before input of Change Apply
is not the same (different record contents or ordering), or the keys are not unique and copy
records are not used, this is not detected and the rules described above are applied anyway,
producing a result that might or might not be useful.
Example Data
This example shows a before and change data set, and the data set that is output by the Change Apply
stage when it has compared them.:
This is the after data set, output by the Change Apply stage (bcol4 is the key column, bcol1 the value
column):
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Change Apply
stages in a job. This section specifies the minimum steps to take to get a Change Apply stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Change Input Column N/A Y Y N/A
Keys/Key
Change True/False True N N Key
Keys/Case
Sensitive
Change Ascending/ Ascending N N Key
Keys/Sort Order Descending
Change First/Last First N N Key
Keys/Nulls
Position
Change Input Column N/A N Y N/A
Values/Value
Change True/False True N N Value
Values/Case
Sensitive
Options/Change Explicit Keys & Explicit Keys & Y N N/A
Mode Values/All keys, Values
Explicit
values/Explicit
Keys, All Values
Options/Log True/False False N N N/A
Statistics
Options/Check True/False True Y N N/A
Value Columns
on Delete
Options/Code string change_ code N N N/A
Column Name
Options/Copy number 0 N N N/A
Code
Options/Deleted number 2 N N N/A
Code
Options/Edit number 3 N N N/A
Code
Options/Insert number 1 N N N/A
Code
Specifies the name of a difference key input column. This property can be repeated to specify multiple
difference key input columns. You can use the Column Selection dialog box to select several keys at once
if required. Key has the following dependent properties:
v Case Sensitive
Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for
example, the values ″CASE″ and ″case″ would not be judged equivalent.
v Sort Order
Specify ascending or descending sort order.
v Nulls Position
Specify whether null values should be placed first or last.
Specifies the name of a value input column for an explanation of how Value columns are used). You can
use the Column Selection dialog box to select several values at once if required. Value has the following
dependent properties:
v Case Sensitive
Use this to property to specify whether each value is case sensitive or not. It is set to True by default;
for example, the values ″CASE″ and ″case″ would not be judged equivalent.
Options category
Change mode
This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the
keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined,
but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that
key columns must be defined but all other columns are value columns unless they are excluded.
Log statistics
This property configures the stage to display result information containing the number of input records
and the number of copy, delete, edit, and insert records.
Specifies that WebSphere DataStage should not check value columns on deletes. Normally, Change Apply
compares the value columns of delete change records to those in the before record to ensure that it is
deleting the correct record.
Allows you to specify that a different name has been used for the change data set column carrying the
change code generated for each record by the stage. By default the column is called change_code.
Copy code
Allows you to specify an alternative value for the code that indicates a record copy. By default this code
is 0.
Allows you to specify an alternative value for the code that indicates a record delete. By default this code
is 2.
Edit code
Allows you to specify an alternative value for the code that indicates a record edit. By default this code is
3.
Insert code
Allows you to specify an alternative value for the code that indicates a record insert. By default this code
is 1.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
By default the first link added will represent the before set. To rearrange the links, choose an input link
and click the up arrow button or the down arrow button.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being compared. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Change Apply stage partitioning are given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Partitioning tab
The change input to Change Apply should have been output from the Change Capture stage without
modification and should have the same number of partitions. Additionally, both inputs of Change Apply
are automatically designated as partitioned using the Same partitioning method.
The standard partitioning and collecting controls are available on the Change Apply stage, however, so
you can override this behavior.
If the Change Apply stage is operating in sequential mode, it will first collect the data before writing it to
the file using the default auto collection method.
The Partitioning tab allows you to override the default behavior. The exact operation of this tab depends
on:
v Whether the Change Apply stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Change Apply stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Change Apply stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the operation is performed. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Change Apply stage. The
Change Apply stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data.The Mapping tab allows you to specify the relationship
between the columns being input to the Change Apply stage and the Output columns. The Advanced tab
allows you to change the default buffering settings for the output link.
Details about Change Apply stage mapping is given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Mapping tab
For the Change Capture stage the Mapping tab allows you to specify how the output columns are
derived, i.e., what input columns map onto them or how they are generated.
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility. By default the columns are mapped straight across.
The comparison is performed based on a set of difference key columns. Two records are copies of one
another if they have the same value for all difference keys. You can also optionally specify change values.
If two records have identical key columns, you can compare the value columns to see if one is an edited
copy of the other.
The Difference stage is similar, but not identical, to the Change Capture stage described in Chapter 26,
“Change Capture stage,” on page 331. The Change Capture stage is intended to be used in conjunction
with the Change Apply stage (Chapter 27, “Change Apply stage,” on page 341); it produces a change
data set which contains changes that need to be applied to the before data set to turn it into the after data
set. The Difference stage outputs the before and after rows to the output data set, plus a code indicating
if there are differences. Usually, the before and after data will have the same column names, in which
case the after data set effectively overwrites the before data set and so you only see one set of columns in
the output. You are warned that WebSphere DataStage is doing this. If your before and after data sets
have different column names, columns from both data sets are output; note that any key and value
columns must have the same name.
The stage generates an extra column, Diff, which indicates the result of each record comparison.
Example data
This example shows a before and after data set, and the data set that is output by the Difference stage
when it has compared them.
© Copyright IBM Corp. 2006 351
This is the before data set:
This is the data set output by the Difference stage (Key is the key column, All non-key columns values
are set True, all other settings take the default):
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Difference
stages in a job. This section specifies the minimum steps to take to get a Difference stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link
Ordering tab allows you to specify which input link carries the before data set and which the after data
set. The NLS Locale tab appears if your have NLS enabled on your system. It allows you to select a
locale other than the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Difference Input Column N/A Y Y N/A
Keys/Key
Difference True/False True N N Key
Keys/Case
Sensitive
Specifies the name of a difference key input column. This property can be repeated to specify multiple
difference key input columns. You can use the Column Selection dialog box to select several keys at once
if required. Key has this dependent property:
v Case Sensitive
Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for
example, the values ″CASE″ and ″case″ would not be judged equivalent.
Set this to True to indicate that any columns not designated as difference key columns are value columns
for a description of value columns). It is False by default. The property has this dependent property:
v Case Sensitive
Use this to property to specify whether each value is case sensitive or not. It is set to True by default;
for example, the values ″CASE″ and ″case″ would not be judged equivalent. This property is only
available if the All non-Key columns are values property is set to True.
Specifies that the input data sets are not sorted. This property allows you to process groups of records
that may be arranged by the difference key columns but not sorted. The stage processed the input records
in the order in which they appear on its input. It is False by default.
Log statistics
This property configures the stage to display result information containing the number of input records
and the number of copy, delete, edit, and insert records. It is False by default.
Specifies to drop (not generate) an output record for an insert result. By default, an output record is
always created by the stage.
Specifies to drop (not generate) the output record for a delete result. By default, an output record is
always created by the stage.
Specifies to drop (not generate) the output record for an edit result. By default, an output record is
always created by the stage.
Specifies to drop (not generate) the output record for a copy result. By default, an output record is always
created by the stage.
Copy code
Allows you to specify an alternative value for the code that indicates the after record is a copy of the
before record. By default this code is 2.
Deleted code
Allows you to specify an alternative value for the code that indicates that a record in the before set has
been deleted from the after set. By default this code is 1.
Edit code
Allows you to specify an alternative value for the code that indicates the after record is an edited version
of the before record. By default this code is 3.
Insert code
Allows you to specify an alternative value for the code that indicates a new record has been inserted in
the after set that did not exist in the before set. By default this code is 0.
Advanced tab
This tab allows you to specify the following:
By default the first link added will represent the before set. To rearrange the links, choose an input link
and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the incoming data sets. The Difference stage expects
two incoming data sets: a before data set and an after data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being compared. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Difference stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before the operation is performed. It also allows you to specify that the data should be sorted before
being operated on.
If the Difference stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Difference stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Difference stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the Difference stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the operation is performed. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Difference stage. The
Difference stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Difference stage and the Output columns. The Advanced tab
allows you to change the default buffering settings for the output link.
Details about Difference stage mapping is given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Mapping tab
For the Difference stage the Mapping tab allows you to specify how the output columns are derived, i.e.,
what input columns map onto them or how they are generated.
The left pane shows the columns from the before/after data sets plus the DiffCode column. These are
read only and cannot be modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging input columns over, or by using the
Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that
there is an output column to carry the change code and that this is mapped to the DiffCode column.
The Compare stage performs a column-by-column comparison of records in two presorted input data
sets. You can restrict the comparison to specified key columns.
The Compare stage does not change the table definition, partitioning, or content of the records in either
input data set. It transfers both data sets intact to a single output data set generated by the stage. The
comparison results are also recorded in the output data set.
We recommend that you use runtime column propagation in this stage and allow WebSphere DataStage
to define the output column schema for you. The stage outputs a data set with three columns:
v result. Carries the code giving the result of the comparison.
v first. A subrecord containing the columns of the first input link.
v second. A subrecord containing the columns of the second input link.
If you specify the output link meta data yourself, you should use fully qualified names for the column
definitions (e.g. first.col1, second.col1 etc.), because WebSphere DataStage will not let you specify two lots
of identical column names.
Example Data
This example shows two data sets being compared, and the data set that is output by the Compare stage
when it has compared them.
The stage compares on the Key columns bcol1 and bcol4. This is the output data set:
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Abort True/False False Y N N/A
On Difference
Options/Warn on True/False False Y N N/A
Record Count
Mismatch
Options/`Equals’ number 0 N N N/A
Value
Options/`First is number 1 N N N/A
Empty’ Value
Options/`Greater number 2 N N N/A
Than’ Value
Options/`Less number -1 N N N/A
Than’ Value
Options/`Second number -2 N N N/A
is Empty’ Value
Options/Key Input Column N/A N Y N/A
Options/Case True/False True N N Key
Sensitive
This property forces the stage to abort its operation each time a difference is encountered between two
corresponding columns in any record of the two input data sets. This is False by default, if you set it to
True you cannot set Warn on Record Count Mismatch.
This property directs the stage to output a warning message when a comparison is aborted due to a
mismatch in the number of records in the two input data sets. This is False by default, if you set it to
True you cannot set Abort on difference.
`Equals’ value
Allows you to set an alternative value for the code which the stage outputs to indicate two compared
records are equal. This is 0 by default.
Allows you to set an alternative value for the code which the stage outputs to indicate the first record is
empty. This is -2 by default.
Allows you to set an alternative value for the code which the stage outputs to indicate the first record is
greater than the other. This is 1 by default.
Allows you to set an alternative value for the code which the stage outputs to indicate the second record
is greater than the other. This is -1 by default.
Allows you to set an alternative value for the code which the stage outputs to indicate the second record
is empty. This is -2 by default.
Key
Allows you to specify one or more key columns. Only these columns will be compared. Repeat the
property to specify multiple columns. You can use the Column Selection dialog box to select several keys
at once if required. The Key property has a dependent property:
v Case Sensitive
Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the
values ″CASE″ and ″case″ in would end up in different groups.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
By default the first link added will represent the First set. To rearrange the links, choose an input link and
click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the incoming data sets. The Compare stage expects
two incoming data sets.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being compared. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Compare stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
If you are running the Compare stage in parallel you must ensure that the incoming data is suitably
partitioned and sorted to make a comparison sensible.
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is compared. It also allows you to specify that the data should be sorted before being operated
on.
If the Compare stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Compare stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Compare stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the Compare stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
If you are collecting data, the Partitioning tab also allows you to specify that data arriving on the input
link should be sorted before being collected and compared. The sort is always carried out within data
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Compare stage. The Compare
stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use
an encoded data set as an input to stages that performs column-based processing or re-orders rows, but
you can input it to stages such as Copy. You can view information about the data set in the data set
viewer, but not the data itself. You cannot repartition an encoded data set, and you will be warned at
runtime if your job attempts to do that.
As the output is always a single stream, you do not have to define meta data for the output link
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Encode stages
in a job. This section specifies the minimum steps to take to get an Encode stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Command Line N/A Y N N/A
Command Line
Options category
Command line
Specifies the command line used for encoding the data set. The command line must configure the UNIX
command to accept input from standard input and write its results to standard output. The command
must be located in your search path and be accessible by every processing node on which the Encode
stage executes.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Set by default to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Encode stage can only
have one input link.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being encoded. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Encode stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Encode stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Encode stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Encode stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Encode stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Encode stage. The Encode
stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab allows
you to specify column definitions for the data (although this is optional for an encode stage). The
Advanced tab allows you to change the default buffering settings for the output link.
As the input is always a single stream, you do not have to define meta data for the input link.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Decode stages
in a job. This section specifies the minimum steps to take to get a Decode stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. This
stage only has one property and you must supply a value for this. The property appears in the warning
color (red by default) until you supply a value.
Options category
Command line
Specifies the command line used for decoding the data set. The command line must configure the UNIX
command to accept input from standard input and write its results to standard output. The command
must be located in the search path of your application and be accessible by every processing node on
which the Decode stage executes.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Decode stage expects a
single incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being decoded. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Compare stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is decoded. It also allows you to specify that the data should be sorted before being operated on.
If the Decode stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
Output page
The Output page allows you to specify details about data output from the Decode stage. The Decode
stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions for the decoded data.
The Switch stage takes a single data set as input and assigns each input row to an output data set based
on the value of a selector field. The Switch stage performs an operation analogous to a C switch
statement, which causes the flow of control in a C program to branch to one of several cases based on the
value of a selector variable. Rows that satisfy none of the cases are output on the rejects link.
Example
The example Switch stage implements the following switch statement:
switch (selector)
{
case 0: // if selector = 0,
// write record to output data set 0
break;
case 10: // if selector = 10,
// write record to output data set 1
break;
case 12: // if selector = discard value (12)
// skip record
The column called Select is the selector; the value of this determines which output links the rest of the
row will be output to. The properties of the stage are:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Switch stages in
a job. This section specifies the minimum steps to take to get a Switch stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link
Ordering tab allows you to specify what order the output links are processed in. The NLS Locale tab
appears if your have NLS enabled on your system. It allows you to select a locale other than the project
default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Input/Selector Input Column N/A Y N N/A
Input/Case True/False True N N Selector
Sensitive
Input/Selector User-defined User-defined Y N N/A
Mode mapping/Auto/ mapping
Hash
User-defined String N/A Y (if Selector Y N/A
Mapping/Case Mode =
User-defined
mapping)
Options/If not Pathname N/A Y (if Column N N/A
found Method = Schema
file)
Input category
Selector
Case sensitive
Selector mode
Specifies how you are going to define the case statements for the switch. Choose between:
v User-defined Mapping. This is the default, and means that you must provide explicit mappings from
case values to outputs. If you use this mode you specify the switch expression under the User-defined
Mapping category.
v Auto. This can be used where there are as many distinct selector values as there are output links.
v Hash. The incoming rows are hashed on the selector column modulo the number of output links and
assigned to an output link accordingly.
This property appears if you have chosen a Selector Mode of User-defined Mapping. Specify the case
expression in the case property. It has the following format:
Selector_Value[= Output_Link_Label_Number]
You must specify a selector value for each value of the input column that you want to direct to an output
column. Repeat the Case property to specify multiple values. You can omit the output link label if the
value is intended for the same output link as the case previously specified. For example, the case
statements:
1990=0
1991
1992
1993=1
1994=1
would cause the rows containing the dates 1990, 1991, or 1992 in the selector column to be routed to
output link 0, and the rows containing the dates 1993 to 1994 to be routed to output link 1.
Options category
If not found
Specifies the action to take if a row fails to match any of the case statements. This does not appear if you
choose a Selector Mode of Hash. Otherwise, choose between:
v Fail. Causes the job to fail.
v Drop. Drops the row.
v Output. Routes the row to the Reject link.
You can use this property in conjunction with the case property to specify that rows containing certain
values in the selector column will always be discarded. For example, if you defined the following case
statement:
1995=5
and set the Discard Value property to 5, all rows containing 1995 in the selector column would be routed
to link 5 which has been specified as the discard link and so will be dropped.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Switch stage expects one
incoming data set.
Details about Switch stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is switched. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Column Import stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Switch stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Switch stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Switch stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being imported. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available with the default Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Switch stage. The Switch stage
can have up to 128 output links, and can also have a reject link carrying rows that have been rejected.
Choose the link you are working on from the Output name drop-down list.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Switch stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output links.
Details about Switch stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For the Switch stage the Mapping tab allows you to specify how the output columns are derived.
The left pane shows the columns that have been switched. These are read only and cannot be modified
on this tab.
The right pane shows the output columns for each link.
When you edit an FTP Enterprise stage, the FTP stage editor appears. This is based on the generic stage
editor.
The stage editor has up to three pages, depending on whether you are accessing or transferring a file:
v Stage Page. This is always present and is used to specify general information about the stage.
v Input Link. This is present when you are accessing files from a remote host. Specify details about the
input link here.
v Output Link. This is present when you are transferring data sets to a file. Specify details about the
output link here.
Restartability
You can specify that the FTP operation runs in restartable mode. To do this you set the stage properties
as follows:
1. Specify a restartable mode of restartable transfer.
2. Specify a unique job id for the transfer
3. Optionally specify a checkpoint directory for the transfer directory (if you do not specify a checkpoint
directory, the current working directory is used)
If the FTP operation does not succeed, you can rerun the same job with the restartable mode set to restart
transfer or abandon transfer. For a production environment you could build a job sequence that
performed the transfer, then tested whether it was successful. If it was not another job in the seqence
could use an FTP stage with the restart transfer option to attempt the transfer again using the
information in the restart directory. .
For get operations, WebSphere DataStage reinitiates the FTP transfer at the file boundary. The transfer of
the files that failed half way is restarted from the beginning or zero file location. The file URIs that were
transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the
dataset from the temporary folder path.
If the operation repeatedly fails, you can use the abandon transfer option to abandon the transfer and
clear the temporary restart directory.
Must Do’s
WebSphere DataStage has many defaults, which means that it is easy to include FTP Enterprise stages in
a job. This section specifies the minimum steps required to get an FTP stage functioning. WebSphere
DataStage provides a versatile user interface, with many shortcuts available to achieve a particular end.
This section describes the basic method. You will learn where the shortcuts are when you get familiar
with the product.
The steps required depend on whether you are using the FTP Enterprise stage to access or transfer a file.
Accessing Data from a Remote Host Using the FTP Enterprise Stage
In the Output pageProperties tab:
1. Specify the URI that the file will be taken from via a get operation. You can specify multiple URIs if
required. You can also use a wildcard character to retrieve multiple files. You can specify an absolute
or relative pathname, see ″URI″ for details of syntax.
2. Ensure column meta data has been specified for the files. (You can achieve this via a schema file if
required.)
Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. Use the
NLS Map (National Language Support) tab to define a character set map for the FTP Enterprise stage.
Input Page
The Input page allows you to specify details about how the stage transfers files to a remote host using
FTP. The FTP Enterprise stage can have only one input link, but this can write to multiple files.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the file or files. The Columns tab specifies the
column definitions of data being written. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about FTP Enterprise stage properties and partitioning are given in the following sections. See
../concepts/c_deeref_Stage_Editors.dita, ″Stage Editors,″ for a general description of the other tabs.
Properties Tab
Use the Properties tab to specify properties, which determine what the stage actually does. The available
properties are displayed in a tree structure. They are divided into categories to help you find your way
around them. All the mandatory properties are included in the tree by default and cannot be removed.
Required properties (i.e. which do not have a default value) are shown in a warning color (red by
default) but change to black when you have set a value.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Target/ URI Pathname N/A Y Y N/A
Target/URI/ String N/A N N URI
Open command
Connection /ftp Pathname ftp N N N/A
command
Connection/ Encrypted N/A N Y N/A
Password
Connection/ String N/A N Y N/A
User Name
Options/ Force Yes/No No N N N/A
Parallelism
Options/ Yes/No No N N N/A
Overwrite
Options/ Abandon Restartable N Y N/A
Restartable mode Transfer/ Transfer
Restart
Transfer/
Restartable
Transfer
Options/ Pathname N/A N N Restartable Mode
Restartable
Mode/
Checkpointdir
Options/ Integer 1 N N Restartable Mode
Restartable
Mode/ Job Id
Options/ Pathname N/A N N N/A
Schema file
Options/ Binary/ASCII Binary N N N/A
Transfer Type
Transfer FTP/Secure FTP N N N/A
Protocol/ FTP (SFTP)
Transfer Mode
Target Category
Specify URI and, if required, Open command.
URI
Is a pathname connecting the Stage to a target file on a remote host. It has the Open dependent property.
You can repeat this property to specify multiple URIs. You can specify an absolute or a relative
pathname.
While connecting to the mainframe system, the syntax for an absolute path is:
ftp://host/\’path.filename\’
The syntax for connecting to the mainframe system through USS (Unix System Services) is:
ftp://host//path/filename
Open command
Is required if you perform any operation besides navigating to the directory where the file exists. There
can be multiple Open commands. This is a dependent property of URI.
Connection Category
Specify the ftp Command, Password and User Name. Only one ftp command can be specified, but
multiple passwords and user names can be specified.
ftp command
Is an optional command that you can specify if you do not want to use the default ftp command. For
example, you could specify /opt/gnu/bin/wuftp. You can enter the path of the command (on the server)
directly in this field. You can also specify a job parameter if you want to be able to specify the ftp
command at run time.
User Name
Specify the user name for the transfer. You can enter it directly in this field, or you can specify a job
parameter if you want to be able to specify the user name at run time. You can specify multiple user
names. User1 corresponds to URI1 and so on. When the number of users is less than the number of URIs,
the last user name is set for remaining URIs
If no User Name is specified, the FTP Enterprise Stage tries to use .netrc file in the home directory.
Password
Enter the password in this field. You can also specify a job parameter if you want to be able to specify
the password at run time. Specify a password for each user name. Password1 corresponds to URI1. When
the number of passwords is less than the numbers of URIs, the last password is set for the remaining
URIs.
Transfer Protocol
Select the type of FTP service to transfer files between computers. You can choose either FTP or Secure
FTP (SFTP).
FTP
Select this option if you want to transfer files using the standard FTP protocol. This is a nonsecure
protocol. By default FTP enterprise stage uses this protocol to transfer files.
Select this option if you want to transfer files between computers in a secured channel. Secure FTP (SFTP)
uses the SSH (Secured Shell) protected channel for data transfer between computers over a nonsecure
network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the
SSH connection without any pass phrase for RSA authentication.
Options Category
Force Parallelism
You can set either Yes or No. In general, the FTP Enterprise stage tries to start as many processes as
needed to transfer the n files in parallel. However, you can force the parallel transfer of data by
specifying this property to yes. This allows m number of processes at a time where m is the number
specified in WebSphere DataStage configuration file. If m is less than n, then the stage waits to transfer
the first m files and then start the next m until n files are transferred.
When you set Force Parallelism to Yes, you should only give one URI.
Overwrite
Set this option to have any existing files overwritten by this transfer.
Restartable Mode
When you specify a restartable mode of Restartable transfer, WebSphere DataStage creates a directory for
recording information about the transfer in a restart directory. If the transfer fails, you can run an
identical job with the restartable mode property set to Restart transfer, which will reattempt the transfer.
If the transfer repeatedly fails, you can run an identical job with the restartable mode option set to
Abandon transfer, which will delete the restart directory. For more details, see ″Restartability″ .
Restartable mode has the following dependent properties:
v Job Id
Identifies a restartable transfer job. This is used to name the restart directory.
v Checkpoint directory
Optionally specifies a checkpoint directory to contain restart directories. If you do not specify this, the
current working directory is used.
For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the
files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100.
Schema file
Contains a schema for storing data. Setting this option overrides any settings on the Columns tab. You
can enter the path name of a schema file, or specify a job parameter, so the schema file name can be
specified at run time.
Transfer Type
Select a data transfer type to transfer files between computers. You can select either the Binary or ASCII
mode of data transfer. The default data transfer mode is binary.
Partitioning Tab
Use the Partitioning tab to specify details about how the incoming data is partitioned or collected before
it is operated on. Use it also to specify whether the data should be sorted before being operated on. By
default, most stages partition in Auto mode. This mode attempts to work out the best partitioning
Use the Partitioning tab to override this default behavior. The exact operation of this tab depends on
whether:
v The stage is set to execute in parallel or sequential mode.
v The preceding stage in the job is set to execute in parallel or sequential mode.
If the FTP Enterprise stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the FTP Enterprise stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being written to the file or files. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available with the Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
v You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether
null columns will appear first or last for each column. Where you are using a keyed partitioning
method, you can also specify whether the column is used as a key for sorting, for partitioning, or for
both. Select the column in the Selected list and right-click to invoke the shortcut menu.
Output Page
The Output page allows you to specify details about how the FTP Enterprise stage transfers one or more
files from a remote host using the FTP protocol. The FTP Enterprise stage can have only one output link,
but this can transfer multiple files.
The General tab allows you to specify an optional description of the output link. The Properties tab
allows you to specify details of exactly what the link does. The Columns tab specifies the column
definitions of the data. The Advanced tab allows you to change the default buffering settings for the
output link.
Details about FTP Enterprise stage properties and formatting are given in the following sections. See
../concepts/c_deeref_Stage_Editors.dita, ″Stage Editors,″ for a general description of the other tabs.
Properties Tab
Use the Properties tab to specify properties, which determine what the stage actually does. The available
properties are displayed in a tree structure. They are divided into categories to help you find your way
around them. All the mandatory properties are included in the tree by default and cannot be removed.
Required properties (i.e. which do not have a default value) are shown in a warning color (red by
default) but change to black when you have set a value.
The following table gives a quick reference list of the properties and their attributes.
Source Category
Specify URI and, if required, Open command.
URI
Is a pathname connecting the Stage to a source file on a remote host. It has the Open dependent property.
You can repeat this property to specify multiple URIs, or you can use a wildcard in the path to retrieve a
number of files. You can specify an absolute or a relative pathname.
While connecting to the main frame system, the syntax for an absolute path is:
ftp://host/\’path.filename\’
The syntax for connecting to the main frame system through USS is:
ftp://host//path/filename
Open command
Is required if you perform any operation besides navigating to the directory where the file exists. There
can be multiple Open commands. This is a dependent property of URI.
Connection Category
Specify the ftp Command, Password and User Name. Only one ftp command can be specified, but
multiple passwords and user names can be specified.
ftp command
Is an optional command that you can specify if you do not want to use the default ftp command. For
example, you could specify /opt/gnu/bin/wuftp. You can enter the path of the command (on the server)
directly in this field. You can also specify a job parameter if you want to be able to specify the ftp
command at run time.
Specify the user name for the transfer. You can enter it directly in this field, or you can specify a job
parameter if you want to be able to specify the user name at run time. You can specify multiple user
names. User1 corresponds to URI1 and so on. When the number of users is less than the number of URIs,
the last user name is set for remaining URIs
If no User Name is specified, the FTP Enterprise Stage tries to use .netrc file in the home directory.
Password
Enter the password in this field. You can also specify a job parameter if you want to be able to specify
the password at run time. Specify a password for each user name. Password1 corresponds to URI1. When
the number of passwords is less than the numbers of URIs, the last password is set for the remaining
URIs.
Transfer Protocol
Select the type of FTP service to transfer files between computers. You can choose either FTP or Secure
FTP (SFTP).
FTP
Select this option if you want to transfer files using the standard FTP protocol. This is a nonsecure
protocol. By default FTP Enterprise Stage uses this protocol to transfer files.
Select this option if you want to transfer the files between computers in a secured channel. Secure FTP
(SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a
non-secure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should
configure the SSH connection without any pass phrase for RSA authentication.
Options Category
Force Parallelism
You can set either Yes or No. In general, the FTP Enterprise stage tries to start as many processes as
needed to transfer the n files in parallel. However, you can force the parallel transfer of data by
specifying this property to yes. This allows m number of processes at a time where m is the number
specified in WebSphere DataStage configuration file. If m is less than n, then the stage waits to transfer
the first m files and then start the next m until n files are transferred.
When you set Force Parallelism to Yes, you should only give one URI.
Restartable Mode
When you specify a restartable mode of Restartable transfer, WebSphere DataStage creates a directory for
recording information about the transfer in a restart directory. If the transfer fails, you can run an
identical job with the restartable mode property set to Restart transfer, which will reattempt the transfer.
If the transfer repeatedly fails, you can run an identical job with the restartable mode option set to
Abandon transfer, which will delete the restart directory. For more details, see ″Restartability″ .
For get operations, WebSphere DataStage reinitiates the FTP transfer at the file boundary. The transfer of
the files that failed half way is restarted from the beginning or zero file location. The file URIs that were
transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the
dataset from the temporary folder path.
For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the
files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100.
Schema file
Contains a schema for storing data. Setting this option overrides any settings on the Columns tab. You
can enter the path name of a schema file, or specify a job parameter, so the schema file name can be
specified at run time.
Transfer Type
Select a data transfer type to transfer files between computers. You can select either the Binary or ASCII
mode of data transfer. The default data transfer type is binary.
The Generic stage allows you to call an Orchestrate operator from within a WebSphere DataStage stage
and pass it options as required. See WebSphere DataStage Parallel Job Advanced Developer Guide for more
details about Orchestrate operators.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Generic stages
in a job. This section specifies the minimum steps to take to get a Generic stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Operator Orchestrate N/A Y N N/A
operator
Options/Option String N/A N Y N/A
name
Options/Option String N/A N N Option name
Value
Options category
Operator
Specify the name of the Orchestrate operator the stage will call.
Option name
Specify the name of an option the operator requires. This has a dependent property:
v Option Value
The value the option is to be set to.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
To rearrange the links, choose an output link and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the incoming data sets. The Generic stage can accept
multiple incoming data sets. Select the link whose details you are looking at from the Input name
drop-down list.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being operated on. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Generic stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is operated on. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Generic stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Generic stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Generic stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Generic stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being operated on. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Generic stage. The Generic
stage can have any number of output links. Select the link whose details you are looking at from the
Output name drop-down list.
A surrogate key is a unique primary key that is not derived from the data that it represents, therefore
changes to the data will not change the primary key. In a star schema database, surrogate keys are used
to join a fact table to a dimension table.
The Surrogate Key Generator stage can have a single input link, a single output link, both an input link
and an output link, or no links. Job design depends on the purpose of the stage.
You can use a Surrogate Key Generator stage to perform the following tasks:
v Create or delete the key source before other jobs run
v Update a state file with a range of key values
v Generate surrogate key columns and pass them to the next stage in the job
v View the contents of the state file
Generated keys are unsigned 64-bit integers. The key source can be a state file or a database sequence.
You can use the Surrogate Key Generator stage to update a state file, but not a database sequence.
Sequences must be modified with database tools.
You must create a key source before you can use it in a job. The key source can be a state file or a
database sequence.
Before you edit the Surrogate Key Generator stage, you must edit the stage on the input link. In the data
input stage, specify details about the data source, such as the table name and connection properties, and
define the column metadata.
You cannot update database sequences. Use database tools to update sequences.
If you want to pass input columns to the next stage in the job, the Surrogate Key Generator stage can
also have an input link.
The SCD stage reads source data on the input link, performs a dimension table lookup on the reference
link, and writes data on the output link. The output link can pass data to another SCD stage, to a
different type of processing stage, or to a fact table. The dimension update link is a separate output link
that carries changes to the dimension. You can perform these steps in a single job or a series of jobs,
depending on the number of dimensions in your database and your performance requirements.
SCD stages support both SCD Type 1 and SCD Type 2 processing:
SCD Type 1
Overwrites an attribute in a dimension table.
SCD Type 2
Adds a new row to a dimension table.
Each SCD stage processes a single dimension and performs lookups by using an equality matching
technique. If the dimension is a database table, the stage reads the database to build a lookup table in
memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed
data. If a match is not found, the stage creates a new row in the dimension table. All of the columns that
are needed to create a new dimension row must be present in the source data.
Input data to SCD stages must accurately represent the order in which events occurred. You might need
to presort your input data by a sequence number or a date field. If a job has multiple SCD stages, you
must ensure that the sort order of the input data is correct for each stage.
If the SCD stage is running in parallel, the input data must be hash partitioned by key. Hash partitioning
allows all records with the same business key to be handled by the same process. The SCD stage divides
the dimension table across processes by building a separate lookup table for each process.
The job design shown in these figures minimizes the use of database facilities. Job 1 builds a lookup table
in memory for the dimension, so the database connection is active only when the table is being created.
Both the output data and the dimension update records are written to flat files. Job 2 and Job 3 use these
files to update the dimension table and to load the fact table later.
This series of jobs represents a single dimension table. If you have multiple dimensions, each has a Job 1
and a Job 2. The output of the last Job 1 is the input to Job 3.
Another strategy is to combine the first two jobs into a single step, as shown in the following figure:
Figure 5. This job performs the dimension lookup and updates the dimension table.
Here the SCD stage provides the necessary column information to the database stage so that it can
generate the correct INSERT and UPDATE SQL statements to update the dimension table. By contrast, the
design in the first three figures requires you to save your output columns from the SCD stage in Job 1 as
a table definition in the repository. You must then load columns from this table definition into the
database stage in Job 2.
The SCD stage uses purpose codes to determine how to build the lookup table for the dimension lookup.
If a dimension has only Type 1 columns, the stage builds the lookup table by using all dimension rows. If
any Type 2 columns exist, the stage builds the lookup table by using only the current rows. If a
dimension has a Current Indicator column, the stage uses the derivation value of this column on the Dim
Update tab to identify the current rows of the dimension table. If a dimension does not have a Current
Indicator column, then the stage uses the Expiration Date column and its derivation value to identify the
current rows. Any dimension columns that are not needed are not used. This technique minimizes the
amount of memory that is required by the lookup table.
Purpose codes are also used to detect dimension changes. The SCD stage compares Type 1 and Type 2
column values to source column values to determine whether to update an existing row, insert a new
row, or expire a row in the dimension table.
Purpose codes are part of the column metadata that the SCD stage propagates to the dimension update
link. You can send this column metadata to a database stage in the same job, or you can save the
metadata on the Columns tab and load it into a database stage in a different job. When the database
stage uses the auto-generated SQL option to perform inserts and updates, it uses the purpose codes to
generate the correct SQL statements.
For more information, see “Selecting purpose codes” on page 410 and “Purpose code definitions” on
page 410.
When the SCD stage performs a dimension lookup, it retrieves the value of the existing surrogate key if a
matching record is found. If a match is not found, the stage obtains a new surrogate key value by using
the derivation of the Surrogate Key column on the Dim Update tab. If you want the SCD stage to
generate new surrogate keys by using a key source that you created with a Surrogate Key Generator
stage, you must use the NextSurrogateKey function to derive the Surrogate Key column. If you want to
use your own method to handle surrogate keys, you should derive the Surrogate Key column from a
source column.
You can replace the dimension information in the source data stream with the surrogate key value by
mapping the Surrogate Key column to the output link.
For more information, see “Specifying information about a key source” on page 411 and “Creating
derivations for dimension columns” on page 411.
Prerequisites
Before you edit the SCD stage, you must edit the stages on the input links:
v In the source stage, specify the file name for the data source and define the column metadata. You
should also verify that the source data is sorted correctly.
v In the dimension reference stage, specify the table name and connection properties for the dimension
table, and define the column metadata.
After you edit the SCD stage, you must edit the stages on the output links:
v If you are updating the dimension table in the current job, specify the name and connection properties
for the dimension table, set the Write Method property to Upsert, and set the Upsert Mode property
to Auto-generated Update & Insert. You do not need to define column metadata because the SCD
stage propagates the column definitions from the dimension reference link to the dimension update
link.
v If you are loading the fact table in the current job, specify the table name and connection properties for
the target database, and select any write method. The column metadata is already defined by the
mappings that you specified in the SCD stage.
You must associate at least one pair of columns, but you can define multiple pairs if required. When you
define more than one pair, the SCD stage combines the match conditions. A successful lookup requires all
associated pairs of columns to match.
Purpose codes apply to columns on the dimension reference link and on the dimension update link.
Select purpose codes according to the type of columns in a dimension:
v If a dimension contains a Type 2 column, you must select a Current Indicator column, an Expiration
Date column, or both. An Effective Date column is optional. You cannot assign Type 2 and Current
Indicator to the same column.
v If a dimension contains only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or
SK Chain columns are allowed.
After you select purpose codes on the Lookup tab, the SCD stage automatically propagates the column
definitions and purpose codes to the Dim Update tab.
The key source can be a flat file or a database sequence. The key source must exist before the job runs. If
the key source is a flat file, the file must be accessible from all nodes that run the SCD stage.
Calls to the key source are made by the NextSurrogateKey function. On the Dim Update tab, create a
derivation that uses the NextSurrogateKey function for the column that has a purpose code of Surrogate
Key. The NextSurrogateKey function returns the value of the next surrogate key when the SCD stage
creates a new dimension row.
Every dimension column must have a derivation. Columns with a purpose code of Current Indicator or
Expiration Date must also have an Expire derivation. The following requirements apply:
v Columns with a purpose code of Type 1 or Type 2 must be derived from a source column.
v Columns with a purpose code of Current Indicator or Expiration Date must be derived from a literal
value.
You can map input data and dimension data to the output link. Dimension data is mapped from the
lookup table in memory, so new rows and changed data are available for output.
Changes to Type 2 columns take precedence over changes to Type 1 columns. The following table
describes how the SCD stage determines the update action for the dimension table:
The Column Import stage imports data from a single column and outputs it to one or more columns. You
would typically use it to divide data arriving in a single column into multiple columns. The data would
be fixed-width or delimited in some way to tell the Column Import stage where to make the divisions.
The input column must be a string or binary data, the output columns can be any data type.
You supply an import table definition to specify the target columns and their types. This also determines
the order in which data from the import column is written to output columns. Information about the
format of the incoming column (e.g., how it is delimited) is given in the Format tab of the Output page.
You can optionally save reject records, that is, records whose import was rejected, and write them to a
rejects link.
In addition to importing a column you can also pass other columns straight through the stage. So, for
example, you could pass a key column straight through.
Examples
This section gives examples of input and output data from a Column Import stage to give you a better
idea of how the stage works.
The import table definition can either be supplied on the Output Page Columns tab or in a schema file.
For the example, the definition would be:
You have to give WebSphere DataStage information about how to treat the imported data to split it into
the required columns. This is done on the Output page Format Tab. For this example, you specify a data
format of binary to ensure that the contents of col_to_import are interpreted as binary integers, and that
the data has a field delimiter of none
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Column Import
stages in a job. This section specifies the minimum steps to take to get a Column Import stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Input category
Import input column
Specifies the name of the column containing the string or binary data to import.
Output category
Column method
Specifies whether the columns to import should be derived from column definitions on the Output page
Columns tab (Explicit) or from a schema file (Schema File).
Column to import
Specifies an output column. The meta data for this column determines the type that the import column
will be converted to. Repeat the property to specify multiple columns. You can use the Column Selection
dialog box to select multiple columns at once if required . You can specify the properties for each column
using the Parallel tab of the Edit Column Meta dialog box (accessible from the shortcut menu on the
columns grid of the output Columns tab). The order of the Columns to Import that you specify should
match the order on the Columns tab.
Schema File
Instead of specifying the source data type details via output column definitions, you can use a schema
file (note, however, that if you have defined columns on the Columns tab, you should ensure these
match the schema file). Type in a pathname or browse for a schema file.
Options category
Keep Input Column
Specifies whether the original input column should be transferred to the output data set unchanged in
addition to being imported and converted. Defaults to False.
Reject mode
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Column Import stage
expects one incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being imported. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Column Import stage partitioning are given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is imported. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Column Import stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
If the Column Import stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Column Import stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being imported. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available with the default Auto methods).
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Column Import stage. The
Column Import stage can have only one output link, but can also have a reject link carrying records that
have been rejected.
The General tab allows you to specify an optional description of the output link. The Format tab allows
you to specify details about how data in the column you are importing is formatted so the stage can
divide it into separate columns. The Columns tab specifies the column definitions of the data. The
Mapping tab allows you to specify the relationship between the columns being input to the Column
Import stage and the Output columns. The Advanced tab allows you to change the default buffering
settings for the output links.
Details about Column Import stage mapping is given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Mapping tab
For the Column Import stage the Mapping tab allows you to specify how the output columns are
derived.
The left pane shows the columns the stage is deriving from the single imported column. These are read
only and cannot be modified on this tab.
The right pane shows the output columns for each link.
We recommend that you maintain the automatic mappings of the generated columns when using this
stage.
Mapping tab
For the Column Import stage the Mapping tab allows you to specify how the output columns are
derived.
The left pane shows the columns the stage is deriving from the single imported column. These are read
only and cannot be modified on this tab.
The right pane shows the output columns for each link.
We recommend that you maintain the automatic mappings of the generated columns when using this
stage.
Reject link
You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting
the data records.
Columns you are importing do not have inherent column definitions, and so WebSphere DataStage
cannot always tell where there are extra columns that need propagating. You can only use RCP on
Column Import stages if you have used the Schema File property (see ″Schema File″ on page Schema
File) to specify a schema which describes all the columns in the column. You need to specify the same
schema file for any similar stages in the job where you want to propagate columns. Stages that will
require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
v Column Import
v Column Export
The Column Export stage exports data from a number of columns of different data types into a single
column of data type string or binary. It is the complementary stage to Column Import (see Chapter 37,
“Column Import stage,” on page 415).
The input data column definitions determine the order in which the columns are exported to the single
output column. Information about how the single column being exported is delimited is given in the
Formats tab of the Input page. You can optionally save reject records, that is, records whose export was
rejected.
In addition to exporting a column you can also pass other columns straight through the stage. So, for
example, you could pass a key column straight through.
Examples
This section gives examples of input and output data from a Column Export stage to give you a better
idea of how the stage works.
In this example the Column Export stage extracts data from three input columns and outputs two of
them in a single column of type string and passes the other through. The example assumes that the job is
running sequentially. The column definitions for the input data set are as follows:
The following are the rows from the input data set:
value SN code
000.00 0 aaaa
001.00 1 bbbb
002.00 2 cccc
003.00 3 dddd
004.00 4 eeee
005.00 5 ffff
006.00 6 gggg
007.00 7 hhhh
008.00 8 iiii
009.00 9 jjjj
The import table definition is supplied on the Output Page Columns tab. For our example, the definition
would be:
You have to give WebSphere DataStage information about how to delimit the exported data when it
combines it into a single column. This is done on the Input page Format Tab. For this example, you
specify a data format of text, a Field Delimiter of comma, and a Quote type of double.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Column Export
stages in a job. This section specifies the minimum steps to take to get a Column Export stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Export Output Column N/A Y N N/A
Output Column
Options/Export Binary/ VarChar Binary N N N/A
Column Type
Options/Reject Continue (warn) Continue N N N/A
Mode /Output
Options/Column Input Column N/A N Y N/A
to Export
Options/Schema Pathname N/A N N N/A
File
Options category
Export output column
Specifies the name of the single column to which the input column or columns are exported.
Reject mode
Column to export
Specifies an input column the stage extracts data from. The format properties for this column can be set
on the Format tab of the Input page. Repeat the property to specify multiple input columns. You can use
the Column Selection dialog box to select multiple columns at once if required. The order of the Columns
to Export that you specify should match the order on the Columns tab. If it does not, the order on the
Columns tab overrides the order of the properties.
Schema file
Instead of specifying the source data details via input column definitions, you can use a schema file
(note, however, that if you have defined columns on the Columns tab, you should ensure these match the
schema file). Type in a pathname or browse for a schema file.
Input page
The Input page allows you to specify details about the incoming data sets. The Column Export stage
expects one incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being exported. The Format tab allows you
to specify details how data in the column you are exporting will be formatted. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Column Export stage partitioning are given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is exported. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Column Export stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Column Export stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Column Export stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being exported. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
Format tab
The Format tab allows you to supply information about the format of the column you are exporting. You
use it in the same way as you would to describe the format of a flat file you were writing. The tab has a
similar format to the Properties tab.
Select a property type from the main tree then add the properties you want to set to the tree structure by
clicking on them in the Available properties to add window. You can then set a value for that property
in the Property Value box. Pop up help for each of the available properties appears if you over the mouse
pointer over it.
This description uses the terms ″record″ and ″row″ and ″field″ and ″column″ interchangeably.
The following sections list the Property types and properties available for each type.
Record level
These properties define details about how data records are formatted in the flat file. Where you can enter
a character, this can usually be an ASCII character or a multi-byte Unicode character (if you have NLS
enabled). The available properties are:
Fill char
Specify an ASCII character or a value in the range 0 to 255. You can also choose Space or Null from a
drop-down list. This character is used to fill any gaps in a written record caused by column positioning
properties. Set to 0 by default (which is the NULL character). For example, to set it to space you could
also type in the space character or enter 32. Note that this value is restricted to one byte, so you cannot
specify a multi-byte Unicode character.
Specify a string to be written after the last column of a record in place of the column delimiter. Enter one
or more characters, this precedes the record delimiter if one is used. Mutually exclusive with Final
delimiter, which is the default. For example, if you set Delimiter to comma and Final delimiter string to `,
` (comma space - you do not need to enter the inverted commas) all fields are delimited by a comma,
except the final field, which is delimited by a comma followed by an ASCII space character.
Final delimiter
Specify a single character to be written after the last column of a record in place of the field delimiter.
Type a character or select one of whitespace, end, none, null, tab, or comma. See the following diagram
for an illustration.
v whitespace. The last column of each record will not include any trailing white spaces found at the end
of the record.
v end. The last column of each record does not include the field delimiter. This is the default setting.
v none. The last column of each record does not have a delimiter; used for fixed-width fields.
v null. The last column of each record is delimited by the ASCII null character.
v comma. The last column of each record is delimited by the ASCII comma character.
field delimiter
When writing, a space is now inserted after every field except the last in the record. Previously, a space
was inserted after every field including the last. (If you want to revert to the pre-release 7.5 behavior of
inserting a space after the last field, set the APT_FINAL_DELIM_COMPATIBLE environment variable.
Intact
The intact property specifies an identifier of a partial schema. A partial schema specifies that only the
column(s) named in the schema can be modified by the stage. All other columns in the row are passed
through unmodified. The file containing the partial schema is specified in the Schema File property on
the Properties tab. This property has a dependent property, Check intact, but this is not relevant to input
links.
Specify a string to be written at the end of each record. Enter one or more characters. This is mutually
exclusive with Record delimiter, which is the default, record type and record prefix.
Record delimiter
Specify a single character to be written at the end of each record. Type a character or select one of the
following:
v UNIX Newline (the default)
v null
(To implement a DOS newline, use the Record delimiter string property set to ″\R\N″ or chooseFormat
as → DOS line terminator from the shortcut menu.)
Note: Record delimiter is mutually exclusive with Record delimiter string, Record prefix, and Record
type.
Record length
Select Fixed where fixed length fields are being written. WebSphere DataStage calculates the appropriate
length for the record. Alternatively specify the length of fixed records as number of bytes. This is not
used by default (default files are comma-delimited). The record is padded to the specified length with
either zeros or the fill character if one has been specified.
Record Prefix
Record type
Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If
you choose the implicit property, data is written as a stream with no explicit record boundaries. The end
of the record is inferred when all of the columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or
VR.
This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and
Record prefix and by default is not used.
Field defaults
Defines default properties for columns written to the file or files. These are applied to all columns
written, but can be overridden for individual columns from the Columns tab using the Edit Column
Metadata dialog box. Where you can enter a character, this can usually be an ASCII character or a
multi-byte Unicode character (if you have NLS enabled). The available properties are:
v Actual field length. Specifies the number of bytes to fill with the Fill character when a field is
identified as null. When WebSphere DataStage identifies a null field, it will write a field of this length
full of Fill characters. This is mutually exclusive with Null field value.
v Delimiter. Specifies the trailing delimiter of all fields in the record. Type an ASCII character or select
one of whitespace, end, none, null, comma, or tab.
– whitespace. Whitespace characters at the end of a column are ignored, i.e., are not treated as part of
the column.
– end. The end of a field is taken as the delimiter, i.e., there is no separate delimiter. This is not the
same as a setting of `None’ which is used for fields with fixed-width columns.
– none. No delimiter (used for fixed-width).
– null. ASCII Null character is used.
– comma. ASCII comma character is used.
– tab. ASCII tab character is used.
v Delimiter string. Specify a string to be written at the end of each field. Enter one or more characters.
This is mutually exclusive with Delimiter, which is the default. For example, specifying `, ` (comma
space - you do not need to enter the inverted commas) would have each field delimited by `, ` unless
overridden for individual fields.
v Null field length. The length in bytes of a variable-length field that contains a null. When a
variable-length field is written, WebSphere DataStage writes a length value of null field length if the
field contains a null. This property is mutually exclusive with null field value.
v Null field value. Specifies the value written to null field if the source is set to null. Can be a number,
string, or C-type literal escape character. For example, you can represent a byte value by \ooo, where
each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F.
You must use this form to encode non-printable byte values.
This property is mutually exclusive with Null field length and Actual length. For a fixed width data
representation, you can use Pad char (from the general section of Type defaults) to specify a repeated
trailing character if the value you specify is shorter than the fixed width of the field.
v Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a
binary value, either the column’s length or the tag value for a tagged field.
Type defaults
These are properties that apply to all columns of a specific data type unless specifically overridden at the
column level. They are divided into a number of subgroups according to data type.
General
These properties apply to several data types (unless overridden at column level):
v Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered.
Choose from:
– little-endian. The high byte is on the right.
– big-endian. The high byte is on the left.
– native-endian. As defined by the native format of the machine. This is the default.
v Data Format. Specifies the data representation format of a field. Applies to fields of all data types
except string, ustring, and raw and to record, subrec or tagged fields containing at least one field that
is neither string nor raw. Choose from:
– binary
– text (the default)
A setting of binary has different meanings when applied to different data types:
– For decimals, binary means packed.
– For other numerical data types, binary means ″not text″.
– For dates, binary is equivalent to specifying the julian property for the date field.
– For time, binary is equivalent to midnight_seconds.
– For timestamp, binary specifies that the first integer contains a Julian day count for the date portion
of the timestamp and the second integer specifies the time portion of the timestamp as the number
of seconds from midnight. A binary timestamp specifies that two 32-but integers are written.
By default data is formatted as text, as follows:
– For the date data type, text specifies that the data to be written contains a text-based date in the
form %yyyy-%mm-%dd or in the default date format if you have defined a new one on an NLS
system (see WebSphere DataStage NLS Guide).
String
These properties are applied to columns with a string data type, unless overridden at column level.
v Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII
characters. Applies to fields of the string data type and record, subrec, or tagged fields if they contain
at least one field of this type.
v Import ASCII as EBCDIC. Not relevant for input links.
For ASCII-EBCDIC and EBCDIC-ASCII conversion tables, see WebSphere DataStage Developer’s Help.
Decimal
These properties are applied to columns with a decimal data type unless overridden at column level.
Numeric
These properties apply to integer and float fields unless overridden at column level.
Date
These properties are applied to columns with a date data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Days since. Dates are written as a signed integer containing the number of days since the specified
date. Enter a date in the form %yyyy-%mm-%dd or in the default date format if you have defined a
new one on an NLS system (see WebSphere DataStage NLS Guide).
v Format string. The string format of a date. By default this is %yyyy-%mm-%dd. For details about the
format, see .
v Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A
Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.
Time
These properties are applied to columns with a time data type unless overridden at column level. All of
these are incompatible with a Data Format setting of Text.
v Format string. Specifies the format of columns representing time as a string. For details about the
format, see “Time formats” on page 33
v Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing
the number of seconds elapsed from the previous midnight.
Timestamp
These properties are applied to columns with a timestamp data type unless overridden at column level.
v Format string. Specifies the format of a column representing a timestamp as a string. Defaults to
%yyyy-%mm-%dd %hh:%nn:%ss. The format combines the format for date strings and time strings. See
“Date formats” on page 30 and “Time formats” on page 33.
Output page
The Output page allows you to specify details about data output from the Column Export stage. The
Column Export stage can have only one output link, but can also have a reject link carrying records that
have been rejected.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Column Export stage and the Output columns. The Advanced
tab allows you to change the default buffering settings for the output links.
Details about Column Export stage mapping is given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
The left pane shows the input columns plus the composite column that the stage exports the specified
input columns to. These are read only and cannot be modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The remaining columns are all being exported to comp_col, which is the specified Export Column. You
could also pass the original columns through the stage, if required.
Reject link
You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting
the data rows. Rows will be rejected if they do not match the expected schema.
You can only use RCP on Column Export stages if you have used the Schema File property (see ″Schema
File″ ) to specify a schema which describes all the columns in the column. You need to specify the same
schema file for any similar stages in the job where you want to propagate columns. Stages that will
require a schema file are:
v Sequential File
v File Set
v External Source
v External Target
v Column Import
v Column Export
The Make Subrecord stage combines specified vectors in an input data set into a vector of subrecords
whose columns have the names and data types of the original vectors. You specify the vector columns to
be made into a vector of subrecords and the name of the new subrecord. See ″Complex Data Types″ for
an explanation of vectors and subrecords.
Vector 1
Column 1 0 1 2 3
Vector 2
Column 2 0 1 2 3 4
input data
Vector 3
Column 3 0 1 2 3
Vector 4
Column 4 0 1 2
Shows how four separate columns are combined into a single subrecord
The Split Subrecord stage performs the inverse operation. See ″Split Subrecord Stage.″
The length of the subrecord vector created by this operator equals the length of the longest vector column
from which it is created. If a variable-length vector column was used in subrecord creation, the subrecord
vector is also of variable length.
Vectors that are smaller than the largest combined vector are padded with default values: NULL for
nullable columns and the corresponding type-dependent value for non-nullable columns. When the Make
Subrecord stage encounters mismatched vector lengths, it warns you by writing to the job log.
You can also use the stage to make a simple subrecord rather than a vector of subrecords. If your input
columns are simple data types rather than vectors, they will be used to build a vector of subrecords of
length 1 - effectively a simple subrecord.
Column 2 Colname1
input data
Column 3 Colname2
Column 4 Colname3
Column 5 Colname4
Column 1 Subrec
output data
Keycol
Colname1
Colname2
Colname3
Colname4
Examples
This section gives examples of input and output data from a Make Subrecord stage to give you a better
idea of how the stage works.
In this example the Make Subrecord stage extracts data from four input columns, three of which carry
vectors. The data is output in two columns, one carrying the vectors in a subrecord, and the non-vector
column being passed through the stage. The example assumes that the job is running sequentially. The
column definitions for the input data set are as follows
The following are the rows from the input data set (superscripts represents the vector index):
The stage outputs the subrecord it builds from the input data in a single column called parent. The
column called key will be output separately. The output column definitions are as follows:
Key Parent
0 1 2 3
row A 12 13 4 64
Will wombat bill William
D 0 pad pad
B 22 6 4 21
Robin Dally Rob RD
G A pad pad
C 76 0 52 2
Beth Betany Bethany Bets
B 7 pad pad
D 4 6 81 0
Heathcliff HC Hchop Horror
A 1 pad pad
E 2 4 6 8
Chaz Swot Chazlet Twerp
C H pad pad
F 18 8 5 8
Kayser Cuddles KB Ibn Kayeed
M 1 pad pad
G 12 10 6 43
Jayne Jane J JD
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Make
Subrecord stages in a job. This section specifies the minimum steps to take to get a Make Subrecord stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties that determine what the stage actually does. Some of
the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Output Column N/A Y N N/A
Subrecord Output
Column
Options/Vector Input Column N/A N Y Key
Column for
Subrecord
Input category
Subrecord output column
Specify the name of the subrecord into which you want to combine the columns specified by the Vector
Column for Subrecord property.
Output category
Vector column for subrecord
Specify the name of the column to include in the subrecord. You can specify multiple columns to be
combined into a subrecord. For each column, specify the property followed by the name of the column to
include. You can use the Column Selection dialog box to select multiple columns at once if required.
Options category
Disable warning of column padding
When the stage combines vectors of unequal length, it pads columns and displays a message to this
effect. Optionally specify this property to disable display of the message.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Make Subrecord stage
expects one incoming data set.
Details about Make Subrecord stage partitioning are given in the following section. See ″Stage Editors,″
for a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is converted. It also allows you to specify that the data should be sorted before being operated
on.
If the Make Subrecord stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Make Subrecord stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Make Subrecord stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Make Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being converted. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default for the default Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Make Subrecord stage. The
Make Subrecord stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The stage creates one new vector column for each element of the original subrecord. That is, each
top-level vector column that is created has the same number of elements as the subrecord from which it
was created. The stage outputs columns of the same name and data type as those of the columns that
Vector 1
Column 1 0 1 2 3
Vector 3
Column 3 0 1 2 3
Vector 4
Column 4 0 1 2
comprise the subrecord.
The Make Subrecord stage performs the inverse operation (see, ″Make Subrecord Stage.″)
Examples
This section gives examples of input and output data from a Split Subrecord stage to give you a better
idea of how the stage works.
In this example the Split Subrecord stage extracts data from a subrecord containing three vectors. The
data is output in fours column, three carrying the vectors from the subrecord, plus another column which
is passed through the stage. The example assumes that the job is running sequentially. The column
definitions for the input data set are as follows:
The following are the rows from the input data set (superscripts represents the vector index):
Key Parent
0 1 2 3
row A 12 13 4 64
Will wombat bill William
D 0 pad pad
B 22 6 4 21
Robin Dally Rob RD
G A pad pad
C 76 0 52 2
Beth Betany Bethany Bets
B 7 pad pad
D 4 6 81 0
Heathcliff HC Hchop Horror
A 1 pad pad
E 2 4 6 8
Chaz Swot Chazlet Twerp
C H pad pad
F 18 8 5 8
Kayser Cuddles KB Ibn Kayeed
M 1 pad pad
G 12 10 6 43
Jayne Jane J JD
F 2 pad pad
H 12 0 6 43
Ann Anne AK AJK
H E pad pad
I 3 0 7 82
The stage outputs the data it extracts from the subrecord in three separate columns each carrying a
vector. The column called key will be output separately. The output column definitions are as follows:
The output data set will be (superscripts represents the vector index):
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Split Subrecord
stages in a job. This section specifies the minimum steps to take to get a Split Subrecord stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. The
Split Subrecord only has one property, and you must supply a value for this.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Input Column N/A Y N N/A
Subrecord
Column
Options category
Subrecord column
Specifies the name of the vector whose elements you want to promote to a set of similarly named
top-level columns.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. There can be only one input to
the Split Subrecord stage.
Details about Split Subrecord stage partitioning are given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is converted. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Split Subrecord stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Split Subrecord stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Split Subrecord stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Split Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being converted. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Split Subrecord stage. The
Split Subrecord stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Combine Records stage combines records (i.e., rows), in which particular key-column values are
identical, into vectors of subrecords. As input, the stage takes a data set in which one or more columns
are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the
same record as subrecords.
Column 1 Keycol
Column 2 Colname1
input data
Column 3 Colname2
Column 4 Colname3
Column 5 Colname4
All elements
share same
value of Keycol
The data set input to the Combine Records stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be processed by the
same node. Choosing the (auto) partitioning method will ensure that partitioning and sorting is done. If
sorting and partitioning are carried out on separate stages before the Combine Records stage, WebSphere
DataStage in auto mode will detect this and not repartition (alternatively you could explicitly specify the
Same partitioning method).
Examples
This section gives examples of input and output data from a Combine Records stage to give you a better
idea of how the stage works.
Example 1
This example assumes that the job is running sequentially. The column definitions for the input data set
are as follows:
The following are some rows from the input data set:
Once combined by the stage, each group of rows will be output in a single column called supercool. This
contains the keycoll, col1, col2, and col3 columns. (If you do not take advantage of the runtime column
propagation feature, you would have to set up the subrecord using the Edit Column Meta Data dialog
box to set a level number for each of the columns the subrecord column contains.)
subreccol
vector index col1 col2 col3 Keycol
row 0 1 00:11:01 1960-01-02 A
1 3 08:45:54 1946-09-15 A
row 0 1 12:59:00 1955-12-22 B
1 2 07:33:04 1950-03-10 B
2 2 12:00:00 1967-02-06 B
3 2 07:37:04 1950-03-10 B
4 3 07:56:03 1977-04-14 B
5 3 09:58:02 1960-05-18 B
row 0 1 11:43:02 1980-06-03 C
1 2 01:30:01 1985-07-07 C
2 2 11:30:01 1985-07-07 C
3 3 10:28:02 1992-11-23 C
Example 2
This example shows a more complex structure that can be derived using the Top Level Keys Property.
This can be set to True to indicate that key columns should be left as top level columns and not included
in the subrecord.This example assumes that the job is running sequentially. The same column definition
are used, except both col1 and keycol are defined as keys.
The Output column definitions have two separate columns defined for the keys, as well as the column
carrying the subrecords:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Combine
Records stages in a job. This section specifies the minimum steps to take to get a Combine Records stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS
Locale tab appears if your have NLS enabled on your system. It allows you to select a locale other than
the project default to determine collating rules.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Output Column N/A Y N N/A
Subrecord Output
Column
Options/Key Input Column N/A Y Y N/A
Options/Case True/False True N N Key
Sensitive
Options/Top True/False False N N N/A
Level Keys
Outputs category
Subrecord output column
Specify the name of the subrecord that the Combine Records stage creates.
460 Parallel Job Developer Guide
Combine keys category
Key
Specify one or more columns. You can use the Column Selection dialog box to select multiple columns at
once if required. All records whose key columns contain identical values are gathered into the same
record as subrecords. If the Top Level Keys property is set to False, each column becomes the element of
a subrecord.
If the Top Level Keys property is set to True, the key column appears as a top-level column in the output
record as opposed to in the subrecord. All non-key columns belonging to input records with that key
column appear as elements of a subrecord in that key column’s output record. Key has the following
dependent property:
v Case Sensitive
Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for
example, the values ″CASE″ and ″case″ would not be judged equivalent.
Options category
Top level keys
Specify whether to leave keys as top-level columns or have them put into the subrecord. False by default.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices
from drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being converted. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Combine Records stage partitioning are given in the following section. See ″Stage Editors,″
for a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is converted. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.Auto mode ensures that data being input to the Combine Records stage is hash
partitioned and sorted.
If the Combine Records stage is operating in sequential mode, it will first collect the data using the
default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Combine Records stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Combine Records stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Combine Records stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being converted. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Combine Records stage. The
Combine Records stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Promote Subrecord stage promotes the columns of an input subrecord to top-level columns. The
number of output columns equals the number of subrecord elements. The data types of the input
subrecord columns determine those of the corresponding top-level columns.
Colname1
input data
Colname2
Colname3
Colname4
Column 1 Colname1
Column 2 Colname2
output data
Column 3 Colname3
Column 4 Colname4
The stage can also promote the columns in vectors of subrecords, in which case it acts as the inverse of
the Combine Subrecord stage (see Chapter 46, “Tail stage,” on page 497).
Column 1 Keycol
output data
Column 2 Colname1
Column 3 Colname2
Column 4 Colname3
Column 5 Colname4
Examples
This section gives examples of input and output data from a Promote Subrecord stage to give you a
better idea of how the stage works.
Example 1
In this example the Promote Subrecord stage promotes the records of a simple subrecord to top level
columns. It extracts data from a single column containing a subrecord. The data is output in four
columns, each carrying a column from the subrecord. The example assumes that the job is running
sequentially. The column definition for the input data set contains a definition of a single column called
subrec.
subrec
subrecord column
name col1 col2 col3 col4
row 1 AAD Thurs No
row 2 ABD Thurs No
row 3 CAD Weds Yes
row 4 CCC Mon Yes
5 BDD Mon Yes
6 DAK Fri No
7 MDB Tues Yes
The stage outputs the data it extracts from the subrecord in four separate columns of appropriate type.
The output column definitions are as follows:
The Subrecord Column property on the Properties tab of the Promote Subrecord stage is set to ’subrec’.
Example 2
This example shows how the Promote Subrecord would operate on an aggregated vector of subrecords,
as would be produced by the Combine Records stage. It assumes that the job is running sequentially. The
column definition for the input data set contains a definition of a single column called subrec.
The following are some rows from the input data set:
subreccol
vector index col1 col2 col3 Keycol
row 0 1 00:11:01 1960-01-02 A
1 3 08:45:54 1946-09-15 A
row 0 1 12:59:00 1955-12-22 B
1 2 07:33:04 1950-03-10 B
Once the columns in the subrecords have been promoted the data will be output in four columns as
follows:
The Subrecord Column property on the Properties tab of the Promote Subrecord stage is set to ’subrec’.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Promote Subrecord Stage has one property:
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Input Column N/A Y N N/A
Subrecord
Column
Options category
Subrecord column
Specifies the name of the subrecord whose elements will be promoted to top-level records.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
Input page
The Input page allows you to specify details about the incoming data sets. The Promote Subrecord stage
expects one incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being converted. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Promote Subrecord stage partitioning are given in the following section. See ″Stage Editors,″
for a general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is converted. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Promote Subrecord stage is operating in sequential mode, it will first collect the data using the
default Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Promote Subrecord stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Promote Subrecord stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Promote Subrecord stage is set to execute in sequential mode, but the preceding stage is executing
in parallel, then you can set a collection method from the Collector type drop-down list. This will
override the default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being converted. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning methods chosen (it is not available
with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Promote Subrecord stage. The
Promote Subrecord stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Make Vector stage combines specified columns of an input data record into a vector of columns. The
stage has the following requirements:
v The input columns must form a numeric sequence, and must all be of the same type.
v The numbers must increase by one.
v The columns must be named column_name0 to column_namen, where column_name starts the name
of a column and 0 and n are the first and last of its consecutive numbers.
v The columns do not have to be in consecutive order.
All these columns are combined into a vector of the same length as the number of columns (n+1). The
vector is called column_name. Any input columns that do not have a name of that form will not be
included in the vector but will be output as top level columns.
Column 1 Col0
Column 2 Col1
input data
Column 3 Col2
Column 4 Col3
Column 5 Col4
The Split Vector stage performs the inverse operation. See ″Split Vector Stage.″
Examples
This section gives examples of input and output data from a Make Vector stage to give you a better idea
of how the stage works.
Example 1
In this example, all the input data will be included in the output vector. The example assumes that the
job is running sequentially. The column definitions for the input data set are in the following table. Note
the columns all have the same type and names in the form column_nameN.:
The following are some rows from the input data set:
The stage outputs the vectors it builds from the input data in a single column called col. You do not have
to explicitly define the output column name, WebSphere DataStage will do this for you as the job runs,
but you may wish to do so to make the job more understandable.
The Column’s Common Partial Name property is set to ’col’ in the Properties tab.
col
Vector index 0 1 2 3 4
row 3 6 2 9 9
row 3 2 7 2 4
Example 2
In this example, there are additional columns as well as the ones that will be included in the vector. The
example assumes that the job is running sequentially. The column definitions for the input data set are
show below, note the additional columns called name and code:
The following are some rows from the input data set:
The stage outputs the vectors it builds from the input data in a single column called column_name. The
two other columns are output separately. You do not have to explicitly define the output column names,
WebSphere DataStage will do this for you as the job runs, but you may wish to do so to make the job
more understandable.
The Column’s Common Partial Name property is set to ’col’ in the Properties tab
col
Vector index 0 1 2 3 4
row Will D070 3 6 2 9 9
row Robin GA36 3 2 7 2 4
row Beth B777 7 8 8 5 3
row Heathcliff A100 4 8 7 1 6
row Chaz CH01 1 6 2 5 1
row Kayser CH02 0 1 6 7 8
row Jayne M122 9 9 6 4 2
row Ann F234 0 8 4 4 3
row Kath HE45 1 7 2 5 3
row Rupert BC11 7 9 4 7 8
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Make Vector
stages in a job. This section specifies the minimum steps to take to get a Make Vector stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/ Name N/A Y N N/A
Column’s
Common Partial
Name
Options category
Column’s common partial name
Specifies the beginning column_name of the series of consecutively numbered columns column_name0 to
column_namen to be combined into a vector called column_name.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Make Vector stage expects
one incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being converted. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Make Vector stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
If the Make Vector stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Make Vector stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Make Vector stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
If the Make Vector stage is set to execute in sequential mode, but the preceding stage is executing in
parallel, then you can set a collection method from the Collector type drop-down list. This will override
the default collection method.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Make Vector stage. The Make
Vector stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Split Vector stage promotes the elements of a fixed-length vector to a set of similarly named top-level
columns. The stage creates columns of the format name0 to namen, where name is the original vector’s
name and 0 and n are the first and last elements of the vector.
Column 1 Col0
Column 2 Col1
output data
Column 3 Col2
Column 4 Col3
Column 5 Col4
The Make Vector stage performs the inverse operation (see Chapter 43, “Make Vector stage,” on page
473).
Examples
This section gives examples of input and output data from a Split Vector stage to give you a better idea
of how the stage works.
The following are some rows from the input data set:
col
Vector index 0 1 2 3 4
row 3 6 2 9 9
row 3 2 7 2 4
row 7 8 8 5 3
row 4 8 7 1 6
row 1 6 2 5 1
row 0 1 6 7 8
row 9 9 6 4 2
row 0 8 4 4 3
row 1 7 2 5 3
row 7 9 4 7 8
The stage splits the columns it extracts from the vector into separate columns called column_nameN. You
do not have to explicitly define the output column names, WebSphere DataStage will do this for you as
the job runs, but you may wish to do so to make the job more understandable.
Example 2
In this example, there are additional columns as well as the ones containing the vector. The example
assumes that the job is running sequentially. The column definitions for the input data set are as follows,
note the additional columns called name and code:
The following are some rows from the input data set:
col
Vector index 0 1 2 3 4
row Will D070 3 6 2 9 9
row Robin GA36 3 2 7 2 4
row Beth B777 7 8 8 5 3
row Heathcliff A100 4 8 7 1 6
row Chaz CH01 1 6 2 5 1
row Kayser CH02 0 1 6 7 8
row Jayne M122 9 9 6 4 2
row Ann F234 0 8 4 4 3
row Kath HE45 1 7 2 5 3
row Rupert BC11 7 9 4 7 8
The stage splits the columns it extracts from the vector into separate columns called column_nameN. You
do not have to explicitly define the output column names, WebSphere DataStage will do this for you as
the job runs, but you may wish to do so to make the job more understandable.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Split Vector
stages in a job. This section specifies the minimum steps to take to get a Split Vector stage functioning.
WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a
particular end, this section describes the basic method, you will learn where the shortcuts are when you
get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Split Vector stage has one property:
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Vector Name N/A Y N N/A
Column
Options category
Vector column
Specifies the name of the vector whose elements you want to promote to a set of similarly named
top-level columns.
Advanced tab
This tab allows you to specify the following:
Input page
The Input page allows you to specify details about the incoming data sets. There can be only one input to
the Split Vector stage.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being converted. The Columns tab
specifies the column definitions of incoming data. The Advanced tab allows you to change the default
buffering settings for the input link.
Details about Split Vector stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is converted. It also allows you to specify that the data should be sorted before being operated
on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file.
If the Split Vector stage is operating in sequential mode, it will first collect the data using the default
Auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Split Vector stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Split Vector stage is set to execute in parallel, then you can set a partitioning method by selecting
from the Partition type drop-down list. This will override any current partitioning.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being converted. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available with the default Auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
Output page
The Output page allows you to specify details about data output from the Split Vector stage. The Split
Vector stage can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Advanced tab allows you to change the default buffering
settings for the output link.
The Head Stage selects the first N rows from each partition of an input data set and copies the selected
rows to an output data set. You determine which rows are copied by setting properties which allow you
to specify:
v The number of rows to copy
v The partition from which the rows are copied
v The location of the rows to copy
v The number of rows to skip before the copying operation begins
This stage is helpful in testing and debugging applications with large data sets. For example, the Partition
property lets you see data from a single partition to determine if the data is being partitioned as you
want it to be. The Skip property lets you access a certain portion of a data set.
Examples
After the job is run we get a data set comprising four partitions each containing ten rows. Here is a
sample of partition 0 as input to the Head stage, and partition 0 in its entirety as output by the stage:
"GC20442"," A VALIDER","Bp 18","9/30/1988 "
"GC26211","# 7112 GREENWAY PHOENIX W & S ","16012 N 32ND ST","3/2/1994 "
"GC25543","# 7114 PHOENIZ AZ W & S DC ONL","3622 S 30TH ST","1/7/1994 "
"GC20665","*","Avenue Charles Bedaux ZI du Me","6/6/1989 "
Skipping Data
In this example we are using the same data, but this time we are only interested in partition 0, and are
skipping the first 100 rows before we take our ten rows. The Head stage properties are set as follows:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Head stages in
a job. This section specifies the minimum steps to take to get a Head stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Rows/All Rows True/False False N N N/A
Rows/Number of Count 10 N N N/A
Rows (per
Partition)
Rows/Period (per Number N/A N N N/A
Partition)
Rows/Skip (per Number N/A N N N/A
Partition)
Partitions/All Partition Number N/A N Y N/A
Partitions
Partitions/ Number N/A Y (if All Partitions Y N/A
Partition Number = False)
Rows category
All rows
Copy all input rows to the output data set. You can skip rows before Head performs its copy operation
by using the Skip property. The Number of Rows property is not needed if All Rows is true.
Specify the number of rows to copy from each partition of the input data set to the output data set. The
default value is 10. The Number of Rows property is not needed if All Rows is true.
Copy every Pth record in a partition, where P is the period. You can start the copy operation after records
have been skipped by using the Skip property. P must equal or be greater than 1.
Ignore the first number of rows of each partition of the input data set, where number is the number of
rows to skip. The default skip count is 0.
If False, copy records only from the indicated partition, specified by number. By default, the operator
copies rows from all partitions.
Partition number
Specifies particular partitions to perform the Head operation on. You can specify the Partition Number
property multiple times to specify multiple partition numbers.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data sets. The Head stage expects one
input.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being headed. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Head stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is headed. It also allows you to specify that the data should be sorted before being operated on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages, and how many nodes are specified in the
Configuration file.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Head stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Head stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Head stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being headed. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Head stage. The Head stage
can have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Head stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output link.
Details about Head stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For the Head stage the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them or how they are generated.
The left pane shows the input columns and/or the generated columns. These are read only and cannot be
modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Tail Stage selects the last N records from each partition of an input data set and copies the selected
records to an output data set. You determine which records are copied by setting properties which allow
you to specify:
v The number of records to copy
v The partition from which the records are copied
This stage is helpful in testing and debugging applications with large data sets. For example, the Partition
property lets you see data from a single partition to determine if the data is being partitioned as you
want it to be. The Skip property lets you access a certain portion of a data set.
Examples
We accept the default setting to sample ten rows from the end of each partition.The input data set
comprises data from the Global customer billing information, which has previously been
hash-partititioned into four partitions. We accept the default setting to sample ten rows from the end of
each partition.
After the job is run we get a data set comprising four partitions each containing ten rows. Here is a
sample of partition 0 as input to the Tail stage, and partition 0 in its entirety as output by the stage:
"GC22593","ifm Electronic Ltd ","EASTLEIGH ","6/5/1991 "
"GC22998","ifm Electronic Ltd ","6/8 HIGH STREET ","2/7/1992 "
"GC20828","protectic ","31 Place Des Corolles","8/5/1989 "
"GC20910","protectic ","207 Cours Du Médoc","8/5/1989 "
"GC16187","southwest underground","596 w tarpon blvd","3/8/1985 "
"GC13583","ZURN INDUSTRIES","3033 NW 25TH BLVD","7/31/1978 "
"GC13546","ZURN INDUSTRIES INC","3033 NORTHWEST 25TH AVENUE","7/11/1978 "
"GC21204","Zilog Europe INC. "," ","8/5/1989 "
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Tail stages in a
job. This section specifies the minimum steps to take to get a Tail stage functioning. WebSphere DataStage
provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section
describes the basic method, you will learn where the shortcuts are when you get familiar with the
product.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Rows/Number of Count 10 N N Key
Rows (per
Partition)
Partitions/All Partition Number N/A N Y N/A
Partitions
Partitions/ Number N/A Y (if All Partitions Y N/A
Partition Number = False)
Rows category
Number of rows (per partition)
Specify the number of rows to copy from each partition of the input data set to the output data set. The
default value is 10.
Partitions category
All partitions
If False, copy records only from the indicated partition, specified by number. By default, the operator
copies records from all partitions.
Partition number
Specifies particular partitions to perform the Tail operation on. You can specify the Partition Number
property multiple times to specify multiple partition numbers.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
Input page
The Input page allows you to specify details about the incoming data sets. The Tail stage expects one
input.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being tailed. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Tail stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is tailed. It also allows you to specify that the data should be sorted before being operated on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this
stage will warn if it cannot preserve the partitioning of the incoming data.
If the Tail stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Tail stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Tail stage is set to execute in parallel, then you can set a partitioning method by selecting from the
Partition type drop-down list. This will override any current partitioning.
If the Tail stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then
you can set a collection method from the Collector type drop-down list. This will override the default
collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being tailed. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning method chosen.
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Tail stage. The Tail stage can
have only one output link.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
Details about Tail stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For the Tail stage the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them or how they are generated.
The left pane shows the input columns and/or the generated columns. These are read only and cannot be
modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Sample stage samples an input data set. It operates in two modes. In Percent mode, it extracts rows,
selecting them by means of a random number generator, and writes a given percentage of these to each
output data set. You specify the number of output data sets, the percentage written to each, and a seed
value to start the random number generator. You can reproduce a given distribution by repeating the
same number of outputs, the percentage, and the seed value
In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply.
In this case all rows will be output to a single data set, so the stage used in this mode can only have a
For both modes you can specify the maximum number of rows that you want to sample from each
partition.
Examples
In the Stage page Properties tab we specify which percentages are written to which outputs as follows:
When we run the job we end up with three data sets of different sizes as shown in the following tables
(you would see this information if you looked at the data sets using the Data Set Manager:
In the Stage page Properties tab we specify the period sample as follows:
When we run the job it produces a sample from each partion. Here is the data sampled from partition 0:
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes. TheLink
Ordering tab allows you to specify which output links are which.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Sample percent/period percent Y N N/A
Mode
Options/Percent number N/A Y (if Sample Y N/A
Mode = Percent)
Options/Output number N/A Y N Percent
Link Number
Options/Seed number N/A N N N/A
Options/Period number N/A Y (if Sample N N/A
(Per Partition) Mode = Period)
Options/Max number N/A N N N/A
Rows Per
Partition
Specifies the type of sample operation. You can sample on a percentage of input rows (percent), or you
can sample the Nth row of every partition (period).
Percent
Specifies the sampling percentage for each output data set when use a Sample Mode of Percent. You can
repeat this property to specify different percentages for each output data set. The sum of the percentages
specified for all output data sets cannot exceed 100%. You can specify a job parameter if required.
Seed
This is the number used to initialize the random number generator. You can specify a job parameter if
required. This property is only available if Sample Mode is set to percent.
This specifies the maximum number of rows that will be sampled from each partition.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the
partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
By default the output links will be processed in the order they were added. To rearrange them, choose an
output link and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the data set being sampled. There is only one input
link.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data.
Details about Sample stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on
the previous stage in the job, the stage will warn if it cannot preserve the partitioning of the incoming
data.
If the Sample stage is operating in sequential mode, it will first collect the data before writing it to the file
using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Sample stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Sample stage is set to execute in parallel, then you can set a partitioning method by selecting from
the Partition type drop-down list. This will override any current partitioning.
If the Sample stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the sample is performed. The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the partitioning or collecting method
chosen (it is not available for the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
The Output page allows you to specify details about data output from the Sample stage. In Percent
mode, the stage can have any number of output links, in Period mode it can only have one output.
Choose the link you want to work on from the Output Link drop down list.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of outgoing data. The Mapping tab allows you to specify the relationship
between the columns being input to the Sample stage and the output columns. The Advanced tab allows
you to change the default buffering settings for the output links. The Advanced tab allows you to change
the default buffering settings for the input link.
Mapping tab
For Sample stages the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them.
The left pane shows the columns of the sampled data. These are read only and cannot be modified on
this tab. This shows the meta data from the incoming link
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Peek stage lets you print record column values either to the job log or to a separate output link as
the stage copies records from its input data set to one or more output data sets. Like the Head stage
(Chapter 45, “Head stage,” on page 489) and the Tail stage (Chapter 47, “Sample stage,” on page 503), the
Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your
application.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Peek stages in a
job. This section specifies the minimum steps to take to get a Peek stage functioning. WebSphere
DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end,
this section describes the basic method, you will learn where the shortcuts are when you get familiar
with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Rows/All True/False False N N N/A
Records (After
Skip)
Rows/Number of number 10 Y N N/A
Records (Per
Partition)
Rows/Period (per Number N/A N N N/A
Partition)
Rows/Skip (per Number N/A N N N/A
Partition)
Columns/Peek True/False True Y N N/A
All Input
Columns
Columns/Input Input Column N/A Y (if Peek All Y N/A
Column to Peek Input Columns =
False)
Partitions/All True/False True Y N N/A
Partitions
Partitions/ number N/A Y (if All Partitions Y N/A
Partition Number = False)
Options/Peek Job Log/Output Job Log N N N/A
Records Output
Mode
Options/Show True/False True N N N/A
Column Names
Options/ space/nl/tab space N N N/A
Delimiter String
Rows category
All records (after skip)
True to print all records from each partition. Set to False by default.
Specifies the number of records to print from each partition. The default is 10.
Print every Pth record in a partition, where P is the period. You can start the copy operation after records
have been skipped by using the Skip property. P must equal or be greater than 1.
Ignore the first number of rows of each partition of the input data set, where number is the number of
rows to skip. The default skip count is 0.
Columns category
Peek all input columns
True by default and prints all the input columns. Set to False to specify that only selected columns will be
printed and specify these columns using the Input Column to Peek property.
If you have set Peek All Input Columns to False, use this property to specify a column to be printed.
Repeat the property to specify multiple columns.
Partitions category
All partitions
Set to True by default. Set to False to specify that only certain partitions should have columns printed,
and specify which partitions using the Partition Number property.
Partition number
If you have set All Partitions to False, use this property to specify which partition you want to print
columns from. Repeat the property to specify multiple columns.
Options category
Peek records output mode
Specifies whether the output should go to an output column (the Peek Records column) or to the job log.
If True, causes the stage to print the column name, followed by a colon, followed by the column value. If
False, the stage prints only the column value, followed by a space. It is True by default.
Delimiter string
The string to use as a delimiter on columns. Can be space, tab or newline. The default is space.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to
maintain the partitioning.
By default the last link added will represent the peek data set. To rearrange the links, choose an output
link and click the up arrow button or the down arrow button.
Input page
The Input page allows you to specify details about the incoming data sets. The Peek stage expects one
incoming data set.
The General tab allows you to specify an optional description of the input link. The Partitioning tab
allows you to specify how incoming data is partitioned before being peeked. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Peek stage partitioning are given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Partitioning tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected
before it is peeked. It also allows you to specify that the data should be sorted before being operated on.
By default the stage partitions in Auto mode. This attempts to work out the best partitioning method
depending on execution modes of current and preceding stages and how many nodes are specified in the
Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this
stage will warn if it cannot preserve the partitioning of the incoming data.
If the Peek stage is operating in sequential mode, it will first collect the data using the default Auto
collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Peek stage is set to execute in parallel or sequential mode.
v Whether the preceding stage in the job is set to execute in parallel or sequential mode.
If the Peek stage is set to execute in parallel, then you can set a partitioning method by selecting from the
Partition type drop-down list. This will override any current partitioning.
If the Peek stage is set to execute in sequential mode, but the preceding stage is executing in parallel,
then you can set a collection method from the Collector type drop-down list. This will override the
default collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before being peeked. The sort is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is
not available with the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
The General tab allows you to specify an optional description of the output link. The Columns tab
specifies the column definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Peek stage and the Output columns. The Advanced tab allows
you to change the default buffering settings for the output links.
Details about Peek stage mapping is given in the following section. See ″Stage Editors,″ for a general
description of the other tabs.
Mapping tab
For the Peek stage the Mapping tab allows you to specify how the output columns are derived, i.e., what
input columns map onto them or how they are generated.
The left pane shows the columns being peeked. These are read only and cannot be modified on this tab.
The right pane shows the output columns for each link. This has a Derivations field where you can
specify how the column is derived. You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The Row Generator stage produces a set of mock data fitting the specified meta data. This is useful
where you want to test your job but have no real data available to process. (See also the Column
Generator stage which allows you to add extra columns to existing data sets, Chapter 50, “Column
Generator stage,” on page 525.)
The meta data you specify on the output link determines the columns you are generating.
Examples
We need to tell the stage how many columns in the generated data set and what type each column has.
We do this in the Output page Columns tab by specifying the following column definitions:
When we run the job, WebSphere DataStage generates a data set containing the following rows (sample
shown):
We can specify more details about each data type if required to shape the data being generated.
The Edit Column Meta Data dialog box contains different options for each data type. The possible
options are described in ″Generator″ . We can use the Next and <Previous buttons to go through all our
columns.
Using this dialog box we specify the following for the generated data:
v string
– Algorithm = cycle
– seven separate Values (assorted animals).
v date
– Epoch = 1958-08-18
– Type = cycle
– Increment = 10
v time
– Scale factor = 60
– Type = cycle
– Increment = 1
v timestamp
Here is the data generated by these settings, compare this with the data generated by the default settings.
In this example we are generating a data set comprising two integers. One is generated by cycling, one
by random number generation.
The random integer’s seed value is set to the partition number, and the limit to the total number of
partitions.
When we run this job in parallel, on a system with four nodes, the data generated in partition 0 is as
follows:
integer1 integer2
0 1
4 2
8 3
12 2
16 3
20 1
24 2
28 3
32 3
36 1
40 0
44 3
48 3
52 2
56 0
60 0
64 1
68 3
72 2
76 2
80 0
84 1
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Row Generator
stages in a job. This section specifies the minimum steps to take to get a Row Generator stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The Generate stage executes in Sequential mode by default. You can select Parallel
mode to generate data sets in separate partitions.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or
Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage
should attempt to maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Output page
The Output page allows you to specify details about data output from the Row Generator stage.
The General tab allows you to specify an optional description of the output link. The Properties tab lets
you specify what the stage does. The Columns tab specifies the column definitions of outgoing data. The
Advanced tab allows you to change the default buffering settings for the output link.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Number number 10 Y N N/A
of Records
Options/Schema pathname N/A N N N/A
File
Options category
Number of records
The number of records you want your generated data set to contain.
Schema file
By default the stage will take the meta data defined on the output link to base the mock data set on. But
you can specify the column definitions in a schema file, if required. You can browse for the schema file or
specify a job parameter.
The Column Generator stage adds columns to incoming data and generates mock data for these columns
for each data row processed. The new data set is then output. (See also the Row Generator stage which
allows you to generate complete sets of mock data, Chapter 49, “Row Generator stage,” on page 519.)
Example
For our example we are going to generate an extra column for a data set containing a list of
seventeenth-century inhabitants of Woodstock, Oxfordshire. The extra column will contain a unique id for
each row.
The columns for the data input to the Column Generator stage is as follows:
We set the Column Generator properties to add an extra column called uniqueid to our data set as
follows:
v Column Method = Explicit
v Column To Generate = uniqueid
The new column now appears on the Output page Mapping tab and can be mapped across to the output
link (so it appears on the Output page Columns
In this example we select the uniqueid column on the Output page Columns tab, then choose Edit Row...
from the shortcut menu. The Edit Column Meta Data dialog box appears and lets us specify more details
about the data that will be generated for the new column. First we change the type from the default of
char to integer. Because we are running the job in parallel, we want to ensure that the id we are
generating will be unique across all partitions, to do this we set the initial value to the partition number
(using the special value `part’) and the increment to the number of partions (using the special
`partcount’):
When we run the job in parallel on a four-node system the stage will generate the uniqueid column for
each row. Here are samples of partion 0 and partition 1 to show how the unique number is generated:
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Column
Generator stages in a job. This section specifies the minimum steps to take to get a Column Generator
stage functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts
to achieving a particular end, this section describes the basic method, you will learn where the shortcuts
are when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you
specify what the stage does. The Advanced tab allows you to specify how the stage executes.
Properties tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some
of the properties are mandatory, although many have default settings. Properties without default settings
appear in the warning color (red by default) and turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/Column Explicit/ Column Explicit Y N N/A
Method Method
Options/Column output column N/A Y Y (if Column N/A
to Generate Method =
Explicit)
Options/Schema pathname N/A N Y (if Column N/A
File Method = Schema
File)
Options category
Column method
Select Explicit if you are going to specify the column or columns you want the stage to generate data for.
Select Schema File if you are supplying a schema file containing the column definitions.
Column to generate
When you have chosen a column method of Explicit, this property allows you to specify which output
columns the stage is generating data for. Repeat the property to specify multiple columns. You can
specify the properties for each column using the Parallel tab of the Edit Column Meta Dialog box
(accessible from the shortcut menu on the columns grid of the output Columns tab). You can use the
Column Selection dialog box to specify several columns at once if required.
When you have chosen a column method of schema file, this property allows you to specify the column
definitions in a schema file. You can browse for the schema file or specify a job parameter.
Advanced tab
This tab allows you to specify the following:
v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the
input data is processed by the available nodes as specified in the Configuration file, and by any node
constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the
conductor node.
v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the
operators that underlie parallel stages so that they run in the same process if it is sensible for this type
of stage.
v Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or
Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage
should attempt to maintain the partitioning.
v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool
or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from
drop down lists populated from the Configuration file.
v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node
map. You can define a node map by typing node numbers into the text box or by clicking the browse
button to open the Available Nodes dialog box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any node pools defined in the Configuration
file).
Input page
The Input page allows you to specify details about the incoming data set you are adding generated
columns to. There is only one input link and this is optional.
The General tab allows you to specify an optional description of the link. The Partitioning tab allows you
to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the
column definitions of incoming data. The Advanced tab allows you to change the default buffering
settings for the input link.
Details about Generate stage partitioning are given in the following section. See ″Stage Editors,″ for a
general description of the other tabs.
If the Column Generator stage is operating in sequential mode, it will first collect the data before writing
it to the file using the default auto collection method.
The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends
on:
v Whether the Column Generator stage is set to execute in parallel or sequential mode.
If the Column Generator stage is set to execute in parallel, then you can set a partitioning method by
selecting from the Partition type drop-down list. This will override any current partitioning.
If the Column Generator stage is set to execute in sequential mode, but the preceding stage is executing
in parallel, then you can set a collection method from the Collector type drop-down list. This will
override the default auto collection method.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the column generate operation is performed. The sort is always carried out within data partitions.
If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting
data, the sort occurs before the collection. The availability of sorting depends on the partitioning or
collecting method chosen (it is not available for the default auto methods).
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu.
Output page
Details about Column Generator stage mapping is given in the following section. See ″Stage Editors,″ for
a general description of the other tabs.
Mapping tab
For Column Generator stages the Mapping tab allows you to specify how the output columns are
derived, i.e., how the generated data maps onto them.
The left pane shows the generated columns. These are read only and cannot be modified on this tab.
These columns are automatically mapped onto the equivalent output columns.
The right pane shows the output columns for the output link. This has a Derivations field where you can
specify how the column is derived.You can fill it in by dragging input columns over, or by using the
Auto-match facility.
The right pane represents the data being output by the stage after the generate operation.
The Write Range Map stage takes an input data set produced by sampling and sorting a data set and
writes it to a file in a form usable by the range partitioning method. The range partitioning method uses
the sampled and sorted data set to determine partition boundaries. .
A typical use for the Write Range Map stage would be in a job which used the Sample stage to sample a
data set, the Sort stage to sort it and the Write Range Map stage to write the range map which can then
be used with the range partitioning method to write the original data set to a file set.
Example
In this example, we sample the data in a flat file then pass it to the Write Range Map stage. The stage
sorts the data itself before constructing a range map and writing it to a file.
Must do’s
WebSphere DataStage has many defaults which means that it can be very easy to include Write Range
Map stages in a job. This section specifies the minimum steps to take to get a Write Range Map stage
functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to
achieving a particular end, this section describes the basic method, you will learn where the shortcuts are
when you get familiar with the product.
Stage page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you
to specify how the stage executes. The NLS Locale tab appears if your have NLS enabled on your system.
It allows you to select a locale other than the project default to determine collating rules.
Input page
The Input page allows you to specify details about how the Write Range Map stage writes the range map
to a file. The Write Range Map stage can have only one input link.
The General tab allows you to specify an optional description of the input link. The Properties tab allows
you to specify details of exactly what the link does. The Partitioning tab allows you to view collecting
details. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to
change the default buffering settings for the input link.
Details about Write Range Map stage properties and collecting are given in the following sections. See
″Stage Editors,″ for a general description of the other tabs.
The following table gives a quick reference list of the properties and their attributes. A more detailed
description of each property follows.
Category/
Property Values Default Mandatory? Repeats? Dependent of
Options/File Create/Overwrite Create Y N N/A
Update Mode
Options category
File update mode
This is set to Create by default. If the file you specify already exists this will cause an error. Choose
Overwrite to overwrite existing files.
Key
This allows you to specify the key for the range map. Choose an input column from the drop-down list.
You can specify a composite key by specifying multiple key properties. You can use the Column Selection
dialog box to select several keys at once if required). Key has the following dependent properties:
Sort Order
Choose Ascending or Descending. The default is Ascending.
Case Sensitive
This property is optional. Use this to specify whether each group key is case sensitive or not, this
is set to True by default, i.e., the values “CASE” and “case” would not be judged equivalent.
Nulls Position
This property is optional. By default columns containing null values appear first in the sorted
data set. To override this default so that columns containing null values appear last in the sorted
data set, select Last.
Sort as EBCDIC
To sort as in the EBCDIC character set, choose True.
Specify the file that is to hold the range map. You can browse for a file or specify a job parameter.
Partitioning tab
The Partitioning tab normally allows you to specify details about how the incoming data is partitioned or
collected before it is written to the file or files. In the case of the Write Range Map stage execution is
always sequential, so there is never a need to set a partitioning method.
You can set a collection method if collection is required. The following Collection methods are available:
v (Auto). This is the default collection method for the Column Generator stage. Normally, when you are
using Auto mode, WebSphere DataStage will eagerly read any row from any input partition as it
becomes available.
v Ordered. Reads all records from the first partition, then all records from the second partition, and so
on.
The Partitioning tab also allows you to specify that data arriving on the input link should be sorted
before the write range map operation is performed. If the stage is collecting data, the sort occurs before
the collection. The availability of sorting depends on the collecting method chosen (it is not available for
the default auto methods).
If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the
collate convention for the sort.
You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null
columns will appear first or last for each column. Where you are using a keyed partitioning method, you
can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the
column in the Selected list and right-click to invoke the shortcut menu. Because the partition mode is set
and cannot be overridden, you cannot use the stage sort facilities, so these are disabled.
You specify that you want to run jobs on USS systems in the WebSphere DataStage Administrator client.
This is done on a per-project basis. Once you have elected to deploy jobs in a project to USS, you can no
longer run parallel jobs from that project on your WebSphere DataStage server unless you opt to switch
back. See WebSphere DataStage Administrator Client Guide for details on how to set up a project for
deployment to USS.
Note: You cannot include server shared containers, BASIC Transformer stages, or supplementary stages
in a job intended for deployment on a USS system. The DB2 stage is the only database stage
currently supported.
Set up
To set up the deployment and running of parallel jobs on a USS system, you need to take the following
steps:
1. Use the WebSphere DataStage Administrator to specify a project that will be used for parallel jobs
intended for USS deployment (see WebSphere DataStage Administrator Client Guide).
2. Install the parallel engine on the USS machine and set up access to it as described in IBM WebSphere
DataStage Installation and Administration Guide.
3. On the server machine, set the environment variable APT_ORCHHOME to identify the parallel
engine’s top-level directory on the USS system.
4. On the WebSphere DataStage server machine, construct a suitable configuration file, and set the
APT_CONFIG_FILE environment variable to point to it.
Deployment options
There are two options for deploying on USS:
v Under control of WebSphere DataStage. Jobs run under the control of the WebSphere DataStage
Director client. This method suits the scenario where the job developer has direct access to the USS
machine.
v Deploy standalone. Parallel jobs scripts are transferred to the USS machine and run there totally
independently of WebSphere DataStage. This method suites the scenario where jobs are run by
operators or external schedulers, maybe overnight.
You can have both of these options selected at once, if required, so you do not have to decide how to run
a job until you come to run it.
When you are ready to run the job, you start the WebSphere DataStage Director client and select the job
and run it as you would any other job. WebSphere DataStage sends two more files to the USS machine,
As the job runs, logging information is captured from the remotely executing job and placed in the
WebSphere DataStage log in real time. The log messages indicate that they originate from a remote
machine. You can monitor the job from the Director, and collect operational metadata.
Note: Only size-based monitoring is available when jobs run on the USS system: i.e., you cannot set
APT_MONITOR_TIME, only APT_MONITOR_SIZE.
You can run a job on a USS system using the command line or job control interfaces on your WebSphere
DataStage Server as described in the WebSphere DataStage Parallel Job Developer Guide. You can also include
jobs in a job sequence.
There are certain restrictions on the use of the built-in WebSphere DataStage macros when running jobs
on USS:
Macro Restrictions
DSHostName
Name of WebSphere DataStage server, not of USS machine
DSProjectName
Supported
DSJobController
Supported
DSJobName
Supported
DSJobStartTimeStamp
Supported, but gives server date and time
DSJobStartDate
Supported, but gives server date
DSJobStartTime
Supported, but gives server time
DSJobWaveNo
Supported
DSJobInvocationId
Supported
DSProjectMapName
Supported (internal value)
When you deploy under the control of WebSphere DataStage, certain other functions besides running jobs
are available:
v View Data.
v Data set management tool.
v Configuration file editing and validation.
v Deployment of build stages on USS.
v Importing Orchestrate schemas.
Special considerations about these features are described in the following sections.
The tool is available from USS projects if FTP and remote shell options are enabled. The header bar of the
data set Browse File dialog box indicates that you are browsing data sets on a remote machine.
We recommend that you change the APT_CONFIG_FILE environment variable in your project to point to
the location of your USS configuration file on the server machine (WebSphere DataStage knows the
location on the USS machine and translates as appropriate). Although you can set it to point directly to
the file on the USS machine itself.
Deploy standalone
With this option selected, you design a job as normal using the Designer. When you compile the job,
WebSphere DataStage produces files which can be transferred to the USS machine using your preferred
method. You can then set the correct execute permissions for the files and run the job on the USS
machine by executing scripts.
If you have specified a remote machine name in the WebSphere DataStage Administrator Project
Properties Remote tab (see WebSphere DataStage Administrator Client Guide), files will automatically be sent
to the USS machine. The job can then be run by executing the scripts on the machine to compile any
transformers the job contains and then run the job.
Different restrictions reply to the WebSphere DataStage built-in macros when you run a job using the
deploy standalone method:
Macro Restrictions
DSHostName
Name of WebSphere DataStage server, not of USS machine
DSProjectName
Supported
DSJobController
Not supported
DSJobName
Supported
DSJobStartTimeStamp
Not supported
DSJobStartDate
Not supported
DSJobStartTime
Not supported
DSJobWaveNo
Not supported
DSJobInvocationId
Not supported
DSProjectMapName
Supported (internal value)
Details of the files that are created and where they should be transferred to on the USS machine are given
in the following section, ″Implementation Details″.
Implementation details
This section describes the directory structure required for job deployment on USS machines. It also
describes files that are generated by WebSphere DataStage when you compile a job in a USS project.
Directory structure
Each job deployed to the USS machine must have a dedicated directory. If you are allowing WebSphere
DataStage to automatically send files to the USS machine for you at compile time, the files are, by
default, copied to the following directory:
/Base_directory/project_name/RT_SCjobnumber
v Base_directory. You must specify a specific base directory in the WebSphere DataStage Administrator
(see WebSphere DataStage Administrator Client Guide).
v project_name. This is a directory named after the USS project.
v RT_SCjobnum. This is the directory that holds all the deployment files for a particular job. By default
the job directory is RT_SCjobnum where jobnum is the internal jobnumber allocated by WebSphere
DataStage, but you can change the form of this name in the Administrator client (see WebSphere
DataStage Administrator Client Guide.
On the WebSphere DataStage server the files are copied to the directory:
$DSHOME/../Projects/project_name/RT_SCjobnumber
Generated files
When you compile a parallel job intended for deployment on a USS machine, it produces various files
which are copied into the job directory on the USS machine. The files are as follows:
File Purpose
OshScript.osh
The main parallel job script . This script is run automatically via a remote shell when jobs are run
under the control of WebSphere DataStage. The script needs to be run manually using the
pxrun.sh script when jobs are deployed standalone.
pxrun.sh
This script is run in order to run OshScript.osh when jobs are deployed standalone.
jpdepfile
This is used by pxrun.sh. It contains the job parameters for a job deployed standalone when it is
run. It is based on the default job parameters when the job was compiled.
evdepfile
This is sourced by pxrun.sh. It contains the environment variables for a job deployed standalone
when it is run. It is based on the environment variables set when the job was compiled.
pxcompile.sh
This file is generated if the job contains one or more Transformer stages and the Deploy
Standalone option is selected. It is used to control the compilation of the transformers on the USS
machine.
internalidentifier _jobname_stagename.trx
There is a file for each Transformer stage in the job; it contains the source code for each stage.
internalidentifier _jobname_stagename.trx.sh
This is a script for compiling Transformer stages. There is one for each transformer stage. It is
called by pxcompile.sh; it can be called individually if required.
internalidentifier _jobname_stagename.trx.osh
Parallel job script to compile the corresponding Transformer stage. Called from corresponding .sh
file.
Where you are deploying jobs under the control of WebSphere DataStage, you will also see the following
files in the job directory on the USS machine:
v OshExecute.sh. This executes the job script, OshScript.osh, under the control of WebSphere DataStage.
You should NOT attempt to run this file manually.
v jpfile and evfile. These are visible while the job is actually running and contain the job parameters and
environment variables used for the job run.
If your job contains one or more Transformer stages, you will also see the following files in your job
directory:
v jobnamestagename.trx.so. Object file, one for each Transformer stage.
v jobnamestagename.trx.C. If the compilation of the corresponding Transformer stage fails for some reason,
this file is left behind.
If you deploy jobs under the control of WebSphere DataStage the configuration maintained on the server
will be automatically mirrored on the USS machine when you edit it.
If you deploy jobs standalone, you must ensure that the USS system has a valid configuration file
identified by the environment variable APT_CONFIG_FILE..
Note: You can enter commands in the Custom deployment commands field in the Administrator
Project Properties dialog box Remote page to further automate the process of deployment,
for example:
tar -cvf * %j.tar
cp %j.tar /home/mydeploy
7. When you are ready to run your job, on the USS machine, go the job directory for the required job.
8. If your job contains Transformer stages, execute the following file:
pxcompile.sh
9. When your Transformer stages have successfully compiled, run the job by executing the following file:
pxrun.sh
You can create and read data sets using the Data Set stage, which is described in Chapter 6, “File set
stage,” on page 99. WebSphere DataStage also provides a utility for managing data sets from outside a
job. This utility is available from the WebSphere DataStage Designer and Director clients.
Segment 1
Segment 1
Segment 1
One or more
data files
The descriptor file for a data set contains the following information:
v Data set header information.
v Creation time and data of the data set.
v The schema of the data set.
v A copy of the configuration file use when the data set was created.
Partitions
The partition grid shows the partitions the data set contains and describes their properties:
v #. The partition number.
v Node. The processing node that the partition is currently assigned to.
v Records. The number of records the partition contains.
v Blocks. The number of blocks the partition contains.
v Bytes. The number of bytes the partition contains.
Segments
Click on an individual partition to display the associated segment details. This contains the following
information:
v #. The segment number.
v Created. Date and time of creation.
v Bytes. The number of bytes in the segment.
v Pathname. The name and path of the file containing the segment in the selected partition.
Click the Refresh button to reread and refresh all the displayed information.
Click the Output button to view a text version of the information displayed in the Data Set Viewer.
You can open a different data set from the viewer by clicking the Open icon on the tool bar. The browse
dialog open box opens again and lets you browse for a data set.
Click OK to view the selected data, the Data Viewer window appears.
The new data set will have the same record schema, number of partitions and contents as the original
data set.
Note: You cannot use the UNIX cp command to copy a data set because WebSphere DataStage represents
a single data set with multiple files.
Note: You cannot use the UNIX rm command to copy a data set because WebSphere DataStage represents
a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much
larger data files behind.
WebSphere DataStage learns about the shape and size of the system from the configuration file. It
organizes the resources needed for a job according to what is defined in the configuration file. When
your system changes, you change the file not the jobs.
This chapter describes how to define configuration files that specify what processing, storage, and sorting
facilities on your system should be used to run a parallel job. You can maintain multiple configuration
files and read them into the system according to your varying processing needs.
When you install WebSphere DataStage Enterprise Edition the system is automatically configured to use
the supplied default configuration file. This allows you to run parallel jobs right away, but is not
optimized for your system. Follow the instructions in this chapter to produce configuration file
specifically for your system.
Configurations editor
The WebSphere DataStage Designer provides a configuration file editor to help you define configuration
files for the parallel engine. To use the editor, choose Tools → Configurations, the Configurations dialog
box appears.
To define a new file, choose (New) from the Configurations drop-down list and type into the upper text
box. Guidance on the operation and format of a configuration file is given in the following sections. Click
Save to save the file at any point. You are asked to specify a configuration name, the config file is then
saved under that name with an .apt extension.
You can verify your file at any time by clicking Check. Verification information is output in the Check
Configuration Output pane at the bottom of the dialog box.
To edit an existing configuration file, choose it from the Configurations drop-down list. You can delete an
existing configuration by selecting it and clicking Delete. You are warned if you are attempting to delete
the last remaining configuration file.
You specify which configuration will be used by setting the APT_CONFIG_FILE environment variable.
This is set on installation to point to the default configuration file, but you can set it on a project wide
level from the WebSphere DataStage Administrator (see WebSphere DataStage Administrator Client Guide) or
for individual jobs from the Job Properties dialog.
Configuration considerations
The parallel engine’s view of your system is determined by the contents of your current configuration
file. Your file defines the processing nodes and disk space connected to each node that you allocate for
use by parallel jobs. When invoking a parallel job, the parallel engine first reads your configuration file to
determine what system resources are allocated to it and then distributes the job to those resources.
Your ability to modify the parallel engine configuration means that you can control the parallelization of
a parallel job during its development cycle. For example, you can first run the job on one node, then on
two, then on four, and so on. You can measure system performance and scalability without altering
application code.
Optimizing parallelism
The degree of parallelism of a parallel job is determined by the number of nodes you define when you
configure the parallel engine. Parallelism should be optimized for your hardware rather than simply
maximized. Increasing parallelism distributes your work load but it also adds to your overhead because
the number of processes increases. Increased parallelism can actually hurt performance once you exceed
the capacity of your hardware. Therefore you must weigh the gains of added parallelism against the
potential losses in processing efficiency.
Obviously, the hardware that makes up your system influences the degree of parallelism you can
establish.
SMP systems allow you to scale up the number of CPUs and to run your parallel application against
more memory. In general, an SMP system can support multiple logical nodes. Some SMP systems allow
scalability of disk I/O. ″Configuration Options for an SMP″ discusses these considerations.
In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk
resources in concert to tackle a single computing problem. In general, you have one logical node per CPU
on an MPP system. ″Configuration Options for an MPP System″ describes these issues.
The properties of your system’s hardware also determines configuration. For example, applications with
large memory requirements, such as sort operations, are best assigned to machines with a lot of memory.
Applications that will access an RDBMS must run on its server nodes; and stages using other proprietary
software, such as SAS, must run on nodes with licenses for that software.
Here are some additional factors that affect the optimal degree of parallelism:
v CPU-intensive applications, which typically perform multiple CPU-demanding operations on each
record, benefit from the greatest possible parallelism, up to the capacity supported by your system.
v Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has
been partitioned and if the required memory is also divided among partitions.
v Applications that are disk- or I/O-intensive, such as those that extract data from and load data into
RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk
spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set
is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.
v For some jobs, especially those that are disk-intensive, you must sometimes configure your system to
prevent the RDBMS from having either to redistribute load data or to re-partition the data from an
extract operation.
The most nearly-equal partitioning of data contributes to the best overall performance of a job run in
parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly
populated.This is referred to as minimizing skew. Experience is the best teacher. Start with smaller data
sets and try different parallelizations while scaling up the data set sizes to collect performance statistics.
SMP systems allow you to scale up the number of CPUs. Increasing the number of processors you use
may or may not improve job performance, however, depending on whether your application is CPU-,
memory-, or I/O-limited. If, for example, a job is CPU-limited, that is, the memory, memory bus, and
disk I/O of your hardware spend a disproportionate amount of time waiting for the CPU to finish its
work, it will benefit from being executed in parallel. Running your job on more processing units will
shorten the waiting time of other resources and thereby speed up the overall application.
All SMP systems allow you to increase your parallel job’s memory access bandwidth. However, none
allow you to increase the memory bus capacity beyond that of the hardware configuration. Therefore,
memory-intensive jobs will also benefit from increased parallelism, provided they do not saturate the
memory bus capacity of your system. If your application is already approaching, or at the memory bus
limit, increased parallelism will not provide performance improvement.
Some SMP systems allow scalability of disk I/O. In those systems, increasing parallelism can increase the
overall throughput rate of jobs that are disk I/O-limited.
For example, the following figure shows a data flow containing three parallel stages:
Stage 2
Stage 3
Data flow
For each stage in this data flow, the parallel engine creates a single UNIX process on each logical
processing node (provided that stage combining is not in effect). On an SMP defined as a single logical
node, each stage runs sequentially as a single process, and the parallel engine executes three processes in
total for this job. If the SMP has three or more CPUs, the three processes in the job can be executed
simultaneously by different CPUs. If the SMP has fewer than three CPUs, the processes must be
scheduled by the operating system for execution, and some or all of the processors must execute multiple
processes, preventing true simultaneous execution.
In order for an SMP to run parallel jobs, you configure the parallel engine to recognize the SMP as a
single or as multiple logical processing node(s), that is:
where N is the number of CPUs on the SMP and M is the number of processing nodes on the
configuration. (Although M can be greater than N when there are more disk spindles than there are
CPUs.)
As a rule of thumb, it is recommended that you create one processing node for every two CPUs in an
SMP. You can modify this configuration to determine the optimal configuration for your system and
application during application testing and evaluation. In fact, in most cases the scheduling performed by
the operating system allows for significantly more than one process per processor to be managed before
performance degradation is seen. The exact number depends on the nature of the processes, bus
bandwidth, caching effects, and other factors.
For example, on an SMP viewed as a single logical processing node, the parallel engine creates a single
UNIX process on the processing node for each stage in a data flow. The operating system on the SMP
schedules the processes to assign each process to an individual CPU.
If the number of processes is less than the number of CPUs, some CPUs may be idle. For jobs containing
many stages, the number of processes may exceed the number of CPUs. If so, the processes will be
scheduled by the operating system.
Suppose you want to configure the parallel engine to recognize an eight-CPU SMP, for example, as two
or more processing nodes. When you configure the SMP as two separate processing nodes, the parallel
engine creates two processes per stage on the SMP. For the three-stage job shown above, configuring an
SMP as more than two parallel engine processing nodes creates at least nine UNIX processes, although
only eight CPUs are available. Process execution must be scheduled by the operating system.
For that reason, configuring the SMP as three or more parallel engine processing nodes can conceivably
degrade performance as compared with that of a one- or two-processing node configuration. This is so
because each CPU in the SMP shares memory, I/O, and network resources with the others. However, this
is not necessarily true if some stages read from and write to disk or the network; in that case, other
processes can use the CPU while the I/O-bound processes are blocked waiting for operations to finish.
CPU CPU
CPU CPU
The table below lists the processing node names and the file systems used by each processing node for
both permanent and temporary storage in this example system:
In this example, the parallel engine processing nodes share two file systems for permanent storage. The
nodes also share a local file system (/scratch) for temporary storage.
Here is the configuration file corresponding to this system. ″Configuration Files″ discusses the keywords
and syntax of configuration files.
{
node "node0" {
fastname "node0_byn" /* node name on a fast network */
pools "" "node0" "node0_fddi" /* node pools */
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
node "node1" {
fastname "node0_byn"
pools "" "node1" "node1_fddi"
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
}
In an MPP environment, you can use the multiple CPUs and their associated memory and disk resources
in concert. In this environment, each CPU has its own dedicated memory, memory bus, disk, and disk
access.
When configuring an MPP, you specify the physical nodes in your system on which the parallel engine
will run your parallel jobs. You do not have to specify all nodes.
node0_css node2_css
CPU CPU
node0 node2
node1_css node3_css
CPU CPU
node1 node3
Ethernet
This figure shows a disk-everywhere configuration. Each node is connected to both a high-speed switch
and an Ethernet. Note that the configuration information below for this MPP would be similar for a
cluster of four SMPs connected by a network.
Note that because this is an MPP system, each node in this configuration has its own disks and hence its
own /orch/s0, /orch/s1, and /scratch. If this were an SMP, the logical nodes would be sharing the same
directories.
Here is the configuration file for this sample system. ″Configuration Files″ discusses the keywords and
syntax of configuration files.
{
node "node0" {
fastname "node0_css" /* node name on a fast network*/
pools "" "node0" "node0_css" /* node pools */
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
node "node1" {
fastname "node1_css"
pools "" "node1" "node1_css"
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
node "node2" {
fastname "node2_css"
pools "" "node2" "node2_css"
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
node "node3" {
fastname "node3_css"
pools "" "node3" "node3_css"
resource disk "/orch/s0" {}
resource disk "/orch/s1" {}
resource scratchdisk "/scratch" {}
}
}
node0_css node2_css
CPU CPU
node0 node2
node1_css node3_css
CPU
CPU
CPU
node1 node3
Ethernet
node "node1" {
In this example, consider the disk definitions for /orch/s0. Since node1 and node1a are logical nodes of
the same physical node, they share access to that disk. Each of the remaining nodes, node0, node2, and
node3, has its own /orch/s0 that is not shared. That is, there are four distinct disks called /orch/s0.
Similarly, /orch/s1 and /scratch are shared between node1 and node1a but not the others.
A cluster may have a node that is not connected to the others by a high-speed switch, as in the following
figure.
node0_css node2_css
CPU CPU
node0 node2
node1_css node3_css
CPU
CPU
CPU
node1 node3
Ethernet
CPU
Node4
In this example, node4 is the Conductor, which is the node from which you need to start your
application. By default, the parallel engine communicates between nodes using the fastname, which in
this example refers to the high-speed switch. But because the Conductor is not on that switch, it cannot
use the fastname to reach the other nodes.
Therefore, to enable the Conductor node to communicate with the other nodes, you need to identify each
node on the high-speed switch by its canonicalhostname and give its Ethernet name as its quoted
attribute, as in the following configuration file. ″Configuration Files″ discusses the keywords and syntax
of configuration files.
Note: Since node4 is not on the high-speed switch and we are therefore using it only as the Conductor
node, we have left it out of the default node pool (″″). This causes the parallel engine to avoid
placing stages on node4. See ″Node Pools and the Default Node Pool″ .
CPU
CPU CPU
CPU CPU
Configuration files
This section describes parallel engine configuration files, and their uses and syntax. The parallel engine
reads a configuration file to ascertain what processing and storage resources belong to your system.
Processing resources include nodes; and storage resources include both disks for the permanent storage
of data and disks for the temporary storage of data (scratch disks). The parallel engine uses this
information to determine, among other things, how to arrange resources for parallel execution.
You must define each processing node on which the parallel engine runs jobs and qualify its
characteristics; you must do the same for each disk that will store data. You can specify additional
information about nodes and disks on which facilities such as sorting or SAS operations will be run, and
about the nodes on which to run stages that access the following relational data base management
systems: DB2, INFORMIX, and Oracle.
You can maintain multiple configuration files and read them into the system according to your needs.
A sample configuration file is located in Configurations directory under the Server directory of your
installation, and is called default.apt..
You can give the configuration file a different name or location or both from their defaults. If you do,
assign the new path and file name to the environment variable APT_CONFIG_FILE. If
APT_CONFIG_FILE is defined, the parallel engine uses that configuration file rather than searching in
the default locations. In a production environment, you can define multiple configurations and set
APT_CONFIG_FILE to different path names depending on which configuration you want to use.
You can set APT_CONFIG_FILE on a project wide level from the WebSphere DataStage Administrator
(see WebSphere DataStage Administrator Client Guide) or for individual jobs from the Job Properties dialog.
Note: Although the parallel engine may have been copied to all processing nodes, you need to copy the
configuration file only to the nodes from which you start parallel engine applications (conductor
nodes).
Syntax
Configuration files are text files containing string data. The general form of a configuration file is as
follows:
/* commentary */
{
node "node name" {
<node information>
.
.
.
}
.
.
.
}
Node names
Each node you define is followed by its name enclosed in quotation marks, for example:
node "orch0"
For a single CPU node or workstation, the node’s name is typically the network name of a processing
node on a connection such as a high-speed switch or Ethernet. Issue the following UNIX command to
learn a node’s network name:
$ uname -n
On an SMP, if you are defining multiple logical nodes corresponding to the same physical node, you
replace the network name with a logical node name. In this case, you need a fast name for each logical
node.
If you run an application from a node that is undefined in the corresponding configuration file, each user
must set the environment variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node
invoking the parallel job.
Options
Each node takes options that define the groups to which it belongs and the storage resources it employs.
Options are as follows:
fastname
Syntax:
fastname "name"
This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in
the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages
use to open connections for high volume data transfers. The attribute of this option is often the network
name. For an SMP, all CPUs share a single connection to the network, and this setting is the same for all
parallel engine processing nodes defined for an SMP. Typically, this is the principal node name, as
returned by the UNIX command uname -n.
pools
Syntax:
pools "node_pool_name0" "node_pool_name1" ...
The pools option indicates the names of the pools to which this node is assigned. The option’s attribute is
the pool name or a space-separated list of names, each enclosed in quotation marks. For a detailed
discussion of node pools, see ″Node Pools and the Default Node Pool″ .
Note that the resource disk and resource scratchdisk options can also take pools as an option, where it
indicates disk or scratch disk pools. For a detailed discussion of disk and scratch disk pools, see ″Disk
and Scratch Disk Pools and Their Defaults″ .
Node pool names can be dedicated. Reserved node pool names include the following names:
DB2 See the DB2 resource below and ″The resource DB2 Option″ .
INFORMIX
See the INFORMIX resource below and ″The resource INFORMIX Option″ .
ORACLE See the ORACLE resource below and ″The resource ORACLE option″ .
resource
Syntax:
resource resource_type "location" [{pools "disk_pool_name"}] |
resource resource_type "value"
canonicalhostname
Syntax:
canonicalhostname "ethernet name"
The canonicalhostname resource takes as its quoted attribute the ethernet name of a node in a cluster that
is unconnected to the Conductor node by the high-speed network. If the Conductor node cannot reach
the unconnected node by a fastname, you must define the unconnected node’s canonicalhostname to
enable communication.
DB2
Syntax:
resource DB2 "node_number" [{pools "instance_owner" ...}]
This option allows you to specify logical names as the names of DB2 nodes. For a detailed discussion of
configuring DB2, see ″The resource DB2 Option″ .
disk
Syntax:
resource disk "directory_path"
[{pools "poolname"...}]
Assign to this option the quoted absolute path name of a directory belonging to a file system connected
to the node. The node reads persistent data from and writes persistent data to this directory. One node
can have multiple disks. Relative path names are not supported.
Typically, the quoted name is the root directory of a file system, but it does not have to be. For example,
the quoted name you assign to disk can be a subdirectory of the file system.
You can group disks in pools. Indicate the pools to which a disk belongs by following its quoted name
with a pools definition enclosed in braces. For a detailed discussion of disk pools, see ″Disk and Scratch
Disk Pools and Their Defaults″ .
INFORMIX
This option allows you to specify logical names as the names of INFORMIX nodes. For a detailed
discussion of configuring INFORMIX, see ″The resource INFORMIX Option″ .
ORACLE
Syntax:
resource ORACLE "nodename"
[{pools "db_server_name" ...}]
This option allows you to define the nodes on which Oracle runs. For a detailed discussion of
configuring Oracle, see ″The resource ORACLE option″ .
sasworkdisk
Syntax:
resource sasworkdisk "directory_path"
[{pools "poolname"...}]
This option is used to specify the path to your SAS work directory. See ″The SAS Resources″ .
scratchdisk
Syntax:
resource scratchdisk "directory_path"
[{pools "poolname"...}]
Assign to this option the quoted absolute path name of a directory on a file system where intermediate
data will be temporarily stored. All users of this configuration must be able to read from and write to this
directory. Relative path names are unsupported.
The directory should be local to the processing node and reside on a different spindle from that of any
other disk space. One node can have multiple scratch disks.
Assign at least 500 MB of scratch disk space to each defined node. Nodes should have roughly equal
scratch space. If you perform sorting operations, your scratch disk space requirements can be
considerably greater, depending upon anticipated use.
We recommend that:
v Every logical node in the configuration file that will run sorting operations have its own sort disk,
where a sort disk is defined as a scratch disk available for sorting that resides in either the sort or
default disk pool.
v Each logical node’s sorting disk be a distinct disk drive. Alternatively, if it is shared among multiple
sorting nodes, it should be striped to ensure better performance.
v For large sorting operations, each node that performs sorting have multiple distinct sort disks on
distinct drives, or striped.
You can group scratch disks in pools. Indicate the pools to which a scratch disk belongs by following its
quoted name with a pools definition enclosed in braces. For more information on disk pools, see ″Disk
and Scratch Disk Pools and Their Defaults″ .
The following sample SMP configuration file defines four logical nodes.
{
node "borodin0" {
fastname "borodin"
pools "compute_1" ""
The option pools is followed by the quoted names of the node pools to which the node belongs. A node
can be assigned to multiple pools, as in the following example, where node1 is assigned to the default
pool (″″) as well as the pools node1, node1_css, and pool4.
node "node1"
{
fastname "node1_css"
pools "" "node1" "node1_css" "pool4"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch" {}
}
A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default
pool name (″″) from the list.
Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that
pool, that is, only on the processing nodes belonging to it. If you constrain both an stage and a job, the
stage runs only on the nodes that appear in both pools.
Nodes or resources that name a pool declare their membership in that pool.
568 Parallel Job Developer Guide
We suggest that when you initially configure your system you place all nodes in pools that are named
after the node’s name and fast name. Additionally include the default node pool in this pool, as in the
following example:
node "n1"
{
fastname "nfast"
pools "" "n1" "nfast"
}
By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You
can constrain the processing nodes used by the parallel engine either by removing node descriptions from
the configuration file or by constraining a job or stage to a particular node pool.
where:
v disk_name and s_disk_name are the names of directories.
v disk_pool... and s_pool... are the names of disk and scratch disk pools, respectively.
Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name
and belong to both resource disk and resource scratchdisk define two separate pools.
A disk that does not specify a pool is assigned to the default pool. The default pool may also be
identified by ″″ by and { } (the empty pool list). For example, the following code configures the disks for
node1:
node "node1" {
resource disk "/orch/s0" {pools "" "pool1"}
resource disk "/orch/s1" {pools "" "pool1"}
resource disk "/orch/s2" { } /* empty pool list */
resource disk "/orch/s3" {pools "pool2"}
resource scratchdisk "/scratch" {pools "" "scratch_pool1"}
}
In this example:
v The first two disks are assigned to the default pool.
v The first two disks are assigned to pool1.
v The third disk is also assigned to the default pool, indicated by { }.
v The fourth disk is assigned to pool2 and is not assigned to the default pool.
v The scratch disk is assigned to the default scratch disk pool and to scratch_pool1.
Application programmers make use of pools based on their knowledge of both their system and their
application.
The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define
a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk
pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used
for temporary storage other than buffering.
Here is an example configuration file that defines a buffer scratch disk pool:
{
node node1 {
fastname "node1_css"
pools "" "node1" "node1_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}
}
node node2 {
fastname "node2_css"
pools "" "node2" "node2_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}
}
}
In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering
will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0
and /scratch1 would be used because both would then be in the default pool.
The resource DB2 option can also take the pools option. You assign to it the user name of the owner of
each DB2 instance configured to run on each node. DB2 uses the instance to determine the location of
db2nodes.cfg.
node "Db2Node3" {
/* other configuration parameters for node3 */
resource DB2 "3" {pools "Mary" "Bill"}
}
If you now specify a DB2 instance of Mary in your jobs, the location of db2nodes.cfg is
~Mary/sqllib/db2nodes.cfg.
There are two methods for specifying the INFORMIX coserver names in the configuration file.
1. Your configuration file can contain a description of each node, supplying the node name (not a
synonym) as the quoted name of the node. Typically, the node name is the network name of a
processing node as returned by the UNIX command uname -n.
Here is a sample configuration file for a system containing INFORMIX coserver nodes node0, node1,
node2, and node3:
{
node "node0" {
/* configuration parameters for node0 */
}
node "node1" {
/* configuration parameters for node1 */
}
node "node2" {
/* configuration parameters for node2 */
}
node "node3" {
/* configuration parameters for node3 */
}
When you specify resource INFORMIX, you must also specify the pools parameter. It indicates the base
name of the coserver groups for each INFORMIX server. These names must correspond to the coserver
group base name using the shared-memory protocol. They also typically correspond to the
DBSERVERNAME setting in the ONCONFIG file. For example, coservers in the group server are typically
named server.1, server.2, and so on.
You can optionally specify the resource ORACLE option to define the nodes on which you want to run
the Oracle stages. If you do, the parallel engine runs the Oracle stages only on the processing nodes for
which resource ORACLE is defined. You can additionally specify the pools parameter of resource
ORACLE to define resource pools, which are groupings of Oracle nodes.
The resource names sas and sasworkdisk and the disk pool name sasdataset are all reserved words. Here
is an example of each of these declarations:
resource sas "/usr/sas612/" { }
resource sasworkdisk "/usr/sas/work/" { }
resource disk "/data/sas/" {pools "" "sasdataset"}
While the disks designated as sasworkdisk need not be a RAID configuration, best performance will
result if each parallel engine logical node has its own reserved disk that is not shared with other parallel
engine nodes during sorting and merging. The total size of this space for all nodes should optimally be
equal to the total work space you use when running SAS sequentially (or a bit more, to allow for uneven
distribution of data across partitions).
The number of disks in the sasdataset disk pool is the degree of parallelism of parallel SAS data sets.
Thus if you have 24 processing nodes, each with its associated disk in the sasdataset disk pool, parallel
SAS data sets will be partitioned among all 24 disks, even if the operation preceding the disk write is, for
example, only four-way parallel.
Example
Here a single node, grappelli0, is defined, along with its fast name. Also defined are the path to a SAS
executable, a SAS work disk (corresponding to the SAS work directory), and two disk resources, one for
parallel SAS data sets and one for non-SAS file sets.
node "grappelli0"
{
fastname "grappelli"
pools "" "a"
resource sas "/usr/sas612" { }
resource scratchdisk "/scratch" { }
resource sasworkdisk "/scratch" { }
disk "/data/pds_files/node0" { pools "" "export" }
disk "/data/pds_files/sas" { pools "" "sasdataset" }
}
Sort configuration
You may want to define a sort scratch disk pool to assign scratch disk space explicitly for the storage of
temporary files created by the Sort stage. In addition, if only a subset of the nodes in your configuration
have sort scratch disks defined, we recommend that you define a sort node pool, to specify the nodes on
which the sort stage should run. Nodes assigned to the sort node pool should be those that have scratch
disk space assigned to the sort scratch disk pool.
The parallel engine then runs sort only on the nodes in the sort node pool, if it is defined, and otherwise
uses the default node pool. The Sort stage stores temporary files only on the scratch disks included in the
sort scratch disk pool, if any are defined, and otherwise uses the default scratch disk pool.
Allocation of resources
The allocation of resources for a given stage, particularly node and disk allocation, is done in a
multi-phase process. Constraints on which nodes and disk resources are used are taken from the parallel
engine arguments, if any, and matched against any pools defined in the configuration file. Additional
constraints may be imposed by, for example, an explicit requirement for the same degree of parallelism as
the previous stage. After all relevant constraints have been applied, the stage allocates resources,
including instantiation of Player processes on the nodes that are still available and allocation of disks to
be used for temporary and permanent storage of data.
However, you can override the default and set configuration parameters for individual processing nodes.
To do so, you create a parallel engine startup script. If a startup script exists, the parallel engine runs it
on all remote shells before it runs your application.
When you invoke an application, the parallel engine looks for the name and location of a startup script
as follows:
1. It uses the value of the APT_STARTUP_SCRIPT environment variable.
2. It searches the current working directory for a file named startup.apt.
3. Searches for the file install_dir/etc/startup.apt on the system that invoked the parallel engine
application, where install_dir is the top-level directory of the installation.
4. If the script is not found, it does not execute a startup script.
Here is a template you can use with Korn shell to write your own startup script.
#!/bin/ksh # specify Korn shell
# your shell commands go here
You must include the last two lines of the shell script. This prevents your application from running if
your shell script detects an error.
The following startup script for the Bourne shell prints the node name, time, and date for all processing
nodes before your application is run:
A single script can perform node-specific initialization by means of a case statement. In the following
example, the system has two nodes named node1 and node2. This script performs initialization based on
which node it is running on.
#!/bin/sh # use Bourne shell
The parallel engine provides the APT_NO_STARTUP_SCRIPT environment variable to prevent the
parallel engine from running the startup script. By default, the parallel engine executes the startup script.
If the variable is set, the parallel engine ignores the startup script. This can be useful for debugging a
startup script.
Let’s assume you are running on a shared-memory multi-processor system, i.e., an SMP box (these are
the most common platforms today). Let’s assume these properties. You can adjust the illustration below
to match your precise situation:
v computer’s hostname ″fastone″
v 6 CPUs
v 4 separate file systems on 4 drives named /fs0 /fs1 /fs2 /fs3
The configuration file to use as a starting point would look like the one below. Note the way the
disk/scratchdisk resources are handled. That’s the real trick here.
{ /* config file allows C-style comments. */
/*
config files look like they have flexible syntax.
They do NOT. Keep all the sub-items of the individual
node specifications in the order shown here.
*/
node "n0" {
pools "" /* on an SMP node pools aren’t used often. */
fastname "fastone"
resource scratchdisk "/fs0/ds/scratch" {} /*start with fs0*/
resource scratchdisk "/fs1/ds/scratch" {}
resource scratchdisk "/fs2/ds/scratch" {}
resource scratchdisk "/fs3/ds/scratch" {}
resource disk "/fs0/ds/disk" {} /* start with fs0 */
resource disk "/fs1/ds/disk" {}
resource disk "/fs2/ds/disk" {}
resource disk "/fs3/ds/disk" {}
The above config file pattern could be called ″give everyone all the disk″. This configuration style works
well when the flow is complex enough that you can’t really figure out and precisely plan for good I/O
utilization. Giving every partition (node) access to all the I/O resources can cause contention, but the
parallel engine tends to use fairly large blocks for I/O so the contention isn’t as much of a problem as
you might think. This configuration style works for any number of CPUs and any number of disks since
it doesn’t require any particular correspondence between them. The heuristic principle at work here is
this ″When it’s too difficult to figure out precisely, at least go for achieving balance.″
The alternative to the above configuration style is more careful planning of the I/O behavior so as to
reduce contention. You can imagine this could be hard given our hypothetical 6-way SMP with 4 disks
because setting up the obvious one-to-one correspondence doesn’t work. Doubling up some nodes on the
same disk is unlikely to be good for overall performance since we create a hotspot. We could give every
CPU 2 disks and rotate around, but that would be little different than our above strategy. So, let’s
imagine a less constrained environment and give ourselves 2 more disks /fs4 and /fs5. Now a config file
might look like this:
{
node "n0" {
pools ""
fastname "fastone"
resource scratchdisk "/fs0/ds/scratch" {}
resource disk "/fs0/ds/disk" {}
}
node "n1" {
pools ""
fastname "fastone"
resource scratchdisk "/fs1/ds/scratch" {}
resource disk "/fs1/ds/disk" {}
}
node "n2" {
pools ""
fastname "fastone"
resource scratchdisk "/fs2/ds/scratch" {}
resource disk "/fs2/ds/disk" {}
}
node "n3" {
pools ""
fastname "fastone"
resource scratchdisk "/fs3/ds/scratch" {}
resource disk "/fs3/ds/disk" {}
}
node "n4" {
pools ""
fastname "fastone"
resource scratchdisk "/fs4/ds/scratch" {}
resource disk "/fs4/ds/disk" {}
}
node "n5" {
pools ""
fastname "fastone"
resource scratchdisk "/fs5/ds/scratch" {}
resource disk "/fs5/ds/disk" {}
}
} /* end of whole config */
This is simplest, but realize that no single player (stage/operator instance) on any one partition can go
faster than the single disk it has access to.
Any remote system that has a job so deployed must have access to the Parallel Engine in order to run the
job. Such systems must also have the correct runtime libraries for that platform type.
Because these jobs are not run on the WebSphere DataStage Server, server components (such as BASIC
Transformer stages, server shared containers, before and after subroutines, and job control routines)
cannot be used. There is also a limited set of plug-in stages available for use in these jobs.
When you run the jobs the logging, monitoring, and operational meta data collection facilities provided
by WebSphere DataStage are not available. Deployed jobs do output logging information in internal
paralle engine format, but provision for collecting this is the user’s responsibility.
You develop a Parallel job for deployment using the WebSphere DataStage Designer, and then compile it.
A deployment package is automatically produced at compilation. Such jobs can also be run under the
control of the WebSphere DataStage Server (using Designer or Director clients, or the dsjob command) as
per normal. (Note that running jobs in the `normal’ way runs the executables in the project directory, not
the deployment scripts.)
It is your responsibility to define a configuration file on the remote machine, transfer the deployment
package to the remote computer and to run the job.
The following diagram gives a conceptual view of an example deployment system. In this example,
deployable jobs are transferred to three `conductor node’ machines. Each conductor node has a
configuration file describing the resources that it has available for running the jobs. The jobs then run
under the control of that conductor:
Deployable scripts
Note: The WebSphere DataStage Server system and the Node systems must be running the same
operating system.
Deployment package
When you compile a job in the WebSphere DataStage Designer with project deployment enabled, the
following files are produced:
v Command shell script
v Environment variable setting source script
v Main Parallel (osh) program script
v Script parameter file
v XML report file (if enabled - see WebSphere DataStage Parallel Job Advanced Developer Guide).
v Compiled transformer binary files (if the job contains any Transformer stages)
v Transformer source compilation scripts
These are the files that will be copied to a job directory in the base directory specified in the
Administrator client for the project. By default the job directory is called RT_SCjobnum, where jobnum is
the internal job number allocated to the job (you can change the form of the name in the Administrator
client).
If you have additional custom components designed outside the job (for example, custom, built, or
wrapped stages) you should ensure that these are available when the job is run. They are not
automatically packaged and deployed.
It is possible to edit this file manually if required before running a job. The file can be removed
altogether, but it is then your responsibility to set up the environment before running the job.
If you want to recompile transformers on your deployment platform before running the job, you should
run pxcompile.sh.
Deploying a job
This describes how to design a job on the WebSphere DataStage system in a remote deployment project,
transfer it to the deployment machine, and run it.
1. In the WebSphere DataStage Administrator, specify a remote deployment project as described in
″Enabling a Project for Job Deployment″ .
2. Define a configuration file on your remote deployment systems that will describe it. Use the
environment variable APT_CONFIG_FILE to identify it on the remote machine. You can do this in one
of three ways:
v If you are always going to use the same configuration file on the same remote system, define
APT_CONFIG_FILE on a project-wide basis in the WebSphere DataStage Administrator. All your
remote deployment job packages will have that value for APT_CONFIG_FILE.
v To specify the value at individual job level, specify APT_CONFIG_FILE as a job parameter and set
the default value to the location of the configuration file. This will be packaged with that particular
job.
v To specify the value at run time, set the value of APT_CONFIG_FILE to $ENV in the WebSphere
DataStage Administrator and then define APT_CONFIG_FILE as an environment variable on your
remote machine. The job will pick up the value at run time.
Note: The Java plug-in does not run on Red Hat Enterprise Linux® AS 2.1/Red Hat 7.3.
The schema file is a plain text file, this appendix describes its format. A partial schema has the same
format.
Note: If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is,
however, easy to convert text files between two different maps with a WebSphere DataStage job.
Such a job would read data from a text file using a Sequential File stage and specifying the
appropriate character set on theNLS Map page. It would write the data to another file using a
Sequential File stage, specifying the UTF-8 map on the NLS Map page.
Schema format
A schema contains a record (or row) definition. This describes each column (or field) that will be
encountered within the record, giving column name and data type. The following is an example record
schema:
record (
name:string[255];
address:nullable string[255];
value1:int32;
value2:int32
date:date)
(The line breaks are there for ease of reading, you would omit these if you were defining a partial
schema, for example record(name:string[255];value1:int32;date:date) is a valid schema.)
You can include comments in schema definition files. A comment is started by a double slash //, and
ended by a newline.
The following sections give special consideration for representing various data types in a schema file.
Date columns
The following examples show various different data definitions:
record (dateField1:date; ) // single date
record (dateField2[10]:date; ) // 10-element date vector
record (dateField3[]:date; ) // variable-length date vector
record (dateField4:nullable date;) // nullable date
(See ″Complex Data Types″ on page Complex Data Types for information about vectors.)
Decimal columns
To define a record field with data type decimal, you must specify the column’s precision, and you may
optionally specify its scale, as follows:
column_name:decimal[ precision, scale];
where precision is greater than or equal 1 and scale is greater than or equal to 0 and less than precision.
Floating-point columns
To define floating-point fields, you use the sfloat (single-precision) or dfloat (double-precision) data type,
as in the following examples:
record (aSingle:sfloat; aDouble:dfloat; ) // float definitions
record (aSingle: nullable sfloat;) // nullable sfloat
record (doubles[5]:dfloat;) // fixed-length vector of dfloats
record (singles[]:sfloat;) // variable-length vector of sfloats
Integer columns
To define integer fields, you use an 8-, 16-, 32-, or 64-bit integer data type (signed or unsigned), as shown
in the following examples:
record (n:int32;) // 32-bit signed integer
record (n:nullable int64;) // nullable, 64-bit signed integer record (n[10]:int16;) // fixed-length vector of 16-bit
//signed integer
record (n[]:uint8;) // variable-length vector of 8-bit unsigned
//int
Raw columns
You can define a record field that is a collection of untyped bytes, of fixed or variable length. You give
the field data type raw. The definition for a raw field is similar to that of a string field, as shown in the
following examples:
record (var1:raw[];) // variable-length raw field
record (var2:raw;) // variable-length raw field; same as raw[]
record (var3:raw[40];) // fixed-length raw field
record (var4[5]:raw[40];)// fixed-length vector of raw fields
You can specify the maximum number of bytes allowed in the raw field with the optional property max,
as shown in the example below:
record (var7:raw[max=80];)
String columns
You can define string fields of fixed or variable length. For variable-length strings, the string length is
stored as part of the string as a hidden integer. The storage used to hold the string length is not included
in the length of the string.
You can specify the maximum length of a string with the optional property max, as shown in the
example below:
record (var7:string[max=80];)
Time columns
By default, the smallest unit of measure for a time value is seconds, but you can instead use
microseconds with the [microseconds] option. The following are examples of time field definitions:
record (tField1:time; ) // single time field in seconds
record (tField2:time[microseconds];)// time field in //microseconds
record (tField3[]:time; ) // variable-length time vector
record (tField4:nullable time;) // nullable time
Timestamp columns
Timestamp fields contain both time and date information. In the time portion, you can use seconds (the
default) or microseconds for the smallest unit of measure. For example:
record (tsField1:timestamp;)// single timestamp field in //seconds
record (tsField2:timestamp[microseconds];)// timestamp in //microseconds
record (tsField3[15]:timestamp; )// fixed-length timestamp //vector
record (tsField4:nullable timestamp;)// nullable timestamp
Vectors
Many of the previous examples show how to define a vector of a particular data type. You define a
vector field by following the column name with brackets []. For a variable-length vector, you leave the
brackets empty, and for a fixed-length vector you put the number of vector elements in the brackets. For
example, to define a variable-length vector of int32, you would use a field definition such as the
following one:
intVec[]:int32;
To define a fixed-length vector of 10 elements of type sfloat, you would use a definition such as:
sfloatVec[10]:sfloat;
You can define a vector of any data type, including string and raw. You cannot define a vector of a vector
or tagged type. You can, however, define a vector of type subrecord, and you can define that subrecord
includes a tagged column or a vector.
You can make vector elements nullable, as shown in the following record definition:
record (vInt[]:nullable int32;
vDate[6]:nullable date; )
In the example above, every element of the variable-length vector vInt will be nullable, as will every
element of fixed-length vector vDate. To test whether a vector of nullable elements contains no data, you
must check each element for null.
Subrecords
Record schemas let you define nested field definitions, or subrecords, by specifying the type subrec. A
subrecord itself does not define any storage; instead, the fields of the subrecord define storage. The fields
in a subrecord can be of any data type, including tagged.
In this example, the record contains a 16-bit integer field, intField, and a subrecord field, aSubrec. The
subrecord includes two fields: a 16-bit integer and a single-precision float.
Subrecord columns of value data types (including string and raw) can be nullable, and subrecord
columns of subrec or vector types can have nullable elements. A subrecord itself cannot be nullable.
You can define vectors (fixed-length or variable-length) of subrecords. The following example shows a
definition of a fixed-length vector of subrecords:
record (aSubrec[10]:subrec (
aField:int16;
bField:sfloat; );
)
You can also nest subrecords and vectors of subrecords, to any depth of nesting. The following example
defines a fixed-length vector of subrecords, aSubrec, that contains a nested variable-length vector of
subrecords, cSubrec:
record (aSubrec[10]:subrec (
aField:int16;
bField:sfloat;
cSubrec[]:subrec (
cAField:uint8;
cBField:dfloat; );
);
)
Subrecords can include tagged aggregate fields, as shown in the following sample definition:
record (aSubrec:subrec (
aField:string;
bField:int32;
cField:tagged (
dField:int16;
eField:sfloat;
);
);
)
In this example, aSubrec has a string field, an int32 field, and a tagged aggregate field. The tagged
aggregate field cField can have either of two data types, int16 or sfloat.
Tagged columns
You can use schemas to define tagged columns (similar to C unions), with the data type tagged. Defining
a record with a tagged type allows each record of a data set to have a different data type for the tagged
column. When your application writes to a field in a tagged column, WebSphere DataStage updates the
tag, which identifies it as having the type of the column that is referenced.
The data type of a tagged columns can be of any data type except tagged or subrec. For example, the
following record defines a tagged subrecord field:
record ( tagField:tagged (
aField:string;
bField:int32;
cField:sfloat;
) ;
)
Partial schemas
Some parallel job stages allow you to use a partial schema. This means that you only need define column
definitions for those columns that you are actually going to operate on. The stages that allow you to do
this are file stages that have a Format tab. These are:
v Sequential File stage
v File Set stage
v External Source stage
v External Target stage
v Column Import stage
You specify a partial schema using the Intact property on the Format tab of the stage together with the
Schema File property on the corresponding Properties tab. To use this facility, you need to turn Runtime
Column Propagation on, and provide enough information about the columns being passed through to
enable WebSphere DataStage to skip over them as necessary.
In the file defining the partial schema, you need to describe the record and the individual columns.
Describe the record as follows:
v intact. This property specifies that the schema being defined is a partial one. You can optionally specify
a name for the intact schema here as well, which you can then reference from the Intact property of the
Format tab.
v record_length. The length of the record, including record delimiter characters.
v record_delim_string. String giving the record delimiter as an ASCII string in single quotes. (For a
single character delimiter, use record_delim and supply a single ASCII character in single quotes).
Columns that are being passed through intact only need to be described in enough detail to allow
WebSphere DataStage to skip them and locate the columns that are to be operated on.
For example, say you have a sequential file defining rows comprising six fixed width columns, and you
are in interested in the last two. You know that the first four columns together contain 80 characters. Your
partial schema definition might appear as follows:
record { intact=details, record_delim_string = ’\r\n’ }
( colstoignore: string [80]
name: string [20] { delim=none };
income: uint32 {delim = ",", text };
This set includes functions that take string arguments or return string values. If you have NLS enabled,
the arguments strings or returned strings can be strings or ustrings. The same function is used for either
string type. The only exceptions are the functions StringToUstring () and UstringToString ().
Date, Time, and Timestamp functions that specify dates, times, or timestamps in the argument use strings
with specific formats:
For a time, the format is %hh:%nn:%ss, or, if extended to include microseconds, %hh:%nn:%ss.x where x
gives the number of decimal places seconds is given to.
For a timestamp the format is %yyyy-%mm-%dd %hh:%nn:%ss, or, if extended to include microseconds,
%yyyy-%mm-%dd %hh:%nn:%ss.x where x gives the number of decimal places seconds is given to.
This applies to the arguments date, baseline date, given date, time, timestamp, and base timestamp.
Functions that have days of week in the argument take a string specifying the day of the week, this
applies to day of week and origin day.
Mathematical Functions
The following table lists the functions available in the Mathematical category (square brackets indicate an
argument is optional):
true = 1
false = 0
Raw functions
The following table lists the functions available in the Raw category (square brackets indicate an
argument is optional):
String functions
The following table lists the functions available in the String category (square brackets indicate an
argument is optional):
[justification (L or R)]
ComparNoCase Case insensitive string1 (string) string2 result (int8)
comparison of two strings (string)
ComparNum Compare the first n string1 (string) string2 result (int8)
characters of the two (string)
strings
length (int16)
CompareNumNoCase Caseless comparison of the string1 (string) string2 result (int8)
first n characters of the two (string)
strings
length (int16)
repeats (int32)
true = 1
false = 0
Vector function
The following function can be used within expressions to access an element in a vector column. The
vector index starts at 0.
This can be used as part of, or the whole of an expression. For example, an expression to add 1 to the
third element of an vector input column ’InLink.col1’ would be:
ElementAt(InLink.col1, 2) + 1
Rtype. The rtype argument is a string, and should contain one of the following:
v ceil. Round the source field toward positive infinity. E.g, 1.4 -> 2, -1.6 -> -1.
v floor. Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2.
v round_inf. Round or truncate the source field toward the nearest representable value, breaking ties by
rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4
-> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.
v trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the
destination, regardless of sign. For example, if the destination is an integer, all fractional digits are
truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size
of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1.
Format string. Date, Time, and Timestamp functions that take a format string (e.g., timetostring(time,
stringformat)) need to have the date format specified. The format strings are described in “Date and time
formats” on page 30.
Where your dates, times, or timestamps convert to or from ustrings, WebSphere DataStage will pick this
up automatically. In these cases the separators in your format string (for example, `:’ or `-’) can
themselves be Unicode characters.
fix_zero. By default decimal numbers comprising all zeros are treated as invalid. If the string fix_zero is
specified as a second argument, then all zero decimal values are regarded as valid.
Utility functions
The following table lists the functions available in the Utility category (square brackets indicate an
argument is optional):
This appendix gives examples of filler creation for different types of COBOL structures, and explains how
to expand fillers later if you need to reselect any columns.
Creating fillers
Unlike other parallel stages, Complex Flat File stages have stage columns. You load columns on the
Records tab of the Complex Flat File Stage page.
First, you select a table from the Table Definitions window. Next, you select columns from the Select
Columns From Table window. This window has an Available columns tree that displays COBOL
structures such as groups and arrays, and a Selected columns list that displays the columns to be loaded
into the stage. The Create fillers check box is selected by default.
When columns appear on the Records tab, FILLER items are shown for the unselected columns. FILLER
columns have a native data type of CHARACTER and a name of FILLER_XX_YY, where XX is the start
offset and YY is the end offset. Fillers for elements of a group array or an OCCURS DEPENDING ON
(ODO) column have the name of FILLER_NN, where NN is the element number. The NN begins at 1 for
the first unselected group element and continues sequentially. Any fillers that follow an ODO column are
also numbered sequentially.
Level numbers of column definitions, including filler columns, are changed after they are loaded into the
Complex Flat File stage. However, the underlying structure is preserved.
The next example shows what happens when a different group element is selected, in this case, column
C2. Four fillers are created: one for column A, one for column C1, one for GRP2 and its elements, and
one for columns E and G. Column B and GRP1 are preserved for the same reasons as before.
If the selected array column (GRP2) is flattened, both occurrences (GRP2 and GRP2_2) are loaded into the
stage. The file layout is:
Now suppose the REDEFINE column (F) is selected. In this case column E is preserved since it is
redefined by column F. One filler is created for columns A through D3, and another for column G. The
file layout is:
If columns C2 and E are selected, four fillers are created: one for column B, one for column C1, one for
column D, and one for column F. Since an element of GRP1 is selected and GRP1 redefines column A,
both column A and GRP1 are preserved. The first three fillers have the same start offset because they
redefine the same storage area.
If columns C2 and E are selected, this time only two fillers are created: one for column C1 and one for
column F. Column A, column B, GRP1, and column D are preserved because they are redefined by other
columns.
Fillers are created for column A, column C1, columns D1 through D3, and columns E and G. GRP1 is
preserved because it redefines column B. Since GRP2 (an ODO column) depends on column C2, column
C2 is preserved. GRP2 is preserved because it is an ODO column.
Expanding fillers
After you select columns to load into a Complex Flat File stage, the selected columns and fillers appear
on the Records tab on the Stage page. If you need to reselect any columns that are represented by fillers,
it is not necessary to reload your table definition. An Expand filler option allows you to reselect any or
all of the columns from a given filler.
To expand a filler, right-click the filler column in the columns tree and select Expand filler from the
pop-up menu. The Expand Filler window opens. The contents of the given filler are displayed in the
Available columns tree, allowing you to reselect the columns that you need.
If you expand FILLER_2_9, the Available columns tree in the Expand Filler window displays the
contents of FILLER_2_9:
C1
C2
GRP2(2)
D1
D2
D3
If you expand FILLER_3_9 and select column C2, the Records tab now changes to this:
If you continue to expand the fillers, eventually the Records tab will contain all of the original columns in
the table:
A
B
GRP1
C1
C2
GRP2
D1
D2
D3
E
F
G
publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp
You can order IBM publications online or through your local IBM representative.
v To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/
order.
v To order publications by telephone in the United States, call 1-800-879-2755.
To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at
www.ibm.com/planetwide.
Contacting IBM
You can contact IBM by telephone for customer support, software services, and general information.
Customer support
To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).
Software services
To learn about available service options, call one of the following numbers:
v In the United States: 1-888-426-4343
v In Canada: 1-800-465-9600
General information
Accessible documentation
Documentation is provided in XHTML format, which is viewable in most Web browsers.
Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing
the online documentation using a screen reader.
Your feedback helps IBM to provide quality information. You can use any of the following methods to
provide comments:
v Send your comments using the online readers’ comment form at www.ibm.com/software/awdtools/
rcf/.
v Send your comments by e-mail to comments@us.ibm.com. Include the name of the product, the version
number of the product, and the name and part number of the information (if applicable). If you are
commenting on specific text, please include the location of the text (for example, a title, a table number,
or a page number).
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program,
or service that does not infringe any IBM intellectual property right may be used instead. However, it is
the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can send
license inquiries, in writing, to:
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some
states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of
the materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided
by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or
any equivalent agreement between us.
Any performance data contained herein was determined in a controlled environment. Therefore, the
results obtained in other operating environments may vary significantly. Some measurements may have
been made on development-level systems and there is no guarantee that these measurements will be the
same on generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.
All statements regarding IBM’s future direction or intent are subject to change or withdrawal without
notice, and represent goals and objectives only.
This information is for planning purposes only. The information herein is subject to change before the
products described become available.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to the names and addresses used by an
actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.
Each copy or any portion of these sample programs or any derivative work, must include a copyright
notice as follows:
© (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. ©
Copyright IBM Corp. _enter the year or years_. All rights reserved.
Trademarks
IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun
Microsystems, Inc. in the United States, other countries, or both.
Microsoft®, Windows®, Windows NT®, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both.
Intel®, Intel Inside® (logos), MMX and Pentium® are trademarks of Intel Corporation in the United States,
other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product or service names might be trademarks or service marks of others.
T
table definitions 25
tagged subrecords 30
tail stage 497
tail stage properties 499
time format 34
timestamp format 36
toolbars
Transformer Editor 168, 187, 240
trademarks 617
Transformer Editor
link area 168, 188, 241
meta data area 169, 188, 241
shortcut menus 169, 188, 241
toolbar 168, 187, 240
transformer stage 167
transformer stage properties 179, 251
Transformer stages
basic concepts 170, 189, 242
editing 171, 190, 242
Expression Editor 198
specifying after-stage
subroutines 196
specifying before-stage
subroutines 196
type conversion functions 305
type conversions 303
U
UniVerse stages 73, 383
USS systems 1, 539
V
vector 30
visual cues 41
W
write range map stage 533
write range map stage input
properties 535
Z
z/OS systems 539
Index 621
622 Parallel Job Developer Guide
Printed in USA
SC18-9891-00
Spine information:
IBM WebSphere DataStage and QualityStage Version 8 Parallel Job Developer Guide