Partitioning in Datastage
Partitioning in Datastage
Agenda
Introduction Why do we need partitioning Types of partitioning
Introduction
Strength of DataStage Parallel Extender is in the parallel processing capability it brings into your data extraction and transformation applications. DataStage PX version has the ability to slice the data into chunks and process it simultaneously. Parallelism in DataStage PX is of two types.
Pipeline parallelism. Partition parallelism.
Types of Parallelism
Parallelism in PX jobs is of two types. Pipeline output of a producer operator is processed by a consumer operator before the producer operator completes processing of the input. Partition Data is broken into packets and processed by each of the producer operators at the same time.
Pipeline parallelism
Job using the parallel extender running sequentially, each stage would process a single row of data then pass it to the next process, which would run and process this row then pass it on. General Run the same job in parallel, the stage reading would start on one node and start filling a pipeline with the data it had read. Next stage would start running on another node as soon as there was data in the pipeline, process it and start filling another pipeline.
Pipeline
Partition parallelism
Same job when processing huge volume of data pipelining the data would take time. We can use the power of parallel processing of DataStage by partitioning the data into separate sets of data. Each of these sets is then processed a node.
Why do we need
To induce parallel processing into job data should be partitioned. To achieve greater performance data should be partitioned. Each node works on different partition.
Types of partitioning
Following are various partitioning methods Round Robin Random Same Entire Hash Modulus Range DB2 Auto
10
General
11
Round Robin
First records goes to first processing node, second record goes to second processing node. Once last processing node is reached , next records goes to first processing node. Used to re-sizing the partitions that are not equal in size. This method is used to create equal sized partitions. This method is used to create sequences.
12
Round Robin
13
Same
Fastest method of partitioning. Records are processed by same processing node. There is no repartitioning done by the operator using the output from preceding stage.
14
Same
15
Entire
Every processing node of the Stage get entire set of data. Used when data is small and can fit into memory. Access to entire data is needed.
16
Entire
17
Hash
Partitioning is based on a function of columns chosen as hash keys. This method is used when related records need to be kept in same partition. It does not ensure that partitioned are evenly distributed. This partitioning method is used in join, sort, merge and lookup Stages.
18
Hash
19
Modulus
Partitioning is based on a key column modulo the number of partitions This method is similar to hash by field, but involves simpler computation.
20
Range
Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. This method needs a Range map to be created which decides which records goes to which processing node.
21
Range
22
Range map
23
DB2
Data is partitioned same as DB2 table. Used when writing to a DB2 table. Default partitioning method for DB2 Stages
24
DB2
25
Degree of parallelism
Degree of Parallelism is determined by the configuration file
Total number of logical nodes in default pool, or a subset if using "constraints".
Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
26
Partitioner
Collector
27