0% found this document useful (0 votes)
582 views27 pages

Partitioning in Datastage

The document discusses different types of partitioning in DataStage Parallel Extender including round robin, random, same, entire, hash, modulus, range, DB2 and auto partitioning. It explains each type of partitioning and provides examples of how and when each would be used.

Uploaded by

Vamsi Karthik
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
582 views27 pages

Partitioning in Datastage

The document discusses different types of partitioning in DataStage Parallel Extender including round robin, random, same, entire, hash, modulus, range, DB2 and auto partitioning. It explains each type of partitioning and provides examples of how and when each would be used.

Uploaded by

Vamsi Karthik
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 27

Partitioning

Agenda
Introduction Why do we need partitioning Types of partitioning

2002. Infosys Technologies Ltd.

Introduction
Strength of DataStage Parallel Extender is in the parallel processing capability it brings into your data extraction and transformation applications. DataStage PX version has the ability to slice the data into chunks and process it simultaneously. Parallelism in DataStage PX is of two types.
Pipeline parallelism. Partition parallelism.

2002. Infosys Technologies Ltd.

Types of Parallelism
Parallelism in PX jobs is of two types. Pipeline output of a producer operator is processed by a consumer operator before the producer operator completes processing of the input. Partition Data is broken into packets and processed by each of the producer operators at the same time.

2002. Infosys Technologies Ltd.

Pipeline parallelism
Job using the parallel extender running sequentially, each stage would process a single row of data then pass it to the next process, which would run and process this row then pass it on. General Run the same job in parallel, the stage reading would start on one node and start filling a pipeline with the data it had read. Next stage would start running on another node as soon as there was data in the pipeline, process it and start filling another pipeline.

2002. Infosys Technologies Ltd.

Pipeline

2002. Infosys Technologies Ltd.

Partition parallelism
Same job when processing huge volume of data pipelining the data would take time. We can use the power of parallel processing of DataStage by partitioning the data into separate sets of data. Each of these sets is then processed a node.

2002. Infosys Technologies Ltd.

Partition and Pipeline


When no of processors are more then both Pipeline and Partition parallel processing can be used to achieve better performance.

2002. Infosys Technologies Ltd.

Why do we need
To induce parallel processing into job data should be partitioned. To achieve greater performance data should be partitioned. Each node works on different partition.

2002. Infosys Technologies Ltd.

Types of partitioning
Following are various partitioning methods Round Robin Random Same Entire Hash Modulus Range DB2 Auto

2002. Infosys Technologies Ltd.

10

General

2002. Infosys Technologies Ltd.

11

Round Robin
First records goes to first processing node, second record goes to second processing node. Once last processing node is reached , next records goes to first processing node. Used to re-sizing the partitions that are not equal in size. This method is used to create equal sized partitions. This method is used to create sequences.

2002. Infosys Technologies Ltd.

12

Round Robin

2002. Infosys Technologies Ltd.

13

Same
Fastest method of partitioning. Records are processed by same processing node. There is no repartitioning done by the operator using the output from preceding stage.

2002. Infosys Technologies Ltd.

14

Same

2002. Infosys Technologies Ltd.

15

Entire
Every processing node of the Stage get entire set of data. Used when data is small and can fit into memory. Access to entire data is needed.

Generally used in lookups to create hash table.

2002. Infosys Technologies Ltd.

16

Entire

2002. Infosys Technologies Ltd.

17

Hash
Partitioning is based on a function of columns chosen as hash keys. This method is used when related records need to be kept in same partition. It does not ensure that partitioned are evenly distributed. This partitioning method is used in join, sort, merge and lookup Stages.

2002. Infosys Technologies Ltd.

18

Hash

2002. Infosys Technologies Ltd.

19

Modulus
Partitioning is based on a key column modulo the number of partitions This method is similar to hash by field, but involves simpler computation.

2002. Infosys Technologies Ltd.

20

Range
Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. This method needs a Range map to be created which decides which records goes to which processing node.

2002. Infosys Technologies Ltd.

21

Range

2002. Infosys Technologies Ltd.

22

Range map

2002. Infosys Technologies Ltd.

23

DB2
Data is partitioned same as DB2 table. Used when writing to a DB2 table. Default partitioning method for DB2 Stages

2002. Infosys Technologies Ltd.

24

DB2

2002. Infosys Technologies Ltd.

25

Degree of parallelism
Degree of Parallelism is determined by the configuration file
Total number of logical nodes in default pool, or a subset if using "constraints".
Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage

Job performance by choosing best configuration for a job.

2002. Infosys Technologies Ltd.

26

Partitioning and Collecting Icons

Partitioner

Collector

2002. Infosys Technologies Ltd.

27

You might also like