0% found this document useful (0 votes)
529 views10 pages

Datastage Interview Questions

There are many ways to improve performance of slowly running Datastage jobs, such as: 1. Staging data from databases using hash or sequential files. 2. Tuning OCI/DB2 stages for optimal array size and transactions. 3. Removing unnecessary transformations and using SQL where possible. 4. Sorting data earlier in the job or in the database to reduce sorting in Datastage. 5. Breaking large jobs into smaller parallel jobs.

Uploaded by

Kalyan Krishna
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
529 views10 pages

Datastage Interview Questions

There are many ways to improve performance of slowly running Datastage jobs, such as: 1. Staging data from databases using hash or sequential files. 2. Tuning OCI/DB2 stages for optimal array size and transactions. 3. Removing unnecessary transformations and using SQL where possible. 4. Sorting data earlier in the job or in the database to reduce sorting in Datastage. 5. Breaking large jobs into smaller parallel jobs.

Uploaded by

Kalyan Krishna
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 10

Datastage Interview Questions - Part I

1. What are the Environmental variables in Datastage?

2. Check for Job Errors in datastage

3. What are Stage Variables, Derivations and Constants? 

4. What is Pipeline Parallelism?

5. Debug stages in PX

6. How do you remove duplicates in dataset 

7. What is the difference between Job Control and Job Sequence 

8. What is the max size of Data set stage?

9. performance in sort stage

10. How to develop the SCD using LOOKUP stage?

12. What are the errors you expereiced with data stage

13. what are the main diff between server job and parallel job in datastage

14. Why you need Modify Stage?

15. What is the difference between Squential Stage & Dataset Stage. When do u use them.

16. memory allocation while using lookup stage

17. What is Phantom error in the datastage. How to overcome this error.

18. Parameter file usage in Datastage

19. Explain the best approch to do a SCD type2 mapping in parallel job?

20. how can we improve the performance of the job while handling huge amount of data

21. HI How can we create read only jobs in Datastage.

22. how to implement routines in data stage,have any one has any material for data stage 

23. How will you determine the sequence of jobs to load into data warehouse? 

24. How can we Test jobs in Datastage??

25. DataStage - delete header and footer on the source sequential 

26. How can we implement Slowly Changing Dimensions in DataStage?.

27. Differentiate Database data and Data warehouse data? 

28. How to run a Shell Script within the scope of a Data stage job? 

29. what is the difference between datastage and informatica

30. Explain about job control language such as (DS_JOBS)

32. What is Invocation ID?

33. How to connect two stages which do not have any common columns between them?

34. In SAP/R3, How do you declare and pass parameters in parallel job . 
35. Difference between Hashfile and Sequential File?

36. How do you fix the error "OCI has fetched truncated data" in DataStage

37. A batch is running and it is scheduled to run in 5 minutes. But after 10 days the time changes to 10 minutes. What type of error is this
and how to fix it?

38. Which partition we have to use for Aggregate Stage in parallel jobs ?

39. What is the baseline to implement parition or parallel execution method in datastage job.e.g. more than 2 millions records only advised ?

40. how do we create index in data satge?

41. What is the flow of loading data into fact & dimensional tables? 

42. What is a sequential file that has single input link??

43. Aggregators – What does the warning “Hash table has grown to ‘xyz’ ….” mean? 

44. what is hashing algorithm?

45. How do you load partial data after job failed 


source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort , Instead of removing 5000 records from
target , How can i resume the load 
46. What is Orchestrate options in generic stage, what are the option names. value ? Name of an Orchestrate operator to call. what are the
orchestrate operators available in datastage for AIX environment.

47. Type 30D hash file is GENERIC or SPECIFIC?

48. Is Hashed file an Active or Passive Stage? When will be it useful?

49. How do you extract job parameters from a file?

50. 
1.What about System variables? 
2.How can we create Containers? 
3.How can we improve the performance of DataStage? 
4.what are the Job parameters? 
5.what is the difference between routine and transform and function? 
6.What are all the third party tools used in DataStage? 
7.How can we implement Lookup in DataStage Server jobs? 
8.How can we implement Slowly Changing Dimensions in DataStage?. 
9.How can we join one Oracle source and Sequential file?. 
10.What is iconv and oconv functions?

Read more: http://www.placementpapers.us/datastage/300-datastage_interview_questions_part_i.html#ixzz1Lvg1Hc2y 
Under Creative Commons License: Attribution

How do you fix the error "OCI has fetched truncated data" in DataStage

Can we use Change capture stage to get the truncated data's.Members please confirm

Dear Friend,Not only truncated data,it captures duplicates,edit,insert,unwanted data that


means every changes in the before and after data

How did u connect with DB2 in your last project?


Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to
connect to DB2 for look-ups as an instance then we used ODBC drivers  
 
Asked by: Interview Candidate
 
Best Answer:
By making use of "DB2/UDB Stage".There we basically need to specify connection parameters like client instance name,default server,default
database.

What are Sequencers?


Sequencers are job control programs that execute other jobs with preset Job parameters.      
 
Asked by: Interview Candidate
 
Best Answer:
A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple
output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the
sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE 
How did you handle an 'Aborted' sequencer?
In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.  
 
Asked by: Interview Candidate
 
Best Answer:
Have you set the compilation options for the sequence so that in case job aborts, you need not to run it from from the first job. By selecting that
compilation option you can run that aborted sequence from the point the sequence was aborted.
Like for example, you have 10 jobs(job1, job2, job3 etc.) in a sequence and the job 5 aborts, then by checking "Add checkpoints so sequence is
restartable on failure" and "Automatically handle activities that fail" you can restart this sequence from job 5 only. it will not run the jobs 1,2,3 and 4.
Please check these options in your sequence. 
Hope this helps. 
Answered by: ritu singhai
What are other Performance tunings you have done in your last project to increase the performance of
What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum
performance also for data recovery in case job aborts.Tuned the OCI stage for 'Array  
 
Asked by: Interview Candidate
 
Best Answer:
1. Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
2. Use SQL Code while extracting the data
3. Handle the nulls
4. Minimise the warnings
5. Reduce the number of lookups in a job design
6. Use not more than 20stages in a job
7. Use IPC stage between two passive stages Reduces processing time
8. Drop indexes before data loading and recreate after loading data into tables
9. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
10. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
11. IPC Stage that is provided in Server Jobs not in Parallel Jobs
12. Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
13. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of
execution of the routines.
14. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
15. Use Preload to memory option in the hash file output.
16. Use Write to cache in the hash file input.
17. Write into the error tables only after all the transformer stages.
18. Reduce the width of the input record - remove the columns that you would not use.
19. Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
20. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
       This would also minimize overflow on the hash file.
21. If possible, break the input into multiple threads and run multiple instances of the job.
22. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum
performance also for data recovery in case job aborts.
23. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
24. Tuned the 'Project Tunables' in Administrator for better performance.
25. Used sorted data for Aggregator.
26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
27. Removed the data not used from the source as early as possible in the job.
28. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
29. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
31. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in
the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls
routines or external macros but if it is inline code then the overhead will be minimal.
32. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before
joins are made.
33. Tuning should occur on a job-by-job basis.
34. Use the power of DBMS.
35. Try not to use a sort stage when you can use an ORDER BY clause in the database.
36. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
37. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
38. Where do you use Link-Partitioner and Link-Collector ?
39. Link Partitioner - Used for partitioning the data.Link Collector - Used for collecting the partitioned data.  
40.  
41. Asked by: Interview Candidate
42.  
43. Best Answer:
Link Partitioner and collecter are basically used to introduce data parallellism in server jobs.link partitioner,splits the data on many
links.Once the data is processed,link collector collects the data and passes it to a single link.These are used in server jobs.In datastage
parallel jobs,these things are inbuilt and automatically taken care of.
Answered by: nikhilanshuman
44.  
45. Read answer (1)
46.      
47.  
48.

49.  
50. What are OConv () and Iconv () functions and where are they used? 
51. IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format.  
52.  
53. Asked by: Interview Candidate
54.  
55. Best Answer:
56. iconv is used to convert the date into into internal format i.e only datastage can understand
57. example :- date comming in mm/dd/yyyy format
58. datasatge will conver this ur date into some number like :- 740
59. u can use this 740 in derive in ur own format by using oconv.
60. suppose u want to change mm/dd/yyyy  to dd/mm/yyyy
61. now u will use iconv and oconv.
62. ocnv(iconv(datecommingfromi/pstring,SOMEXYZ(seein help which is iconvformat),defineoconvformat))
63.
Answered by: sekr
64. What is Metastage?
65.  
66. Asked by: Interview Candidate
67.  
68. Best Answer:
MetaStage is a persistent metadata Directory that uniquely synchronizes metadata across multiple separate silos, eliminating re keying and
the manual establishment of cross-tool relationships. Based on patented technology, it provides seamless cross-tool integration throughout
the entire Business Intelligence and data integration life cycle and tool sets
Answered by: spartankiya
69.  
70. Read answers (6)
71.      

We have to create users in the Administrators and give the necessary privileges to users.

What are the command line functions that import and export the DS jobs? 

A. dsimport.exe- imports the DataStage components.


B. dsexport.exe- exports the DataStage components.

What is difference between data stage and informatica

Here is a very good articles on these differences... whic hhelps to get an idea.. basically
it's depends on what you are tring to accomplish

what are the requirements for your ETL tool? 

Do you have large sequential files (1 million rows, for example) that need to be compared
every day versus yesterday? 

If so, then ask how each vendor would do that. Think about what process they are going
to do. Are they requiring you to load yesterday?s file into a table and do lookups? 

If so, RUN!! Are they doing a match/merge routine that knows how to process this in
sequential files? Then maybe they are the right one. It all depends on what you need the
ETL to do. 

If you are small enough in your data sets, then either would probably be OK.

I want to process 3 files in sequentially one by one , how can i do that. while processing
the files it should fetch files automatically .

If the metadata for all the files r same then create a job having file name as parameter,
then use same job in routine and call the job with different file name...or u can create
sequencer to use the job... I think in datastage8.0.1 there is an option in the sequence job
namely loop via which the purpose can be achieved

What Happens if RCP is disable ?

Runtime column propagation (RCP): If RCP is enabled for any job, and
specifically for those stage whose output connects to the shared container input,
then meta data will be propagated at run time, so there is no need to map it at
design time.

If RCP is disabled for the job, in such case OSH has to perform Import and
export every time when the job runs and the processing time job is also
increased.

Where does unix script of datastage executes weather in clinet machine or in


server.suppose if it eexcutes on server then it will execute ?

Datastage jobs are executed in the server machines only. There is nothing that is
stored in the client machine.

Defaults nodes for datastage parallel Edition

Actually the Number of Nodes depend on the number of processors in your


system.If your system is supporting two processors we will get two nodes by
default.

How can we pass parameters to job by using file.

You can do this, by passing parameters from unix file, and then calling the execution
of a datastage job. the ds job has the parameters defined (which are passed by unix)

What is ' insert for update ' in datastage

i think 'insert to update' is updated value is inserted to maintain history

i think 'insert to update' is updated value is inserted to maintain history

What are the Repository Tables in DataStage and What are they?
A datawarehouse is a repository(centralized as well as distributed) of Data, able to
answer any adhoc,analytical,historical or complex queries.Metadata is data about data.
Examples of metadata include data element descriptions, data type descriptions,
attribute/property descriptions, range/domain descriptions, and process/method
descriptions. The repository environment encompasses all corporate metadata
resources: database catalogs, data dictionaries, and navigation services. Metadata
includes things like the name, length, valid values, and description of a data element.
Metadata is stored in a data dictionary and repository. It insulates the data warehouse
from changes in the schema of operational systems.In data stage I/O and Transfer ,
under interface tab: input , out put & transfer pages.U will have 4 tabs and the last
one is build under that u can find the TABLE NAME .The DataStage client components
are:AdministratorAdministers DataStage projects and conducts housekeeping on the
serverDesignerCreates DataStage jobs that are compiled into executable programs
DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and
edit the contents of the repository.
What is version Control?

Version Control

stores different versions of DS jobs

runs different versions of same job

reverts to previos version of a job

view version histories


What happends out put of hash file is connected to transformer ..
What error it throughs

If Hash file output is connected to transformer stage the hash file will consider as the
Lookup file if there is no primary link to the same Transformer stage, if there is no
primary link then this will treat as primary link itself. you can do SCD in server job by
using Lookup functionality. This will not return any error code.

How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage
as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an
adapter for the extraction and transformation of data? Thanks so much if anybody could
provide an answer.

You would need to install ODBC drivers to connect to DB2 instance (does not
come with regular drivers that we try to install, use CD provided for DB2
installation, that would have ODBC drivers to connect to DB2) and then try out 

How can I connect my DB2 database on AS400 to DataStage? Do I need to use


ODBC 1st to open the database connectivity and then use an adapter for just
connecting between the two? Thanks alot of any replies.

You need to configure the ODBC connectivity for database (DB2 or AS400) in the
datastage.
What is NLS in datastage? how we use NLS in Datastage ? what advantagesin that ?
at the time of installation i am not choosen that NLS option , now i want to use that
options what can i do ? to reinstall that datastage or first uninstall and install once
again ?

Just reinstall you can see the option to include the NLS

NLS stands for national language support. It is used for including other country

You might also like