100% found this document useful (1 vote)

206 views22 pages

Introduction to ETL Processes

The document discusses the key components of the ETL (extract, transform, load) process: extracting data from source systems, cleaning and conforming the data, delivering the data by loading it into data warehouse tables, and managing the ETL environment. It describes each component in detail, covering topics like data profiling, extraction strategies, data quality processes, dimension and fact table loading, aggregate building, and ETL tool and job management. The document also provides guidance on creating an ETL specification document to design and plan the ETL process and jobs.

Uploaded by

srinkulkarni2967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

206 views22 pages

Introduction to ETL Processes

Uploaded by

srinkulkarni2967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

ETL

By Dr. Gabriel

ETL Process
4 major components:
Extracting
Gathering raw data from source systems and storing it in ETL staging environment

Cleaning and conforming

Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions

Delivering
Loading data into data warehouse tables

Managing
Management of ETL environment

ETL: Extracting
Data profiling Identifying data that changed since last load extraction

ETL: Cleaning and Conforming

Data cleansing Recording error events Audit dimensions Deduping Creating and maintaining conformed dimensions and facts

ETL: Delivering
Implementation of SCD logic Surrogate key generation Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken, small static, and usermaintained dimensions
Mini dimensions
used to track changes of dimension attribute when type 2 technique is infeasible. Similar to junk dimensions Typically is used for large dimensions Combinations can be built in advance or on the fly Built from dimension table input

ETL: Delivering (Cont)

Small static dimensions
Dimensions created by the ETL system without real source Lookup dimensions for translations of codes, etc.

User maintained dimensions

Master dimensions without real source system Descriptions, groupings, hierarchies created for reporting and analysis purposes.

ETL: Delivering (Cont)

Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes

ETL: Managing
Management of ETL environment
Goals
Reliability Availability Manageability

Job scheduler backup system Recovery and restart system Version control system

ETL: Managing (Cont.)

Version migration system Workflow monitor Sorting system Analyzing dependencies and lineage Problem escalation system Parallelization Security system Compliance manager Metadata repository manager

ETL Process
Planning
High level source to target data flow diagram Selection and implementation of ETL tool Development of default strategies for dimension management, error handling, and other processes Development data transformations diagrams by target table Development of job sequencing

ETL Process
Developing one-time historic load
Build and test the historic dimension and fact tables load

Developing incremental load process

Build and test dimension and fact tables incremental load processes Build and test aggregate table loads and/or OLAP processing Design, build, and test the ETL system automation

ETL Tools: Build vs Buy

Many off-the-shelf tools exist Benefits are not seen right away
Setup Learning curve

High-end tools may not justify value for smaller warehouses

Off-the-shelf ETL Tools

Tool Oracle Warehouse Builder (OWB) Data Integrator (BODI) IBM Information Server (Ascential) SAS Data Integration Studio Vendor Oracle Business Objects IBM SAS Institute

PowerCenter
Oracle Data Integrator (Sunopsis) Data Migrator Integration Services Talend Open Studio DataFlow

Informatica
Oracle Information Builders Microsoft Talend Group 1 Software (Sagent)

Data Integrator
Transformation Server Transformation Manager Data Manager DT/Studio ETL4ALL

Pervasive
DataMirror ETL Solutions Ltd. Cognos Embarcadero Technologies IKAN

DB2 Warehouse Edition

Jitterbit Pentaho Data Integration

IBM
Jitterbit Pentaho

ETL Specification Document

Can be as large as 100 pages per business process; In reality, the work starts after the high level design is documented in a few pages. Source-to-target mappings Data profiling reports Physical design decisions Default strategy for extracting from each major source system Archival strategy Data quality tracking and metadata Default strategy for managing changes to dimension attributes

ETL Specification Document (Cont)

System availability requirements and strategy Design of data auditing mechanism Location of staging areas Historic and incremental load strategies for each table
Detailed table design Historic data load parameters (# of months) and volumes (# of rows) Incremental data volumes

ETL Specification Document (Cont)

Handling of late arriving data Load frequency Handling of changes in each dimension attribute (types 1,2,3) Table partitioning Overview of data sources; discussion of sourcespecific characteristics Extract strategy for the source data Change data capture logic for each source table Dependencies Transformation logic (diagram or pseudo code)

ETL Specification Document (Cont)

Preconditions to avoid error conditions Recovery and restart assumptions for each major step of the ETL pipeline Archiving assumptions for each table Cleanup steps Estimated effort

Overall workflow Job sequencing Logical dependencies

Loading Pointers
One time historic load
Disable RI constraints (FKs) and re-enable them after the load is complete Drop indexes and re-create them after the load is complete Use bulk loading techniques Not always the case

Loading Pointers (Cont)

Incremental load

Loading Pointers (Cont)

Sometimes historic and incremental load logic is the same; many times- is similar. Updating aggregations, if necessary Error handling

Sample: Generation of Surrogate Keys on SQL Server

As simple as: DECLARE @i INTEGER SELECT @i = MAX(ID) + 1 FROM TableName But may not work with concurrent processes OR Create PROCEDURE pGetNextID (@SeedName VARCHAR(32), @SeedValue BIGINT OUTPUT) AS UPDATE Lookup_Seed SET @SeedValue = SeedValue = SeedValue + 1 WHERE SeedID = @SeedName Lookup_Seed table: SeedID varchar (32) SeedValue bigint

Questions ?

Data Pipeline Design for ETL/ELT
No ratings yet
Data Pipeline Design for ETL/ELT
36 pages
ETL Architecture Fundamentals Explained
No ratings yet
ETL Architecture Fundamentals Explained
8 pages
ETL Process in Data Warehouse: G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT
100% (1)
ETL Process in Data Warehouse: G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT
39 pages
ETL Process Documentation Guide
No ratings yet
ETL Process Documentation Guide
4 pages
ETL and SSIS Training Course Outline
No ratings yet
ETL and SSIS Training Course Outline
5 pages
Delivering Fact Tables in ETL Processes
No ratings yet
Delivering Fact Tables in ETL Processes
77 pages
ETL Testing Fundamentals Explained
No ratings yet
ETL Testing Fundamentals Explained
5 pages
Top 10 ETL Design Principles
No ratings yet
Top 10 ETL Design Principles
37 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
11 pages
Importance of Web Accessibility in Data Warehousing
No ratings yet
Importance of Web Accessibility in Data Warehousing
76 pages
Data Warehouse Concepts and ETL Overview
No ratings yet
Data Warehouse Concepts and ETL Overview
11 pages
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
No ratings yet
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
6 pages
ETL Challenges for Data Warehousing
No ratings yet
ETL Challenges for Data Warehousing
16 pages
ETL Developer Resume: Informatica Expert
No ratings yet
ETL Developer Resume: Informatica Expert
3 pages
Data Warehousing AND Data Mining
100% (1)
Data Warehousing AND Data Mining
90 pages
DWBI Testing by Puneet
No ratings yet
DWBI Testing by Puneet
27 pages
Take Splunk For A Test Drive: Getting Started With Splunk
100% (1)
Take Splunk For A Test Drive: Getting Started With Splunk
3 pages
PowerCenter Mapping Tips
No ratings yet
PowerCenter Mapping Tips
53 pages
DW-BI Best Practices
100% (1)
DW-BI Best Practices
15 pages
ETL Testing Goals and Strategies
No ratings yet
ETL Testing Goals and Strategies
3 pages
ETL Project Development Life Cycle Guide
100% (3)
ETL Project Development Life Cycle Guide
2 pages
Newgen Management Trainee: Oracle Technical Orientation Program
No ratings yet
Newgen Management Trainee: Oracle Technical Orientation Program
41 pages
Velocity v8 Data Warehousing Methodology
No ratings yet
Velocity v8 Data Warehousing Methodology
1,106 pages
Informatica MDM Course Contents
No ratings yet
Informatica MDM Course Contents
7 pages
ETL Design for Data Warehousing
No ratings yet
ETL Design for Data Warehousing
39 pages
Introduction to Data Warehousing Concepts
No ratings yet
Introduction to Data Warehousing Concepts
38 pages
ETL Testing Int - 1
No ratings yet
ETL Testing Int - 1
16 pages
Data Warehouse Essentials
No ratings yet
Data Warehouse Essentials
26 pages
Database vs. Data Warehouse Testing
No ratings yet
Database vs. Data Warehouse Testing
17 pages
IDQ Functionality Imp
No ratings yet
IDQ Functionality Imp
7 pages
Informatica Questions 1
No ratings yet
Informatica Questions 1
15 pages
ETLQA/Tester Datawarehouse QA/Tester Should Have
No ratings yet
ETLQA/Tester Datawarehouse QA/Tester Should Have
12 pages
01 Data Warehoudingand Ab Initio Concepts
100% (1)
01 Data Warehoudingand Ab Initio Concepts
76 pages
DWH Testing Guide
No ratings yet
DWH Testing Guide
8 pages
ETL Testing 1
No ratings yet
ETL Testing 1
15 pages
Comprehensive Data Warehousing Guide
No ratings yet
Comprehensive Data Warehousing Guide
11 pages
Informatica Lookup Transformation Guide
No ratings yet
Informatica Lookup Transformation Guide
2 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
ETL Startegy To Store Data Validation Rules
No ratings yet
ETL Startegy To Store Data Validation Rules
7 pages
But Why Anyone Will Need A Dynamic Cache?
No ratings yet
But Why Anyone Will Need A Dynamic Cache?
8 pages
Informatica Cloud Secure Agent Guide
No ratings yet
Informatica Cloud Secure Agent Guide
1 page
ETL Testing
No ratings yet
ETL Testing
6 pages
Comprehensive ETL Process Guide
No ratings yet
Comprehensive ETL Process Guide
27 pages
Data Warehousing - C05 - Designing and Developing The ETL System
No ratings yet
Data Warehousing - C05 - Designing and Developing The ETL System
41 pages
Etl Ssis
No ratings yet
Etl Ssis
10 pages
ETL Essentials for Businesses
No ratings yet
ETL Essentials for Businesses
5 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
ETL Basics in Data Warehousing
No ratings yet
ETL Basics in Data Warehousing
63 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
37 pages
Intro To ETL
No ratings yet
Intro To ETL
43 pages
Association Rules (Ardytha Luthfiarta)
No ratings yet
Association Rules (Ardytha Luthfiarta)
69 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
29 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
ETL Process in Business Intelligence
No ratings yet
ETL Process in Business Intelligence
4 pages
ETL Process Overview for Data Warehousing
No ratings yet
ETL Process Overview for Data Warehousing
15 pages
ETL Process for Data Warehouse Integration
No ratings yet
ETL Process for Data Warehouse Integration
45 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
ETL Power Point Presentation
No ratings yet
ETL Power Point Presentation
40 pages
What Is ETL?
No ratings yet
What Is ETL?
6 pages
ETLDesignMethodologyDocument SSIS
100% (1)
ETLDesignMethodologyDocument SSIS
12 pages
SVPMA 05 2003 Leadership - in - Product - Management Ivan Chong
No ratings yet
SVPMA 05 2003 Leadership - in - Product - Management Ivan Chong
21 pages
ETL Overview: ETL/DW Refreshment Process Building Dimensions Building Fact Tables Extract Transformations/cleansing Load
No ratings yet
ETL Overview: ETL/DW Refreshment Process Building Dimensions Building Fact Tables Extract Transformations/cleansing Load
10 pages
ETL Overview: ETL/DW Refreshment Process Building Dimensions Building Fact Tables Extract Transformations/cleansing Load
No ratings yet
ETL Overview: ETL/DW Refreshment Process Building Dimensions Building Fact Tables Extract Transformations/cleansing Load
10 pages
Master Swift: Next Gen Trading Platform
No ratings yet
Master Swift: Next Gen Trading Platform
20 pages
Demian 3724 Software Testing Qa Quality Assurance Product Education PPT Power Point
No ratings yet
Demian 3724 Software Testing Qa Quality Assurance Product Education PPT Power Point
14 pages