100% found this document useful (1 vote)
206 views22 pages

Introduction to ETL Processes

The document discusses the key components of the ETL (extract, transform, load) process: extracting data from source systems, cleaning and conforming the data, delivering the data by loading it into data warehouse tables, and managing the ETL environment. It describes each component in detail, covering topics like data profiling, extraction strategies, data quality processes, dimension and fact table loading, aggregate building, and ETL tool and job management. The document also provides guidance on creating an ETL specification document to design and plan the ETL process and jobs.

Uploaded by

srinkulkarni2967
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
206 views22 pages

Introduction to ETL Processes

The document discusses the key components of the ETL (extract, transform, load) process: extracting data from source systems, cleaning and conforming the data, delivering the data by loading it into data warehouse tables, and managing the ETL environment. It describes each component in detail, covering topics like data profiling, extraction strategies, data quality processes, dimension and fact table loading, aggregate building, and ETL tool and job management. The document also provides guidance on creating an ETL specification document to design and plan the ETL process and jobs.

Uploaded by

srinkulkarni2967
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

ETL

By Dr. Gabriel

ETL Process
4 major components:
Extracting
Gathering raw data from source systems and storing it in ETL staging environment

Cleaning and conforming


Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions

Delivering
Loading data into data warehouse tables

Managing
Management of ETL environment

ETL: Extracting
Data profiling Identifying data that changed since last load extraction

ETL: Cleaning and Conforming


Data cleansing Recording error events Audit dimensions Deduping Creating and maintaining conformed dimensions and facts

ETL: Delivering
Implementation of SCD logic Surrogate key generation Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken, small static, and usermaintained dimensions
Mini dimensions
used to track changes of dimension attribute when type 2 technique is infeasible. Similar to junk dimensions Typically is used for large dimensions Combinations can be built in advance or on the fly Built from dimension table input

ETL: Delivering (Cont)


Small static dimensions
Dimensions created by the ETL system without real source Lookup dimensions for translations of codes, etc.

User maintained dimensions


Master dimensions without real source system Descriptions, groupings, hierarchies created for reporting and analysis purposes.

ETL: Delivering (Cont)


Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes

ETL: Managing
Management of ETL environment
Goals
Reliability Availability Manageability

Job scheduler backup system Recovery and restart system Version control system

ETL: Managing (Cont.)


Version migration system Workflow monitor Sorting system Analyzing dependencies and lineage Problem escalation system Parallelization Security system Compliance manager Metadata repository manager

ETL Process
Planning
High level source to target data flow diagram Selection and implementation of ETL tool Development of default strategies for dimension management, error handling, and other processes Development data transformations diagrams by target table Development of job sequencing

ETL Process
Developing one-time historic load
Build and test the historic dimension and fact tables load

Developing incremental load process


Build and test dimension and fact tables incremental load processes Build and test aggregate table loads and/or OLAP processing Design, build, and test the ETL system automation

ETL Tools: Build vs Buy


Many off-the-shelf tools exist Benefits are not seen right away
Setup Learning curve

High-end tools may not justify value for smaller warehouses

Off-the-shelf ETL Tools


Tool Oracle Warehouse Builder (OWB) Data Integrator (BODI) IBM Information Server (Ascential) SAS Data Integration Studio Vendor Oracle Business Objects IBM SAS Institute

PowerCenter
Oracle Data Integrator (Sunopsis) Data Migrator Integration Services Talend Open Studio DataFlow

Informatica
Oracle Information Builders Microsoft Talend Group 1 Software (Sagent)

Data Integrator
Transformation Server Transformation Manager Data Manager DT/Studio ETL4ALL

Pervasive
DataMirror ETL Solutions Ltd. Cognos Embarcadero Technologies IKAN

DB2 Warehouse Edition


Jitterbit Pentaho Data Integration

IBM
Jitterbit Pentaho

ETL Specification Document


Can be as large as 100 pages per business process; In reality, the work starts after the high level design is documented in a few pages. Source-to-target mappings Data profiling reports Physical design decisions Default strategy for extracting from each major source system Archival strategy Data quality tracking and metadata Default strategy for managing changes to dimension attributes

ETL Specification Document (Cont)


System availability requirements and strategy Design of data auditing mechanism Location of staging areas Historic and incremental load strategies for each table
Detailed table design Historic data load parameters (# of months) and volumes (# of rows) Incremental data volumes

ETL Specification Document (Cont)


Handling of late arriving data Load frequency Handling of changes in each dimension attribute (types 1,2,3) Table partitioning Overview of data sources; discussion of sourcespecific characteristics Extract strategy for the source data Change data capture logic for each source table Dependencies Transformation logic (diagram or pseudo code)

ETL Specification Document (Cont)


Preconditions to avoid error conditions Recovery and restart assumptions for each major step of the ETL pipeline Archiving assumptions for each table Cleanup steps Estimated effort

Overall workflow Job sequencing Logical dependencies

Loading Pointers
One time historic load
Disable RI constraints (FKs) and re-enable them after the load is complete Drop indexes and re-create them after the load is complete Use bulk loading techniques Not always the case

Loading Pointers (Cont)


Incremental load

Loading Pointers (Cont)


Sometimes historic and incremental load logic is the same; many times- is similar. Updating aggregations, if necessary Error handling

Sample: Generation of Surrogate Keys on SQL Server


As simple as: DECLARE @i INTEGER SELECT @i = MAX(ID) + 1 FROM TableName But may not work with concurrent processes OR Create PROCEDURE pGetNextID (@SeedName VARCHAR(32), @SeedValue BIGINT OUTPUT) AS UPDATE Lookup_Seed SET @SeedValue = SeedValue = SeedValue + 1 WHERE SeedID = @SeedName Lookup_Seed table: SeedID varchar (32) SeedValue bigint

Questions ?

You might also like