Module 2
Module 2
ETL Process
Most Important and Most Challenging
• It is not uncommon for a project team to spend as much as 50–70% of the project
effort on ETL functions
• Data extraction quite involved depending on the nature and complexity of the
source systems.
• metadata on source systems must contain information on every database and
every data structure that are needed from the source systems
• data transformation involves reformatting internal data structures, resequenceing
data, apply various forms of conversion techniques, supply default values
wherever values are missing etc
• The sheer massive size of the initial loading can populate millions of rows in the
data warehouse database
• it may take two or more weeks to complete the initial physical loading
ETL Requirements and Steps
DATA EXTRACTION
Two categories:
Current Value. (most of the attributes) The value of an
attribute remains constant only until a business transaction
changes it. Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for this
category of data.
Periodic Status. (not as common as the previous category)
The history of the changes is preserved in the source
systems themselves. Therefore, data extraction is relatively
easier.
Data Extraction Techniques
Disadvantages:
• Need to ensure that all log transactions are extracted for data warehouse updates
• If all of your source systems are database applications, there is no problem with this
technique, if some of your source system data is on indexed and other flat files, this
option will not work for these cases.as there are no log files for these nondatabase
applications
• Data replication is simply a method for creating copies of data in a distributed environment.
• Figure illustrates how replication technology can be used to capture changes to source data.
• The appropriate transaction logs contain all the changes to the various source database tables.
• Here are the broad steps for using replication to capture changes to source data:Identify the source
system database table
Identify and define target files in the staging area
Create mapping between the source table and target files
Define the replication mode
Schedule the replication process
Capture the changes from the transaction logs
Transfer captured data from logs to target files
Verify transfer of data changes
Confirm success or failure of replication
In metadata, document the outcome of replication
Maintain definitions of sources, targets, and mappings
• Capture through Database Triggers. option is applicable to your source systems that are
database applications
• Triggers are special stored procedures (programs) that are stored on the database and fired
when certain predefined events occur
• You can create trigger programs for all events for which you need data to be captured
• output of the trigger programs is written to a separate file that will be used to extract data for
the data warehouse. For example, if you need to capture all changes to the records in the
customer table, write a trigger program to capture all updates and deletes in that table.
• Advantages:
• Data capture through database triggers occurs right at the source and is therefore quite reliable
• You can capture both before and after images
• Disadvantages:
• building and maintaining trigger programs puts an additional burden on the development effort
• Also, execution of trigger procedures during transaction processing of the source systems puts
additional overhead on the source systems
• This option is applicable only for source data in databases
• Capture in Source Applications This technique is also referred to as application-assisted
data capture
• Source application is made to assist in the data capture for the data warehouse
• You have to modify the relevant application programs that write to the source files and
databases
• You revise the programs to write all adds, updates, and deletes to the source files and
database tables
• Then other extract programs can use the separate file containing the changes to the source
data.
Advantages: Unlike the previous two cases, this technique may be used for all types of source
data irrespective of whether it is in databases, indexed files, or other flat files
Disadvantages: But you have to revise the programs in the source operational systems and
keep them maintained
• This could be a formidable task if the number of source system programs is large.
• Also, this tecnique may degrade the performance of the source applications because of the additional
processing needed to capture the changes on separate files.
Deferred data extraction
• Techniques under deferred data extraction do not capture then changes in real
time, capture happens later
Capture Based on Date and Time Stamp
• Every time a source record is created or updated it may be marked with a
stamp showing the date and time
• The time stamp provides basis for selecting records for data extraction.
• If you run your data extraction program at midnight every day, each day you
will extract only those with the date and time stamp later than midnight of the
previous day.
Advantages:
• This technique presupposes that all the relevant source records contain date and time
stamps, data capture based on date and time stamp can work for any type of source file.
• This technique captures the latest state of the source data.
Disadvantages:
• Any intermediary states between two data extraction runs are lost.
• Deletion of source records presents a special problem. If a source record gets deleted in
between two extract runs, the information about the delete is not detected. You can get
around this by marking the source record for delete first, do the extraction run, and then
go ahead and physically delete the record
• This means you have to add more logic to the source applications
• This technique works well if the number of revised records is small
Capture by Comparing Files
• If none of the above techniques are feasible for specific source files in
your environment, this technique as the last resort
• also called the snapshot differential technique because it compares two
snapshots of the source data
• While performing today’s data extraction for changes to product data,
you do a full file comparison between today’s copy of the product data
and yesterday’s copy.
• also compare record keys to find the inserts and deletes
• Then you capture any changes between the two copies
Advantages: this may be the only feasible option for some legacy data sources
that do not have transaction logs or time stamps on source records
Disadvantages:
• Though simple and straightforward, comparison of full rows in a large file can be very
inefficient
• This technique necessitates the keeping of prior copies of all the relevant source data
DATA TRANSFORMATION
• Format Revisions
• Decoding of Fields
• Calculated and Derived Values.
• Splitting of Single Fields.
• Merging of Information.
• Character Set Conversion.
• Conversion of Units of Measurements
• Date/Time Conversion.
• Summarization.
• Key Restructuring.
• Deduplication.
Entity Identification Problem
• you are likely to have three different customer files supporting those systems. One system may be the
old order entry system, another the customer service support system, and the third the marketing
system
• A very large number of the customers will be common to all three files
• You must be able to get the activities of the single customer from the various source systems and then
match up with the single record to be loaded to the data warehouse.
• Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this
type of problem
• you have to design complex algorithms to match records from all the three files and form groups of
matching records
• If the matching criteria are too tight, then some records will escape the groups.
• On the other hand, if the matching criteria are too loose, a particular group may include records of more
than one customer.
• solving the entity identification problem in two phases. In the first phase, all records, irrespective of
whether they are duplicates or not, are assigned unique identifiers.
• The second phase consists of reconciling the duplicates periodically through automatic algorithms and
manual verification.
Using Transformation Tools
• desired goal for using transformation tools is to eliminate manual methods altogether, in practice this is not
completely possible.
• Even if you get the most sophisticated and comprehensive set of transformation tools, be prepared to use
in-house programs here and there.
• Use of automated tools certainly improves efficiency and accuracy.
• you just have to specify the parameters, the data definitions, and the rules to the transformation tool.
• If your input into the tool is accurate, then the rest of the work is performed efficiently by the tool.
• You gain a major advantage from using a transformation tool because of the recording of metadata by the
tool.
• When you specify the transformation parameters and rules, these are stored as metadata by the tool.
• This metadata then becomes part of the overall metadata component of the data warehouse
• When changes occur to transformation functions because of changes in business rules or data definitions,
you just have to enter the changes into the tool.
Using Manual Techniques
• Manual techniques are adequate for smaller data warehouses.
• In such cases, manually coded programs and scripts perform every data transformation. Mostly, these
programs are executed in the data staging area.
• Analysts and programmers who already possess the knowledge and the expertise are able to produce
the programs and scripts.
• this method involves elaborate coding and testing.
• Although the initial cost may be reasonable, ongoing maintenance may escalate the cost
• Unlike automated tools, the manual method is more likely to be prone to errors.
• It may also turn out that several individual programs are required in your environment.
• A major disadvantage relates to metadata.
• Automated tools record their own metadata, but in-house programs have to be designed differently if
you need to store and use metadata.
• Even if the in-house programs record the data transformation metadata initially, every time changes
occur to transformation rules, the metadata has to be maintained.
• This puts an additional burden on the maintenance of manually coded transformation programs.
DATA LOADING
• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the data
from the incoming file.
• If the table is already empty before loading, the load process simply
applies the data from the incoming file.
Append
• Consider a data warehouse for hotel occupancy, where there are four
dimensions namely (a) Hotel (b) Room (c) Time (d) Customer and two
measures (i) Occupied rooms (ii) Vacant rooms. Draw information
package diagram,star schema and Snowflake Schema.
Fact Table Sizes
Please study the calculations shown below:
• Time dimension: 5 years × 365 days = 1825
• Store dimension: 300 stores reporting daily sales
• Product dimension: 40,000 products in each store (about 4000 sell in each
store daily)
• Promotion dimension: a sold item may be in only one promotion in a store on
a given day
• Maximum number of base fact table records: 1825 × 300 × 4000 × 1 = 2
billion
Online Analytical Processing
(OLAP)
Need for Multidimensional Analysis
• Multidimensional views are inherently
representative of any business model
• Very few models are limited to three dimensions
or less
• Decision makers must be able to analyze data
along any number of dimensions, at any level of
aggregation, with the capability of viewing results
in a variety of ways.
• They must have the ability to drill down and roll
up along the hierarchies of every dimension
• Time is a critical dimension
Sr.N Data Warehouse (OLAP) Operational Database (OLTP)
o.
Office Day
Month
77
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
78
• we have noted a special method for repre-
senting a data model with more than three
dimensions using an MDS
• This method is an intuitive way of showing a
hypercube
Multidimensional analysis
• drill-down
• roll-up
• Slice
• dice operation
• pivot
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data,
or introducing new dimensions
• Slice and dice:
– project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
81
Drill-Down and Roll-Up
• note these specific attributes of the product
dimension: product name, subcategory, category,
product line, and department
• These attributes signify an ascending hierarchical
sequence from product name to department
• A department includes product lines, a product
line includes categories, a category includes
subcategories, and each subcategory consists of
products with individual product names
It shows the rolling up to
higher hierarchical levels
of aggregation and the
drilling down to lower
levels of detail
Also note the sales
numbers shown
alongside. These are sales
for one particular store in
one particular month at
these levels of
aggregation.
The sale numbers you
notice as you go down the
hierarchy are for a single
department, a single
product line, a single
category, and so on
:
.
Slice-and-Dice or Rotation
Slice-and-Dice or Rotation
• The slice operation selects one particular
dimension from a given cube and provides a
new sub-cube
• Dice selects two or more dimensions from a
given cube and provides a new sub-cube
• Here Slice is performed for the dimension "time" using the criterion time = "Q1“
• The dice operation on the cube based on the following selection criteria involves
three dimensions.
– (location = "Toronto" or "Vancouver")
– (time = "Q1" or "Q2")
– (item =" Mobile" or "Modem")
• The pivot
operation is also
known as
rotation.
• It rotates the data
axes in view in
order to provide
an alternative
presentation of
data. Consider the
following diagram
that shows the
pivot operation.
OLTP Compared With OLAP
• On Line Transaction Processing – OLTP
– Maintains a database that is an accurate model of some
real-world enterprise. Supports day-to-day operations.
Characteristics:
• Short simple transactions
• Relatively frequent updates
• Transactions access only a small fraction of the database
• On Line Analytic Processing – OLAP
– Uses information in database to guide strategic decisions.
Characteristics:
• Complex queries
• Infrequent updates
• Transactions access a large fraction of the database
• Data need not be up-to-date
94
• OLTP-style transaction:
– John Smith, from Schenectady, N.Y., just bought a box of
tomatoes; charge his account; deliver the tomatoes from
our Schenectady warehouse; decrease our inventory of
tomatoes from that warehouse
• OLAP-style transaction:
– How many cases of tomatoes were sold in all northeast
warehouses in the years 2000 and 2001?
95
OLAP, Data Mining, and Analysis
• The “A” in OLAP stands for “Analytical”
• Many OLAP and Data Mining applications
involve sophisticated analysis methods from
the fields of mathematics, statistical analysis,
and artificial intelligence
• Our main interest is in the database aspects of
these fields, not the sophisticated analysis
techniques
96
Example
Fact Tables
98
A Data Cube
99
Dimension Tables
• The dimensions of the fact table are further
described with dimension tables
• Fact table:
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
• Dimension Tables:
Market (Market_Id, City, State, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)
100
Star Schema
• The fact and dimension relations can be
displayed in an E-R diagram, which looks
like a star and is called a star schema
101
Aggregation
• Many OLAP queries involve aggregation of the
data in the fact table
• For example, to find the total sales (over time) of
each product in each market, we might use
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension
and thus produces a two-dimensional view of the
data
102
Aggregation over Time
• The output of the previous query
Market_Id
M1 M2 M3 M4
SUM(Sales_Amt)
P1 3003 1503 …
Product_Id
P2 6003 2402 …
P3 4503 3 …
P4 7503 7000 …
P5 … … …
103
Drilling Down and Rolling Up
• Some dimension tables form an aggregation hierarchy
Market_Id City State Region
• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that
over states) is called drilling down
– Requires the use of the fact table or information more specific
than the requested aggregation (e.g., cities)
• Executing a series of queries that moves up the
hierarchy (e.g., from states to regions) is called rolling
up
104
Drilling Down
• Drilling down on market: from Region to State
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, State, Region)
105
Rolling Up
• Rolling up on market, from State to Region
– If we have already created a table, State_Sales, using
106
Pivoting
• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are
said to be performing a pivot on those axes
– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY
A1,…,Ak and aggregate over Ak+1,…An, where Ai is an
attribute of the dimension Di
– Example: Pivoting on Product and Time corresponds to
grouping on Product_id and Quarter and aggregating
Sales_Amt over Market_id:
108
Slicing-and-Dicing
• When we use WHERE to specify a particular
value for an axis (or several axes), we are
performing a slice
– Slicing the data cube in the Time dimension
(choosing sales only in week 12) then pivoting to
Product_id (aggregating over Market_id)
SELECT S.Product_Id, SUM (Sales_Amt) Slice
Pivot 110
The CUBE Operator
• To construct the following table, would take 3
queries (next slide)
Market_Id
M1 M2 M3 Total
SUM(Sales_Amt)
P1 3003 1503 … …
Product_Id
P2 6003 2402 … …
P3 4503 3 … …
P4 7503 7000 … …
Total … … … …
111
The Three Queries
• For the table entries, without the totals (aggregation on time)
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)
SELECT S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Product_Id
• For the column totals (aggregation on time and products)
SELECT S.Market_Id, SUM (S.Sales)
FROM Sales S
GROUP BY S.Market_Id
112
OLAP MODELS
• ROLAP(relational online analytical processing)
• MOLAP(multidimensional online analytical processing)
• DOLAP(desktop online analytical processing) meant to
provide portability to user
• HOLAP(hybrid online analytical processing) this model
attempts to combine strengths and features of both
MOLAP and ROLAP
• Database OLAP refers to a relational database
management system designed to support OLAP
structures and perform OLAP operations
• Web OLAP refers to online analytical processing where
OLAP data is accessible from a web browser
Relational Online Analytical
Processing (ROLAP):
• ROLAP is used for large data volumes and in
this data is stored in relation tables. In ROLAP,
Static multidimensional view of data is
created.
Multidimensional Online Analytical Processing
(MOLAP):
• MOLAP is used for limited data volumes and in
this data is stored in multidimensional array.
In MOLAP, Dynamic multidimensional view of
data is created.
Applications
• OLAP reporting system is widely used in business
applications like:
• Sales and Marketing
• Retail Industry
• Financial Organizations – Budgeting
• Agriculture People Management
• Process Management Examples are Essbase
from Hyperion Solution and Express Server from
Oracle.