0% found this document useful (0 votes)

71 views117 pages

Module 2

The document discusses the key aspects of the ETL (extract, transform, load) process. It covers the challenges of ETL including the time-consuming nature of extracting data from various source systems and transforming it. It also outlines the common steps in ETL including identifying data sources, defining extraction frequency, and handling exceptions. Specific techniques for data extraction like immediate extraction using transaction logs or triggers and deferred extraction using timestamps or file comparisons are described. The document also discusses data transformation tasks like selection, splitting/joining, and conversion to improve data quality for the data warehouse.

Uploaded by

Rishi Kokil

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

71 views117 pages

Module 2

Uploaded by

Rishi Kokil

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 117

Module 2

ETL Process
Most Important and Most Challenging

• You must perform all three functions of ETL for successfully

transforming data into information
• challenging primarily because of the nature of the source systems
• Most of the challenges in ETL arise from disparities among source
operational systems
Time-Consuming and Arduous

• It is not uncommon for a project team to spend as much as 50–70% of the project
effort on ETL functions
• Data extraction quite involved depending on the nature and complexity of the
source systems.
• metadata on source systems must contain information on every database and
every data structure that are needed from the source systems
• data transformation involves reformatting internal data structures, resequenceing
data, apply various forms of conversion techniques, supply default values
wherever values are missing etc
• The sheer massive size of the initial loading can populate millions of rows in the
data warehouse database
• it may take two or more weeks to complete the initial physical loading
ETL Requirements and Steps
DATA EXTRACTION

list of data extraction issues:

• Source Identification—identify source applications and source structures.

• Method of extraction—for each data source, define whether the extraction process is
manual or tool-based.
• Extraction frequency—for each data source, establish how frequently the data
extraction must by done—daily, weekly, quarterly, and so on.
• Time window—for each data source, denote the time window for the extraction process.
• Job sequencing—determine whether the beginning of one job in an extraction job
stream has to wait until the previous job has finished successfully.
• Exception handling—determine how to handle input records that cannot be extracted.
Data in Operational Systems.

Two categories:
 Current Value. (most of the attributes) The value of an
attribute remains constant only until a business transaction
changes it. Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for this
category of data.
 Periodic Status. (not as common as the previous category)
The history of the changes is preserved in the source
systems themselves. Therefore, data extraction is relatively
easier.
Data Extraction Techniques

• Immediate Data Extraction

• Capture through Transaction Logs
• Capture through Database Triggers
• Capture in Source Applications

• Deferred Data Extraction

• Capture Based on Date and Time Stamp
• Capture by Comparing Files
Immediate Data Extraction

• Data extraction is real-time, It occurs as transactions happen at the source

databases and files
• Capture through Transaction Logs: uses transaction logs of the DBMSs
maintained for recovery from possible failures
• As each transaction adds, updates, or deletes a row from a database table,
the DBMS immediately writes entries on the log file
Advantages:
• There is no extra overhead in the operational systems because logging is already part of
the transaction processing

Disadvantages:
• Need to ensure that all log transactions are extracted for data warehouse updates
• If all of your source systems are database applications, there is no problem with this
technique, if some of your source system data is on indexed and other flat files, this
option will not work for these cases.as there are no log files for these nondatabase
applications
• Data replication is simply a method for creating copies of data in a distributed environment.
• Figure illustrates how replication technology can be used to capture changes to source data.
• The appropriate transaction logs contain all the changes to the various source database tables.
• Here are the broad steps for using replication to capture changes to source data:Identify the source
system database table
 Identify and define target files in the staging area
 Create mapping between the source table and target files
 Define the replication mode
 Schedule the replication process
 Capture the changes from the transaction logs
 Transfer captured data from logs to target files
 Verify transfer of data changes
 Confirm success or failure of replication
 In metadata, document the outcome of replication
 Maintain definitions of sources, targets, and mappings
• Capture through Database Triggers. option is applicable to your source systems that are
database applications
• Triggers are special stored procedures (programs) that are stored on the database and fired
when certain predefined events occur
• You can create trigger programs for all events for which you need data to be captured
• output of the trigger programs is written to a separate file that will be used to extract data for
the data warehouse. For example, if you need to capture all changes to the records in the
customer table, write a trigger program to capture all updates and deletes in that table.
• Advantages:
• Data capture through database triggers occurs right at the source and is therefore quite reliable
• You can capture both before and after images
• Disadvantages:
• building and maintaining trigger programs puts an additional burden on the development effort
• Also, execution of trigger procedures during transaction processing of the source systems puts
additional overhead on the source systems
• This option is applicable only for source data in databases
• Capture in Source Applications This technique is also referred to as application-assisted
data capture
• Source application is made to assist in the data capture for the data warehouse
• You have to modify the relevant application programs that write to the source files and
databases
• You revise the programs to write all adds, updates, and deletes to the source files and
database tables
• Then other extract programs can use the separate file containing the changes to the source
data.
Advantages: Unlike the previous two cases, this technique may be used for all types of source
data irrespective of whether it is in databases, indexed files, or other flat files
Disadvantages: But you have to revise the programs in the source operational systems and
keep them maintained
• This could be a formidable task if the number of source system programs is large.
• Also, this tecnique may degrade the performance of the source applications because of the additional
processing needed to capture the changes on separate files.
Deferred data extraction
• Techniques under deferred data extraction do not capture then changes in real
time, capture happens later
Capture Based on Date and Time Stamp
• Every time a source record is created or updated it may be marked with a
stamp showing the date and time
• The time stamp provides basis for selecting records for data extraction.
• If you run your data extraction program at midnight every day, each day you
will extract only those with the date and time stamp later than midnight of the
previous day.
Advantages:
• This technique presupposes that all the relevant source records contain date and time
stamps, data capture based on date and time stamp can work for any type of source file.
• This technique captures the latest state of the source data.

Disadvantages:
• Any intermediary states between two data extraction runs are lost.
• Deletion of source records presents a special problem. If a source record gets deleted in
between two extract runs, the information about the delete is not detected. You can get
around this by marking the source record for delete first, do the extraction run, and then
go ahead and physically delete the record
• This means you have to add more logic to the source applications
• This technique works well if the number of revised records is small
Capture by Comparing Files
• If none of the above techniques are feasible for specific source files in
your environment, this technique as the last resort
• also called the snapshot differential technique because it compares two
snapshots of the source data
• While performing today’s data extraction for changes to product data,
you do a full file comparison between today’s copy of the product data
and yesterday’s copy.
• also compare record keys to find the inserts and deletes
• Then you capture any changes between the two copies
Advantages: this may be the only feasible option for some legacy data sources
that do not have transaction logs or time stamps on source records
Disadvantages:
• Though simple and straightforward, comparison of full rows in a large file can be very
inefficient
• This technique necessitates the keeping of prior copies of all the relevant source data
DATA TRANSFORMATION

• Extracted data is raw data and it cannot be applied to the data

warehouse
• All the extracted data must be made usable in the data warehouse.
Quality of data

• Major effort within data transformation is the improvement of data

quality.
• This includes filling in the missing values for attributes in the extracted
data.
• Data quality is of paramount importance in the data warehouse
because the effect of strategic decisions based on incorrect information
can be devastating.
Basic tasks in data transformation

• Selection - beginning of the whole process of data transformation. Select

either whole records or parts of several records from the source systems.

• Splitting/joining - types of data manipulation needed to be performed on the

selected parts of source records. Sometimes (uncommonly), you will be
splitting the selected parts even further during data transformation. Joining of
parts selected from many source systems is more widespread in the data
warehouse environment.

• Conversion - all-inclusive task. It includes a large variety of rudimentary

conversions of single fields for two primary reasons—one to standardize
among the data extractions from disparate source systems, and the other to
make the fields usable and understandable to the users.
Basic tasks in data transformation(2)

• Summarization. Sometimes it is not feasible to keep data at the lowest

level of detail in the data warehouse. It may be that none of users ever
need data at the lowest granularity for analysis or querying.
• Enrichment - rearrangement and simplification of individual fields to
make them more useful for the data warehouse environment. You may
use one or more fields from the same input record to create a better
view of the data for the data warehouse. This principle is extended
when one or more fields originate from multiple records, resulting in a
single field for the data warehouse.
Major Transformation Types

• Format Revisions
• Decoding of Fields
• Calculated and Derived Values.
• Splitting of Single Fields.
• Merging of Information.
• Character Set Conversion.
• Conversion of Units of Measurements
• Date/Time Conversion.
• Summarization.
• Key Restructuring.
• Deduplication.
Entity Identification Problem
• you are likely to have three different customer files supporting those systems. One system may be the
old order entry system, another the customer service support system, and the third the marketing
system
• A very large number of the customers will be common to all three files
• You must be able to get the activities of the single customer from the various source systems and then
match up with the single record to be loaded to the data warehouse.
• Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this
type of problem
• you have to design complex algorithms to match records from all the three files and form groups of
matching records
• If the matching criteria are too tight, then some records will escape the groups.
• On the other hand, if the matching criteria are too loose, a particular group may include records of more
than one customer.
• solving the entity identification problem in two phases. In the first phase, all records, irrespective of
whether they are duplicates or not, are assigned unique identifiers.
• The second phase consists of reconciling the duplicates periodically through automatic algorithms and
manual verification.
Using Transformation Tools

• desired goal for using transformation tools is to eliminate manual methods altogether, in practice this is not
completely possible.
• Even if you get the most sophisticated and comprehensive set of transformation tools, be prepared to use
in-house programs here and there.
• Use of automated tools certainly improves efficiency and accuracy.
• you just have to specify the parameters, the data definitions, and the rules to the transformation tool.
• If your input into the tool is accurate, then the rest of the work is performed efficiently by the tool.
• You gain a major advantage from using a transformation tool because of the recording of metadata by the
tool.
• When you specify the transformation parameters and rules, these are stored as metadata by the tool.
• This metadata then becomes part of the overall metadata component of the data warehouse
• When changes occur to transformation functions because of changes in business rules or data definitions,
you just have to enter the changes into the tool.
Using Manual Techniques
• Manual techniques are adequate for smaller data warehouses.
• In such cases, manually coded programs and scripts perform every data transformation. Mostly, these
programs are executed in the data staging area.
• Analysts and programmers who already possess the knowledge and the expertise are able to produce
the programs and scripts.
• this method involves elaborate coding and testing.
• Although the initial cost may be reasonable, ongoing maintenance may escalate the cost
• Unlike automated tools, the manual method is more likely to be prone to errors.
• It may also turn out that several individual programs are required in your environment.
• A major disadvantage relates to metadata.
• Automated tools record their own metadata, but in-house programs have to be designed differently if
you need to store and use metadata.
• Even if the in-house programs record the data transformation metadata initially, every time changes
occur to transformation rules, the metadata has to be maintained.
• This puts an additional burden on the maintenance of manually coded transformation programs.
DATA LOADING

• Data loading takes the prepared data, applies it to the data

warehouse, and stores it in the database
• Terminology:
• Initial Load — populating all the data warehouse tables for the very
first time
• Incremental Load — applying ongoing changes as necessary in a
periodic manner
• Full Refresh — completely erasing the contents of one or more
tables and reloading with fresh data (initial load is a refresh of all the
tables)
Applying Data: Techniques and Processes

LOAD, APPEND, DESTRUCTIVE CONSTRUCTIVE

MERGE, MERGE.
Load

• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the data
from the incoming file.
• If the table is already empty before loading, the load process simply
applies the data from the incoming file.
Append

• Extension of the load.

• If data already exists in the table, the append process
unconditionally adds the incoming data, preserving the
existing data in the target table.
• When an incoming record is a duplicate of an already
existing record, you may define how to handle an incoming
duplicate:
• The incoming record may be allowed to be added as a duplicate.
• In the other option, the incoming duplicate record may be rejected
during the append process.
Destructive Merge

• Applies incoming data to the target data.

• If the primary key of an incoming record matches with the key of an
existing record, update the matching target record.
• If the incoming record is a new record without a match with any existing
record, add the incoming record to the target table.
Constructive Merge

• Slightly different from the destructive merge.

• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record, and
mark the added record as superceding the old record.
Cases:

• Consider a data warehouse for hotel occupancy, where there are four
dimensions namely (a) Hotel (b) Room (c) Time (d) Customer and two
measures (i) Occupied rooms (ii) Vacant rooms. Draw information
package diagram,star schema and Snowflake Schema.
Fact Table Sizes
Please study the calculations shown below:
• Time dimension: 5 years × 365 days = 1825
• Store dimension: 300 stores reporting daily sales
• Product dimension: 40,000 products in each store (about 4000 sell in each
store daily)
• Promotion dimension: a sold item may be in only one promotion in a store on
a given day
• Maximum number of base fact table records: 1825 × 300 × 4000 × 1 = 2
billion
Online Analytical Processing
(OLAP)
Need for Multidimensional Analysis
• Multidimensional views are inherently
representative of any business model
• Very few models are limited to three dimensions
or less
• Decision makers must be able to analyze data
along any number of dimensions, at any level of
aggregation, with the capability of viewing results
in a variety of ways.
• They must have the ability to drill down and roll
up along the hierarchies of every dimension
• Time is a critical dimension
Sr.N Data Warehouse (OLAP) Operational Database (OLTP)
o.

1 Involves historical processing of Involves day-to-day processing.

information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers and database professionals.
analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Schema Based on Entity Relationship Model.
and Fact Constellation Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
8 Provides summarized and multidimensional Provides detailed and flat relational view of data.
view of data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in millions. Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
Fast Access and Powerful Calculations
• Imagine a business analyst looking for reasons why profitability
dipped sharply in the recent months in the entire enterprise
List of typical calculations that get
included in the query requests:
• Roll-ups to provide summaries and aggregations along the
hierarchies of the dimensions.
• Drill-downs from the top level to the lowest along the
hierarchies of the dimensions, in combinations among the
dimensions.
• Simple calculations, such as computation of margins (sales
minus costs).
• Share calculations to compute the percentage of parts to
the whole.
• Algebraic equations involving key performance indicators.
• Moving averages and growth percentages.
• Trend analysis using statistical methods.
Limitations of Other Analysis
Methods
• Earliest method was the medium of reports
• Spreadsheets with all their functionality and
features
• SQL has been the accepted interface for
retrieving and manipulating data from
relational databases
• how suitable are these traditional methods?
• The following is a list of notable report generator software.
Reporting software is used to generate human-
readable reports from various data sources.
• Free software[edit]
– Eclipse BIRT Project
– GNU Enterprise (reporting sub-package)
– JasperReports
– jsreport
– KNIME
– LibreOffice Base
– OpenOffice Base
– Pentaho
– SpagoBI
• Report writers provide two key functions:
– ability to point and click for generating and issuing SQL
calls
– capability to format the output reports
• do not support multidimensionality
• you cannot drill down to lower levels in the
dimensions, that will have to come from additional
reports.
• You cannot rotate the results by switching rows and
columns
• The report writers do not provide aggregate navigation
• Once the report is formatted and run, you cannot alter
presentation of the result data sets
• Spreadsheets, were positioned as analysis tools,you
can perform “what if ” analysis with spreadsheets
• When you modify the values in some cells, the values
in other related cells automatically change
• With add-in tools can perform some forms of
aggregations and also do a variety of calculations
• Third party tools have also enhanced spreadsheet
products to present data in three-dimensional formats
• You can view rows, columns, and pages on
spreadsheets
• Even with enhanced functionality using add-ins,
spreadsheets are still very cumbersome to use
• SQL (Structured Query Language) original goal was to be
end-user query language, now everyone agrees that the
language is too abstruse even for sophisticated users
• SQL vocabulary is ill-suited for analyzing data and exploring
relationships
• in a real-world analysis session, many queries follow one
after the other
• Each query may translate into a number of intricate SQL
statements, with each of the statements likely to invoke full
table scans, multiple joins, aggregations, groupings, and
sorting
• Even an analyst accurately formulating such complex SQL
statements, the overhead on the systems would still be
enormous and seriously impact the response times.
OLAP is the Answer
virtues of OLAP
• Enables analysts, executives, and managers to gain
useful insights from the presentation of data.
• Can reorganize metrics along several dimensions and
allow data to be viewed from different perspectives.
• Supports multidimensional analysis.
• Is able to drill down or roll up within each dimension.
• Is capable of applying mathematical formulas and
calculations to measures.
• Provides fast response, facilitating speed-of-thought
analysis.
OLAP Definition
• “Providing On-Line Analytical Processing to User Analysts,”
by Dr. E. F. Codd, the acknowledged “father” of the
relational database model. The paper, published in 1993,
defined 12 rules or guidelines for an OLAP system in 1995,
six additional rules were included

“On-Line Analytical Processing (OLAP) is a category of

software technology that enables analysts, managers
and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of
possible views of information that has been
transformed from raw data to reflect the real
dimensionality of the enterprise as understood by
the user”
The guidelines proposed by Dr. Codd
A true OLAP system must conform to these guidelines
1. Multidimensional Conceptual View
2. Transparency
3. Accessibility
4. Consistent Reporting Performance
5. Client/Server Architecture
6. Generic Dimensionality
7. Dynamic Sparse Matrix Handling
8. Multiuser Support
9. Unrestricted Cross-dimensional Operations
10. Intuitive Data Manipulation
11. Flexible Reporting
12. Unlimited Dimensions and Aggregation Levels
In addition to these twelve basic guidelines, also take into
account the following requirements, not all distinctly
specified by Dr. Codd
• Drill-through to Detail Level. Allow a smooth transition from the
multidimensional, preaggregated database to the detail record level
of the source data warehouse repository.
• OLAP Analysis Models. Support Dr. Codd’s four analysis models:
exegetical (or descriptive), categorical (or explanatory),
contemplative, and formulaic.
• Treatment of Nonnormalized Data. Prohibit calculations made
within an OLAP system from affecting the external data serving as
the source.
• Storing OLAP Results. Do not deploy write-capable OLAP tools on
top of transactional systems.
• Missing Values. Ignore missing values, irrespective of their source.
• Incremental Database Refresh. Provide for incremental refreshes of
the extracted and aggregated OLAP data.
• SQL Interface. Seamlessly integrate the OLAP system into the
existing enterprise environment.
MAJOR FEATURES AND FUNCTIONS
products
This STAR schema has three business dimensions, on the
namely, product, time, and store X-axis,
time on
The fact table contains sales the Y-
axis,
and
stores
on the
Z-axis
The intersection points on this slice or plane relate to sales along
product and time business dimensions for store: New York
1. Query
– Display the total sales of all products for past five years in
all stores.
– Display of Results
– Rows: Year numbers 2000, 1999, 1998, 1997, 1996
– Columns: Total Sales for all products
– Page: One store per page
2. Query
– Compare total sales for all stores, product by product,
between years 2000 and 1999.
– Display of Results
– Rows: Year numbers 2000, 1999; difference; percentage
increase or decrease
– Columns: One column per product, showing all products
– Page: All stores
1. Query
– Show comparison of sales by individual stores, product by product,
between years
– 2000 and 1999 only for those products with reduced sales.
– Display of Results
– Rows: Year numbers 2000, 1999; difference; percentage decrease
– Columns: One column per product, showing only the qualifying
products
– Page: One store per page
2. Query
– Show the results of the previous query, but rotating and switching the
columns with
– rows.
– Display of Results
– Rows: One row per product, showing only the qualifying products
– Columns: Year numbers 2000, 1999; difference; percentage decrease
– Page: One store per page
What are Hypercubes?
How do you
represent a
four-
dimensional
model with
data points
along the edges
of a three-
dimensional
cube?
• MDS is well suited to represent four dimensions
• Can you think of the four straight lines of the
MDS intuitively to represent a “cube” with four
primary edges?
• This intuitive representation is a hypercube, a
representation that accommodates more than
three dimensions
• At a lower level of simplification, a hypercube can
very well accommodate three dimensions
• A hypercube is a general metaphor for
representing multidimensional data
By combining multiple logical dimensions within the same display group. Notice
how product and metrics are combined to display as columns. The displayed page
represents the sales for store: New York.
six-dimensional data
Notice how product and metrics are combined and represented
as columns, store and time are combined as rows, and
demographics and promotion as pages.
Multidimensional Data
• Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product

Product City Month Week

Office Day

Month
77
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

78
• we have noted a special method for repre-
senting a data model with more than three
dimensions using an MDS
• This method is an intuitive way of showing a
hypercube
Multidimensional analysis
• drill-down
• roll-up
• Slice
• dice operation
• pivot
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data,
or introducing new dimensions
• Slice and dice:
– project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)

81
Drill-Down and Roll-Up
• note these specific attributes of the product
dimension: product name, subcategory, category,
product line, and department
• These attributes signify an ascending hierarchical
sequence from product name to department
• A department includes product lines, a product
line includes categories, a category includes
subcategories, and each subcategory consists of
products with individual product names
It shows the rolling up to
higher hierarchical levels
of aggregation and the
drilling down to lower
levels of detail
Also note the sales
numbers shown
alongside. These are sales
for one particular store in
one particular month at
these levels of
aggregation.
The sale numbers you
notice as you go down the
hierarchy are for a single
department, a single
product line, a single
category, and so on
:
.
Slice-and-Dice or Rotation
Slice-and-Dice or Rotation
• The slice operation selects one particular
dimension from a given cube and provides a
new sub-cube
• Dice selects two or more dimensions from a
given cube and provides a new sub-cube
• Here Slice is performed for the dimension "time" using the criterion time = "Q1“
• The dice operation on the cube based on the following selection criteria involves
three dimensions.
– (location = "Toronto" or "Vancouver")
– (time = "Q1" or "Q2")
– (item =" Mobile" or "Modem")
• The pivot
operation is also
known as
rotation.
• It rotates the data
axes in view in
order to provide
an alternative
presentation of
data. Consider the
following diagram
that shows the
pivot operation.
OLTP Compared With OLAP
• On Line Transaction Processing – OLTP
– Maintains a database that is an accurate model of some
real-world enterprise. Supports day-to-day operations.
Characteristics:
• Short simple transactions
• Relatively frequent updates
• Transactions access only a small fraction of the database
• On Line Analytic Processing – OLAP
– Uses information in database to guide strategic decisions.
Characteristics:
• Complex queries
• Infrequent updates
• Transactions access a large fraction of the database
• Data need not be up-to-date

94
• OLTP-style transaction:
– John Smith, from Schenectady, N.Y., just bought a box of
tomatoes; charge his account; deliver the tomatoes from
our Schenectady warehouse; decrease our inventory of
tomatoes from that warehouse
• OLAP-style transaction:
– How many cases of tomatoes were sold in all northeast
warehouses in the years 2000 and 2001?

95
OLAP, Data Mining, and Analysis
• The “A” in OLAP stands for “Analytical”
• Many OLAP and Data Mining applications
involve sophisticated analysis methods from
the fields of mathematics, statistical analysis,
and artificial intelligence
• Our main interest is in the database aspects of
these fields, not the sophisticated analysis
techniques

96
Example
Fact Tables

• Many OLAP applications are based on a fact table

• For example, a supermarket application might be
based on a table
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
• The table can be viewed as multidimensional
– Market_Id, Product_Id, Time_Id are the dimensions that
represent specific supermarkets, products, and time
intervals
– Sales_Amt is a function of the other three

98
A Data Cube

• Fact tables can be viewed as an N-dimensional data cube

(3-dimensional in our example)
– The entries in the cube are the values for Sales_Amts

99
Dimension Tables
• The dimensions of the fact table are further
described with dimension tables
• Fact table:
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
• Dimension Tables:
Market (Market_Id, City, State, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)

100
Star Schema
• The fact and dimension relations can be
displayed in an E-R diagram, which looks
like a star and is called a star schema

101
Aggregation
• Many OLAP queries involve aggregation of the
data in the fact table
• For example, to find the total sales (over time) of
each product in each market, we might use
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension
and thus produces a two-dimensional view of the
data
102
Aggregation over Time
• The output of the previous query
Market_Id
M1 M2 M3 M4
SUM(Sales_Amt)
P1 3003 1503 …
Product_Id

P2 6003 2402 …
P3 4503 3 …
P4 7503 7000 …
P5 … … …
103
Drilling Down and Rolling Up
• Some dimension tables form an aggregation hierarchy
Market_Id  City  State  Region
• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that
over states) is called drilling down
– Requires the use of the fact table or information more specific
than the requested aggregation (e.g., cities)
• Executing a series of queries that moves up the
hierarchy (e.g., from states to regions) is called rolling
up

104
Drilling Down
• Drilling down on market: from Region to State
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, State, Region)

1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt)

FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Region

2. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt)

FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State,

105
Rolling Up
• Rolling up on market, from State to Region
– If we have already created a table, State_Sales, using

1. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt)

FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State

then we can roll up from there to:

2. SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)

FROM State_Sales T, Market M
WHERE M.State = T.State
GROUP BY T.Product_Id, M.Region

106
Pivoting
• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are
said to be performing a pivot on those axes
– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY
A1,…,Ak and aggregate over Ak+1,…An, where Ai is an
attribute of the dimension Di
– Example: Pivoting on Product and Time corresponds to
grouping on Product_id and Quarter and aggregating
Sales_Amt over Market_id:

SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt)

FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id
GROUP BY S.Product_Id, T.Quarter
Pivot
107
Time Hierarchy as a Lattice

• Not all aggregation

hierarchies are linear
– The time hierarchy is a lattice
• Weeks are not contained in
months
• We can roll up days into weeks
or months, but we can only roll
up weeks into quarters

108
Slicing-and-Dicing
• When we use WHERE to specify a particular
value for an axis (or several axes), we are
performing a slice
– Slicing the data cube in the Time dimension
(choosing sales only in week 12) then pivoting to
Product_id (aggregating over Market_id)
SELECT S.Product_Id, SUM (Sales_Amt) Slice

FROM Sales S, Time T

WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’
GROUP BY S. Product_Id
Pivot
109
Slicing-and-Dicing
• Typically slicing and dicing involves several queries to
find the “right slice.”
For instance, change the slice and the axes:
• Slicing on Time and Market dimensions then pivoting to Product_id
and Week (in the time dimension)

SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt)

FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id Slice
AND T.Quarter = 4
AND S.Market_id = 12345
GROUP BY S.Product_Id, T.Week

Pivot 110
The CUBE Operator
• To construct the following table, would take 3
queries (next slide)
Market_Id
M1 M2 M3 Total
SUM(Sales_Amt)
P1 3003 1503 … …
Product_Id

P2 6003 2402 … …
P3 4503 3 … …
P4 7503 7000 … …
Total … … … …
111
The Three Queries
• For the table entries, without the totals (aggregation on time)
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)
SELECT S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Product_Id
• For the column totals (aggregation on time and products)
SELECT S.Market_Id, SUM (S.Sales)
FROM Sales S
GROUP BY S.Market_Id

112
OLAP MODELS
• ROLAP(relational online analytical processing)
• MOLAP(multidimensional online analytical processing)
• DOLAP(desktop online analytical processing) meant to
provide portability to user
• HOLAP(hybrid online analytical processing) this model
attempts to combine strengths and features of both
MOLAP and ROLAP
• Database OLAP refers to a relational database
management system designed to support OLAP
structures and perform OLAP operations
• Web OLAP refers to online analytical processing where
OLAP data is accessible from a web browser
Relational Online Analytical
Processing (ROLAP):
• ROLAP is used for large data volumes and in
this data is stored in relation tables. In ROLAP,
Static multidimensional view of data is
created.
Multidimensional Online Analytical Processing
(MOLAP):
• MOLAP is used for limited data volumes and in
this data is stored in multidimensional array.
In MOLAP, Dynamic multidimensional view of
data is created.
Applications
• OLAP reporting system is widely used in business
applications like:
• Sales and Marketing
• Retail Industry
• Financial Organizations – Budgeting
• Agriculture People Management
• Process Management Examples are Essbase
from Hyperion Solution and Express Server from
Oracle.

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Database Concepts: James River Jewelry Project Questions
100% (2)
Database Concepts: James River Jewelry Project Questions
16 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Disadvantages of Flat File System Over DBMS
100% (2)
Disadvantages of Flat File System Over DBMS
2 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
IDW Lecture 27-ETL-Tranformation & Loading
No ratings yet
IDW Lecture 27-ETL-Tranformation & Loading
22 pages
ELT Process
No ratings yet
ELT Process
80 pages
Data Warehousing - C04 - ETL
No ratings yet
Data Warehousing - C04 - ETL
52 pages
Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
ETL Process-Training
0% (1)
ETL Process-Training
85 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
Lecture 05 and 06
No ratings yet
Lecture 05 and 06
48 pages
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
No ratings yet
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
16 pages
ETL
No ratings yet
ETL
11 pages
ETL
No ratings yet
ETL
32 pages
Why ETL
No ratings yet
Why ETL
15 pages
DW_unit 3
No ratings yet
DW_unit 3
10 pages
Data Warehousing and Data Mining: Sunil Paudel
No ratings yet
Data Warehousing and Data Mining: Sunil Paudel
29 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
No ratings yet
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
30 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
Details Extract Transform Load
No ratings yet
Details Extract Transform Load
13 pages
Chapter 1.3
No ratings yet
Chapter 1.3
9 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
Kabul University: Computer Science Faculty
No ratings yet
Kabul University: Computer Science Faculty
27 pages
Extract Transform Load
No ratings yet
Extract Transform Load
4 pages
Data Warehousing Dr. L. Rajya Lakshmi
No ratings yet
Data Warehousing Dr. L. Rajya Lakshmi
16 pages
bi-unit-3
No ratings yet
bi-unit-3
26 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
Chapter 2 Notes
No ratings yet
Chapter 2 Notes
5 pages
Extract Transform Load Cycle
No ratings yet
Extract Transform Load Cycle
32 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
No ratings yet
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
19 pages
ETL Power Point Presentation
No ratings yet
ETL Power Point Presentation
40 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
Building The Data WareHouse - Chapter 03
No ratings yet
Building The Data WareHouse - Chapter 03
95 pages
BI Architecture
No ratings yet
BI Architecture
4 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
BC0058 Assignment
No ratings yet
BC0058 Assignment
8 pages
ETL Interview Questions
No ratings yet
ETL Interview Questions
19 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
Unit 2
No ratings yet
Unit 2
7 pages
Building The DW - ETL
100% (1)
Building The DW - ETL
19 pages
ETL Basics
No ratings yet
ETL Basics
6 pages
Z Data Warehouse Concepts
No ratings yet
Z Data Warehouse Concepts
19 pages
Chapter 9 Fundamentals of An ETL Architecture
No ratings yet
Chapter 9 Fundamentals of An ETL Architecture
8 pages
Reading Material Mod 4 Data Integration - Data Warehouse
No ratings yet
Reading Material Mod 4 Data Integration - Data Warehouse
33 pages
Lecture-9 Extraction Transformation Loading
No ratings yet
Lecture-9 Extraction Transformation Loading
15 pages
lecture 7 (17-04-2024)
No ratings yet
lecture 7 (17-04-2024)
29 pages
DWDM(BCS058) 2nd UNIT NOTES
No ratings yet
DWDM(BCS058) 2nd UNIT NOTES
39 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
Oracle: Protect Your Data
From Everand
Oracle: Protect Your Data
Floribert TCHOKO
No ratings yet
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
BC Front Pages including index
No ratings yet
BC Front Pages including index
5 pages
Profile Manager
No ratings yet
Profile Manager
3 pages
Donations
No ratings yet
Donations
3 pages
IP QuestionBank 23 24
No ratings yet
IP QuestionBank 23 24
5 pages
Module 3
No ratings yet
Module 3
187 pages
DVM - HANA How To Backup Size
No ratings yet
DVM - HANA How To Backup Size
4 pages
SQL and Procurement Software
No ratings yet
SQL and Procurement Software
2 pages
Class01 DDL DML TCL
No ratings yet
Class01 DDL DML TCL
10 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Course HandOut Data Analytics Course 2024
No ratings yet
Course HandOut Data Analytics Course 2024
5 pages
Contoh Soal Mta Yang Lainnya PDF
100% (1)
Contoh Soal Mta Yang Lainnya PDF
65 pages
BDA viva
No ratings yet
BDA viva
26 pages
Some Experiments Using Modality Classification in A Multimodal Repository: Medical Collection of Imageclef 2011
No ratings yet
Some Experiments Using Modality Classification in A Multimodal Repository: Medical Collection of Imageclef 2011
5 pages
Intiyaz Pathan -SAP BODS (1)
No ratings yet
Intiyaz Pathan -SAP BODS (1)
3 pages
Report
No ratings yet
Report
11 pages
Sol Ass 4
No ratings yet
Sol Ass 4
11 pages
Modulo 1 - Fundamentos de Big Data
No ratings yet
Modulo 1 - Fundamentos de Big Data
4 pages
Netbackup Emergencey Binary Code
No ratings yet
Netbackup Emergencey Binary Code
73 pages
Circular Linked List
No ratings yet
Circular Linked List
6 pages
5 427 Zamak Duše PDF
No ratings yet
5 427 Zamak Duše PDF
427 pages
Creating A Database in Mariadb Prompt of Xampp Server
No ratings yet
Creating A Database in Mariadb Prompt of Xampp Server
4 pages
Fai Lab
No ratings yet
Fai Lab
8 pages
Mansi Resume - Updated Operations
No ratings yet
Mansi Resume - Updated Operations
3 pages
DBMS Assignment - 2
No ratings yet
DBMS Assignment - 2
3 pages
Project - 0x00. Shell, Navigation - Lagos Intranet TEST
No ratings yet
Project - 0x00. Shell, Navigation - Lagos Intranet TEST
8 pages
Samantha Tan: Core Qualifications
No ratings yet
Samantha Tan: Core Qualifications
3 pages
PlusOne ICT Grade 7 Sample
50% (2)
PlusOne ICT Grade 7 Sample
58 pages
NAS Platform v13 4 Backup Administration Guide MK-92HNAS007-15 PDF
No ratings yet
NAS Platform v13 4 Backup Administration Guide MK-92HNAS007-15 PDF
39 pages
Quiz2 Solution
No ratings yet
Quiz2 Solution
3 pages
Krishna Chaitanya 30.nov
No ratings yet
Krishna Chaitanya 30.nov
2 pages
Visual Basic and Databases
No ratings yet
Visual Basic and Databases
12 pages
Coordinating Parallel Hierarchical Storage Managem
No ratings yet
Coordinating Parallel Hierarchical Storage Managem
16 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages