0% found this document useful (0 votes)
71 views117 pages

Module 2

The document discusses the key aspects of the ETL (extract, transform, load) process. It covers the challenges of ETL including the time-consuming nature of extracting data from various source systems and transforming it. It also outlines the common steps in ETL including identifying data sources, defining extraction frequency, and handling exceptions. Specific techniques for data extraction like immediate extraction using transaction logs or triggers and deferred extraction using timestamps or file comparisons are described. The document also discusses data transformation tasks like selection, splitting/joining, and conversion to improve data quality for the data warehouse.

Uploaded by

Rishi Kokil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
71 views117 pages

Module 2

The document discusses the key aspects of the ETL (extract, transform, load) process. It covers the challenges of ETL including the time-consuming nature of extracting data from various source systems and transforming it. It also outlines the common steps in ETL including identifying data sources, defining extraction frequency, and handling exceptions. Specific techniques for data extraction like immediate extraction using transaction logs or triggers and deferred extraction using timestamps or file comparisons are described. The document also discusses data transformation tasks like selection, splitting/joining, and conversion to improve data quality for the data warehouse.

Uploaded by

Rishi Kokil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 117

Module 2

ETL Process
Most Important and Most Challenging

• You must perform all three functions of ETL for successfully


transforming data into information
• challenging primarily because of the nature of the source systems
• Most of the challenges in ETL arise from disparities among source
operational systems
Time-Consuming and Arduous

• It is not uncommon for a project team to spend as much as 50–70% of the project
effort on ETL functions
• Data extraction quite involved depending on the nature and complexity of the
source systems.
• metadata on source systems must contain information on every database and
every data structure that are needed from the source systems
• data transformation involves reformatting internal data structures, resequenceing
data, apply various forms of conversion techniques, supply default values
wherever values are missing etc
• The sheer massive size of the initial loading can populate millions of rows in the
data warehouse database
• it may take two or more weeks to complete the initial physical loading
ETL Requirements and Steps
DATA EXTRACTION

list of data extraction issues:

• Source Identification—identify source applications and source structures.


• Method of extraction—for each data source, define whether the extraction process is
manual or tool-based.
• Extraction frequency—for each data source, establish how frequently the data
extraction must by done—daily, weekly, quarterly, and so on.
• Time window—for each data source, denote the time window for the extraction process.
• Job sequencing—determine whether the beginning of one job in an extraction job
stream has to wait until the previous job has finished successfully.
• Exception handling—determine how to handle input records that cannot be extracted.
Data in Operational Systems.

Two categories:
 Current Value. (most of the attributes) The value of an
attribute remains constant only until a business transaction
changes it. Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for this
category of data.
 Periodic Status. (not as common as the previous category)
The history of the changes is preserved in the source
systems themselves. Therefore, data extraction is relatively
easier.
Data Extraction Techniques

• Immediate Data Extraction


• Capture through Transaction Logs
• Capture through Database Triggers
• Capture in Source Applications

• Deferred Data Extraction


• Capture Based on Date and Time Stamp
• Capture by Comparing Files
Immediate Data Extraction

• Data extraction is real-time, It occurs as transactions happen at the source


databases and files
• Capture through Transaction Logs: uses transaction logs of the DBMSs
maintained for recovery from possible failures
• As each transaction adds, updates, or deletes a row from a database table,
the DBMS immediately writes entries on the log file
Advantages:
• There is no extra overhead in the operational systems because logging is already part of
the transaction processing

Disadvantages:
• Need to ensure that all log transactions are extracted for data warehouse updates
• If all of your source systems are database applications, there is no problem with this
technique, if some of your source system data is on indexed and other flat files, this
option will not work for these cases.as there are no log files for these nondatabase
applications
• Data replication is simply a method for creating copies of data in a distributed environment.
• Figure illustrates how replication technology can be used to capture changes to source data.
• The appropriate transaction logs contain all the changes to the various source database tables.
• Here are the broad steps for using replication to capture changes to source data:Identify the source
system database table
 Identify and define target files in the staging area
 Create mapping between the source table and target files
 Define the replication mode
 Schedule the replication process
 Capture the changes from the transaction logs
 Transfer captured data from logs to target files
 Verify transfer of data changes
 Confirm success or failure of replication
 In metadata, document the outcome of replication
 Maintain definitions of sources, targets, and mappings
• Capture through Database Triggers. option is applicable to your source systems that are
database applications
• Triggers are special stored procedures (programs) that are stored on the database and fired
when certain predefined events occur
• You can create trigger programs for all events for which you need data to be captured
• output of the trigger programs is written to a separate file that will be used to extract data for
the data warehouse. For example, if you need to capture all changes to the records in the
customer table, write a trigger program to capture all updates and deletes in that table.
• Advantages:
• Data capture through database triggers occurs right at the source and is therefore quite reliable
• You can capture both before and after images
• Disadvantages:
• building and maintaining trigger programs puts an additional burden on the development effort
• Also, execution of trigger procedures during transaction processing of the source systems puts
additional overhead on the source systems
• This option is applicable only for source data in databases
• Capture in Source Applications This technique is also referred to as application-assisted
data capture
• Source application is made to assist in the data capture for the data warehouse
• You have to modify the relevant application programs that write to the source files and
databases
• You revise the programs to write all adds, updates, and deletes to the source files and
database tables
• Then other extract programs can use the separate file containing the changes to the source
data.
Advantages: Unlike the previous two cases, this technique may be used for all types of source
data irrespective of whether it is in databases, indexed files, or other flat files
Disadvantages: But you have to revise the programs in the source operational systems and
keep them maintained
• This could be a formidable task if the number of source system programs is large.
• Also, this tecnique may degrade the performance of the source applications because of the additional
processing needed to capture the changes on separate files.
Deferred data extraction
• Techniques under deferred data extraction do not capture then changes in real
time, capture happens later
Capture Based on Date and Time Stamp
• Every time a source record is created or updated it may be marked with a
stamp showing the date and time
• The time stamp provides basis for selecting records for data extraction.
• If you run your data extraction program at midnight every day, each day you
will extract only those with the date and time stamp later than midnight of the
previous day.
Advantages:
• This technique presupposes that all the relevant source records contain date and time
stamps, data capture based on date and time stamp can work for any type of source file.
• This technique captures the latest state of the source data.

Disadvantages:
• Any intermediary states between two data extraction runs are lost.
• Deletion of source records presents a special problem. If a source record gets deleted in
between two extract runs, the information about the delete is not detected. You can get
around this by marking the source record for delete first, do the extraction run, and then
go ahead and physically delete the record
• This means you have to add more logic to the source applications
• This technique works well if the number of revised records is small
Capture by Comparing Files
• If none of the above techniques are feasible for specific source files in
your environment, this technique as the last resort
• also called the snapshot differential technique because it compares two
snapshots of the source data
• While performing today’s data extraction for changes to product data,
you do a full file comparison between today’s copy of the product data
and yesterday’s copy.
• also compare record keys to find the inserts and deletes
• Then you capture any changes between the two copies
Advantages: this may be the only feasible option for some legacy data sources
that do not have transaction logs or time stamps on source records
Disadvantages:
• Though simple and straightforward, comparison of full rows in a large file can be very
inefficient
• This technique necessitates the keeping of prior copies of all the relevant source data
DATA TRANSFORMATION

• Extracted data is raw data and it cannot be applied to the data


warehouse
• All the extracted data must be made usable in the data warehouse.
Quality of data

• Major effort within data transformation is the improvement of data


quality.
• This includes filling in the missing values for attributes in the extracted
data.
• Data quality is of paramount importance in the data warehouse
because the effect of strategic decisions based on incorrect information
can be devastating.
Basic tasks in data transformation

• Selection - beginning of the whole process of data transformation. Select


either whole records or parts of several records from the source systems.

• Splitting/joining - types of data manipulation needed to be performed on the


selected parts of source records. Sometimes (uncommonly), you will be
splitting the selected parts even further during data transformation. Joining of
parts selected from many source systems is more widespread in the data
warehouse environment.

• Conversion - all-inclusive task. It includes a large variety of rudimentary


conversions of single fields for two primary reasons—one to standardize
among the data extractions from disparate source systems, and the other to
make the fields usable and understandable to the users.
Basic tasks in data transformation(2)

• Summarization. Sometimes it is not feasible to keep data at the lowest


level of detail in the data warehouse. It may be that none of users ever
need data at the lowest granularity for analysis or querying.
• Enrichment - rearrangement and simplification of individual fields to
make them more useful for the data warehouse environment. You may
use one or more fields from the same input record to create a better
view of the data for the data warehouse. This principle is extended
when one or more fields originate from multiple records, resulting in a
single field for the data warehouse.
Major Transformation Types

• Format Revisions
• Decoding of Fields
• Calculated and Derived Values.
• Splitting of Single Fields.
• Merging of Information.
• Character Set Conversion.
• Conversion of Units of Measurements
• Date/Time Conversion.
• Summarization.
• Key Restructuring.
• Deduplication.
Entity Identification Problem
• you are likely to have three different customer files supporting those systems. One system may be the
old order entry system, another the customer service support system, and the third the marketing
system
• A very large number of the customers will be common to all three files
• You must be able to get the activities of the single customer from the various source systems and then
match up with the single record to be loaded to the data warehouse.
• Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this
type of problem
• you have to design complex algorithms to match records from all the three files and form groups of
matching records
• If the matching criteria are too tight, then some records will escape the groups.
• On the other hand, if the matching criteria are too loose, a particular group may include records of more
than one customer.
• solving the entity identification problem in two phases. In the first phase, all records, irrespective of
whether they are duplicates or not, are assigned unique identifiers.
• The second phase consists of reconciling the duplicates periodically through automatic algorithms and
manual verification.
Using Transformation Tools

• desired goal for using transformation tools is to eliminate manual methods altogether, in practice this is not
completely possible.
• Even if you get the most sophisticated and comprehensive set of transformation tools, be prepared to use
in-house programs here and there.
• Use of automated tools certainly improves efficiency and accuracy.
• you just have to specify the parameters, the data definitions, and the rules to the transformation tool.
• If your input into the tool is accurate, then the rest of the work is performed efficiently by the tool.
• You gain a major advantage from using a transformation tool because of the recording of metadata by the
tool.
• When you specify the transformation parameters and rules, these are stored as metadata by the tool.
• This metadata then becomes part of the overall metadata component of the data warehouse
• When changes occur to transformation functions because of changes in business rules or data definitions,
you just have to enter the changes into the tool.
Using Manual Techniques
• Manual techniques are adequate for smaller data warehouses.
• In such cases, manually coded programs and scripts perform every data transformation. Mostly, these
programs are executed in the data staging area.
• Analysts and programmers who already possess the knowledge and the expertise are able to produce
the programs and scripts.
• this method involves elaborate coding and testing.
• Although the initial cost may be reasonable, ongoing maintenance may escalate the cost
• Unlike automated tools, the manual method is more likely to be prone to errors.
• It may also turn out that several individual programs are required in your environment.
• A major disadvantage relates to metadata.
• Automated tools record their own metadata, but in-house programs have to be designed differently if
you need to store and use metadata.
• Even if the in-house programs record the data transformation metadata initially, every time changes
occur to transformation rules, the metadata has to be maintained.
• This puts an additional burden on the maintenance of manually coded transformation programs.
DATA LOADING

• Data loading takes the prepared data, applies it to the data


warehouse, and stores it in the database
• Terminology:
• Initial Load — populating all the data warehouse tables for the very
first time
• Incremental Load — applying ongoing changes as necessary in a
periodic manner
• Full Refresh — completely erasing the contents of one or more
tables and reloading with fresh data (initial load is a refresh of all the
tables)
Applying Data: Techniques and Processes

LOAD, APPEND, DESTRUCTIVE CONSTRUCTIVE


MERGE, MERGE.
Load

• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the data
from the incoming file.
• If the table is already empty before loading, the load process simply
applies the data from the incoming file.
Append

• Extension of the load.


• If data already exists in the table, the append process
unconditionally adds the incoming data, preserving the
existing data in the target table.
• When an incoming record is a duplicate of an already
existing record, you may define how to handle an incoming
duplicate:
• The incoming record may be allowed to be added as a duplicate.
• In the other option, the incoming duplicate record may be rejected
during the append process.
Destructive Merge

• Applies incoming data to the target data.


• If the primary key of an incoming record matches with the key of an
existing record, update the matching target record.
• If the incoming record is a new record without a match with any existing
record, add the incoming record to the target table.
Constructive Merge

• Slightly different from the destructive merge.


• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record, and
mark the added record as superceding the old record.
Cases:

• Consider a data warehouse for hotel occupancy, where there are four
dimensions namely (a) Hotel (b) Room (c) Time (d) Customer and two
measures (i) Occupied rooms (ii) Vacant rooms. Draw information
package diagram,star schema and Snowflake Schema.
Fact Table Sizes
Please study the calculations shown below:
• Time dimension: 5 years × 365 days = 1825
• Store dimension: 300 stores reporting daily sales
• Product dimension: 40,000 products in each store (about 4000 sell in each
store daily)
• Promotion dimension: a sold item may be in only one promotion in a store on
a given day
• Maximum number of base fact table records: 1825 × 300 × 4000 × 1 = 2
billion
Online Analytical Processing
(OLAP)
Need for Multidimensional Analysis
• Multidimensional views are inherently
representative of any business model
• Very few models are limited to three dimensions
or less
• Decision makers must be able to analyze data
along any number of dimensions, at any level of
aggregation, with the capability of viewing results
in a variety of ways.
• They must have the ability to drill down and roll
up along the hierarchies of every dimension
• Time is a critical dimension
Sr.N Data Warehouse (OLAP) Operational Database (OLTP)
o.

1 Involves historical processing of Involves day-to-day processing.


information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers and database professionals.
analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Schema Based on Entity Relationship Model.
and Fact Constellation Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
8 Provides summarized and multidimensional Provides detailed and flat relational view of data.
view of data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in millions. Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
Fast Access and Powerful Calculations
• Imagine a business analyst looking for reasons why profitability
dipped sharply in the recent months in the entire enterprise
List of typical calculations that get
included in the query requests:
• Roll-ups to provide summaries and aggregations along the
hierarchies of the dimensions.
• Drill-downs from the top level to the lowest along the
hierarchies of the dimensions, in combinations among the
dimensions.
• Simple calculations, such as computation of margins (sales
minus costs).
• Share calculations to compute the percentage of parts to
the whole.
• Algebraic equations involving key performance indicators.
• Moving averages and growth percentages.
• Trend analysis using statistical methods.
Limitations of Other Analysis
Methods
• Earliest method was the medium of reports
• Spreadsheets with all their functionality and
features
• SQL has been the accepted interface for
retrieving and manipulating data from
relational databases
• how suitable are these traditional methods?
• The following is a list of notable report generator software.
Reporting software is used to generate human-
readable reports from various data sources.
• Free software[edit]
– Eclipse BIRT Project
– GNU Enterprise (reporting sub-package)
– JasperReports
– jsreport
– KNIME
– LibreOffice Base
– OpenOffice Base
– Pentaho
– SpagoBI
• Report writers provide two key functions:
– ability to point and click for generating and issuing SQL
calls
– capability to format the output reports
• do not support multidimensionality
• you cannot drill down to lower levels in the
dimensions, that will have to come from additional
reports.
• You cannot rotate the results by switching rows and
columns
• The report writers do not provide aggregate navigation
• Once the report is formatted and run, you cannot alter
presentation of the result data sets
• Spreadsheets, were positioned as analysis tools,you
can perform “what if ” analysis with spreadsheets
• When you modify the values in some cells, the values
in other related cells automatically change
• With add-in tools can perform some forms of
aggregations and also do a variety of calculations
• Third party tools have also enhanced spreadsheet
products to present data in three-dimensional formats
• You can view rows, columns, and pages on
spreadsheets
• Even with enhanced functionality using add-ins,
spreadsheets are still very cumbersome to use
• SQL (Structured Query Language) original goal was to be
end-user query language, now everyone agrees that the
language is too abstruse even for sophisticated users
• SQL vocabulary is ill-suited for analyzing data and exploring
relationships
• in a real-world analysis session, many queries follow one
after the other
• Each query may translate into a number of intricate SQL
statements, with each of the statements likely to invoke full
table scans, multiple joins, aggregations, groupings, and
sorting
• Even an analyst accurately formulating such complex SQL
statements, the overhead on the systems would still be
enormous and seriously impact the response times.
OLAP is the Answer
virtues of OLAP
• Enables analysts, executives, and managers to gain
useful insights from the presentation of data.
• Can reorganize metrics along several dimensions and
allow data to be viewed from different perspectives.
• Supports multidimensional analysis.
• Is able to drill down or roll up within each dimension.
• Is capable of applying mathematical formulas and
calculations to measures.
• Provides fast response, facilitating speed-of-thought
analysis.
OLAP Definition
• “Providing On-Line Analytical Processing to User Analysts,”
by Dr. E. F. Codd, the acknowledged “father” of the
relational database model. The paper, published in 1993,
defined 12 rules or guidelines for an OLAP system in 1995,
six additional rules were included

“On-Line Analytical Processing (OLAP) is a category of


software technology that enables analysts, managers
and executives to gain insight into data through fast,
consistent, interactive access in a wide variety of
possible views of information that has been
transformed from raw data to reflect the real
dimensionality of the enterprise as understood by
the user”
The guidelines proposed by Dr. Codd
A true OLAP system must conform to these guidelines
1. Multidimensional Conceptual View
2. Transparency
3. Accessibility
4. Consistent Reporting Performance
5. Client/Server Architecture
6. Generic Dimensionality
7. Dynamic Sparse Matrix Handling
8. Multiuser Support
9. Unrestricted Cross-dimensional Operations
10. Intuitive Data Manipulation
11. Flexible Reporting
12. Unlimited Dimensions and Aggregation Levels
In addition to these twelve basic guidelines, also take into
account the following requirements, not all distinctly
specified by Dr. Codd
• Drill-through to Detail Level. Allow a smooth transition from the
multidimensional, preaggregated database to the detail record level
of the source data warehouse repository.
• OLAP Analysis Models. Support Dr. Codd’s four analysis models:
exegetical (or descriptive), categorical (or explanatory),
contemplative, and formulaic.
• Treatment of Nonnormalized Data. Prohibit calculations made
within an OLAP system from affecting the external data serving as
the source.
• Storing OLAP Results. Do not deploy write-capable OLAP tools on
top of transactional systems.
• Missing Values. Ignore missing values, irrespective of their source.
• Incremental Database Refresh. Provide for incremental refreshes of
the extracted and aggregated OLAP data.
• SQL Interface. Seamlessly integrate the OLAP system into the
existing enterprise environment.
MAJOR FEATURES AND FUNCTIONS
products
This STAR schema has three business dimensions, on the
namely, product, time, and store X-axis,
time on
The fact table contains sales the Y-
axis,
and
stores
on the
Z-axis
The intersection points on this slice or plane relate to sales along
product and time business dimensions for store: New York
1. Query
– Display the total sales of all products for past five years in
all stores.
– Display of Results
– Rows: Year numbers 2000, 1999, 1998, 1997, 1996
– Columns: Total Sales for all products
– Page: One store per page
2. Query
– Compare total sales for all stores, product by product,
between years 2000 and 1999.
– Display of Results
– Rows: Year numbers 2000, 1999; difference; percentage
increase or decrease
– Columns: One column per product, showing all products
– Page: All stores
1. Query
– Show comparison of sales by individual stores, product by product,
between years
– 2000 and 1999 only for those products with reduced sales.
– Display of Results
– Rows: Year numbers 2000, 1999; difference; percentage decrease
– Columns: One column per product, showing only the qualifying
products
– Page: One store per page
2. Query
– Show the results of the previous query, but rotating and switching the
columns with
– rows.
– Display of Results
– Rows: One row per product, showing only the qualifying products
– Columns: Year numbers 2000, 1999; difference; percentage decrease
– Page: One store per page
What are Hypercubes?
How do you
represent a
four-
dimensional
model with
data points
along the edges
of a three-
dimensional
cube?
• MDS is well suited to represent four dimensions
• Can you think of the four straight lines of the
MDS intuitively to represent a “cube” with four
primary edges?
• This intuitive representation is a hypercube, a
representation that accommodates more than
three dimensions
• At a lower level of simplification, a hypercube can
very well accommodate three dimensions
• A hypercube is a general metaphor for
representing multidimensional data
By combining multiple logical dimensions within the same display group. Notice
how product and metrics are combined to display as columns. The displayed page
represents the sales for store: New York.
six-dimensional data
Notice how product and metrics are combined and represented
as columns, store and time are combined as rows, and
demographics and promotion as pages.
Multidimensional Data
• Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
77
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

78
• we have noted a special method for repre-
senting a data model with more than three
dimensions using an MDS
• This method is an intuitive way of showing a
hypercube
Multidimensional analysis
• drill-down
• roll-up
• Slice
• dice operation
• pivot
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data,
or introducing new dimensions
• Slice and dice:
– project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)

81
Drill-Down and Roll-Up
• note these specific attributes of the product
dimension: product name, subcategory, category,
product line, and department
• These attributes signify an ascending hierarchical
sequence from product name to department
• A department includes product lines, a product
line includes categories, a category includes
subcategories, and each subcategory consists of
products with individual product names
It shows the rolling up to
higher hierarchical levels
of aggregation and the
drilling down to lower
levels of detail
Also note the sales
numbers shown
alongside. These are sales
for one particular store in
one particular month at
these levels of
aggregation.
The sale numbers you
notice as you go down the
hierarchy are for a single
department, a single
product line, a single
category, and so on
:
.
Slice-and-Dice or Rotation
Slice-and-Dice or Rotation
• The slice operation selects one particular
dimension from a given cube and provides a
new sub-cube
• Dice selects two or more dimensions from a
given cube and provides a new sub-cube
• Here Slice is performed for the dimension "time" using the criterion time = "Q1“
• The dice operation on the cube based on the following selection criteria involves
three dimensions.
– (location = "Toronto" or "Vancouver")
– (time = "Q1" or "Q2")
– (item =" Mobile" or "Modem")
• The pivot
operation is also
known as
rotation.
• It rotates the data
axes in view in
order to provide
an alternative
presentation of
data. Consider the
following diagram
that shows the
pivot operation.
OLTP Compared With OLAP
• On Line Transaction Processing – OLTP
– Maintains a database that is an accurate model of some
real-world enterprise. Supports day-to-day operations.
Characteristics:
• Short simple transactions
• Relatively frequent updates
• Transactions access only a small fraction of the database
• On Line Analytic Processing – OLAP
– Uses information in database to guide strategic decisions.
Characteristics:
• Complex queries
• Infrequent updates
• Transactions access a large fraction of the database
• Data need not be up-to-date

94
• OLTP-style transaction:
– John Smith, from Schenectady, N.Y., just bought a box of
tomatoes; charge his account; deliver the tomatoes from
our Schenectady warehouse; decrease our inventory of
tomatoes from that warehouse
• OLAP-style transaction:
– How many cases of tomatoes were sold in all northeast
warehouses in the years 2000 and 2001?

95
OLAP, Data Mining, and Analysis
• The “A” in OLAP stands for “Analytical”
• Many OLAP and Data Mining applications
involve sophisticated analysis methods from
the fields of mathematics, statistical analysis,
and artificial intelligence
• Our main interest is in the database aspects of
these fields, not the sophisticated analysis
techniques

96
Example
Fact Tables

• Many OLAP applications are based on a fact table


• For example, a supermarket application might be
based on a table
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
• The table can be viewed as multidimensional
– Market_Id, Product_Id, Time_Id are the dimensions that
represent specific supermarkets, products, and time
intervals
– Sales_Amt is a function of the other three

98
A Data Cube

• Fact tables can be viewed as an N-dimensional data cube


(3-dimensional in our example)
– The entries in the cube are the values for Sales_Amts

99
Dimension Tables
• The dimensions of the fact table are further
described with dimension tables
• Fact table:
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
• Dimension Tables:
Market (Market_Id, City, State, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)

100
Star Schema
• The fact and dimension relations can be
displayed in an E-R diagram, which looks
like a star and is called a star schema

101
Aggregation
• Many OLAP queries involve aggregation of the
data in the fact table
• For example, to find the total sales (over time) of
each product in each market, we might use
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension
and thus produces a two-dimensional view of the
data
102
Aggregation over Time
• The output of the previous query
Market_Id
M1 M2 M3 M4
SUM(Sales_Amt)
P1 3003 1503 …
Product_Id

P2 6003 2402 …
P3 4503 3 …
P4 7503 7000 …
P5 … … …
103
Drilling Down and Rolling Up
• Some dimension tables form an aggregation hierarchy
Market_Id  City  State  Region
• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that
over states) is called drilling down
– Requires the use of the fact table or information more specific
than the requested aggregation (e.g., cities)
• Executing a series of queries that moves up the
hierarchy (e.g., from states to regions) is called rolling
up

104
Drilling Down
• Drilling down on market: from Region to State
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, State, Region)

1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt)


FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Region

2. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt)


FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State,

105
Rolling Up
• Rolling up on market, from State to Region
– If we have already created a table, State_Sales, using

1. SELECT S.Product_Id, M.State, SUM (S.Sales_Amt)


FROM Sales S, Market M
WHERE M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State

then we can roll up from there to:

2. SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)


FROM State_Sales T, Market M
WHERE M.State = T.State
GROUP BY T.Product_Id, M.Region

106
Pivoting
• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are
said to be performing a pivot on those axes
– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY
A1,…,Ak and aggregate over Ak+1,…An, where Ai is an
attribute of the dimension Di
– Example: Pivoting on Product and Time corresponds to
grouping on Product_id and Quarter and aggregating
Sales_Amt over Market_id:

SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt)


FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id
GROUP BY S.Product_Id, T.Quarter
Pivot
107
Time Hierarchy as a Lattice

• Not all aggregation


hierarchies are linear
– The time hierarchy is a lattice
• Weeks are not contained in
months
• We can roll up days into weeks
or months, but we can only roll
up weeks into quarters

108
Slicing-and-Dicing
• When we use WHERE to specify a particular
value for an axis (or several axes), we are
performing a slice
– Slicing the data cube in the Time dimension
(choosing sales only in week 12) then pivoting to
Product_id (aggregating over Market_id)
SELECT S.Product_Id, SUM (Sales_Amt) Slice

FROM Sales S, Time T


WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’
GROUP BY S. Product_Id
Pivot
109
Slicing-and-Dicing
• Typically slicing and dicing involves several queries to
find the “right slice.”
For instance, change the slice and the axes:
• Slicing on Time and Market dimensions then pivoting to Product_id
and Week (in the time dimension)

SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt)


FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id Slice
AND T.Quarter = 4
AND S.Market_id = 12345
GROUP BY S.Product_Id, T.Week

Pivot 110
The CUBE Operator
• To construct the following table, would take 3
queries (next slide)
Market_Id
M1 M2 M3 Total
SUM(Sales_Amt)
P1 3003 1503 … …
Product_Id

P2 6003 2402 … …
P3 4503 3 … …
P4 7503 7000 … …
Total … … … …
111
The Three Queries
• For the table entries, without the totals (aggregation on time)
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)
SELECT S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY S.Product_Id
• For the column totals (aggregation on time and products)
SELECT S.Market_Id, SUM (S.Sales)
FROM Sales S
GROUP BY S.Market_Id

112
OLAP MODELS
• ROLAP(relational online analytical processing)
• MOLAP(multidimensional online analytical processing)
• DOLAP(desktop online analytical processing) meant to
provide portability to user
• HOLAP(hybrid online analytical processing) this model
attempts to combine strengths and features of both
MOLAP and ROLAP
• Database OLAP refers to a relational database
management system designed to support OLAP
structures and perform OLAP operations
• Web OLAP refers to online analytical processing where
OLAP data is accessible from a web browser
Relational Online Analytical
Processing (ROLAP):
• ROLAP is used for large data volumes and in
this data is stored in relation tables. In ROLAP,
Static multidimensional view of data is
created.
Multidimensional Online Analytical Processing
(MOLAP):
• MOLAP is used for limited data volumes and in
this data is stored in multidimensional array.
In MOLAP, Dynamic multidimensional view of
data is created.
Applications
• OLAP reporting system is widely used in business
applications like:
• Sales and Marketing
• Retail Industry
• Financial Organizations – Budgeting
• Agriculture People Management
• Process Management Examples are Essbase
from Hyperion Solution and Express Server from
Oracle.

You might also like