Module 1
Module 1
Warehousing
18CS641
Pavan Kumar SP
VVCE MYSURU
Data Mining & Data Warehousing
a. Subject oriented
b. Integrated
c. Time -variant
d. Non-volatile
Subject oriented
Integrated
Time-Variant
• Data is stored to provide information from a historical perspective (e.g., the past 5–
10 years).
• Every key structure in the data warehouse contains a time element, either implicitly
or explicitly.
Non-volatile:
• A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.
• Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.
• It usually requires only two operations in data accessing: initial loading and access of
data.
a. Data cleaning
b. Data integration
c. Data consolidation
Organizations use the data warehouse information for decision-making activities as follows.
1. Metadata dictionary translates the query into queries appropriate for the individual
heterogeneous sites involved.
2. Queries are then mapped and sent to local query processors
3. The results from the different sites are integrated into a global answer set.
orientation
4
Page
5
Page
Bottom tier
• Back-end tools and utilities feed data from operational databases or other external
sources into the bottom tier.
• These tools and utilities perform data extraction, cleaning, transformation, and load
and refresh functions to update the data warehouse.
• The data are extracted using application program interfaces known as gateways.
• A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
• This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.
Middle tier
• is an OLAP server typically implemented using either a relational OLAP(ROLAP) or
multidimensional OLAP (MOLAP) models.
• ROLAP- an extended relational DBMS that maps operations on multidimensional
data to standard relational operations.
• MOLAP -special-purpose server that directly implements multidimensional data and
operations.
Top tier
• It is a front-end client layer that contains query and reporting tools, analysis tools,
and data mining tools
From the architectural point of view, there are three Data Warehouse Models
a. Enterprise Warehouse
b. Data mart
c. Virtual warehouse
Enterprise Warehouse
• it collects all information about subjects spanning the entire organization
• it provides corporate-wide integration (from one are more operational system/external
information providers)
• it contains both detailed and summarized data
• size of the data varies from hundreds of gigabytes (GB) to terabytes (TB)
• it can be implemented in traditional mainframe computers/ computer Super servers
/Parallel architecture platforms.
• it requires extensive business modeling
• implementation cycle is measured in years
Datamart
• it contains a subset of corporate-wide data
• the scope is confined to some specific selected subjects
• low-cost departmental servers usually implement its
• implementation cycle is measured in weeks
• if it’s design and planning are not enterprise-wide, in the long run, we should involve a
6
• Depending on data sources, the data mart is categorized into dependent and independent.
• If the data is sourced from one or more operational system /external information providers,
we can call the data mart an independent data mart
• If the data is sourced directly from the enterprise data warehouse, we can use the data mart
as a dependent data mart.
Virtual warehouse
• It is a set of views over an operational database
• For efficient query processing, only some of the possible summary views may be
materialized
• It is easy to build, but it requires excess capacity on the operational database server
What are the pros and cons of the top-down and bottom-up approaches to data
warehouse development?
Top-down approach Bottom-up approach
• It is a systematic solution • Design and development of
• it minimizes the integration problem independent data marts provide
Pros flexibility
• Low cost
• Rapid return on investment
• it is expensive • it may lead to problems while
• it takes a long time to develop integrating various disparate data
Cons
• lack of flexibility marts into the consistent data
warehouse
7
Page
Step1
• Distributed data marts can be constructed to integrate different data marts via hub
servers
Step 4
• Metadata also guides the algorithms used to summarize the current detailed data
and the lightly summarized data, and between the lightly summarized data and the
highly summarized data.
• Metadata should be stored and managed persistently
A metadata repository should contain the following
Which includes
a. Warehouse schema view
Description of Datawarehouse b. Dimensions
1
structure c. Hierarchies
d. Derived data definitions
e. Data marts location and contents
This includes
a. Data lineage- data migration and
transformation
2 Operational Metadata
b. Currency of data-active, archived -purged
c. Monitoring Information -statistics, error
reports, audit trails
Which includes
a. Measure and dimension definition algorithms
b. Data on granularity
The algorithm used for
3 c. Partitions
summarization
d. Subject areas,
e. Aggregation
f. Predefined quarries and reports
which includes
a. Source databases and their contents
b. gateway descriptions
Mapping from the operational
c. data partitions/extraction/cleaning/
4 environment to the data
transformation rules
warehouse
d. data refresh and purging rules,
e. Security (user authorization and access
control).
which include
a. indices and profiles that improve data access
Data related to system
5 and retrieval performance
performance
b. Rules for the timing and scheduling of refresh,
update, and replication cycles
which include
a. Business terms and definitions
6 Business metadata
b. data ownership information,
c. charging policies
9
Page
Dimensions
Time Item Branch location
Dimension table
Item_name Brand Type
• The fact table contains the names of the facts measures and keys to each of the
10
• We will look at the AllElectronics sales data for items sold per quarter in the city of
Vancouver.
• In this 2-D representation, the sales for Vancouver are shown concerning the time
dimension (organized in quarters) and the item dimension (classified to the types of
items sold).
• The fact or measure displayed is dollars sold (in thousands).
• The 3-D data in the table are represented as a series of 2-D tables
11
Page
Conceptually, we may also represent the same data in a 3-D data cube, as shown below.
• Given a set of dimensions, we can generate a cuboid for each of the possible subsets
of the given dimensions.
• The result would form a lattice of cuboids, each showing the data at a different level
of summarization or group-by.
• The lattice of cuboids is then referred to as a data cube
• The cuboid that holds the lowest level of summarization is called the base cuboid
• The 0-D cuboid, which holds the highest level of summarization, is called the apex
cuboid.
Star Schema
In a star schema, the data warehouse contains
1. Fact Table - a large central table containing the bulk of the data, with no redundancy
2. Dimension tables- a set of smaller attendant tables, one for each dimension
The schema graph resembles a starburst, with the dimension tables displayed in a radial
13
Snowflake schema
• The snowflake schema is a variant of the star schema model
• some of the dimension tables are normalized by splitting the data into additional
tables.
• The resulting schema graph forms a shape like a snowflake
• The snowflake model may be kept in the normalized form to reduce redundancies
• Snowflake structure can reduce the effectiveness of browsing since more joins will
be needed to execute a query
Example
• The single dimension table for items in the star schema is normalized in the
snowflake schema, resulting in new item and supplier tables
• the item dimension table now contains the attributes item key, item name, brand,
type, and supplier key,
• where supplier key is linked to the supplier dimension table, containing supplier key
and supplier type information.
• Similarly, the single dimension table for location in the star schema can be
normalized into two new tables: area and city.
• The city key in the new location table links to the city dimension
Fact constellation:
• Sophisticated applications may require multiple fact tables to share dimension
tables.
• This kind of schema can be viewed as a collection of stars and hence is called a
galaxy schema or a fact constellation
15
Page
Example
Concept Hierarchies
• It defines a sequence of mappings from a set of low-level concepts to higher-level,
more general concepts
Example
Dimension: Location
Fact: City
Values: Vancouver, Toronto, New York, and Chicago
Example
Dimension: Price
Intervals: ($X …$Y) 18
Page
Distributive
• An aggregate function is distributive if it can be computed in a distributed manner.
• Suppose the data are partitioned into n sets.
• We apply the function to each partition, resulting in n aggregate values.
• If the result derived by applying the function to the n aggregate values is the same as
that derived by applying the function to the entire data set (without partitioning),
the function can be computed in a distributed manner.
For example,
sum() can be computed for a data cube by first partitioning the cube into a set of
subcubes, computing sum() for each subcube, and then summing up the counts
obtained for each subcube. Hence, sum() is a distributive aggregate function. For the
same reason, count(), min(), and max() are distributive aggregate functions.
• A measure is distributive if it is obtained by applying a distributive aggregate
function.
• Distributive measures can be computed efficiently because the computation can be
partitioned.
Algebraic
• An aggregate function is algebraic if it can be computed by an algebraic function with
M arguments, Where M is a bounded positive integer) each of which is obtained by
applying a distributive aggregate function.
For example,
1. avg() (average) can be computed by sum()/count(), where both sum() and count()
are distributive aggregate functions.
2. min N() and max N() (which find the N minimum and N maximum values,
respectively, in a given set) and standard deviation() are algebraic aggregate
functions.
3. A measure is algebraic if it is obtained by applying an algebraic aggregate
function
19
Page
Holistic:
• An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate.
• That is, there does not exist an algebraic function with M arguments (where M is a
constant) that characterizes the computation. Common examples of holistic
functions include median (), mode(), and rank().
• A measure is holistic if it is obtained by applying a holistic aggregate function
Note
• Most large data cube applications require efficient computation of distributive and
algebraic measures.
• Many efficient techniques for this exist.
• It is challenging to compute holistic measures efficiently.
• Efficient techniques to approximate the computation of some holistic measures do
exist.
20
Page
• Several OLAP data cube operations exist to materialize these different views,
allowing interactive querying and analysis of the data at hand.
• OLAP provides a user-friendly environment for interactive data analysis
• OLAP operations are
1. Roll-Up-Up
2. Drill -Down
3. Slice and Dice
4. Pivot(rotate)
Example
Data Cube represents: AllElectronics sales
Dimensions: location, time, and item
Where
• location is aggregated to city values
• time is aggregated concerning quarters
• item is aggregated concerning item types
Measure: dollars sold
Roll-up:
• The roll-up operation/ drill-up performs aggregation on a data cube by climbing up a
concept hierarchy for a dimension or by dimension reduction.
• Below is the result of a roll-up operation performed on the central cube by climbing
up the concept hierarchy for the location.
21
Page
• This hierarchy was defined as the total order of “street < city < province or state <
country.
• The roll-up operation shown aggregates the data by ascending the location
hierarchy from the city level to the country level.
• When roll-up is performed by dimension reduction, one or more dimensions are
removed from the given cube.
Drill-down
• Drill-down is the reverse of roll-up.
• It navigates from less detailed data to more complex data.
• Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
• Below figure shows the result of a drill-down operation performed on the central
cube by stepping down a concept hierarchy for the time defined as “day < month <
quarter < year.”
• Drill-down occurs by descending the time hierarchy from the quarter to the more
detailed month level.
Pivot (rotate)
• Pivot (also called rotate) is a visualization operation that rotates the data axes in
view to provide an alternative data presentation.
• Below figure shows a pivot operation where the item and location axes in a 2-D slice
are rotated.
Summary
analysis.
•
Page
Source:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Pearson, First
impression,2014.
2. Jiawei Han, Micheline Kamber, Jian Pei: Data Mining -Concepts and Techniques, 3rd Edition,
Morgan Kaufmann Publisher, 2012.
Question Bank
1. What is a Data warehouse? Explain the three-tier architecture of the data
warehouse. [8 Marks] [Dec 2019/Jan2020]
2. Explain the Schemas of multidimensional data models. [8 Marks] [Dec
2019/Jan2020]
3. What is a Data cube measure? Explain the categorization of measures. [8 Marks]
[Dec 2019/Jan2020]
4. Explain data cube operations with examples. [8 Marks] [Dec 2019/Jan2020]
5. Describe a 3 -tier data warehouse architecture. [6 Marks] [June /July 2019] [8 Marks]
[Aug /Sept 2020].
6. Compare OLTP and OLAP Systems. [6 Marks] [June /July 2019]
7. What is Data warehouse and what are its four key features? [4 Marks] [June /July
2019]
8. Explain with suitable examples the various OLAP operations in a multidimensional
data model. [7 Marks] [June /July 2019] [8 Marks] [Aug /Sept 2020].
9. Explain the following terms with examples [9 Marks] [June /July 2019] [8 Marks] [Aug
/Sept 2020].
a. Snowflakes schema
b. Fact constellation schema
c. Star schema
10. What is Data Warehousing? Discuss various usage and trends in data warehousing .
[8 Marks] [Aug /Sept 2020].
24
Page