Module1-Question Bank With Answers (1) - 2
Module1-Question Bank With Answers (1) - 2
MODULE 1
DATA MINING AND DATA WAREHOUSING
Q. No Question & Answer Marks
1. What is a data warehouse? 4
Key features:
Subject-oriented: A data warehouse is organized around major subjects such as
customer, supplier, product, and sales. Rather than concentrating on the day-to-day
operations and transaction processing of an organization, a data warehouse focuses
on the modeling and analysis of data for decision makers. Hence, data warehouses
typically provide a simple and concise view of particular subject issues by excluding
data that are not useful in the decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and online transaction
records. Data cleaning and data integration techniques are applied to ensure
consistency in naming conventions, encoding structures, attribute measures, and so
on.
Time-variant: Data are stored to provide information from an historic perspective
(e.g., the past 5–10 years). Every key structure in the data warehouse contains,
either implicitly or explicitly, a time element.
Nonvolatile: Non-volatile means the previous data is not erased when new data is
added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well
as load and refresh functions to update the data warehouse. The data are
extracted using application program interfaces known as gateways.
A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and
OLEDB (Open Linkingand Embedding for Databases) by Microsoft and
JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the
data warehouse and its contents.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so
on).
The major task of online operational database systems is to perform online transaction
and query processing. These systems are called online transaction processing (OLTP)
systems. They cover most of the day-to-day operations of an organization such as
purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role
of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of different users. These
systems are known as Online Analytical Processing(OLAP) systems. The major
distinguishing features of OLTP and OLAP are summarized as follows:
The major distinguishing features of OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used
for transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historic data, provides facilities for summarization and aggregation, and
stores and manages information at different levels of granularity. These features
make the data easier to use for informed decision making.
View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due
to the evolutionary process of an organization. OLAP systems also deal with
information that originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP data are stored on
multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
(because most data warehouses store historic rather than up-to-date information),
although many could be complex queries.
The major task of online operational database systems is to perform online transaction
and query processing. These systems are called online transaction processing (OLTP)
systems. They cover most of the day-to-day operations of an organization such as
purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role
of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of different users. These
systems are known as Online Analytical Processing(OLAP) systems. The major
distinguishing features of OLTP and OLAP are summarized as follows:
The major distinguishing features of OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used
for transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historic data, provides facilities for summarization and aggregation, and
stores and manages information at different levels of granularity. These features
make the data easier to use for informed decision making.
adopts either a star or a snowflake model (see Section 4.2.2) and a subject-oriented
database design.
View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due
to the evolutionary process of an organization. OLAP systems also deal with
information that originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP data are stored on
multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
(because most data warehouses store historic rather than up-to-date information),
although many could be complex queries.
A major reason for separate database for decision makers is to help promote the
high performance of both systems. An operational database is designed and tuned
from known tasks and workloads like indexing and hashing using primary keys,
searching for particular records, and optimizing ―canned‖ queries.
On the other hand, data warehouse queries are often complex. They involve the
computation of large data groups at summarized levels, and may require the use of
special data organization, access, and implementation methods based on
multidimensional views. Processing OLAP queries in operational databases would
substantially degrade the performance of operational tasks.
Operational metadata, which include data lineage (history of migrated data and
the sequence of transformations applied to it), currency of data (active, archived,
or purged), and monitoring information (warehouse usage statistics, error
reports, and audit trails).
The algorithms used for summarization, which include measure and dimension
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example,
a marketing data mart may confine its subjects to customer, item, and sales. The
data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is
more likely to be measured in weeks rather than months or years. However, it
may involve complex integration in the long run if its design and planning were
not enterprise-wide.
3. Virtual warehouse:
A virtual warehouse is a set of views over operational databases. For efficient
Star schema: The most common modeling paradigm is the star schema, in which
thedata warehouse contains (1) a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables
(dimensiontables), one for each dimension. The schema graph resembles a starburst,
with thedimension tables displayed in a radial pattern around the central fact table.
Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact constellation
Star Schema: The most common modeling paradigm is the star schema, in which the
data warehouse contains (1) a large central table (fact table) containing the bulk of the
data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables),
one for each dimension. The schema graph resembles a starburst, with the dimension
tables displayed in a radial pattern around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
Star schema: The most common modeling paradigm is the star schema, in which
thedata warehouse contains (1) a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the central fact table.
Figure shows the result of roll-up operation. Here, the data is grouped into cities
rather than countries.
When roll-up is performed by dimension reduction, one or more dimension are
removed from the given cube.
DRILL DOWN
o This is like zooming in on the data and is therefore the reverse of roll-up. Drill-
down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
o This is an appropriate operation when the user needs further details or wants to
focus on some particular values of certain dimensions.This adds more details to
the data.
o Initially the concept hierarchy was "day < month < quarter < year."
o On drill-up, the time dimension is descended from the level quarter to the level
of month as shown in figure.
o Since a drill-down adds more detail to the given data, it can also be performed by
adding new dimensions to a cube.
The dice operation defines a sub cube by performing a selection on two or more
dimensions.
Figure shows a dice operation on cube based on the following selection criteria
that involve three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
PIVOT OR ROTATE
• This is used when the user wishes to re-orient the view of the data cube.
• This may involve
In general terms, dimensions are the perspectives or entities with respect to which an
organization wants to keep records. For example, AllElectronics may create a sales data
warehouse in order to keep records of the store’s sales with respect to the dimensions
time, item, branch, and location. These dimensions allow the store to keep track of
things like monthly sales of items and the branches and locations at which the items were
sold. Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension. For example, a dimension table for item may
contain the attributes item name, brand, and type. Dimension tables can be specified by
users or experts, or automatically generated and adjusted based on data distributions.
dollars), units sold (number of units sold), and amount budgeted. The fact table contains
the names of the facts, or measures, as well as keys to each of the related dimension
tables.
Although we usually think of cubes as 3-D geometric structures, in data warehousing the
data cube is n-dimensional. To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D data cube that is, in
fact, a table or spreadsheet for sales data from AllElectronics. In particular, we will look
at the AllElectronics sales data for items sold per quarter in the city of Vancouver. These
data are shown in Table 4.2.
In this 2-D representation, the sales for Vancouver are shown with respect to the time
dimension (organized in quarters) and the item dimension (organized according to the
types of items sold). The fact or measure displayed is dollars sold (in thousands)
Now, suppose that we would like to view the sales data with a third dimension. For
instance, suppose we would like to view the data according to time and item, as well as
location, for the cities Chicago, New York, Toronto, and Vancouver. These 3-D data
areshown in Table 4.3. The 3-D data in the table are represented as a series of 2-D tables.
Conceptually, we may also represent the same data in the form of a 3-D data cube, as in
Figure 1.3.
Suppose that we would now like to view our sales data with an additional fourth
dimension such as supplier. Viewing things in 4-D becomes tricky. However, we can
think of a 4-D cube as being a series of 3-D cubes, as shown in Figure 1.4. If we
continue in this way, we may display any n-dimensional data as a series of n 1/-
dimensional ―cubes.‖ The data cube is a metaphor for multidimensional data storage.
The actual physical storage of such data may differ from its logical representation. The
important thing to remember is that data cubes are n-dimensional and do not confine data
to 3-D.
Figure 1.3 A 3-D data cube representation of the data in Table 4.3, according to time, item,
and location.
Tables 4.2 and 4.3 show the data at different degrees of summarization. In the data
warehousing research literature, a data cube like those shown in Figures 1.3 and 1.4 is
often referred to as a cuboid. Given a set of dimensions, we can generate a cuboid for
each of the possible subsets of the given dimensions. The result would form a lattice of
cuboids, each showing the data at a different level of summarization, or group-by. The
lattice of cuboids is then referred to as a data cube. Figure 1.5 shows a lattice of cuboids
forming a data cube for the dimensions time, item, location, and supplier.
Figure 1.4 A 4-D data cube representation of sales data, according to time, item, location,
and supplier. The measure displayed is dollars sold (in thousands). For improved
readability, only some of the cube values are shown.
Figure 1.5 Lattice of cuboids, making up a 4-D data cube for time, item, location, and
supplier. Each cuboid represents a different degree of summarization
16 What is a data cube? Explain Dimension, Dimension table, Fact and fact table with 4
example.
OR
Define data cube in your own words.
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined
by dimensions and facts.
In general terms, dimensions are the perspectives or entities with respect to which an
organization wants to keep records. For example, AllElectronics may create a sales data
warehouse in order to keep records of the store’s sales with respect to the dimensions
time, item, branch, and location. These dimensions allow the store to keep track of
things like monthly sales of items and the branches and locations at which the items were
sold. Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension. For example, a dimension table for item may
contain the attributes item name, brand, and type. Dimension tables can be specified by
users or experts, or automatically generated and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as
sales. This theme is represented by a fact table. Facts are numeric measures. Think of
them as the quantities by which we want to analyze relationships between dimensions.
Examples of facts for a sales data warehouse include dollars sold (sales amount in
dollars), units sold (number of units sold), and amount budgeted. The fact table contains
the names of the facts, or measures, as well as keys to each of the related dimension
tables.
17 Explain how organizations are using data warehouse. 6
Data warehousing is also very useful from the point of view of heterogeneous database
integration. Organizations typically collect diverse kinds of data and maintain large
databases from multiple, heterogeneous, autonomous, and distributed information
sources. It is highly desirable, yet challenging, to integrate such data and provide easy
and efficient access to it. Much effort has been spent in the database industry and
research community toward achieving this goal.
18 With a neat figure explain the recommended approach for the data warehouse 8
development.
A recommended method for the development of data warehouse systems is to implement
the warehouse in an incremental and evolutionary manner, as shown in Figure 4.2.
First, a high-level corporate data model is defined within a reasonably short period (such
as one or two months) that provides a corporate-wide, consistent, integrated view of data
among different subjects and potential usages. This high-level model, although it will
need to be refined in the further development of enterprise data warehouses and
departmental data marts, will greatly reduce future integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.
Third, distributed data marts can be constructed to integrate different data marts via hub
servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the
various dependent data marts.
A data cube allows data to be modelled and viewed in multiple dimensions. It is defined
by dimensions and facts. In general terms, dimensions are the perspectives or entities
with respect to which an organization wants to keep records. For example, AllElectronics
may create a salesdata warehouse in order to keep records of the store’s sales with
respect to the dimensions time, item, branch, and location. These dimensions allow the
store to keep track of things like monthly sales of items and the branches and locations at
which the items were sold. Each dimension may have a table associated with it, called a
dimensiontable, which further describes the dimension. For example, a dimension table
for item may contain the attributes item name, brand, and type. Dimension tables can be
specified by users or experts, or automatically generated and adjusted based on data
distributions.
Although we usually think of cubes as 3-D geometric structures, in data warehousing the
data cube is n-dimensional. To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D data cube that is, in
fact, a table or spreadsheet for sales data from AllElectronics. In particular, wewill look
at the AllElectronics sales data for items sold per quarter in the city of Vancouver. These
data are shown in Table 4.2. In this 2-D representation, the sales for Vancouver are
shown with respect to the time dimension (organized in quarters) and the itemdimension
(organized according to the types of items sold). The fact or measure displayed is dollars
sold (in thousands).
Now, suppose that we would like to view the sales data with a third dimension. For
instance, suppose we would like to view the data according to time and item, as well as
location, for the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are
shown in Table 4.3. The 3-D data in the table are represented as a series of 2-D tables.
Conceptually, we may also represent the same data in the form of a 3-D data cube, as in
Figure 4.3.
Suppose that we would now like to view our sales data with an additional fourth
dimension such as supplier. Viewing things in 4-D becomes tricky. However, we can
think of a 4-D cube as being a series of 3-D cubes, as shown in Figure 4.4.
20 What are cuboids? Explain the lattice of cuboids for 4-D data cube. 10
In the data warehousing research literature, a data cube is often referred to as a cuboid.
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of
the given dimensions. The result would form a lattice of cuboids, each showing the data
at a different level of summarization, or group-by. The lattice of cuboids is then referred
to as a data cube.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or
customer. In other words, the lowest level should be usable, or useful for the analysis. A
cube at the highest level of abstraction is the apex cuboid.
For the sales data in Figure 3.11, the apex cuboid would give one total—the total sales
for all three years, for all item types, and for all branches. Data cubes created for varying
levels of abstraction are often referred to as cuboids, so that a data cube may instead
refer to a lattice of cuboids. Each higher abstraction level further reduces the resulting
data size. When replying to data mining requests, the smallest available cuboid relevant
to the given task should be used.
Figure 1.5 Lattice of cuboids, making up a 4-D data cube for time, item, location, and
supplier. Each cuboid represents a different degree of summarization.
21 Explain concept hierarchy with example. 6
A concept hierarchy defines a sequence of mappings from a set of low-level
concept to higher-level, more general concepts.
Consider a concept hierarchy for the dimension location. City values for location
include Vancouver, Toronto, New York, and Chicago. Each city, however, can be
mapped to the province or state to which it belongs. For example, Vancouver can be
mapped to British Columbia, and Chicago to Illinois. The provinces and states can
in turn be mapped to the country (e.g., Canada or the United States) to which they
belong. These mappings form a concept hierarchy for the dimension location,
mapping a set of low-level concepts (i.e., cities) to higher-level, more general
concepts (i.e., countries). This concept hierarchy is illustrated in Figure 4.9.
Figure 4.9 A concept hierarchy for location. Due to space limitations, not all of the
hierarchy nodes are shown, indicated by ellipses between nodes
Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street,
city, province or state, zip code, and country. These attributes are related by a total
order, forming a concept hierarchy such as ―street < city < province or state <
country.‖ This hierarchy is shown in Figure 4.10(a). Alternatively, the attributes of a
dimension may be organized in a partial order, forming a lattice. An example of a
partial order for the time dimension based on the attributes day, week, month,
quarter, and year is ―day < month < quarter; weekg < year.‖1 This lattice structure
is shown in Figure 4.10(b).
Figure 4.10 Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a
hierarchy for location and (b) a lattice for time.