Data Mining and Data Warehousing
Data Mining and Data Warehousing
Definitions of Data warehouse and Data mining. Application areas, pit-falls in data mining. Data
warehouse architectures. Dimensional modelling. Multidimensional aggregation queries and
view materialization. Data mining algorithms: association rule, classification and prediction,
clustering, scalable algorithms and flexible predictive modelling. Web mining. Text and data
clustering. Automated recommender systems and pattern discovery algorithms.
30h (T); C
It is common to combine some of these steps together for instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
Also, data selection and data transformation can be combined where the consolidation of the data
is the result of the selection, or as for the case of data warehouses, the selection is done on
transformed data.
The KDD is an iterative process and can contain loops between any two steps. Once knowledge
is discovered it is presented to the user, the evaluation measures are enhanced and the mining can
be further refined, new data can be selected or further transformed, or new data sources can be
integrated, in order to get different and more appropriate results
OLAP is part of a spectrum of decision support tools. Unlike traditional query and report tools
that describe what is in a database, OLAP goes further to answer why certain things are true. The
user forms a hypothesis about a relationship and verifies it with a series of queries against the
data. For example, an analyst may want to determine the factors that lead to loan defaults. He or
she might initially hypothesis that people with low incomes are bad credit risks and analyse the
database with OLAP to verify or disprove assumption. If that hypothesis were not borne out by
the data, the analyst might then look at high debt as the determinant of risk. It the data does not
support this guess either, he or she might then try debt and income together as the best prediction
of bad credit risks (Two Crows Corporation, 2005)
In other words, OLAP is used to generate a series of hypothetical patterns and relationships, uses
queries against database to verify them or disprove them. OLAP analysis is basically a deductive
process. But when the number of variable to be analysed becomes voluminous it becomes much
more difficult and time-consuming to find a good hypothesis, analyse the database with OLAP to
verify or disprove it.
Data mining is different from OLAP; unlike OLAP that verifies hypothetical patterns, it uses the
data itself to uncover such patterns and is basically an inductive process. For instance, suppose
an analyst wants to identify the risk factors for loan default is to use a data mining tool. The data
mining tool may discover people with high dept and low incomes are bad credit risks, it may go
further to discover a pattern that the analyst does not consider that age is also a determinant of
risk.
Although data mining and OLAP complement each other in the sense that before acting on the
pattern, the analyst needs to know what would be the financial implications using the discovered
pattern to govern who gets credit. OLAP tool allows the analyst to answer these kinds of
questions. It is also complimentary in the early stages of the knowledge discovery process.
Data mining techniques are the results of a long process of research and product development.
The evolution started when business data was first stored on computers with data access
improvements and generated technologies that allow users to navigate through their data in real
time. This evolutionary process is taken beyond retrospective data access and navigation to
prospective and proactive information delivery.
Data mining is a natural development of the increased use of computerized databases to store
data and provide answers to business analysis. Traditional query and report tools have been used
to describe and extract what is in a database. Data mining is ready for application in the business
community because it is supported by these technologies that are now sufficiently mature:
Data mining engine is very essential to the data mining system. These includes:
1. Classification: This analysis is used to retrieve important and relevant information about data,
and metadata. This data mining method helps to classify data in different classes.
2. Clustering: Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the data.
3. Regression: Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable, given
the presence of other variables.
4. Association Rules: This data mining technique helps to find the association between two or
more Items. It discovers a hidden pattern in the data set.
5. Outer detection: This type of data mining technique refers to observation of data items in the
dataset which do not match an expected pattern or expected behavior. This technique can be used
in a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection
is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns: This data mining technique helps to discover or identify similar patterns
or trends in transaction data for certain period.
7. Prediction: Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a right
sequence for predicting a future event.
1. Tracking patterns. One of the most basic techniques in data mining is learning to recognize
patterns in your data sets. This is usually recognition of some aberration in your data happening
at regular intervals, or an ebb and flow of a certain variable over time. For example, you might
see that your sales of a certain product seem to spike just before the holidays, or notice that
warmer weather drives more people to your website.
2. Classification. Classification is a more complex data mining technique that forces you to
collect various attributes together into discernable categories, which you can then use to draw
further conclusions, or serve some function. For example, if you’re evaluating data on individual
customers’ financial backgrounds and purchase histories, you might be able to classify them as
“low,” “medium,” or “high” credit risks. You could then use these classifications to learn even
more about those customers.
4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give you a
clear understanding of your data set. You also need to be able to identify anomalies, or outliers in
your data. For example, if your purchasers are almost exclusively male, but during one strange
week in July, there’s a huge spike in female purchasers, you’ll want to investigate the spike and
see what drove it, so you can either replicate it or better understand your audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data
together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable income
they have, or how often they tend to shop at your store.
6. Regression. Regression, used primarily as a form of planning and modeling, is used to identify
the likelihood of a certain variable, given the presence of other variables. For example, you could
use it to project a certain price, based on other factors like availability, consumer demand, and
competition. More specifically, regression’s main focus is to help you uncover the exact
relationship between two (or more) variables in a given data set.
7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used to
project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of what will
happen in the future. For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks). Data goes through a series of
steps during preprocessing:
• Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing
the noisy data, or resolving the inconsistencies in the data. • Data Integration: Data with different
representations are put together and conflicts within the data are resolved.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
Performance Issues
Qualitative Attributes
1. Nominal Attributes – related to names : The values of a Nominal attribute are name of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of nominal attribute.
Example
2. Binary Attributes : Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false.
i) Symmetric : Both values are equally important (Gender).
ii) Asymmetric : Both values are not equally important (Result).
3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence
or ranking(order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.
Quantitative Attributes
1. Numeric : A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval and ratio.
i) An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point or we can call zero point. Data can be added
and subtracted at interval scale but cannot be multiplied or divided. Consider a example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice than the other day
we cannot say that one day is twice as hot as another day.
Data warehousing is the process of constructing and using the data warehouse. A data warehouse
is constructed by integrating the data from multiple heterogeneous sources. It supports analytical
reporting, structured and/or ad hoc queries, and decision making.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data warehousing involves data cleaning, data integration, and data consolidations. To integrate
heterogeneous databases, we have the following two approaches:
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach is used to
build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.
Process of Query Driven Approach
1. When a query is issued to a client side, a metadata dictionary translates the query into the
queries, appropriate for the individual heterogeneous site involved.
2. Now these queries are mapped and sent to the local query processor.
3. The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
This approach has the following disadvantages:
The Query Driven Approach needs complex integration and filtering processes.
It is very inefficient and very expensive for frequent queries.
Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather than the traditional
approach discussed earlier. In the update-driven approach, the information from multiple
heterogeneous sources is integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
Advantages
This approach has the following advantages:
This approach provides high performance.
The data can be copied, processed, integrated, annotated, summarized and restructured in
the semantic data store in advance.
Query processing does not require interface with the processing at local sources.
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
A data warehouses is kept separate from operational databases due to the following reasons :
• An operational database is constructed for well-known tasks and workloads such as searching
particular records, indexing, etc. In contract, data warehouse queries are often complex and they
present a general form of data.
• An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.
• An operational database maintains current data. On the other hand, a data warehouse maintains
historical data.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database. Data
Warehouse Applications As discussed before, a data warehouse helps business executives to
organize, analyze, and use their data for decision making. A data warehouse serves as a sole part
of a plan-execute assess "closed-loop" feedback system for the enterprise management. Data
warehouses are widely used in the following fields such as financial services, Banking services,
Consumer goods, Retail sectors, Controlled manufacturing etc.
Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases where
the technology is mature and well known, and where the business problems that must be solved
are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the
top-down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.
High quality of data in data warehouses - The data mining tools are required to work
on integrated, consistent, and cleaned data. These steps are very costly in the
preprocessing of data. The data warehouses constructed by such preprocessing are
valuable sources of high quality data for OLAP and data mining as well.
Online selection of data mining functions - Integrating OLAP with multiple data
mining functions and online analytical mining provide users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather detailed
data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's
situation.
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate
processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation
between analytical and transactional processing. Analysis queries are agreed to operational data
after the middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system
The World Wide Web contains huge amounts of information that provides a rich source for data
mining.
The web poses great challenges for resource and knowledge discovery based on the following
observations:
The web is too huge. - The size of the web is very huge and rapidly increasing. This seems
that the web is too huge for data warehousing and data mining.
Complexity of Web pages. - The web pages do not have unifying structure. They are very
complex as compared to traditional text document. There are huge amount of documents in
digital library of web. These libraries are not arranged according to any particular sorted order.
Web is dynamic information source. - The information on the web is rapidly updated. The
data such as news, stock markets, weather, sports, shopping, etc., are regularly updated.
Diversity of user communities. - The user community on the web is rapidly expanding.
These users have different backgrounds, interests, and usage purposes. There are more than 100
million workstations that are connected to the Internet and still rapidly increasing.