0% found this document useful (0 votes)
175 views25 pages

Data Mining and Data Warehousing

The document provides an overview of a course on data mining and data warehousing. It defines data mining as extracting useful information from large datasets and discusses key processes involved, including data cleaning, integration, transformation, and mining. It contrasts data mining with OLAP and knowledge discovery in databases (KDD), noting that data mining is an inductive process to uncover patterns while OLAP is deductive to verify hypotheses. Examples of data mining applications and techniques are also outlined.

Uploaded by

Obiwusi Kolawole
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
175 views25 pages

Data Mining and Data Warehousing

The document provides an overview of a course on data mining and data warehousing. It defines data mining as extracting useful information from large datasets and discusses key processes involved, including data cleaning, integration, transformation, and mining. It contrasts data mining with OLAP and knowledge discovery in databases (KDD), noting that data mining is an inductive process to uncover patterns while OLAP is deductive to verify hypotheses. Examples of data mining applications and techniques are also outlined.

Uploaded by

Obiwusi Kolawole
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 25

CSC 405 Data Mining and Data Warehousing 2 Credits

Definitions of Data warehouse and Data mining. Application areas, pit-falls in data mining. Data
warehouse architectures. Dimensional modelling. Multidimensional aggregation queries and
view materialization. Data mining algorithms: association rule, classification and prediction,
clustering, scalable algorithms and flexible predictive modelling. Web mining. Text and data
clustering. Automated recommender systems and pattern discovery algorithms.
30h (T); C

CSC405 – Data Mining and Data Warehousing


Introduction
There is a huge amount of data available in the Information Industry. This data is of no use until
it is converted into useful information. It is necessary to analyze this huge amount of data and
extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to
use this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.
What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. It can also be described as
the process of searching for valuable information in large volumes of data. Data mining is
relatively a powerful new technology with great potential to assist companies focus on the most
important information in their data warehouses.
Data mining is a cooperative effort of humans and computers; human actually designs the
databases, describe the problems and set goals while computers sort through the data and search
for patterns that matches the goals. The information or knowledge extracted so can be used for
any of the following applications:
1. Market Analysis
2. Fraud Detection
3. Customer Retention
4. Production Control
5. Science Exploration
Data Mining is not same as KDD (Knowledge Discovery from Data) Data Mining is a step in
KDD.
Data Mining is not same as KDD (Knowledge Discovery from Data)
What is Knowledge Discovery?
KDD refers to the overall process of discovering useful knowledge from data
Some people don’t differentiate data mining from knowledge discovery while others view data
mining as an essential step in the process of knowledge discovery. The term KDD was first
coined by Gregory Piatetsky-Shapiro in 1989 to describe the process of searching for interesting,
interpreted, useful and novel data. Reflecting the conceptualization of data mining, it is
considered by researchers to be a particular step in a larger process of Knowledge Discovery in
Databases (KDD)
The knowledge discovery in databases process comprises of a few steps in chronological order
that starts from raw data collections to some forms of new knowledge. This include data
cleaning, data integration, data selection, data transformation, data mining, pattern evaluation
and knowledge presentation. Here is the list of steps involved in the knowledge discovery
process:
 Data Cleaning - In this step, the noise and inconsistent data is removed.
 Data Integration - In this phase, multiple data sources are combined.
 Data Selection - In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation - In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
 Data Mining - In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation - In this step, data patterns are evaluated.
 Knowledge Presentation - In this step, knowledge is represented.
An Abstract View of process of knowledge discovery from data

It is common to combine some of these steps together for instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
Also, data selection and data transformation can be combined where the consolidation of the data
is the result of the selection, or as for the case of data warehouses, the selection is done on
transformed data.

The KDD is an iterative process and can contain loops between any two steps. Once knowledge
is discovered it is presented to the user, the evaluation measures are enhanced and the mining can
be further refined, new data can be selected or further transformed, or new data sources can be
integrated, in order to get different and more appropriate results

Data Mining and OLAP


The difference between data mining and On-Line Analytical Processing (OLAP) is a very
common question among data processing professionals. As we all see, the two are different tools
that can complement each other.

OLAP is part of a spectrum of decision support tools. Unlike traditional query and report tools
that describe what is in a database, OLAP goes further to answer why certain things are true. The
user forms a hypothesis about a relationship and verifies it with a series of queries against the
data. For example, an analyst may want to determine the factors that lead to loan defaults. He or
she might initially hypothesis that people with low incomes are bad credit risks and analyse the
database with OLAP to verify or disprove assumption. If that hypothesis were not borne out by
the data, the analyst might then look at high debt as the determinant of risk. It the data does not
support this guess either, he or she might then try debt and income together as the best prediction
of bad credit risks (Two Crows Corporation, 2005)

In other words, OLAP is used to generate a series of hypothetical patterns and relationships, uses
queries against database to verify them or disprove them. OLAP analysis is basically a deductive
process. But when the number of variable to be analysed becomes voluminous it becomes much
more difficult and time-consuming to find a good hypothesis, analyse the database with OLAP to
verify or disprove it.

Data mining is different from OLAP; unlike OLAP that verifies hypothetical patterns, it uses the
data itself to uncover such patterns and is basically an inductive process. For instance, suppose
an analyst wants to identify the risk factors for loan default is to use a data mining tool. The data
mining tool may discover people with high dept and low incomes are bad credit risks, it may go
further to discover a pattern that the analyst does not consider that age is also a determinant of
risk.

Although data mining and OLAP complement each other in the sense that before acting on the
pattern, the analyst needs to know what would be the financial implications using the discovered
pattern to govern who gets credit. OLAP tool allows the analyst to answer these kinds of
questions. It is also complimentary in the early stages of the knowledge discovery process.

The Evolution of Data Mining

Data mining techniques are the results of a long process of research and product development.
The evolution started when business data was first stored on computers with data access
improvements and generated technologies that allow users to navigate through their data in real
time. This evolutionary process is taken beyond retrospective data access and navigation to
prospective and proactive information delivery.

Data mining is a natural development of the increased use of computerized databases to store
data and provide answers to business analysis. Traditional query and report tools have been used
to describe and extract what is in a database. Data mining is ready for application in the business
community because it is supported by these technologies that are now sufficiently mature:

1. Massive data collection


2. Powerful multiprocessor computer

Presently commercial databases are growing at an unprecedented rate. In some organizations,


such as retail, these numbers can be much larger. The accompanying need for improved
computational engines can now be met in a cost- effective with parallel multiprocessor computer
technology. Data mining algorithms embody techniques that have been existing for at least ten
years, but have recently been implemented as nature, reliable, understandable tools that
consistently outperform older statistical methods.

Data Mining Applications


Data mining is highly useful in the following domains:
1. Market Analysis and Management
2. Corporate Analysis & Risk Management
3. Fraud Detection
4. production control
5. customer retention
6. science exploration
7. sports
8. lie detection
9. intrusion detection
10. astrology and
11. Internet Web Surf-Aid
12. Customer relation management
13. Future Healthcare
14. Education
15. Manufacturing Engineering

DATA MINING TECHNIQUES / Mining Engine:

Data mining engine is very essential to the data mining system. These includes:

1. Classification: This analysis is used to retrieve important and relevant information about data,
and metadata. This data mining method helps to classify data in different classes.

2. Clustering: Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the data.
3. Regression: Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable, given
the presence of other variables.

4. Association Rules: This data mining technique helps to find the association between two or
more Items. It discovers a hidden pattern in the data set.

5. Outer detection: This type of data mining technique refers to observation of data items in the
dataset which do not match an expected pattern or expected behavior. This technique can be used
in a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection
is also called Outlier Analysis or Outlier mining.

6. Sequential Patterns: This data mining technique helps to discover or identify similar patterns
or trends in transaction data for certain period.

7. Prediction: Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a right
sequence for predicting a future event.

DATA MINING TECHNIQUES (IN DETAIL):

1. Tracking patterns. One of the most basic techniques in data mining is learning to recognize
patterns in your data sets. This is usually recognition of some aberration in your data happening
at regular intervals, or an ebb and flow of a certain variable over time. For example, you might
see that your sales of a certain product seem to spike just before the holidays, or notice that
warmer weather drives more people to your website.

2. Classification. Classification is a more complex data mining technique that forces you to
collect various attributes together into discernable categories, which you can then use to draw
further conclusions, or serve some function. For example, if you’re evaluating data on individual
customers’ financial backgrounds and purchase histories, you might be able to classify them as
“low,” “medium,” or “high” credit risks. You could then use these classifications to learn even
more about those customers.

3. Association. Association is related to tracking patterns, but is more specific to dependently


linked variables. In this case, you’ll look for specific events or attributes that are highly
correlated with another event or attribute; for example, you might notice that when your
customers buy a specific item, they also often buy a second, related item. This is usually what’s
used to populate “people also bought” sections of online stores.

4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give you a
clear understanding of your data set. You also need to be able to identify anomalies, or outliers in
your data. For example, if your purchasers are almost exclusively male, but during one strange
week in July, there’s a huge spike in female purchasers, you’ll want to investigate the spike and
see what drove it, so you can either replicate it or better understand your audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data
together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable income
they have, or how often they tend to shop at your store.

6. Regression. Regression, used primarily as a form of planning and modeling, is used to identify
the likelihood of a certain variable, given the presence of other variables. For example, you could
use it to project a certain price, based on other factors like availability, consumer demand, and
competition. More specifically, regression’s main focus is to help you uncover the exact
relationship between two (or more) variables in a given data set.

7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used to
project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of what will
happen in the future. For example, you might review consumers’ credit histories and past
purchases to predict whether they’ll be a credit risk in the future.

DATA PRE-PROCESSING Definition - What does Data Preprocessing mean?

Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks). Data goes through a series of
steps during preprocessing:

• Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing
the noisy data, or resolving the inconsistencies in the data. • Data Integration: Data with different
representations are put together and conflicts within the data are resolved.

• Data Transformation: Data is normalized, aggregated and generalized.

• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.

• Data Discretization: Involves the reduction of a number of values of a continuous attribute by


dividing the range of attribute intervals.

Why Data Pre-processing?


Data preprocessing prepares raw data for further processing. The traditional data preprocessing
method is reacting as it starts with data that is assumed ready for analysis and there is no
feedback and impart for the way of data collection.

ISSUES in Data mining

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.

The major issues are:

1. Mining Methodology and User Interaction


2. Performance Issues
3. Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues:


1. Mining different kinds of knowledge in databases - Different users may be interested
in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
2. Interactive mining of knowledge at multiple levels of abstraction - The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
3. Incorporation of background knowledge - To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
4. Data mining query languages and ad hoc data mining - Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
5. Presentation and visualization of data mining results - Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
6. Handling noisy or incomplete data - The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
7. Pattern evaluation - The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows:


1. Efficiency and scalability of data mining algorithms - In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
2. Parallel, distributed, and incremental mining algorithms - The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.

Diverse Data Types Issues


1. Handling of relational and complex types of data - The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
2. Mining information from heterogeneous databases and global information systems -
The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.

DATA OBJECTS AND ATTRIBUTE TYPES:


When we talk about data mining, we usually discuss about knowledge discovery from data. To
get to know about the data it is necessary to discuss about data objects, data attributes and types
of data attributes. Mining data includes knowing about data, finding relation between data. And
for this we need to discuss about data objects and attributes.
Data objects are the essential part of a database. A data object represents the entity. Data Objects
are like group of attributes of a entity. For example a sales data object may represent customer,
sales or purchases. When a data object is listed in a database they are called data tuples.
Attribute
It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc. We can say that a set of attributes
used to describe a given object are known as attribute vector or feature vector. Type of
attributes : This is the First step of Data Data-preprocessing. We differentiate between different
types of attributes and then preprocess the data. So here is description of attribute types. 1.
Qualitative (Nominal (N), Ordinal (O), Binary(B)). 2. Quantitative (Discrete, Continuous)

Qualitative Attributes
1. Nominal Attributes – related to names : The values of a Nominal attribute are name of
things, some kind of symbols. Values of Nominal attributes represents some category or state
and that’s why nominal attribute also referred as categorical attributes and there is no order
(rank, position) among values of nominal attribute.
Example
2. Binary Attributes : Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false.
i) Symmetric : Both values are equally important (Gender).
ii) Asymmetric : Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence
or ranking(order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.

Quantitative Attributes
1. Numeric : A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval and ratio.
i) An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point or we can call zero point. Data can be added
and subtracted at interval scale but cannot be multiplied or divided. Consider a example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice than the other day
we cannot say that one day is twice as hot as another day.

ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a measurement is


ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference.
between values, and the mean, median, mode, Quantile-range and Five number summary can be
given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical
form. These attributes has finite or countably infinite set of values. Example
3. Continuous : Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3. Example

Data Warehousing and Data warehouse

Data warehousing is the process of constructing and using the data warehouse. A data warehouse
is constructed by integrating the data from multiple heterogeneous sources. It supports analytical
reporting, structured and/or ad hoc queries, and decision making.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of


data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

A data warehouses provides us generalized and consolidated data in multidimensional view.


Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space.

Data warehousing involves data cleaning, data integration, and data consolidations. To integrate
heterogeneous databases, we have the following two approaches:

 Query Driven Approach


 Update Driven Approach

Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach is used to
build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.
Process of Query Driven Approach
1. When a query is issued to a client side, a metadata dictionary translates the query into the
queries, appropriate for the individual heterogeneous site involved.

2. Now these queries are mapped and sent to the local query processor.

3. The results from heterogeneous sites are integrated into a global answer set.

Disadvantages
This approach has the following disadvantages:
 The Query Driven Approach needs complex integration and filtering processes.
 It is very inefficient and very expensive for frequent queries.

 This approach is expensive for queries that require aggregations.

Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather than the traditional
approach discussed earlier. In the update-driven approach, the information from multiple
heterogeneous sources is integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
Advantages
This approach has the following advantages:
 This approach provides high performance.
 The data can be copied, processed, integrated, annotated, summarized and restructured in
the semantic data store in advance.

Query processing does not require interface with the processing at local sources.

Understanding a Data Warehouse

• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.

• It possesses consolidated historical data, which helps the organization to analyze its business.

• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.

• Data warehouse systems help in the integration of diversity of application systems.

• A data warehouse system helps in consolidated historical data analysis.

Why a Data Warehouse is separated from Operational Databases?

A data warehouses is kept separate from operational databases due to the following reasons :

• An operational database is constructed for well-known tasks and workloads such as searching
particular records, indexing, etc. In contract, data warehouse queries are often complex and they
present a general form of data.

• Operational databases support concurrent processing of multiple transactions. Concurrency


control and recovery mechanisms are required for operational databases to ensure robustness and
consistency of the database.

• An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.

• An operational database maintains current data. On the other hand, a data warehouse maintains
historical data.

Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database. Data
Warehouse Applications As discussed before, a data warehouse helps business executives to
organize, analyze, and use their data for decision making. A data warehouse serves as a sole part
of a plan-execute assess "closed-loop" feedback system for the enterprise management. Data
warehouses are widely used in the following fields such as financial services, Banking services,
Consumer goods, Retail sectors, Controlled manufacturing etc.

Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both.

The top-down approach starts with the overall design and planning. It is useful in cases where
the technology is mature and well known, and where the business problems that must be solved
are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of the
top-down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.

The warehouse design process consists of the following steps:

 Choose a business process to model, for example, orders, invoices, shipments,


inventory, account administration, sales, or the general ledger. If the business process is
organizational and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the analysis
of one kind of business process, a data mart model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of
data to be represented in the fact table for this process, for example, individual
transactions, individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions
are time, item, customer, supplier, warehouse, transaction type, and status.
 Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.

From Data Warehousing (OLAP) to Data Mining (OLAM)


Online Analytical Mining integrates with Online Analytical Processing with data mining and
mining knowledge in multidimensional databases. Here is the diagram that shows the integration
of both OLAP and OLAM:
Importance of OLAM

OLAM is important for the following reasons:

 High quality of data in data warehouses - The data mining tools are required to work
on integrated, consistent, and cleaned data. These steps are very costly in the
preprocessing of data. The data warehouses constructed by such preprocessing are
valuable sources of high quality data for OLAP and data mining as well.

 Available information processing infrastructure surrounding data warehouses -


Information processing infrastructure refers to accessing, integration, consolidation, and
transformation of multiple heterogeneous databases, web-accessing and service facilities,
reporting and OLAP analysis tools.
 OLAP-based exploratory data analysis - Exploratory data analysis is required for
effective data mining. OLAM provides facility for data mining on various subset of data
and at different levels of abstraction.

 Online selection of data mining functions - Integrating OLAP with multiple data
mining functions and online analytical mining provide users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.

DATA WAREHOUSE ARCHITECTURE

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.

Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather detailed
data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.

Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

Data warehouses and their architectures very depending upon the elements of an organization's
situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
 Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.
 Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.

Data Warehouse Architecture: With Staging Area and Data Marts


We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in the
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.

Properties of Data Warehouse Architectures


1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements, which
have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies
without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures
There are mainly three types of warehouse architecture:

1. Single tier architecture


2. Two tier architecture
3. Three tier architecture

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate
processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for separation
between analytical and transactional processing. Analysis queries are agreed to operational data
after the middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system

Although it is typically called two-layer architecture to highlight a separation between physically


available sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can
also be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments. Meta-data repositories
store information on sources, access procedures, data staging, users, data mart schema,
and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue
reports, dynamically analyze information, and simulate hypothetical business scenarios. It
should feature aggregate information navigators, complex query optimizers, and
customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for
a whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer.
It also makes the analytical tools a little further away from being real-time.
MINING WORLD WIDE WEB

The World Wide Web contains huge amounts of information that provides a rich source for data
mining.

Challenges in Web Mining

The web poses great challenges for resource and knowledge discovery based on the following
observations:

 The web is too huge. - The size of the web is very huge and rapidly increasing. This seems
that the web is too huge for data warehousing and data mining.

 Complexity of Web pages. - The web pages do not have unifying structure. They are very
complex as compared to traditional text document. There are huge amount of documents in
digital library of web. These libraries are not arranged according to any particular sorted order.
 Web is dynamic information source. - The information on the web is rapidly updated. The
data such as news, stock markets, weather, sports, shopping, etc., are regularly updated.

 Diversity of user communities. - The user community on the web is rapidly expanding.
These users have different backgrounds, interests, and usage purposes. There are more than 100
million workstations that are connected to the Internet and still rapidly increasing.

 Relevancy of Information. - It is considered that a particular person is generally interested in


only small portion of the web, while the rest of the portion of the web contains the information
that is not relevant to the user and may swamp desired results.

You might also like