Unit-2-Part-1 Data Mining

Data Mining
&
Motivating Challenges
UNIT - II
By
M. Rajesh Reddy
WHAT IS DATA MINING?
• Data mining is the process of automatically discovering

useful information in large data repositories.
• To find novel and useful patterns that might
otherwise remain unknown.
• provide capabilities to predict the outcome of a future
observation,
• Example
• predicting whether a newly arrived customer will spend
more than $100 at a department store.
• Not all information discovery tasks are considered to be

data mining.
• For example, tasks related to the area of information
retrieval.
• looking up individual records using a database
management system
or
• finding particular Web pages via a query to an Internet
search engine
• To enhance information retrieval systems.
Data Mining and Knowledge

• Data mining is an integral part of Knowledge Discovery in
Databases (KDD),
• process of converting raw data into useful
information
• This process consists of a series of transformation
steps
• Preprocessing - to transform the raw input data into an

appropriate format for subsequent analysis.
• Steps involved in data preprocessing
• Fusing (joining) data from multiple sources,
• cleaning data to remove noise and duplicate
observations
• selecting records and features that are relevant to the
data mining task at hand.
• most laborious and time-consuming step
• Post Processing:
• only valid and useful results are incorporated into the
decision support system.
• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.
• Statistical measures or hypothesis testing methods can

also be applied
• to eliminate spurious (false or fake) data mining
results.
Motivating Challenges:
• challenges that motivated the development of data

mining.
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis
• Scalability
• Size of datasets are in the order of GB, TB or PB.
• special search strategies
• implementation of novel data structures ( for efficient
access)
• out-of-core algorithms - for large datasets
• sampling or developing parallel and distributed algorithms.

• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.
Heterogeneous and Complex Data

• Traditional data analysis methods - data sets - attributes
of the same type - either continuous or categorical.
• Examples of such non-traditional types of data include
• collections of Web pages containing semi-structured
text and hyperlinks;
• DNA data with sequential and three-dimensional
structure and
• climate data with time series measurements
• DM should maintain relationships in the data, such as
• temporal and spatial autocorrelation,
• graph connectivity, and
• parent-child relationships between the elements in
semi-structured text and XML documents.
• Data Ownership and Distribution

• Data is not stored in one location or owned by one organization
• geographically distributed among resources belonging to multiple
entities.
• This requires the development of distributed data mining techniques.
• key challenges in distributed data mining algorithms
• (1) reduction in the amount of communication needed
• (2) effective consolidation of data mining results obtained from
multiple sources, and
• (3) Data security issues.
• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.

Unit-2-Part-1 Data Mining

Uploaded by

Unit-2-Part-1 Data Mining

Uploaded by

Data Mining

• Data mining is the process of automatically discovering

• Not all information discovery tasks are considered to be

Data Mining and Knowledge

• Preprocessing - to transform the raw input data into an

• Statistical measures or hypothesis testing methods can

• challenges that motivated the development of data

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• special search strategies

• implementation of novel data structures ( for efficient

• out-of-core algorithms - for large datasets

• sampling or developing parallel and distributed algorithms.

Heterogeneous and Complex Data

• Data Ownership and Distribution

You might also like