0% found this document useful (0 votes)
13 views12 pages

Unit-2-Part-1 Data Mining

The document discusses the key concepts of data mining including what data mining is, its relationship to knowledge discovery in databases, and the preprocessing, data mining, and postprocessing steps. It also outlines several motivating challenges for data mining including issues related to scalability, high dimensionality, heterogeneous complex data, data ownership and distribution, and non-traditional analysis.

Uploaded by

basapurprathik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
13 views12 pages

Unit-2-Part-1 Data Mining

The document discusses the key concepts of data mining including what data mining is, its relationship to knowledge discovery in databases, and the preprocessing, data mining, and postprocessing steps. It also outlines several motivating challenges for data mining including issues related to scalability, high dimensionality, heterogeneous complex data, data ownership and distribution, and non-traditional analysis.

Uploaded by

basapurprathik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Data Mining

&
Motivating Challenges

UNIT - II

By
M. Rajesh Reddy
WHAT IS DATA MINING?

• Data mining is the process of automatically discovering


useful information in large data repositories.
• To find novel and useful patterns that might
otherwise remain unknown.
• provide capabilities to predict the outcome of a future
observation,
• Example
• predicting whether a newly arrived customer will spend
more than $100 at a department store.
WHAT IS DATA MINING?

• Not all information discovery tasks are considered to be


data mining.
• For example, tasks related to the area of information
retrieval.
• looking up individual records using a database
management system
or
• finding particular Web pages via a query to an Internet
search engine
• To enhance information retrieval systems.
WHAT IS DATA MINING?

Data Mining and Knowledge


• Data mining is an integral part of Knowledge Discovery in
Databases (KDD),
• process of converting raw data into useful
information
• This process consists of a series of transformation
steps
WHAT IS DATA MINING?

• Preprocessing - to transform the raw input data into an


appropriate format for subsequent analysis.
• Steps involved in data preprocessing
• Fusing (joining) data from multiple sources,
• cleaning data to remove noise and duplicate
observations
• selecting records and features that are relevant to the
data mining task at hand.
• most laborious and time-consuming step
WHAT IS DATA MINING?

• Post Processing:
• only valid and useful results are incorporated into the
decision support system.

• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.

• Statistical measures or hypothesis testing methods can


also be applied
• to eliminate spurious (false or fake) data mining
results.
Motivating Challenges:

• challenges that motivated the development of data


mining.
• Scalability

• High Dimensionality

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• Non-traditional Analysis
Motivating Challenges:

• Scalability
• Size of datasets are in the order of GB, TB or PB.

• special search strategies

• implementation of novel data structures ( for efficient

access)

• out-of-core algorithms - for large datasets

• sampling or developing parallel and distributed algorithms.


Motivating Challenges:

• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.
Motivating Challenges:

Heterogeneous and Complex Data


• Traditional data analysis methods - data sets - attributes
of the same type - either continuous or categorical.
• Examples of such non-traditional types of data include
• collections of Web pages containing semi-structured
text and hyperlinks;
• DNA data with sequential and three-dimensional
structure and
• climate data with time series measurements
• DM should maintain relationships in the data, such as
• temporal and spatial autocorrelation,
• graph connectivity, and
• parent-child relationships between the elements in
semi-structured text and XML documents.
Motivating Challenges:

• Data Ownership and Distribution


• Data is not stored in one location or owned by one organization
• geographically distributed among resources belonging to multiple
entities.
• This requires the development of distributed data mining techniques.
• key challenges in distributed data mining algorithms
• (1) reduction in the amount of communication needed
• (2) effective consolidation of data mining results obtained from
multiple sources, and
• (3) Data security issues.
Motivating Challenges:

• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.

You might also like