Introduction Lecture1gghhhhh

Data Mining
— Introduction —
1
Why Data Mining?
 The Explosive Growth of Data(abundant data): from terabytes
to petabytes
 Data collection and data availability

Automated data collection tools, database systems,
Web, computerized society
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
define Data Mining?
 Sifting through very large amounts of data for useful
information. Data mining uses artificial intelligence
techniques, neural networks, and advanced statistical tools
(such as cluster analysis) to reveal trends, patterns, and
relationships, which might otherwise have remained
undetected. In contrast to an expert system (which draws
inferences from the given data on the basis of a given set of
rules) data mining attempts to discover hidden rules
underlying the data. Also called data surfing.
3
Data Mining Techniques
 The most commonly used techniques in data mining are:
1- Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks
in structure.
2- Decision trees: Tree-shaped structures that represent sets of

decisions. These decisions generate rules for the classification
of a dataset..
4
Data Mining Techniques
3- Genetic algorithms: Optimization techniques that use

processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of evolution.
4-Nearest neighbor method: A technique that classifies each

record in a dataset based on a combination of the classes of
the k record(s) most similar to it in a historical dataset (where
k ³ 1). Sometimes called the k-nearest neighbor technique.
5- Rule induction: The extraction of useful if-then rules from data

based on statistical significance
5
Applications of Data Mining
 There is a rapidly growing body of successful applications in

a wide range of areas as diverse as:
 analysis of organic compounds
 weather forecasting
 predicting share of television audiences
 medical diagnosis
 financial forecasting
 automatic abstracting
 credit card fraud detection
 targeted marketing
 electric load prediction
 toxic hazard analysis
6
Application examples
 and many more. Some examples of applications (potential or
actual) are:
1– a supermarket chain mines its customer transactions data to
optimise targeting of high value customers.
2– a credit card company can use its data warehouse of

customer transactions for fraud detection.
3– a major hotel chain can use survey databases to identify

attributes of a 'high-value’ prospect
4– predicting the probability of default for consumer loan

applications by improving the ability to predict bad loans.
7
Application examples
5– reducing fabrication flaws in VLSI chips.
6– data mining systems can sift through vast quantities of data

collected during the semiconductor fabrication process to
identify conditions that are causing yield problems.
7– predicting audience share for television programmers ,

allowing television executives to arrange show schedules to
maximize market share and increase advertising revenues
8– predicting the probability that a cancer patient will respond

to chemotherapy,thus reducing health-care costs without
affecting quality of care.
8
Knowledge Discovery in Databases (KDD)
Process
 The KDD process is defined as: the nontrivial process of

identifying
valid, novel, potentially useful, and ultimately
understandable (comprehensible) patterns in data”, [ Fayyad
et al.(1996)].
 Valid: are the discovered patterns representative of the data.

 Novel: are the discovered patterns new to the organization.
 Useful: can the organization use the discovered patterns.
 Comprehensible: can we understand the discovered patterns.
9
Knowledge Discovery (KDD) Process
 Data mining—core of Pattern Evaluation

knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection and

Warehouse Transformation
Data Cleaning
Data Integration
Databases
10
KDD Process: Several Key
Steps
1. Preprocessing steps:-
 Data cleaning (to remove noise and inconsistent data).
 Data integration (where multiple data sources may be
combined).
 Data transformation( where data transformed into
appropriate for mining).
2. Data mining( an essential process where intelligent

methods are applied in order to extract data patterns).
3. Post-processing steps:-
 Pattern evaluation (to identify the truly interesting patterns)
 knowledge presentation( present the mined knowledge to
the user -rules, tables, pie/bar chart, concept hierarchy, trees
etc.)
11
What Is Data Mining?
 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
dredging, information harvesting, etc.
12
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
13
Data Mining Functionalities
 General functionality
 Descriptive data mining
Find human-interpretable patterns that describe
the data.
 Predictive data mining
Use some variables to predict unknown or future
values of other variables.
14
Data Mining Tasks…
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of
the attributes is the class.
 Find a model for class attribute as a
function of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of
the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
to validate it.
Classification Example
cal cal us
i i o
gor gor inu
a te a te ont a ss
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6
7
No
Yes
Married
Divorced 220K
60K No
No
10
No Married 80K ?
Test
8 No Single 85K Yes Set
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Clustering Definition
 Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
 Data points in one cluster are more similar to
one another.
 Data points in separate clusters are less similar
to one another.
Association Rule Discovery:
Definition
 Association rule mining searches for interesting relationships
among items in a given dataset.

Which items are frequently purchased by my customers?
Market basket analyst.
TID Items
{Milk}→
Rules
RulesDiscovered:
→{Coke}(
1 Bread, Coke, Milk Discovered:
Milk}→{Beer}
{Milk}
→{Beer}
2 Beer, Bread {Coke}(support=0.6%, confidence=0.75%
support=0.6%, confidence=0.75
{Diaper,
{Diaper,Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

If a customer buys diaper and milk, then he is very
likely to buy beer.

So, don’t be surprised if you find six-packs stacked
next to diapers!
Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network fields.
 Examples:

Predicting sales amounts of new product based
on advetising expenditure.

Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.

Time series prediction of stock market indices.
Deviation/Anomaly Detection
 Detect significant deviations from normal

behavior
 Applications:

Credit Card Fraud Detection

Network Intrusion
Detection
Are All the “Discovered” Patterns
Interesting?
 Data mining may generate thousands of patterns: Not all of
them are interesting
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures
 Objective(data driven): based on statistics and structures of
patterns, e.g., support, confidence(degree of certainty), etc.
 Subjective(user driven): based on user’s belief in the data, e.g.,
unexpectedness(contradicting a user’s belief), novelty(previously
unknown), actionability(Use of discovered knowledge), etc…
22
Pattern Interestingness Measure
 Simplicity
e.g., (association) rule length, (decision) tree size
 Certainty ( A → B)
e.g., confidence= #(A and B)/ #(A), classification
reliability or accuracy, certainty factor, rule strength, rule
quality,
 Support = #(A and B)/ #(Domain),
 Coverage= #(A and B)/ #(B),
 Novelty
not previously known, surprising.
23

Introduction Lecture1gghhhhh

Uploaded by

Introduction Lecture1gghhhhh

Uploaded by

Data Mining

2- Decision trees: Tree-shaped structures that represent sets of

3- Genetic algorithms: Optimization techniques that use

4-Nearest neighbor method: A technique that classifies each

5- Rule induction: The extraction of useful if-then rules from data

 There is a rapidly growing body of successful applications in

2– a credit card company can use its data warehouse of

3– a major hotel chain can use survey databases to identify

4– predicting the probability of default for consumer loan

5– reducing fabrication flaws in VLSI chips.

6– data mining systems can sift through vast quantities of data

7– predicting audience share for television programmers ,

8– predicting the probability that a cancer patient will respond

 The KDD process is defined as: the nontrivial process of

 Valid: are the discovered patterns representative of the data.

 Data mining—core of Pattern Evaluation

Data Selection and

2. Data mining( an essential process where intelligent

 Data mining (knowledge discovery from data)

1 Yes Single 125K No No Single 75K ?

 Detect significant deviations from normal

You might also like