Introduction Lecture1gghhhhh
Introduction Lecture1gghhhhh
— Introduction —
1
Why Data Mining?
The Explosive Growth of Data(abundant data): from terabytes
to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
define Data Mining?
Sifting through very large amounts of data for useful
information. Data mining uses artificial intelligence
techniques, neural networks, and advanced statistical tools
(such as cluster analysis) to reveal trends, patterns, and
relationships, which might otherwise have remained
undetected. In contrast to an expert system (which draws
inferences from the given data on the basis of a given set of
rules) data mining attempts to discover hidden rules
underlying the data. Also called data surfing.
3
Data Mining Techniques
The most commonly used techniques in data mining are:
1- Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks
in structure.
4
Data Mining Techniques
5
Applications of Data Mining
6
Application examples
and many more. Some examples of applications (potential or
actual) are:
1– a supermarket chain mines its customer transactions data to
optimise targeting of high value customers.
7
Application examples
8
Knowledge Discovery in Databases (KDD)
Process
9
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Cleaning
Data Integration
Databases
10
KDD Process: Several Key
Steps
1. Preprocessing steps:-
Data cleaning (to remove noise and inconsistent data).
Data integration (where multiple data sources may be
combined).
Data transformation( where data transformed into
appropriate for mining).
3. Post-processing steps:-
Pattern evaluation (to identify the truly interesting patterns)
knowledge presentation( present the mined knowledge to
the user -rules, tables, pie/bar chart, concept hierarchy, trees
etc.)
11
What Is Data Mining?
12
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
13
Data Mining Functionalities
General functionality
Descriptive data mining
Find human-interpretable patterns that describe
the data.
Predictive data mining
Use some variables to predict unknown or future
values of other variables.
14
Data Mining Tasks…
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a
function of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
to validate it.
Classification Example
cal cal us
i i o
gor gor inu
a te a te ont a ss
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Set Classifier
Clustering Definition
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Association Rule Discovery:
Definition
Association rule mining searches for interesting relationships
among items in a given dataset.
Which items are frequently purchased by my customers?
Market basket analyst.
TID Items
{Milk}→
Rules
RulesDiscovered:
→{Coke}(
1 Bread, Coke, Milk Discovered:
Milk}→{Beer}
{Milk}
→{Beer}
2 Beer, Bread {Coke}(support=0.6%, confidence=0.75%
support=0.6%, confidence=0.75
{Diaper,
{Diaper,Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
If a customer buys diaper and milk, then he is very
likely to buy beer.
So, don’t be surprised if you find six-packs stacked
next to diapers!
Regression
Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based
on advetising expenditure.
Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection
Network Intrusion
Detection
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of
them are interesting
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective(data driven): based on statistics and structures of
patterns, e.g., support, confidence(degree of certainty), etc.
Subjective(user driven): based on user’s belief in the data, e.g.,
unexpectedness(contradicting a user’s belief), novelty(previously
unknown), actionability(Use of discovered knowledge), etc…
22
Pattern Interestingness Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty ( A → B)
e.g., confidence= #(A and B)/ #(A), classification
reliability or accuracy, certainty factor, rule strength, rule
quality,
Support = #(A and B)/ #(Domain),
Coverage= #(A and B)/ #(B),
Novelty
not previously known, surprising.
23