Data Mining Introductiondifferent
Data Mining Introductiondifferent
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics,
Prentice Hall, 2002.
PART I
Data Mining Outline
Introduction
Related Concepts
Data Mining Techniques
PART II
Classification
Clustering
Association Rules
PART III
Web Mining
Spatial Mining
Temporal Mining
Introduction Outline
Goal: Provide an overview of data mining.
Output Output
Precise Fuzzy
Subset of database Not a subset of database
Query Examples
Database
Find all credit applicants with last name of Smith.
Identify customers who have purchased more
than $10,000 in the last month.
Find all customers who have purchased milk
Data Mining
Find all credit applicants who are poor credit
risks. (classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased
with milk. (association rules)
Data Mining Models and Tasks
Basic Data Mining Tasks
Classification maps data into predefined groups or
classes
Supervised learning
Pattern recognition
Prediction
Regression is used to map a data item to a real
valued prediction variable.
Clustering groups similar data together into clusters.
Unsupervised learning
Segmentation
Partitioning
Basic Data Mining Tasks (contd)
Summarization maps data into subsets with
associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
Ex: Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
Data Mining vs. KDD
Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Algorithm Design Techniques
Algorithm Analysis Neural Networks
Data Structures Decision Tree Algorithms
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
KDD Issues (contd)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
Social Implications of
DM
Privacy
Profiling
Unauthorized use
Data Mining Metrics
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
Database Perspective on Data
Mining
Scalability
Real World Data
Updates
Ease of Use
Visualization Techniques
Graphical
Geometric
Icon-based
Pixel-based
Hierarchical
Hybrid
Related Concepts Outline
Goal: Examine some areas which are related to
data mining.
Database/OLTP Systems
Fuzzy Sets and Logic
Information Retrieval(Web Search Engines)
Dimensional Modeling
Data Warehousing
OLAP/DSS
Statistics
Machine Learning
Pattern Matching
DB & OLTP Systems
Schema
(ID,Name,Address,Salary,JobNo)
Data Model
ER
Relational
Transaction
Query:
SELECT Name
FROM T
WHERE Salary > 100000
Simple Fuzzy
Information Retrieval
Information Retrieval (IR): retrieving desired information from
textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about data mining.
IR Classification
Dimensional Modeling
View data in a hierarchical manner more as business
executives might
Useful in decision support systems and mining
Dimension: collection of logically related attributes;
axis for modeling data.
Facts: data stored
Ex: Dimensions products, locations, date
Facts quantity, unit price
Roll Up
Drill Down
Why square?
Root Mean Square Error (RMSE)
Jackknife Estimate
Jackknife Estimate: estimate of parameter is
obtained by omitting one value from the set of
observed values.
Ex: estimate of mean for X={x1, , xn}
Maximum Likelihood Estimate
(MLE)
Obtain parameter estimates that maximize the
probability that the sample data occurs for the
specific model.
Joint probability for observing the sample data by
multiplying the individual probabilities. Likelihood
function:
Maximize L.
MLE Example
Coin toss five times: {H,H,H,H,T}
Assuming a perfect coin with H and T equally likely, the
likelihood of this sequence is:
Bayes Example(contd)
Calculate P(xi|hj) and P(xi)
Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)
=1/6; P(xi|h1)=0 for all other xi.
Predict the class for x4:
Calculate P(hj|x4) for all hj.
Place x4 in class with largest value.
Ex:
P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
x4 in class h1.
Regression
Predict future values based on
past values
Linear Regression assumes
linear relationship exists.
y = c 0 + c 1 x1 + + c n xn
Find values to best fit the data
Linear Regression
Correlation
Examine the degree to which the values for two
variables behave similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
Similarity Measures
Determine similarity between two
objects.
Similarity characteristics: