Originated From DB Community : IIIT Hyderabad
Originated From DB Community : IIIT Hyderabad
Business
Mine/Explore
Data Mining Government
DATA
Patterns
Automated extraction of
interesting patterns from large Research Labs
databases Feedback To Data Sources
Decision Making
Internet
Types of Patterns
Associations
Coffee buyers usually also purchase sugar Association Rules
Clustering
That which is infrequent is not
Segments of customers requiring different
promotion strategies worth worrying about.
Classification
Customers expected to be loyal
6
Association Rules Association Rule Applications
Transaction ID Items E-commerce
1 Tomato, Potato, Onions People who have bought Sundara Kandam
D: 2 Tomato, Potato, Brinjal, Pumpkin have also bought Srimad Bhagavatham
3 Tomato, Potato, Onions, Chilly Census analysis
4 Lemon, Tamarind Immigrants are usually male
Sports
Rule: Tomato, Potato Onion (confidence: 66%, support: 50%) A chess end-game configuration with “white
Support(X) = |transactions containing X| / |D| pawn on A7” and “white knight dominating
Confidence(R) = support(R) / support(LHS(R)) black rook” typically results in a “win for white”.
Medical diagnosis
Problem proposed in [AIS 93]: Find all rules satisfying Allergy to latex rubber usually co-occurs with
user given minimum support and minimum allergies to banana and tomato
confidence.
7 8
Classification sunny
sunny
85
72
85
95
false
false
don’t play
don’t play Model relationship between
sunny 69 70 false play class labels and attributes
overcast 72 90 true play
sunny 77 69 true ?
rain 73 76 false ?
9 10
Applications
Text classification
Classify emails into spam / non-spam
Classify web-pages into yahoo-type hierarchy
NLP Problems
Tagging: Classify words into verbs, nouns, etc.
Risk management, Fraud detection, Computer intrusion
Clustering
detection
Given the properties of a transaction (items purchased, amount,
location, customer profile, etc.)
Determine if it is a fraud
Birds of a feather flock together.
Machine learning / pattern recognition applications
Vision
Speech recognition
etc.
All of science & knowledge is about predicting future in terms of
past
So classification is a very fundamental problem with ultra-wide scope
of applications
11 12
The Clustering Problem Applications
Outlook Temp Humidity Windy?
(F)) (%)
Find groups of similar records.
Targetting similar people or objects
sunny 75 70 true Student tutorial groups
sunny 80 90 true
sunny 85 85 false
Hobby groups
sunny 72 95 false Need a function to compute
Health support groups
sunny 69 70 false similarity, given 2 input records Customer groups for marketing
overcast 72 90 true
overcast 73 88 true
Organizing e-mail
overcast 64 65 true Spatial clustering
overcast 81 75 false Exam centres
rain 71 80 true Unsupervised learning
rain 65 70 true
Locations for a business chain
rain 75 80 false Planning a political strategy
rain 68 80 false
rain 70 96 false
13 14
Take Home
Data mining is a mature field.
Good algorithms for core tasks are available.
Focus on applications to challenging kinds of
data
Streams, Distributed data, Multimedia, Web, …
Most effort is in how to map domain problems to
data mining problems
And how to make sense of the output.
15 16