Attribute Selection Measures

data mining notes outliers

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views15 pages

Attribute Selection Measures

data mining notes outliers

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Classification -Part2

[Attribute selection measures]

Shubham kumar
Dept. of CS&IT
MGCUB
Attribute Selection Measures
• It is a heuristic approach to select the best splitting criterion that separates a given
data partition, D, of class-labeled training tuples into individual classes.

• Splitting criterion is called the best when after splitting, each partition will be pure.

• A partition is called pure when all the tuples that fall into the partition belongs to
the same class.

• Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split.

• First, a rank is provided for each attribute that describes the training tuples. And the
attribute having the best score for the measure is chosen as the splitting attribute
for the given tuples.
• If the splitting attribute is continuous-valued or if we are restricted to binary
trees, then respectively either a split point or a splitting subset must also be
determined as part of the splitting criterion.
partition scenarios Examples

3.
1. A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A. A branch is created for each
known value, aj, of A and labeled with that value (as in the figure). Partition Dj
is the subset of class-labeled tuples in D having value aj of A.
2. A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split point,
respectively, where split point is the split-point returned by Attribute selection
method as part of the splitting criterion.

3. If A is discrete-valued and a binary tree must be produced, then the test is of

the form A ∈ SA, where SA is the splitting subset for A.
• According to the algorithm the tree node created for partition D is labeled
with the splitting criterion, and the tuples are partitioned accordingly. [Also
Shown in the figure ].
• There are three popular attribute selection measures: Information Gain, Gain
ratio, and, Gini index.
• Information gain:
The attribute with the highest information gain is chosen as the splitting
attribute.
This attribute minimizes the information needed to classify the tuples in the
resulting partitions.
Let D, the data partition, be a training set of class-labeled tuples.
let class label attribute has m distinct values defining m distinct classes, Ci (for i
= 1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and |Ci,D| denote
the number of tuples in D and Ci,D, respectively.
• Then the expected information needed to classify a tuple in D is given by

• where pi is the nonzero probability that an arbitrary tuple in D belongs to class

Ci and is estimated by |Ci,D|/|D|. Info(D) is the average amount of information
needed to identify the class label of a tuple in D. Info(D) is also known as the
entropy of D.
• Now, suppose we have to partition the tuples in D on some attribute A having
v distinct values, {a1, a2,..., av }.
Then the expected information required to classify the tuple from D based on
attribute A is:
• The term |Dj| /|D| acts as the weight of the j th partition. Info A (D) is the
expected information required to classify a tuple from D based on the
partitioning by A.

• Information gain is defined as the difference between the original

information requirement and the new requirement (i.e. obtained after
portioning on A).
Gain(A) = Info (D) – Info A (D)

Now, the attribute A with the highest information gain is chosen as the
splitting attribute.
Example:
This is a training set D, of class-labeled tuples randomly selected from the
AllElectronics customer database.
RID Age Income Student Credit rating Class: buys computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle_aged Low Yes Excellent yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
14 Senior Mdium No Excellent no
• Here, the class label attribute, buys computer, has two distinct values: yes &
no.
• Therefore, there are two distinct classes (i.e., m = 2). Let class C1 correspond
to yes and class C2 correspond to no.
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for these tuples, we must compute the
information gain of each attribute.
• Next, we need to compute the expected information requirement for each
attribute.
(1) Age: We need to look at the distribution of yes and no tuples for each
category of age. For category “youth,” there are two yes tuples and three
no tuples. For the category “middle_aged,” there are four yes tuples and
zero no tuples. For the category “senior,” there are three yes tuples and
two no tuples.
Therefore, the expected information needed to classify a tuple in D if the
tuples are partitioned according to age is:
=

Hence, the gain in information from such a partitioning would be

Gain(age) = Info(D) - InfoAge (D)
= 0.940 – 0.694 = 0.246 bits
Similarly,
Gain(income) = 0.029 bits
Gain (student)= 0.151 bits, and Gain(credit_rating) = 0.048 bits.
• Since Age has the highest information gain among the attributes, therefore
it is selected as the splitting attribute.

• According to the decision tree algorithm, Node N is labeled with age, and
branches are grown for each of the attribute’s values and the tuples are
then partitioned accordingly.
[Shown in the next figure]

• Here, the tuples falling into the partition for age = middle_aged, all belong
to the same class. Since they all belong to class “yes,” a leaf should
therefore be created at the end of this branch and labeled “yes.”
But for the other resulting partition where classes are not same, decision tree
algorithm uses the same splitting process recursively to form a decision tree.
Final decision tree would be: (Using Information gain as attribute selection
measure.)
Reference
• Jiawei Han, Micheline kamber and Jian pei. “DATA MINING concepts and
Techniques” 3/e, Elsevier, 2012

Classification by Decision Tree Induction
No ratings yet
Classification by Decision Tree Induction
25 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
30 pages
Data Classification Basics
No ratings yet
Data Classification Basics
34 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Classification - Decision Tree
No ratings yet
Classification - Decision Tree
32 pages
DWDM Asgmnt Prog
No ratings yet
DWDM Asgmnt Prog
51 pages
DM Unit-4
No ratings yet
DM Unit-4
75 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
190 pages
Trees
No ratings yet
Trees
78 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
Classification
No ratings yet
Classification
75 pages
Decision Tree Basics for Data Scientists
No ratings yet
Decision Tree Basics for Data Scientists
61 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Decision Tree Attribute Selection Methods
No ratings yet
Decision Tree Attribute Selection Methods
5 pages
ML - 4
No ratings yet
ML - 4
58 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Understanding Classification and Decision Trees
No ratings yet
Understanding Classification and Decision Trees
80 pages
Data Mining - Lecture 5
No ratings yet
Data Mining - Lecture 5
33 pages
8 Classification
No ratings yet
8 Classification
82 pages
Understanding Decision Trees
No ratings yet
Understanding Decision Trees
6 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Topic01 Classification Basics Jiawei Han Extra
No ratings yet
Topic01 Classification Basics Jiawei Han Extra
198 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Unit-4 DM
No ratings yet
Unit-4 DM
15 pages
MLT 3 UNIT-Part-1
No ratings yet
MLT 3 UNIT-Part-1
28 pages
DM 4
No ratings yet
DM 4
68 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
Unit-Iii: Classification and Prediction
No ratings yet
Unit-Iii: Classification and Prediction
21 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
30 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Data Mining: Classification Algorithms
No ratings yet
Data Mining: Classification Algorithms
34 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
DWM Unit-V Notes
No ratings yet
DWM Unit-V Notes
15 pages
DM 3
No ratings yet
DM 3
37 pages
Unit 3 - Classification
No ratings yet
Unit 3 - Classification
28 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Classification
No ratings yet
Classification
27 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Unit-3 ML
No ratings yet
Unit-3 ML
47 pages
Decision Tree 9146388
No ratings yet
Decision Tree 9146388
30 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Unit-3 Classification
No ratings yet
Unit-3 Classification
28 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
5 Classification
No ratings yet
5 Classification
59 pages
Classification Methods in Data Mining
No ratings yet
Classification Methods in Data Mining
28 pages
DMDW Classification
No ratings yet
DMDW Classification
18 pages
CH 5
No ratings yet
CH 5
81 pages
Oops Unit 1 Important Questions
No ratings yet
Oops Unit 1 Important Questions
2 pages
Unit 1 Importantquestions Answers
No ratings yet
Unit 1 Importantquestions Answers
22 pages
Flutter Record 2025
No ratings yet
Flutter Record 2025
50 pages
Daa Unit4
No ratings yet
Daa Unit4
38 pages
Flat Unit 1
No ratings yet
Flat Unit 1
80 pages
University Exam Resources
No ratings yet
University Exam Resources
33 pages
University Updates & Papers Links
No ratings yet
University Updates & Papers Links
35 pages
Flat Unit 2 Qa
No ratings yet
Flat Unit 2 Qa
30 pages
Flat Unit 4 Qa
No ratings yet
Flat Unit 4 Qa
37 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
JNTUH Previous Question Papers
No ratings yet
JNTUH Previous Question Papers
25 pages
Challenges in Distributed Systems Explained
No ratings yet
Challenges in Distributed Systems Explained
18 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
DWDM Lecture Notes
No ratings yet
DWDM Lecture Notes
139 pages
Nonfiction Critique Format Guide
No ratings yet
Nonfiction Critique Format Guide
3 pages
Herbs For The Grahas
No ratings yet
Herbs For The Grahas
2 pages
Chapter 1 - Highway Planning and Development Process
No ratings yet
Chapter 1 - Highway Planning and Development Process
7 pages
Animorphs
0% (1)
Animorphs
23 pages
PGH Residency Application Form
100% (1)
PGH Residency Application Form
2 pages
Agenda Session: Wing Ownship Ouncil
No ratings yet
Agenda Session: Wing Ownship Ouncil
4 pages
Court Denies u-blox's Injunction Request
No ratings yet
Court Denies u-blox's Injunction Request
6 pages
English Writing Skills Notes Class12
No ratings yet
English Writing Skills Notes Class12
4 pages
Resume Rough Draft
No ratings yet
Resume Rough Draft
2 pages
Notes For Rimsky-Korsakov's Principles of Orchestration, Chapter 1
100% (2)
Notes For Rimsky-Korsakov's Principles of Orchestration, Chapter 1
18 pages
Structural Beam Flexure Analysis
No ratings yet
Structural Beam Flexure Analysis
33 pages
ES-3003 Heat Transfer: Gauss-Seidel Iteration: Example 8-1
No ratings yet
ES-3003 Heat Transfer: Gauss-Seidel Iteration: Example 8-1
7 pages
Internship Report PDF
100% (1)
Internship Report PDF
19 pages
Hydropower Project Cost Breakdown
No ratings yet
Hydropower Project Cost Breakdown
2 pages
Expanded Weapon Options for D&D 5e
100% (2)
Expanded Weapon Options for D&D 5e
7 pages
HSE Performance Report Summary
No ratings yet
HSE Performance Report Summary
1 page
2008 Volvo XC90 Wiring Diagrams
No ratings yet
2008 Volvo XC90 Wiring Diagrams
100 pages
BHEL Strategic Growth Analysis
No ratings yet
BHEL Strategic Growth Analysis
9 pages
Case Study
No ratings yet
Case Study
19 pages
Sokalan HP 56 K Technical Data Sheet
No ratings yet
Sokalan HP 56 K Technical Data Sheet
2 pages
Coima Usa Duct Catalog
No ratings yet
Coima Usa Duct Catalog
28 pages
Form - 11 E450AJ Boom Lift - 300092303
No ratings yet
Form - 11 E450AJ Boom Lift - 300092303
1 page
Smart Health Technologies For COVID - Contact Tracing
No ratings yet
Smart Health Technologies For COVID - Contact Tracing
14 pages
Santander Private Banking UK - Global Private Banking
No ratings yet
Santander Private Banking UK - Global Private Banking
11 pages
New Institute Cargo Clauses 2009 V 1982
No ratings yet
New Institute Cargo Clauses 2009 V 1982
42 pages
Fast Setup DCV Controller
No ratings yet
Fast Setup DCV Controller
7 pages
X-Bar R Charts
No ratings yet
X-Bar R Charts
6 pages
IELTS Speaking Tips: People & Places
No ratings yet
IELTS Speaking Tips: People & Places
12 pages
Commercial Suit for Recovery of ₹8.96L
100% (1)
Commercial Suit for Recovery of ₹8.96L
20 pages
SIWES Technical Report Guide
100% (1)
SIWES Technical Report Guide
13 pages