Introduction to Machine Learning Basics
Introduction to Machine Learning Basics
Module 1: Introduction
Part A: Introduction
Sudeshna Sarkar
IIT Kharagpur
Overview of Course
1. Introduction
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Support Vector Machines
6. Neural Network
7. Introduction to Computational Learning Theory
8. Clustering
Module 1
1. Introduction
a) Introduction
b) Different types of learning
c) Hypothesis space, Inductive Bias
d) Evaluation, Training and test set, cross-validation
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Support Vector Machines
6. Neural Network
7. Introduction to Computational Learning Theory
8. Clustering
Machine Learning History
• 1950s:
– Samuel's checker-playing program
• 1960s:
– Neural network: Rosenblatt's perceptron
– Minsky & Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Expert systems and knowledge acquisition bottleneck
– Qui la ’s ID3
– Natural language processing (symbolic)
Machine Learning History
• 1980s:
– Advanced decision tree and rule learning
– Learning and planning and problem solving
– Resurgence of neural network
– Valia t’s PAC learning theory
– Focus on experimental methodology
• 90's ML and Statistics
– Data Mining
– Adaptive agents and web applications
– Text learning
• 1994: Self-driving car
– Reinforcement learning
road test
– Ensembles
• 1997: Deep Blue
– Bayes Net learning
beats Gary Kasparov
Machine Learning History
• Popularity of this field in
recent time and the reasons • 2009: Google builds self
driving car
behind that
• 2011: Watson wins
– New software/ algorithms Jeopardy
• Neural networks • 2014: Human vision
• Deep learning surpassed by ML systems
– New hardware
• GPU’s
– Cloud Enabled
– Availability of Big Data
Programs vs learning algorithms
Algorithmic solution Machine learning solution
Computer Computer
Output Program
Machine Learning : Definition
• Learning is the ability to improve one's behaviour based on
experience.
• Build computer systems that automatically improve with
experience
• What are the fundamental laws that govern all learning
processes?
• Machine Learning explores algorithms that can
– learn from data / build a model from data
– use the model for prediction, decision making or solving
some tasks
Machine Learning : Definition
• A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E.
[Mitchell]
Components of a learning problem
• Task: The behaviour or task being improved.
– For example: classification, acting in an
environment
• Data: The experiences that are being used to
improve performance in the task.
• Measure of improvement :
– For example: increasing accuracy in prediction,
acquiring new, improved speed and efficiency
Black-box Learner
Experiences Problem/
Data Task
Models
Learner Reasoner
Speech recognition
Machine translation
Many domains and applications
Financial:
• predict if a stock will rise or fall
• predict if a user will click on an ad or not
Application in Business Intelligence
• Forecasting product sales quantities taking
seasonality and trend into account.
• Identifying cross selling promotional opportunities
for consumer goods.
• …
Some other applications
• Fraud detection : Credit card Providers
• determine whether or not someone will
default on a home mortgage.
• Understand consumer sentiment based off of
unstructured text data.
• Fore asti g o e ’s o i tio rates ased
off external macroeconomic factors.
Learner
Experiences Problem/
Data Task
Models
Learner Reasoner
• Components of Representation
– Features
– Function class / hypothesis language
Foundations of Machine Learning
Module 1: Introduction
Part B: Different types of learning
Sudeshna Sarkar
IIT Kharagpur
Module 1
1. Introduction
a) Introduction
b) Different types of learning
c) Hypothesis space, Inductive Bias
d) Evaluation, Training and test set, cross-validation
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Neural Network
6. Support Vector Machines
7. Introduction to Computational Learning Theory
8. Clustering
Broad types of machine learning
• Supervised Learning
– X,y (pre-classified training examples)
– Given an observation x, what is the best label for y?
• Unsupervised learning
–X
– Give a set of ’s, cluster or su arize the
• Semi-supervised Learning
• Reinforcement Learning
– Determine what to do based on rewards and punishments.
Supervised Learning
X y
Input1 Output1 New Input x
Input2 Output2
Input3 Output3 Learning
Algorithm
Model
Input-n
Semi-supervised learning
Reinforcement Learning
Action at
State st St+1
Agent Environment
Reward rt rt+1
Reinforcement Learning
Action at
State st St+1
RLearner Environment
Reward rt rt+1
Q-values
update
State,
Policy
Action at
action
state
Best
State st St+1
User Environment
Reward rt rt+1
Supervised Learning
Given:
– a set of input features 1, … , 𝑛
– A target feature
– a set of training examples where the values for the input
features and the target features are given for each
example
– a new example, where only the values for the input
features are given
Predict the values for the target features for the new
example.
– classification when Y is discrete
– regression when Y is continuous
Classification
Differentiating between
low-risk and high-risk
customers from their
income and savings
Regression
Example: Price of a
used car
x : car attributes
y = wx+w0
y : price
y: price
y = g (x, θ )
g ( ) model,
𝜃 parameters
x: mileage
Features
• Often, the individual observations are analyzed
into a set of quantifiable properties which are
called features. May be
– categorical (e.g. "A", "B", "AB" or "O", for blood type)
– ordinal (e.g. "large", "medium" or "small")
– integer-valued (e.g. the number of words in a text)
– real-valued (e.g. height)
Example Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long Home
e2 reads unknown new short Work
e3 skips unknown old long Work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Training
Set
Training
Learning
Algorithm
Testing
Hypothesis Predicted
X
y
Training phase
Label machine
learning
feature algorithm
extractor
features
Input
Testing Phase
feature classifier
Label
extractor model
features
Input
Classification learning
• Task T:
– input:
– output:
• Performance metric P:
• Experience E:
Classification learning
• Task T:
– input: a set of instances d1,…,dn
• an instance has a set of features
• we can represent an instance as a vector d=<x1,…,xn>
– output: a set of predictions ŷ1,..., ŷn
• one of a fixed set of constant values:
– {+1,-1} or {cancer, healthy}, or {rose, hi is us, jas i e, …}, or …
• Performance metric P:
• Experience E:
Classification Learning
Task Instance Labels
medical patient record: {-1,+1} = low, high risk
diagnosis blood pressure diastolic, blood of heart disease
pressure systolic,
age, sex (0 or 1), BMI,
cholesterol
finding entity a word in context: capitalized {first,later,outside} =
names in text (0,1), word-after-this-equals- first word in name,
Inc, bigram-before-this-equals- second or later word
acquired-by, … in name, not in a
name
image image: {0,1} = no house,
recognition 1920*1080 pixels, each with a house
code for color
Classification learning
we care about performance on the
• Task T: distribution, not the training data
– input: a set of instances d1,…,dn
– output: a set of predictions ŷ1,..., ŷn
• Performance metric P:
– Prob (wrong prediction) on examples from D
• Experience E:
– a set of labeled examples (x,y) where y is the true
label for x
– ideally, examples should be sampled from some fixed
distribution D
Classification Learning
Task Instance Labels Getting data
medical patient record: risk of heart wait and look
diagnosis lab readings disease for heart
disease
finding entity a word in context: {first,later,outside} text with
names in text capitalized, manually
nearby words, ... annotated
entities
image image: no house, house hand-labeled
recognition pixels images
Representations
Weekend
2. Linear function
Representations
3. Multivariate linear
function
Sudeshna Sarkar
IIT Kharagpur
Inductive Learning or Prediction
• Given examples of a function (X, F(X))
– Predict function F(X) for new examples X
• Classification
F(X) = Discrete
• Regression
F(X) = Continuous
• Probability estimation
F(X) = Probability(X):
Features
• Features: Properties that describe each
instance in a quantitative manner.
• Feature vector: n-dimensional vector of
features that represent some object
Feature Space
Example:
<0.5,2.8,+>
+
3.0
+ + +
-
+ + - - -
+ -
2.0
- +
- + + -
- - -
1.0
+ + + - -
0.0
Label: + + Label: -
3.0
+ ? + +
-
+ + - - -
+ -
2.0
- ?
? +
- + + -
- - -
1.0
+ + + - ?
-
0.0
+
3.0
+ + +
-
+ + - - -
+ -
2.0
- +
- + + -
- - -
1.0
+ + + - -
0.0
2. Linear function
Representations
3. Multivariate linear
function
4. Single layer perceptron
5. Multi-layer neural
networks
Hypothesis Space
• The space of all hypotheses that can, in
principle, be output by a learning algorithm.
Sudeshna Sarkar
IIT Kharagpur
Experimental Evaluation of Learning
Algorithms
• Evaluating the performance of learning systems is
important because:
– Learning systems are usually designed to predict the
lass of future unla eled data points.
• Typical choices for Performance Evaluation:
– Error
– Accuracy
– Precision/Recall
• Typical choices for Sampling Methods:
– Train/Test Sets
– K-Fold Cross-validation
Evaluating predictions
• Suppose we want to make a prediction of a
value for a target feature on example x:
– y is the observed value of target feature on
example x.
– is the predicted value of target feature on
example x.
– How is the error measured?
Measures of error
• Absolute error: |𝑓 − |
𝑛
𝑛
• Sum of squares error: 𝑖= 𝑓 −
𝑛
𝑛
• Number of misclassifications: 𝑖= 𝛿 𝑓 ,
𝑛
True class
Hypothesized
Pos Neg • Accuracy =
class (TP+TN)/(P+N)
Yes TP FP • Precision =
No FN TN TP/(TP+FP)
P=TP+FN N=FP+TN • Recall/TP rate =
TP/P
• FP Rate = FP/N
Sample Error and True Error
• The sample error of hypothesis f with respect to
target function c and data sample S is:
errors(f)= 1/n xS(f(x),c(x))
• The true error (denoted errorD(f)) of hypothesis f
with respect to target function c and distribution D,
is the probability that h will misclassify an instance
drawn at random according to D.
errorD(f)= PrxD[f(x) c(x)]
Why Errors
• Errors in learning are caused by:
– Limited representation (representation bias)
– Limited search (search bias)
– Limited data (variance)
– Limited features (noise)
Difficulties in evaluating hypotheses
with limited data
• Bias in the estimate: The sample error is a poor
estimator of true error
– ==> test the hypothesis on an independent test set
• We divide the examples into:
– Training examples that are used to train the learner
– Test examples that are used to evaluate the learner
• Variance in the estimate: The smaller the test set,
the greater the expected variance.
Validation set
Sudeshna Sarkar
IIT Kharagpur
Regression
• In regression the output is continuous
• Many models could be used – Simplest is linear
regression
– Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)
4
Types of Regression Models
Regression
1 feature Models 2+ features
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X
Simple Linear Regression Equation
E(y)
Regression line
Intercept
Slope β1
b0
x
Linear Regression Model
Y=𝛽0 + 𝛽1 𝑥1 + 𝜖
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5 Sample 15
4 56.9 12.5 houses
5 66.6 18.0 from the
6 82.5 14.3 region.
7 126.3 27.5
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
House price vs size
Linear Regression – Multiple Variables
Yi = b0 + b1X1 + b2 X2 + + bp Xp +e
11
Regression Model
• Our model assumes that
E(Y | X = x) = b0 + b1x (the “population line”)
Population Yi = b0 + b1X1 + b2 X2 + + bp Xp +e
line
i 0 1i
[ y
i 1
(b b x )]2
Y = b0 + b1X
• To find the values for the coefficients which minimize the
objective function we take the partial derivates of the
objective function (SSE) with respect to the coefficients. Set
these to 0, and solve.
n å xy - å x å y å y - b åx
b1 = b0 =
1
nå x 2 - (å x )
2
n
17
Multiple Linear Regression
Y b 0 b1 X 1 b 2 X 2 b n X n
𝑛
ℎ 𝑥 = 𝛽𝑖 𝑥𝑖
𝑖=0
• There is a closed form which requires matrix
inversion, etc.
• There are iterative techniques to find weights
– delta rule (also called LMS method) which will update
towards the objective of minimizing the SSE.
18
Linear Regression
𝑛
ℎ 𝑥 = 𝛽𝑖 𝑥𝑖
𝑖=0
Repeat {
for I = 1 to m do
𝜃𝑗 ≔ 𝜃𝑗 + 𝛼(𝑦 𝑖 −ℎ 𝑥 𝑖
)𝑥𝑗 (𝑖) (for every j)
end for
} until convergence
Thank You
Delta Rule for Classification
1
z
0
x
1
z
0
x
• What would happen in this adjusted case for perceptron and delta rule and
where would the decision point (i.e. .5 crossing) be?
CS 478 - Regression 25
Delta Rule for Classification
1
z
0
x
1
z
0
x
CS 478 - Regression 26
Delta Rule for Classification
1
z
0
x
1
z
0
x
1
z
0
x
• What would happen if we were doing a regression fit with a sigmoid/logistic
curve rather than a line?
CS 478 - Regression 27
Delta Rule for Classification
1
z
0
x
1
z
0
x
1
z
0
x
• Sigmoid fits many decision cases quite well! This is basically what logistic
regression does.
28
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part B: Introduction to Decision Tree
Sudeshna Sarkar
IIT Kharagpur
Definition
• A decision tree is a classifier in the form of a
tree structure with two types of nodes:
– Decision node: Specifies a choice or test of
some attribute, with one branch for each
outcome
– Leaf node: Indicates classification of an example
Decision Tree Example 1
Whether to approve a loan
Employed?
No Yes
Credit
Income?
Score?
High Low High Low
Skips 6
Skips 3
Reads 2
Reads 7
Two Example DTs
Decision Tree for PlayTennis
• Attributes and their values:
– Outlook: Sunny, Overcast, Rain
– Humidity: High, Normal
– Wind: Strong, Weak
– Temperature: Hot, Mild, Cool
No Yes No Yes
Decision Tree for PlayTennis
Outlook
No Yes No Yes
Decision Tree
decision trees represent disjunctions of conjunctions
Outlook
No Yes No Yes
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
Searching for a good tree
• How should you go about building a decision tree?
• The space of decision trees is too big for systematic
search.
• Stop and
– return the a value for the target feature or
– a distribution over target feature values
Sudeshna Sarkar
IIT Kharagpur
Top-Down Induction of Decision Trees ID3
ICS320 4
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits
to message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
Entropy
Humidity Wind
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
? Yes ?
Test for this node
No Yes No Yes
GINInode(Node) = 1- [ p(c)] 2
c classes
Sv
GINIsplit (A) = S
GINI(N v )
v Value s(A)
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
true if Ac c
Ac =
false otherwise
How to choose c ?
• consider all possible splits and finds the best cut
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.
20
Bias and Occam’s Razor
Prefer short hypotheses.
Argument in favor:
– Fewer short hypotheses than long hypotheses
– A short hypothesis that fits the data is unlikely to
be a coincidence
– A long hypothesis that fits the data might be a
coincidence
ICS320 21
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part D: Overfitting
Sudeshna Sarkar
IIT Kharagpur
Overfitting
• Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data
if there is another hypothesis, h’, such that h has
smaller error than h’ on the training data but h
has larger error on the test data than h’.
Overfitting
• Learning a tree that classifies the training data perfectly may not
lead to the tree with the best generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training
data but h has larger error on the test data than h’.
On training
accuracy On testing
Complexity of tree
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points makes it difficult to predict correctly the class labels
of that region
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary
ICS320 10
Pre-Pruning (Early Stopping)
• Evaluate splits before installing them:
– Don’t install splits that don’t look worthwhile
– when no worthwhile splits to install, done
Pre-Pruning (Early Stopping)
• Typical stopping conditions for a node:
– Stop if all instances belong to the same class
– Stop if all the attribute values are the same
• More restrictive conditions:
– Stop if number of instances is less than some user-specified
threshold
– Stop if class distribution of instances are independent of the
available features (e.g., using 2 test)
– Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Reduced-error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level
15
Triple Trade-Off
• There is a trade-off between three factors:
– Complexity of H, c (H),
– Training set size, N,
– Generalization error, E on new data overfitting
• As N increases, E decreases
• As c (H) increases, first E decreases and then E increases
• As c (H) increases, the training error decreases for some time
and then stays constant (frequently at 0)
16
Notes on Overfitting
• overfitting happens when a model is capturing
idiosyncrasies of the data rather than generalities.
– Often caused by too many parameters relative to the
amount of training data.
– E.g. an order-N polynomial can intersect any N+1 data
points
Dealing with Overfitting
• Use more data
• Use a tuning set
• Regularization
• Be a Bayesian
18
Regularization
• In a linear regression model overfitting is
characterized by large weights.
19
Penalize large weights in Linear Regression
• Introduce a penalty term in the loss function.
Regularized Regression
1. (L2-Regularization or Ridge Regression)
1. L1-Regularization
20
Foundations of Machine Learning
Module 3: Instance Based Learning and Feature
Selection
Part A: Instance Based Learning
Sudeshna Sarkar
IIT Kharagpur
Instance-Based Learning
• One way of solving tasks of approximating
discrete or real valued target functions
• Have training examples: (xn, f(xn)), n=1..N.
• Key idea:
– just store the training examples
– when a test example is given then find the closest
matches
2
Inductive Assumption
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.
• Example:
[Link]
4
What is the decision boundary?
Voronoi diagram
5
Basic k-nearest neighbor classification
• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x
– Predict the most frequent class among those yi’s.
• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set quickly
6
k-Nearest Neighbor
N 2
Dist(c1 ,c2 ) attri (c1 ) attri (c2 )
i1
k NearestNeighbors k MIN(Dist(ci ,ctest ))
1 k 1 k
predictiontest classi (or valuei )
k i1 k i1
attribute_2
– noise in class labels o + oo+o
o oo o
++ +
– classes partially overlap ++++
+
+
attribute_1
How to choose “k”
• Large k:
– less sensitive to noise (particularly class noise)
– better probability estimates for discrete classes
– larger training sets allow larger values of k
• Small k:
– captures fine structure of problem space better
– may be necessary with small training sets
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows large, kNN
becomes Bayes optimal
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p419
Distance-Weighted kNN
k k
w class
i i w value
i i
predictiontest i1
k (or i 1
k )
w
i 1
i w
i 1
i
1
wk
Dist(ck , ctest )
Locally Weighted Averaging
• Let k = number of training points
• Let weight fall-off rapidly with distance
k k
w class
i i w value
i i
predictiontest i1
k (or i 1
k )
w
i 1
i w
i 1
i
1
wk
e Ke rnelWidthDist(ck ,c te st )
N 2
D(c1,c2) attri (c1) attri (c2)
i1
attribute_2
oo o
variance o
+ o
+ +++
+ +
• assumes spherical classes +
+
attribute_1
Euclidean Distance?
o + o o
attribute_2 oo oo
attribute_2
oo
oo o
+ + o
+ + o oo
o + o
++ + + + o oo
+
++++
+ + o
+
attribute_1 attribute_1
N 2
D(c1,c2) wi attri (c1) attri (c2)
i1
• as number of dimensions increases, distance between points becomes larger and more uniform
• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp
distance
• when more irrelevant than relevant dimensions, distance becomes less reliable
• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions
19
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o
20
K-NN and irrelevant features
+
+ oo o o
? +
o o o oo
+ o o
+ o o + o + oo
+
o o
21
Ways of rescaling for KNN
Normalized L1 distance:
Scale by IG:
22
Ways of rescaling for KNN
Dot product:
Cosine distance:
#docs in
#occur. of
corpus
term i in
doc j
#docs in corpus
that contain
term i
23
Combining distances to neighbors
Standard KNN: yˆ arg max y C ( y, Neighbors( x))
C ( y, D' ) | {( x' , y ' ) D': y ' y} |
Distance-weighted KNN:
• Curse of Dimensionality:
– often works best with 25 or fewer dimensions
• Run-time cost scales with training set size
• Large training sets will not fit in memory
• Many MBL methods are strict averagers
• Sometimes doesn’t seem to perform as well as other methods
such as neural nets
• Predicted values for regression not continuous
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction
Sudeshna Sarkar
IIT Kharagpur
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Reduction in ML
- Irrelevant and
- redundant features
- can confuse learners.
Feature Extraction
• Project the original xi , i =1,...,d dimensions to new
𝑘 < 𝑑 dimensions, zj , j =1,...,k
7
Feature Selection Steps
Feature selection is an
optimization problem.
o Step 1: Search the space
of possible feature
subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)
Search strategies
– Optimum
– Heuristic
– Randomized
Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset
From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes
Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
• Given a set of features 𝐹 = {𝑥1 , … , 𝑥𝑁 }
the Feature Extraction(“Construction”) problem is
is to map 𝐹 to some feature set 𝐹′′ that maximizes
the learner’s ability to classify patterns
Feature Extraction
𝒛 = 𝑤𝑇𝒙
PCA
• Assume that N features are linear
combination of M < 𝑁 vectors
𝑧𝑖 = 𝑤𝑖1 𝑥𝑖1 + ⋯ + 𝑤𝑖𝑑 𝑥𝑖𝑁
• What we expect from such basis
– Uncorrelated or otherwise can be reduced further
– Have large variance (e.g. 𝑤𝑖1 have large variation)
or otherwise bear no information
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables
x , x ,, x
1 2 p
N
Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?
• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes
Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class
m1 m2
2
J w
s s
2
1
2
2
Multiple Classes
• For 𝑐 classes, compute 𝑐 − 1 discriminants, project
N-dimensional features into 𝑐 − 1 space.
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction
Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
• Given a set of features 𝐹 = {𝑥1 , … , 𝑥𝑁 }
the Feature Extraction(“Construction”) problem is
is to map 𝐹 to some feature set 𝐹′′ that maximizes
the learner’s ability to classify patterns
Feature Extraction
• Find a projection matrix w from N-dimensional to M-
dimensional vectors that keeps error low
• Assume that N features are linear combination of M < 𝑁
vectors
𝑧𝑖 = 𝑤𝑖1 𝑥𝑖1 + ⋯ + 𝑤𝑖𝑑 𝑥𝑖𝑁
𝒛 = 𝑤𝑇𝒙
x , x ,, x
1 2 p
N
Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?
• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes
Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class
m1 m2
2
J w
s s
2
1
2
2
Thank You
Foundations of Machine Learning
Module 3: Instance based Learning
and Feature Reduction
Part D: Collaborative Filtering
Sudeshna Sarkar
IIT Kharagpur
Recommender Systems
• Item Prediction: predict a ranked list of items that a
user is likely to buy or use. predict the rating score
that a user is likely to give to an item that s/he has
not seen or used before. E.g.,
– rating on an unseen movie. In this case, the utility of item s
to user u is the rating given to s by u.
3
Recommender Systems
• Content based :
– recommend items similar to the ones the
user preferred in the past
• Collaborative filtering:
– Look at what similar users liked
– Similar users = Similar likes and dislikes
Collaborative Filtering
• Present each user with a vector of ratings
• Two types:
– Yes / No
– Explicit Ratings
• Predict Rating by User-based Nearest
Neighbour
Collaborative Filtering for Rating
Prediction
• User-based Nearest Neighbour
– Neighbour = similar users
– Generate a prediction for an item i by analyzing
ratings for i from users in u’s neighbourhood
Neighborhood formation phase
• Let the record (or profile) of the target user be u
(represented as a vector), and the record of another
user be v (v T).
• The similarity between the target user, u, and a
neighbor, v, can be calculated using the Pearson’s
correlation coefficient:
sim (u, v)
iC
(ru ,i ru )( rv ,i rv )
,
iC u,i u
( r r ) 2
iC v,i v
( r r ) 2
7
Recommendation Phase
• Use the following formula to compute the rating
prediction of item i for target user u
•
p(u, i) ru
vV
sim(u, v) (rv ,i r v )
vV sim(u, v)
where V is the set of k similar users, rv,i is the rating
of user v given to item i,
8
Issue with the user-based kNN CF
• The problem with the user-based formulation
of collaborative filtering is the lack of
scalability:
– it requires the real-time comparison of the target
user to all user records in order to generate
predictions.
• A variation of this approach that remedies this
problem is called item-based CF.
sim (i, j )
uU
(ru ,i ru )( ru , j ru )
uU u,i u
( r r ) 2
uU u, j u
( r r ) 2
p(u, i)
r sim(i, j )
jJ u , j
jJ sim(i, j)
where J is the set of k similar items
CS583, Bing Liu, UIC 11
Thank You
Foundations of Machine Learning
Module 4:
Part A: Probability Basics
Sudeshna Sarkar
IIT Kharagpur
• Probability is the study of randomness and
uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.
Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 3
Sample Space
• Sample space Ω : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin
three times.
Ω = {(ℎℎℎ, ℎℎ𝑡, ℎ𝑡ℎ, ℎ𝑡𝑡, 𝑡ℎℎ, 𝑡ℎ𝑡, 𝑡𝑡ℎ, 𝑡𝑡𝑡}
– the experiment is the number of customers that
arrive at a service desk during a fixed time period, the
sample space should be the set of nonnegative
integers: Ω = 𝑍 + = 0, 1, 2, 3, …
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0 P(A) 1.
– P() =1 and 𝑃 𝜙 = 0
– If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A B, then P(A) P(B).
• For any two events A and B,
P(A B) = P(A) + P(B) - P(A B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:
8
Random Variable
• A random variable is a function defined on the
sample space Ω
– maps the outcome of a random event into real
scalar values
X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
Marginalization
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
P X x Y y
P X x Y y
P Y y
P Y y j X xi P X xi
P X xi Y y j
P Y y
k j
X xk P X xk
Independent RVs
P X x Y y P X x P Y y X x P Y y
P X x Y y Z z P X x Z z P Y y Z z
More on Conditional Independence
P X x Y y Z z P X x Z z P Y y Z z
P X x Y y, Z z P X x Z z
P Y y X x, Z z P Y y Z z
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function 𝑓(𝑥) that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf
–
f x 0, x
–
–
f x 1
f x 1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P 0 X 1
0
f x dx
Cumulative Distribution Function
• FX v P X v
• Discrete RVs
– FX v v P X vi
i
• Continuous vRVs
– FX v
f x dx
d
– FX x f x
dx
Common Distributions
• Normal 𝑋~𝑁(𝜇, 𝜎 2 )
1 2
x
– f x exp , x
2 2 2
– E.g. the height of the entire population
0.4
0.35
0.3
0.25
f(x)
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix
• 1
f X x1 , , xd
2 d 2
12
1
exp x x
T 1
2
Mean
Mean and Variance
• Mean (Expectation): E X
– Discrete RVs: E X vi P X vi
v i
– Continuous RVs: E X
xf x dx
• Variance: V X E X
2
V X vi P X vi
2
– Discrete RVs:
vi
V X x f x dx
2
– Continuous RVs:
Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:
28
Variance Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the variance of the
distribution by:
29
Thank You
Foundations of Machine Learning
Module 4:
Part B: Bayesian Learning
Sudeshna Sarkar
IIT Kharagpur
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem
P( D | h) P(h)
Bayes Rule: P ( h | D)
P ( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P(h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP arg max P(h | D)
hH
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P(h)
hH
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies 𝑁(𝑓 𝑥𝑖 , 𝜎 2 )
Assume that ei is drawn independently for each xi .
Compute ML Hypo
m
1 1 d i h( xi ) 2
arg max ln( 2 ) (
2
)
hH i 1 2 2
m
arg min (d i h( xi )) 2
hH i 1
10
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• ℎ𝑀𝐴𝑃 (𝑥) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max
v j V
P(v
hi H
j | hi ) P(hi | D)
where V is the set of all the values a classification can take and vj is one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P( | h ) P(h | D) .4
hiH
i i
12
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to
P(h|D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance
13
Error for Gibbs Algorithm
• Surprising fact: Assume the expected value is
taken over target concepts drawn at random,
according to the prior probability distribution
assumed by the learner, then (Haussler et al.
1994)
Module 4:
Part C: Naïve Bayes
Sudeshna Sarkar
IIT Kharagpur
Bayes Theorem
P( D | h) P(h)
P ( h | D)
P ( D)
Naïve Bayes
• Bayes classification
P(Y | X) P( X | Y ) P(Y ) P( X 1 , , X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1 , , Xn |C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 , , X n | Y ) P( X 1 | X 2 , , X n , Y ) P( X 2 , , X n | Y )
P( X 1 | Y ) P( X 2 , , X n | Y )
P( X 1 | Y ) P( X 2 | Y ) P( X n | Y )
Naïve Bayes
Bayes rule:
• Classify (Xnew)
7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
9
Estimating Parameters: Y, Xi discrete-valued
MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent
• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous
ˆ 1 ( x 21 . 64 ) 2
1 ( x 21 . 64 ) 2
P( x | Yes ) exp exp
2.35 2 2 2.35 2.35 2
2
11.09
ˆ 1 ( x 23 .88 ) 2
1 ( x 23 .88 ) 2
P( x | No)
exp
exp
7.09 2 2 7.09 7.09 2
2
50.25
15
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Thank You
Foundations of Machine Learning
Module 4:
Part D: Bayesian Networks
Sudeshna Sarkar
IIT Kharagpur
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day
Traffic Meeting
Jam postponed
Late for
Work
Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work
Late for
meeting
symptom
symptom
Bayesian Networks
• Structure of the graph Conditional independence relations
In general,
p(X1, X2,....XN) = p(Xi | parents(Xi ) )
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn
C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2
----------------------------------------------------
S1 S2 S3 Sn Hidden
Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St
Module 5:
Part A: Logistic Regression
Sudeshna Sarkar
IIT Kharagpur
Logistic Regression for classification
• Linear Regression: Logistic:
𝑛 1
𝑔 𝑧 =
1+𝑒 −𝑧
ℎ 𝑥 = 𝛽𝑖 𝑥𝑖 = 𝛽 𝑇 𝑋
𝑖=0
• Logistic Regression for
classification:
1
ℎ𝛽 𝑥 = 𝑇𝑋 = g(𝛽 𝑇 𝑥)
1 + 𝑒 −𝛽
1
𝑔 𝑧 =
1+𝑒 −𝑧
is called the logistic function or the
sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• 𝑔(𝑧) → 1 as 𝑧 → ∞
• 𝑔(𝑧) → 0 as 𝑧 → −∞
′
𝑑 1
𝑔 𝑧 =
𝑑𝑧 1 + 𝑒 −𝑧
1 −𝑧
= . 𝑒
(1 + 𝑒 −𝑧 )2
1 1
= −𝑧
. (1 − −𝑧
)
1+𝑒 1+𝑒
= 𝑔(𝑧)(1 − 𝑔 𝑧
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; 𝛽) be our estimate of P(y|x), where 𝛽 is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and
𝑃 𝑦 = 1 𝑥 = ℎ𝛽 𝑥
𝑃 𝑦 = 0 𝑥 = 1 − ℎ𝛽 (𝑥)
• Can be written more compactly
𝑃 𝑦 𝑥 = ℎ(𝑥)𝑦 (1 − ℎ 𝑥 )1−𝑦
• We can used the gradient method
5
Maximize likelihood
𝐿 𝛽 = 𝑝(𝑦|𝑋;
Ԧ 𝛽)
𝑚
= ෑ 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝛽)
𝑖=1
𝑚
= (𝑦 1 − 𝑔 𝛽𝑇 𝑥 − 1 − 𝑦 𝑔 𝛽𝑇 𝑥 )𝑥𝑗
= (𝑦 − ℎ𝛽 𝑥 )𝑥𝑗
𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ𝛽 𝑥 𝑖 )𝑥𝑗 (𝑖)
Foundations of Machine Learning
Module 5:
Part B: Introduction to Support
Vector Machine
Sudeshna Sarkar
IIT Kharagpur
1
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.
2
Logistic Regression and Confidence
• Logistic Regression:
𝑝 𝑦 = 1 𝑥 = ℎ𝛽 𝑥 = 𝑔(𝛽𝑇 𝑥)
• Predict 1 on an input x iff ℎ𝛽 𝑥 ≥ 0.5,
equivalently, 𝛽𝑇 𝑥 ≥ 0
• The larger the value of ℎ𝛽 𝑥 , the larger is the probability,
and higher the confidence.
• Similarly, confident prediction of 𝑦 = 0 if 𝛽𝑇 𝑥 ≪ 0
• More confident of prediction from points (instances) located
far from the decision surface.
3
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating line
to use?
• Bayesian answer:
– Use all
– Weight each line by its posterior
probability
• Can we approximate the correct
answer efficiently?
4
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support vectors to
decide which side of the
separator a test case is on. The support vectors are
indicated by the circles
around them.
5
Functional Margin
• Functional Margin of a point (𝑥𝑖 , 𝑦𝑖 ) wrt (𝑤, 𝑏)
– Measured by the distance of a point (𝑥𝑖 , 𝑦𝑖 ) from the
decision boundary (𝑤, 𝑏)
𝛾 𝑖 = 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
– Larger functional margin →more confidence for
correct prediction
– Problem: w and b can be scaled to make this value
larger
• Functional Margin of training set
{ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 } wrt (𝑤, 𝑏) is
𝛾 = min 𝛾 𝑖
1≤𝑖≤𝑚
6
Geometric Margin
• For a decision surface (𝑤, 𝑏) P=(a1,a2)
• the vector orthogonal to it is
given by 𝑤. Q=(b1,b2)
→
• The unit length orthogonal 𝑤
𝑤
vector is
𝑤
𝑤
• 𝑃 =𝑄+𝛾 (w,b)
𝑤
7
Geometric Margin
𝑤
𝑃 =𝑄+𝛾
𝑤
𝑤
𝑏1, 𝑏2 = 𝑎1, 𝑎2 − 𝛾 P=(a1,a2)
𝑤
𝑇
𝑤
→ 𝑤 𝑎1, 𝑎2 − 𝛾 +𝑏 =0
𝑤
𝑤 𝑇 𝑎1, 𝑎2 + 𝑏 Q=(b1,b2)
→𝛾= →
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏 (w,b)
𝛾 = 𝑦. ( 𝑎1, 𝑎2 + )
𝑤 𝑤
Geometric margin : 𝑤 = 1
Geometric margin of (w,b) wrt S={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 }
-- smallest of the geometric margins of individual points. 8
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability
x1
9
Maximize Margin Width
𝛾
• Maximize subject to
𝑤
• 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 𝛾 for 𝑖 = 1,2, . . , 𝑚
• Scale so that 𝛾 = 1
1 2
• Maximizing is the same as minimizing 𝑤
𝑤
• Minimize 𝒘. 𝒘 subject to the constraints
• for all (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑚 :
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1
10
Large Margin Linear Classifier
• Formulation: x2
Margin
1 2 x+
minimize w
2
such that x+
yi (wT xi b) 1 n
x-
x1
denotes +1
denotes -1 11
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi b) 1
12
Foundations of Machine Learning
Module 5:
Part C: Support Vector Machine:
Dual
Sudeshna Sarkar
IIT Kharagpur
1
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi b) 1
2
Lagrangian Duality in brief
The Primal Problem min w f ( w)
s.t. g i ( w) 0, i 1,, k
hi ( w) 0, i 1,, l
The generalized Lagrangian:
k l
L( w, a , b ) f ( w) a i g i ( w) b i hi ( w)
i 1 i 1
the a's (ai≥0) and b's are called the Lagrange multipliers
Lemma:
f ( w) if w satisfies primal constraints
maxa , b ,a i 0 L( w, a , b )
otherwise
A re-written Primal:
min w maxa , b ,a i 0 L( w, a , b )
Lagrangian Duality, cont.
The Primal Problem: min w maxa , b ,a i 0 L( w, a , b )
*
p
a , b ,a i 0 min w L( w, a , b )
The Dual Problem: d * max
If 𝛼𝑖 > 0 then 𝑔 𝑤 = 0
6
Solving the Optimization Problem
1 2
Quadratic minimize w
programming
2
with linear s.t. yi (wT xi b) 1
constraints
Lagrangian Function
2 i 1
s.t. ai 0
7
Solving the Optimization Problem
minimize Lp (w, b, a i ) w a i yi (wT xi b) 1
n
1 2
2 i 1
s.t. ai 0
Lp
n
Minimize 0 w a i yi xi
wrt w and b w i 1
n
Lp
a y
for fixed 𝛼
0 i i 0
bm i 1
1 m m
L p ( w, b, a ) a i a ia j yi y j (xTi x j ) b a i yi
i 1 2 i , j 1 i 1
m
1 m
L p ( w, b, a ) a i a ia j yi y j (xTi x j )
i 1 2 i , j 1
8
The Dual problem
Now we have the following dual opt problem:
m
1 m
maxa J (a ) a i a ia j yi y j (xTi x j )
i 1 2 i , j 1
s.t. a i 0, i 1, , k
m
a y
i 1
i i 0.
11
Foundations of Machine Learning
Module 5: Support Vector Machine
Part D: SVM – Maximum Margin
with Noise
Sudeshna Sarkar
IIT Kharagpur
1
Linear SVM formulation
Find w and b such that
2
is maximized
𝑤
• Or noisy data
points?
• No longer QP formulation
4
Objective to be minimized
• Minimize
𝑤. 𝑤
+ 𝐶 (distance of error points to their
x2 x3 Minimize
𝑚
𝑤. 𝑤 + 𝐶 𝜉𝑘
𝑘=1
𝑚 constraints
𝑤. 𝑥𝑘 + 𝑏 ≥ 1 − 𝜉𝑘 if 𝑦𝑘 = 1
C controls the relative 𝑤. 𝑥𝑘 + 𝑏 ≤ −1 + 𝜉𝑘 if 𝑦𝑘 = −1
importance of maximizing ≡
the margin and fitting the
training data. 𝒚𝒌 𝒘. 𝒙𝒌 + 𝒃 ≥ 𝟏 − 𝝃𝒌 , k=1,…,m
Controls overfitting. 𝝃𝒌 ≥ 𝟎, k=1,…,m
Lagrangian
𝐿 𝑤, 𝑏, 𝜉, 𝛼, 𝛽
𝑚
1
= 𝑤. 𝑤 + 𝐶 𝜉𝑖
2
𝑖=1
𝑚 𝑚
+ 𝛼𝑖 𝑦𝑖 𝑥. 𝑤 + 𝑏 − 1 + 𝜉𝑖 − 𝛽𝑖 𝜉𝑖
𝑖=1 𝑖=1
y
m
0.
y
i 1
i i 0.
i 1
i i
8
Solution to Soft Margin Classification
• 𝑥𝑖 with non-zero 𝛼𝑖 will be support vectors.
• Solution to the dual problem is:
𝑚
𝑤 = 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑚
𝑏 = 𝑦𝑘 1 − 𝜉𝑘 − 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑘
𝑖=1
for any 𝑘 s.t. 𝛼𝑘 > 0
For classification,
𝑚
𝑓 𝑥 = 𝛼𝑖 𝑦𝑖 𝑥𝑖 . 𝑥 + 𝑏
𝑖=1
(no need to compute 𝑤 explicitly)
9
Thank You
10
Foundations of Machine Learning
Module 5: Support Vector Machine
Part E: Nonlinear SVM and Kernel
function
Sudeshna Sarkar
IIT Kharagpur
1
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.
2
Non-linear SVMs: Feature Space
Φ: 𝑥 → 𝜙(𝑥)
Non-linear SVMs: Feature Space
Φ: 𝑥 → 𝜙(𝑥)
g ( x) w T ( x) b i i ( x) b
iSV
( x )T
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Often 𝐾 𝑥𝑎 , 𝑥𝑏 may be very inexpensive to compute even if
𝜙 𝑥𝑎 may be extremely high dimensional.
Kernel Example
ഥ = [𝑥1 𝑥2 ]
2-dimensional vectors 𝒙
let 𝑲 𝒙𝒊 , 𝒙𝒋 = (𝟏 + 𝒙𝒊 . 𝒙𝒋 )𝟐
We need to show that 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 . 𝜙(𝑥𝑗 )
K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi). φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 . 𝑥𝑗
• Polynomial of power p:
𝐾 𝑥𝑖 , 𝑥𝑗 = (1 + 𝑥𝑖 . 𝑥𝑗 )𝑝
• Gaussian (radial-basis function):
2
𝑥𝑖 −𝑥𝑗
−
𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒 2𝜎2
• Sigmoid
𝐾 𝑥𝑖 , 𝑥𝑗 = tanh(𝛽0 𝑥𝑖 . 𝑥𝑗 + 𝛽1 )
In general, functions that satisfy Mercer’s condition can
be kernel functions.
9
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.
𝐾(𝑥𝑖 , 𝑥𝑗 )𝑐𝑖 𝑐𝑗 ≥ 0
𝑖,𝑗
• can be expressed as a dot product in a high dimensional
space.
10
SVM examples
such that 0 i C
n
y
i 1
i i 0
16
Foundations of Machine Learning
Module 5: Support Vector Machine
Part F: SVM – Solution to the Dual
Problem
Sudeshna Sarkar
IIT Kharagpur
1
The SMO algorithm
The SMO algorithm can efficiently solve the dual problem.
First we discuss Coordinate Ascent.
Coordinate Ascent
• Consider solving the unconstrained optimization problem:
max 𝑊(𝛼1 , 𝛼2 , … , 𝛼𝑛 )
𝛼
a y
i 1
i i 0.
a y
i 1
i i 0.
7
Foundations of Machine Learning
Sudeshna Sarkar
IIT Kharagpur
Introduction
• Inspired by the human brain.
• Some NNs are models of biological neural networks
• Human brain contains a massively interconnected
net of 1010-1011 (10 billion) neurons (cortical cells)
– Massive parallelism – large number of simple
processing units
– Connectionism – highly interconnected
– Associative distributed memory
• Pattern and strength of synaptic connections
Neuron
Neural Unit
ANNs
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Perceptrons
• Basic unit in a neural network: Linear separator
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs, 𝑦 = σ𝑛𝑖=0 𝑤𝑖 𝑥𝑖
– A threshold function, i.e., 1 if y > 0, −1 if y <= 0
x0
x1 w0
w1
x2 w2
⋮ wn
Σ 𝜑=
1 if 𝑦 > 0
𝑦 = 𝑤𝑖 𝑥𝑖 −1 otherwise
xn
Perceptron training rule
Updates perceptron weights for a training ex as follows:
𝑤𝑖 = 𝑤𝑖 + 𝛥𝑤𝑖
𝛥𝑤𝑖 = 𝜂 𝑦 − 𝑦ො 𝑥𝑖
• If the data is linearly separable and 𝜂 is sufficiently small, it will
converge to a hypothesis that classifies all training data correctly in a
finite number of iterations
Gradient Descent
• Perceptron training rule may not converge if points are not
linearly separable
• Gradient descent by changing the weights by the total error
for all training points.
– If the data is not linearly separable, then it will converge to
the best fit
Linear neurons
• The neuron has a real- • Define the error as the
valued output which is a squared residuals summed
weighted sum of its inputs over all training cases:
1
𝐸 = (𝑦 − 𝑦) ො 2
𝑦ො = 𝑤𝑖 𝑥𝑖 = 𝐰 𝑇 𝐱 2
𝑗
𝑖
• Differentiate to get error derivatives for weights
𝜕𝐸 1 𝜕𝑦ෝ𝑗 𝜕𝐸𝑗
= = − 𝑥𝑖,𝑗 (𝑦𝑗 − 𝑦ෝ𝑗 )
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕𝑦ෝ𝑗
𝑗=1..𝑚 𝑗=1..𝑚
• The batch delta rule changes the weights in proportion to
their error derivatives summed over all training cases
𝜕𝐸
∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
Error Surface
• The error surface lies in a space with a horizontal axis for each
weight and one vertical axis for the error.
– For a linear neuron, it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
Batch Line and Stochastic Learning
Batch Learning Stochastic/ Online Learning
• Steepest descent on the For each example compute the
error surface gradient.
1
𝐸 = (𝑦 − 𝑦) ො 2
2
𝜕𝐸 1 𝜕ෝ𝑦. 𝜕𝐸𝑗
=
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕ෝ 𝑦.
= −𝑥𝑖 (𝑦. − 𝑦ෝ. )
Computation at Units
• Compute a 0-1 or a graded function of the
weighted sum of the inputs
• () is the activation function
x1 w1
( w.x)
x2
w2
wn
xn w.x wi xi
Neuron Model: Logistic Unit
1 1
(z) z
1 e 1 e w .x
𝜙′ 𝑧 = 𝜑 𝑧 1 − 𝜑 𝑧
1 1
ො = (𝑦 − 𝜑 𝑤. 𝑥𝑑 )2
𝐸 = (𝑦 − 𝑦) 2
2 2
𝑑 𝑑
𝜕𝐸 1 𝜕𝐸𝑑 𝜕 𝑦ෞ
𝑑.
=
𝜕𝑤𝑖 2 𝜕ෞ
𝑦𝑑 𝜕𝑤𝑖
𝑑
𝜕𝑦
= (𝑦𝑑 − 𝑦ෞ
𝑑. ) 𝑦 − 𝜑(𝑤. 𝑥𝑑 )
𝜕𝑤𝑖 𝑑
𝑑
= − (𝑦𝑑 − 𝑦ෞ
𝑑 . ) 𝜑′ 𝑤. 𝑥𝑑 𝑥𝑖,𝑑
𝑑
= − 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
𝑑
Training Rule: ∆𝑤𝑖 = 𝜂 σ𝑑 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part B: Multi-layer Neural
Network
Sudeshna Sarkar
IIT Kharagpur
Limitations of Perceptrons
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as the
corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions
A solution: multiple layers
output layer
y y
z2
hidden layer
z1 z2 z1
x2
input layer
x1 x2 x1
Power/Expressiveness of Multilayer
Networks
• Can represent interactions among inputs
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
6
The back-propagation training algorithm
• Step 1: Initialisation
Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small range
1
v01
v11 1
x1 1 1 w11
v21 w01
1 y1
v22
x2 2 2 w21
v22
Input v02 Output
1
x z y
Backprop
• Initialization
– Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small
range
• Forward computing:
– Apply an input vector x to input units
– Compute activation/output vector z on hidden layer
𝑧𝑗 = 𝜑(σ𝑖 𝑣𝑖𝑗 𝑥𝑖 )
– Compute the output vector y on output layer
𝑦𝑘 = 𝜑(σ𝑗 𝑤𝑗𝑘 𝑧𝑗 )
y is the result of the computation.
Learning for BP Nets
• Update of weights in W (between output and hidden layers):
– delta rule
• Not applicable to updating V (between input and hidden)
– don’t know the target values for hidden units z1, Z2, … ,ZP
• Solution: Propagate errors at output units to hidden units to
drive the update of weights in V (again by delta rule)
(error BACKPROPAGATION learning)
• Error backpropagation can be continued downward if the net
has more than one hidden layer.
• How to compute errors on hidden units?
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑦) ො 2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
Derivation
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝑛
𝜕𝑛𝑒𝑡𝑗 𝜕
= 𝑤𝑘𝑗 𝑜𝑘 = 𝑜𝑖
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑘=1
𝜕𝑜𝑗 𝜕
= 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
Consider 𝐸 as as a function of the inputs of all neurons 𝑍 = 𝑧1 , 𝑧2 , …
receiving input from neuron 𝑗,
𝜕𝐸 𝑜𝑗 𝜕𝐸 𝑛𝑒𝑡𝑧1 , 𝑛𝑒𝑡𝑧2 , …
=
𝜕𝑜𝑗 𝜕𝑜𝑗
taking the total derivative with respect to 𝑜𝑗 , a recursive expression for
the derivative is obtained:
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
= = 𝑤𝑗𝑧𝑙
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙
𝑙 𝑙
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
= = 𝑤
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙 𝑙
• Therefore, the derivative with respect to 𝑜𝑗 can be calculated if all the derivatives
with respect to the outputs 𝑜𝑧𝑙 of the next layer – the one closer to the output
neuron – are known.
• Putting it all together:
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
With
𝑜𝑗 − 𝑡𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part C: Neural Network and
Backpropagation Algorithm
Sudeshna Sarkar
IIT Kharagpur
Single layer Perceptron
• Single layer perceptrons learn o x
linear decision boundaries
x2
0 0
0 0 o o
+ + 0 0
+ 0
+ ++
x: class I (y = 1)
o: class II (y = -1)
x1
x x
x2
+ 0
o x
0 +
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2
Boolean OR + +
OR
- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2
Boolean AND - +
AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2
Boolean XOR
+ -
XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR
o -0.5
input input
ouput 1 -1
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network with
a single hidden layer.
• Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
9
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑜)2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
1
For one output neuron, the error function is 𝐸 = (𝑦 − 𝑜)2
2
For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝜕𝐸 𝜕𝑜𝑙
= 𝑤 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗 𝑜𝑖
𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
with
𝑜𝑗 − 𝑦𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit 𝑘
𝛿𝑘 ← 𝑜𝑘 (1 − 𝑜𝑘 )(𝑦𝑘 − 𝑜𝑘 )
• For each hidden unit h 𝑥𝑑 = input
Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima
Sudeshna Sarkar
IIT Kharagpur
Deep Learning
• Breakthrough results in
– Image classification
– Speech Recognition
– Machine Translation
– Multi-modal learning
Deep Neural Network
• Problem: training networks with many hidden layers
doesn’t work very well
• Local minima, very slow training if initialize with zero
weights.
• Diffusion of gradient.
Hierarchical Representation
• Hierarchical Representation help represent complex
functions.
• NLP: character ->word -> Chunk -> Clause -> Sentence
• Image: pixel > edge -> texton -> motif -> part -> object
• Deep Learning: learning a hierarchy of internal
representations
• Learned internal representation at the hidden layers
(trainable feature extractor)
• Feature learning
Trainable Trainable
Input Feature … Trainable
Feature Output
Extractor Classifier
Extractor
Unsupervised Pre-training
We will use greedy, layer wise pre-training
Train one layer at a time
Fix the parameters of previous hidden layers
Previous layers viewed as feature extraction
find hidden unit features that are more common in training
input than in random inputs
Tuning the Classifier
• After pre-training of the layers
– Add output layer
– Train the whole network using
supervised learning (Back propagation)
Deep neural network
• Feed forward NN
• Stacked Autoencoders (multilayer neural net
with target output = input)
• Stacked restricted Boltzmann machine
• Convolutional Neural Network
A Deep Architecture: Multi-Layer Perceptron
Output Layer
y
Here predicting a supervised target
h3 …
Hidden layers
These learn more
h2 …
abstract representations
as you head up
h1 …
Input layer x …
Raw sensory inputs
A Neural Network
• Training : Back
Propagation of Error
– Calculate total error at
the top
– Calculate contributions
to error at each step
going backwards INPUT LAYER HIDDEN LAYER OUTPUT LAYER
10
Training of neural networks
• Forward Propagation :
– Sum inputs, produce
activation
– feed-forward
𝑒 𝑥 −𝑒 −𝑥
• tanh(x)= 𝑥 −𝑥
𝑒 +𝑒
1
• sigmoid(x) =
1+𝑒 −𝑥
• Rectified linear
relu(x) = max(0,x)
- Simplifies backprop
- Makes learning faster
- Make feature sparse
→ Preferred option
Autoencoder
Unlabeled training examples
set
a1
{𝑥 (1) , 𝑥 (2) , 𝑥 (3) . . . }, 𝑥 (𝑖) ∈
ℝ𝑛
a2
Set the target values to be
a3 equal to the inputs. 𝑦 (𝑖) = 𝑥 (𝑖)
Network is trained to output
the input (learn identify
function).
ℎ𝑤,𝑏 𝑥 ≈ 𝑥
Solution may be trivial!
Autoencoders and sparsity
1. Place constraints on the
network, like limiting the
number of hidden units, to
discover interesting structure
about the data.
2. Impose sparsity constraint.
a neuron is “active” if its output
value is close to 1
It is “inactive” if its output value is
close to 0.
constrain the neurons to be inactive
most of the time.
Auto-Encoders
15
Stacked Auto-Encoders
• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network
to fine- tune all weights
e zi
yi
e j
z
j
16
Convolutional Neural netwoks
• A CNN consists of a number of convolutional and
subsampling layers.
• Input to a convolutional layer is a m x m x r image
where m x m is the height and width of the image
and r is the number of channels, e.g. an RGB image
has r=3
• Convolutional layer will have k filters (or kernels)
• size n x n x q
• n is smaller than the dimension of the image and,
• q can either be the same as the number of
channels r or smaller and may vary for each kernel
Convolutional Neural netwoks
2
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)
Pr 𝑃 𝑐⨁ℎ ≤ 𝜖 ≥ 1 − 𝛿
C h
C ⨁h=
Error region3
Prototypical Concept Learning Task
• Given
– 𝑑 𝑑
Instances X (e.g., 𝑋 = 𝑅 or 𝑋 = {0,1}
h 𝑐
+ + −
– Distribution 𝒟 over X - + + -
– Target function c - -
– Hypothesis Space ℋ Instance space X
– Training Examples S = 𝑥𝑖 , 𝑐(𝑥𝑖 ) 𝑥𝑖 i.i.d. from 𝒟
• Determine
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐(𝑥) for all 𝑥 in S?
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐 𝑥 for all 𝑥 in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over 𝒟
4
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.
• Inductive inference –
generalizing beyond the training h 𝑐
data is impossible unless we add + + −
- + + -
more assumptions (e.g., - -
priors over H) Instance space X
We need a bias!
Function Approximation
• How many 𝑁
labeled examples in order to determine which
of the 22 hypothesis is the correct one?
• All 2𝑁 instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)
- 𝐻 = ℎ: 𝑋 → 𝑌
|𝑋| 2𝑁
+ ℎ1 ||H|=2 = 2
c + +
- +
- + -
ℎ2
Instance space X
Error of a hypothesis
The true error of hypothesis h, with respect to the target
concept c and observation distribution 𝒟 is the probability that h
will misclassify an instance drawn according to 𝒟
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ 𝑃𝑟𝑥~𝒟 𝑐 𝑥 ≠ ℎ 𝑥
In a perfect world, we’d like the true error to be 0.
Consistent Case
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.
Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability ≥ 1 − 𝛿, all ℎ ∈ 𝐻 have
𝑒𝑟𝑟𝐷 ℎ − 𝑒𝑟𝑟𝑆 (ℎ) <∈, for
2 2
𝑚≥ 2
𝐼𝑛 |𝐻| + 𝐼𝑛
2∈ 𝛿
Sample complexity: example
• 𝒞 : Conjunction of n Boolean literals. Is 𝒞 PAC-learnable?
|ℋ| = 3𝑛
1 1
𝑚 ≥ (𝑛 ln 3 + ln( ))
𝜀 𝛿
• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a PAC concept is:
1 2𝑛 1 1 𝑛 1
𝑚 ≥ (ln 2 + ln( )) = (2 ln 2 + ln( ))
𝜀 𝛿 𝜀 𝛿
17
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
Back
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A
Sudeshna Sarkar
IIT Kharagpur
Sample Complexity: Infinite
Hypothesis Spaces
• Need some measure of the expressiveness of infinite
hypothesis spaces.
• The Vapnik-Chervonenkis (VC) dimension provides
such a measure, denoted VC(H).
• Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
Shattering
• Consider a hypothesis for the 2-class problem.
• A set of 𝑁 points (instances) can be labeled as + or
− in 2𝑁 ways.
• If for every such labeling a function can be found in
ℋ consistent with this labeling, we set that the set
of instances is shattered by ℋ.
Three points in R2
• It is enough to find one set of three points that can be
shattered.
• It is not necessary to be able to shatter every possible set of
three points in 2 dimensions
Shattering Instances
• Consider 2 instances described using a single real-
valued feature being shattered by a single
interval.
x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z
x y z + –
_ x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
z x,y
Cannot do x,z y
7
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the largest
finite subset of X shattered by H. If arbitrarily large finite
subsets of X can be shattered then VC(H) =
• For a single intervals on the real line, all sets of 2 instances can
be shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.
9
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.
• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.
11
Upper Bound on Sample Complexity with VC
12
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that 𝑉𝐶 𝐻 > 2 , any learner 𝐿
and any 0 < 𝜀 < 1Τ8 , 0 < 𝛿 < 1Τ100 .
Then there exists a distribution D and target concept in C such that if
L observes fewer than:
1 1 VC(C ) 1
max log 2 ,
32
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the upper
bound except for the extra log2(1/ ε) factor in the upper bound.
13
Thank You
Foundations of Machine Learning
Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of
70%.
– If all n give the same class output then you can be confident
it is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using
the majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)
n! n -r
P(r) = p (1 - p)
r
r!(n - r)!
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight ∝ accuracy
• weight ∝ 1Τvariance
L
y = åw jd j
j=1
L
w j ³ 0 and åw j =1
j=1
• Bayesian
(
P Ci |x = ) å
all models M j
( )( )
P Ci |x , Mj P Mj
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each
hypothesizes is not always possible.
BMA
• If dj are independent
1 1 1
Var y Var d j 2 Var d j 2 L Var d j Var d j
1
j L L j L L
Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement
𝑍𝑡 = 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝑖=1
Output the final classifier:
𝑇
𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
Choose 𝛼𝑡 to minimize training error
– Choose 𝛼𝑡 ∈ ℝ.
1 1−∈𝑡
– Update: 𝛼𝑡 = 𝐼𝑛
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) 2 ∈𝑡
𝐷𝑡 + 1 𝑖 =
𝑍𝑡 where
Where 𝑍𝑡 is a normalization factor 𝑚
𝑚
𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Strong weak classifiers
• If each classifiers is (at least slightly) better than random
∈𝑡 < 0.5
𝑚 𝑇
1
𝛿(𝐻(𝑥𝑖 ) ≠ 𝑦𝑖 ) ≤ ෑ 𝑍𝑡 ≤ 𝑒𝑥𝑝 −2 (1Τ2 −∈𝑡 )2
𝑚
𝑖=1 𝑡 𝑡=1
Illustrating AdaBoost
Initial weights for each data point Data points
for training
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Thank You
Foundations of Machine Learning
Module 9: Clustering
Part A: Introduction and kmeans
Sudeshna Sarkar
IIT Kharagpur
Unsupervised learning
• Unsupervised learning:
– Data with no target attribute. Describe hidden structure from
unlabeled data.
– Explore the data to find some intrinsic structures in them.
• Clustering: the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are
more similar to each other than to those in other clusters.
• Useful for
– Automatically organizing data.
– Understanding hidden structure in data.
– Preprocessing for further analysis.
2
Applications: News Clustering (Google)
Gene Expression Clustering
Other Applications
• Biology: classification of plants and animal kingdom
given their features
• Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
• Clustering weblog data to discover groups of similar
access patterns.
• Recognize communities in social networks.
An illustration
• This data set has four natural clusters.
6
An illustration
• This data set has four natural clusters.
7
Aspects of clustering
• A clustering algorithm such as
– Partitional clustering eg, kmeans The quality of a
– Hierarchical clustering eg, AHC clustering result
– Mixture of Gaussians depends on the
algorithm, the
• A distance or similarity function
distance function,
– such as Euclidean, Minkowski, cosine
and the application.
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
8
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of
objects using some criterion
• Model-based: Hypothesize a model for each cluster and find
best fit of models to data
• Density-based: Guided by connectivity and density functions
• Graph-Theoretic Clustering
9
Partitioning Algorithms
• Partitioning method: Construct a partition of a
database D of m objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic method: k-means (MacQueen, 1967)
10
Hierarchical Clustering
animal
vertebrate invertebrate
• Pearson correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑥𝑗 )
𝑟 𝑥𝑖 , 𝑥𝑗 =
𝜎𝑥𝑖 𝜎𝑥𝑗
Quality of Clustering
• Internal evaluation:
– assign the best score to the algorithm that produces clusters with high
similarity within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin index
𝑘
1 𝜎𝑖 + 𝜎𝑗
𝐷𝐵 = max
𝑛 𝑗≠𝑖 𝑑(𝑐𝑖 , 𝑐𝑗 )
𝑖=1
• External evaluation:
– evaluated based on data such as known class labels and external
benchmarks, eg, Rand Index, Jaccard Index, f-measure
𝑇𝑃 + 𝑇𝑁
𝑅𝐼 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
𝐴∩𝐵 𝑇𝑃
𝐽 𝐴, 𝐵 = =
𝐴∪𝐵 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
Thank You
Foundations of Machine Learning
Module 9: Clustering
Part B: kmeans clustering
Sudeshna Sarkar
IIT Kharagpur
Partitioning Algorithms
• Given k
• Construct a partition of m objects 𝑋 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒎 }
where 𝒙𝒊 = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 ) is a vector in a real-valued space 𝑋 ⊆ ℝ𝑛 , n is the
number of attributes.
• into a set of k clusters 𝑆 = {𝑆1 , 𝑆2 , … , 𝑆𝑘 }
• The cluster mean 𝜇𝑖 serves as a prototype of the cluster 𝑆𝑖 .
• Find k clusters that optimizes a chosen criterion
– E.g., the within-cluster sum of squares (WCSS)
(sum of distance functions of each point in the cluster to the cluster
mean)
𝑘
2
argmin 𝑥𝑖 − 𝜇𝑖
𝑆
𝑖=1 𝑥∈𝑆𝑖
3
Stopping/convergence criterion
OR
1. no re-assignments of data points to different
clusters
2. no (or minimum) change of centroids
3. minimum decrease in the sum of squared error
𝑘
2
𝑆𝑆𝐸 = 𝑥𝑖 − 𝜇𝑖
𝑖=1 𝑥∈𝑆𝑖
4
Kmeans illustrated
Similarity / Distance measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures
1ൗ
𝑛 𝑝
𝑝
𝑑 𝒙𝒊 , 𝒙𝒋 = 𝑥𝑖𝑠 − 𝑥𝑗𝑠
𝑠=1
Manhattan (p=1), Euclidean (p=2)
– Cosine distance
Similarity / Distance measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance
𝑑 𝑥𝑖 , 𝑥𝑖 = 𝑥𝑖 − 𝑥𝑗 Σ −1 𝑥𝑖 − 𝑥𝑗
• Pearson correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑥𝑗 )
𝑟 𝑥𝑖 , 𝑥𝑗 =
𝜎𝑥𝑖 𝜎𝑥𝑗
Convergence of K-Means
• Recomputation monotonically decreases each square
error since
(𝑚𝑗 is number of members in cluster j):
σ 𝑥𝑖 − 𝑎 2 reaches minimum for:
−2 𝑥𝑖 − 𝑎 = 0
𝑥𝑖 = 𝑎 = 𝑚𝑗 𝑎
𝑎 = 1ൗ𝑚𝑗 𝑥𝑖 = 𝑐𝑗
9
Advantages
• Fast, robust easy to understand.
• Relatively efficient: O(tkmn)
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally
weight underlying factors.
• Applicable only when mean is defined i.e. fails
for categorical data.
• Only local optima
K-Means on RGB image
x1={r1, g1, b1} Classification Results
x2={r2, g2, b2} x1C(x1)
x2C(x2)
… …
Classifier
xi={ri, gi, bi} xiC(xi)
(K-Means) …
…
Cluster Parameters
𝜃 1 for C1
𝜃 2 for C2
…
𝜃 k for Ck
Module 9: Clustering
Part C: Hierarchical Clustering
Sudeshna Sarkar
IIT Kharagpur
Hierarchical Clustering
animal
vertebrate invertebrate
3
Dendrogram: Hierarchical Clustering
Dendrogram
0.2
– Given an input set S
0.15
– nodes represent subsets
0.1
of S
0.05
– Features of the tree:
0
1 3 2 5 4 6 – The root is the whole
input set S.
6 5 – The leaves are the
3
4
4
individual elements of S.
– The internal nodes are
2
5
2
defined as the union of
3
1
1 their children.
4
4
Dendrogram: Hierarchical Clustering
Dendrogram
– Each level of the tree
represents a
partition of the input
data into several
(nested) clusters or
groups.
– May be cut at any
level: Each
connected
component forms a
cluster.
5
5
Hierarchical clustering
Hierrarchical Agglomerative clustering
• Initially each data point forms a cluster.
• Compute the distance matrix between the
clusters.
• Repeat
– Merge the two closest clusters
– Update the distance matrix
• Until only a single cluster remains.
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
Merge the two closest clusters (C2 and C5) and update the
distance matrix.
C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• Update the distance matrix
C2
U
C1 C5 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
– Similarity of the most similar (single-link)
• Complete-link
– Similarity of the least similar points
• Centroid
– Clusters whose centroids (centers of gravity) are the
most similar
• Average-link
– Average cosine between pairs of elements
12
Distance between two clusters
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj
14
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete link method
• The distance between two clusters is the distance of
two furthest data points in the two clusters.
sim(ci ,c j ) min sim( x, y)
xci , yc j
16
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete Link Example
18
Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial
instances, which is O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2)
performance, computing similarity to each other
cluster must be done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done
more cleverly
19
Average Link Clustering
• Similarity of two clusters = average similarity between
any object in Ci and any object in Cj
1
sim(ci , c j )
Ci C j
sim( x, y)
xCi yC j
• Compromise between single and complete link. Less
susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged
cluster
– Averaged over all pairs between the two original
clusters
20
The complexity
• All the algorithms are at least O(n2). n is the
number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in
O(n2logn).
• Due the complexity, hard to use for large data
sets.
21
Model-based clustering
• Assume data generated from 𝑘 probability
distributions
• Goal: find the distribution parameters
• Algorithm: Expectation Maximization (EM)
• Output: Distribution parameters and a soft
assignment of points to clusters
Model-based clustering
• Assume 𝑘 probability distributions with
parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘
• Given data 𝑋, compute 𝜃1 , 𝜃2 , … , 𝜃𝑘 such that
𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [likelihood] or
ln 𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [log likelihood]
is maximized.
• Every point 𝑥𝜖𝑋 may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘 randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters