0% found this document useful (0 votes)
89 views606 pages

Introduction to Machine Learning Basics

Uploaded by

Vaibhav Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views606 pages

Introduction to Machine Learning Basics

Uploaded by

Vaibhav Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Machine Learning

Module 1: Introduction
Part A: Introduction

Sudeshna Sarkar
IIT Kharagpur
Overview of Course
1. Introduction
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Support Vector Machines
6. Neural Network
7. Introduction to Computational Learning Theory
8. Clustering
Module 1
1. Introduction
a) Introduction
b) Different types of learning
c) Hypothesis space, Inductive Bias
d) Evaluation, Training and test set, cross-validation
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Support Vector Machines
6. Neural Network
7. Introduction to Computational Learning Theory
8. Clustering
Machine Learning History
• 1950s:
– Samuel's checker-playing program
• 1960s:
– Neural network: Rosenblatt's perceptron
– Minsky & Papert prove limitations of Perceptron
• 1970s:
– Symbolic concept induction
– Expert systems and knowledge acquisition bottleneck
– Qui la ’s ID3
– Natural language processing (symbolic)
Machine Learning History
• 1980s:
– Advanced decision tree and rule learning
– Learning and planning and problem solving
– Resurgence of neural network
– Valia t’s PAC learning theory
– Focus on experimental methodology
• 90's ML and Statistics
– Data Mining
– Adaptive agents and web applications
– Text learning
• 1994: Self-driving car
– Reinforcement learning
road test
– Ensembles
• 1997: Deep Blue
– Bayes Net learning
beats Gary Kasparov
Machine Learning History
• Popularity of this field in
recent time and the reasons • 2009: Google builds self
driving car
behind that
• 2011: Watson wins
– New software/ algorithms Jeopardy
• Neural networks • 2014: Human vision
• Deep learning surpassed by ML systems
– New hardware
• GPU’s
– Cloud Enabled
– Availability of Big Data
Programs vs learning algorithms
Algorithmic solution Machine learning solution

Data Program Data Output

Computer Computer

Output Program
Machine Learning : Definition
• Learning is the ability to improve one's behaviour based on
experience.
• Build computer systems that automatically improve with
experience
• What are the fundamental laws that govern all learning
processes?
• Machine Learning explores algorithms that can
– learn from data / build a model from data
– use the model for prediction, decision making or solving
some tasks
Machine Learning : Definition
• A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E.
[Mitchell]
Components of a learning problem
• Task: The behaviour or task being improved.
– For example: classification, acting in an
environment
• Data: The experiences that are being used to
improve performance in the task.
• Measure of improvement :
– For example: increasing accuracy in prediction,
acquiring new, improved speed and efficiency
Black-box Learner
Experiences Problem/
Data Task

Background knowledge/ Answer/


Bias Performance
Learner
Experiences Problem/
Data Task

Models
Learner Reasoner

Background knowledge/ Answer/


Bias Performance
Many domains and applications
Medicine:
• Diagnose a disease
– Input: symptoms, lab measurements, test results,
DNA tests,
– Output: o e of set of possi le diseases, or o e
of the a o e
• Data: historical medical records
• Learn: which future patients will respond best to
which treatments
Many domains and applications
Vision:
• say what objects appear in an image
• convert hand-written digits to characters 0..9
• detect where objects appear in an image
Many domains and applications
Robot control:
• Design autonomous mobile robots that learn
from experience to
– Play soccer
– Navigate from their own experience
Many domains and applications
NLP:
• detect where entities are mentioned in NL
• detect what facts are expressed in NL
• detect if a product/movie review is positive,
negative, or neutral

Speech recognition
Machine translation
Many domains and applications
Financial:
• predict if a stock will rise or fall
• predict if a user will click on an ad or not
Application in Business Intelligence
• Forecasting product sales quantities taking
seasonality and trend into account.
• Identifying cross selling promotional opportunities
for consumer goods.
• …
Some other applications
• Fraud detection : Credit card Providers
• determine whether or not someone will
default on a home mortgage.
• Understand consumer sentiment based off of
unstructured text data.
• Fore asti g o e ’s o i tio rates ased
off external macroeconomic factors.
Learner
Experiences Problem/
Data Task

Models
Learner Reasoner

Background knowledge/ Answer/


Bias Performance
Design a Learner
Experiences Problem/ 1. Choose the training
Data Task
experience
Models
2. Choose the target
function (that is to be
Learn Reaso learned)
er ner 3. Choose how to
represent the target
function
Background
4. Choose a learning
Answer/
knowledge/ Performance
algorithm to infer the
Bias target function
Choosing a Model Representation
• The richer the representation, the more useful
it is for subsequent problem solving.
• The richer the representation, the more
difficult it is to learn.

• Components of Representation
– Features
– Function class / hypothesis language
Foundations of Machine Learning
Module 1: Introduction
Part B: Different types of learning

Sudeshna Sarkar
IIT Kharagpur
Module 1
1. Introduction
a) Introduction
b) Different types of learning
c) Hypothesis space, Inductive Bias
d) Evaluation, Training and test set, cross-validation
2. Linear Regression and Decision Trees
3. Instance based learning
Feature Selection
4. Probability and Bayes Learning
5. Neural Network
6. Support Vector Machines
7. Introduction to Computational Learning Theory
8. Clustering
Broad types of machine learning
• Supervised Learning
– X,y (pre-classified training examples)
– Given an observation x, what is the best label for y?

• Unsupervised learning
–X
– Give a set of ’s, cluster or su arize the

• Semi-supervised Learning
• Reinforcement Learning
– Determine what to do based on rewards and punishments.
Supervised Learning
X y
Input1 Output1 New Input x
Input2 Output2
Input3 Output3 Learning
Algorithm
Model

Input-n Output-n Output y


Unsupervised Learning
X Clusters
Input1
Input2
Input3 Learning
Algorithm

Input-n
Semi-supervised learning
Reinforcement Learning
Action at

State st St+1
Agent Environment
Reward rt rt+1
Reinforcement Learning
Action at

State st St+1
RLearner Environment
Reward rt rt+1
Q-values
update
State,

Policy
Action at
action

state
Best

State st St+1
User Environment
Reward rt rt+1
Supervised Learning
Given:
– a set of input features 1, … , 𝑛
– A target feature
– a set of training examples where the values for the input
features and the target features are given for each
example
– a new example, where only the values for the input
features are given
Predict the values for the target features for the new
example.
– classification when Y is discrete
– regression when Y is continuous
Classification

Example: Credit scoring

Differentiating between
low-risk and high-risk
customers from their
income and savings
Regression

Example: Price of a
used car
x : car attributes
y = wx+w0
y : price

y: price
y = g (x, θ )
g ( ) model,
𝜃 parameters

x: mileage
Features
• Often, the individual observations are analyzed
into a set of quantifiable properties which are
called features. May be
– categorical (e.g. "A", "B", "AB" or "O", for blood type)
– ordinal (e.g. "large", "medium" or "small")
– integer-valued (e.g. the number of words in a text)
– real-valued (e.g. height)
Example Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long Home
e2 reads unknown new short Work
e3 skips unknown old long Work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work

New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Training
Set
Training

Learning
Algorithm

Testing

Hypothesis Predicted
X
y
Training phase
Label machine
learning
feature algorithm
extractor
features
Input

Testing Phase
feature classifier
Label
extractor model
features
Input
Classification learning
• Task T:
– input:
– output:
• Performance metric P:
• Experience E:
Classification learning
• Task T:
– input: a set of instances d1,…,dn
• an instance has a set of features
• we can represent an instance as a vector d=<x1,…,xn>
– output: a set of predictions ŷ1,..., ŷn
• one of a fixed set of constant values:
– {+1,-1} or {cancer, healthy}, or {rose, hi is us, jas i e, …}, or …
• Performance metric P:
• Experience E:
Classification Learning
Task Instance Labels
medical patient record: {-1,+1} = low, high risk
diagnosis blood pressure diastolic, blood of heart disease
pressure systolic,
age, sex (0 or 1), BMI,
cholesterol
finding entity a word in context: capitalized {first,later,outside} =
names in text (0,1), word-after-this-equals- first word in name,
Inc, bigram-before-this-equals- second or later word
acquired-by, … in name, not in a
name
image image: {0,1} = no house,
recognition 1920*1080 pixels, each with a house
code for color
Classification learning
we care about performance on the
• Task T: distribution, not the training data
– input: a set of instances d1,…,dn
– output: a set of predictions ŷ1,..., ŷn
• Performance metric P:
– Prob (wrong prediction) on examples from D
• Experience E:
– a set of labeled examples (x,y) where y is the true
label for x
– ideally, examples should be sampled from some fixed
distribution D
Classification Learning
Task Instance Labels Getting data
medical patient record: risk of heart wait and look
diagnosis lab readings disease for heart
disease
finding entity a word in context: {first,later,outside} text with
names in text capitalized, manually
nearby words, ... annotated
entities
image image: no house, house hand-labeled
recognition pixels images
Representations
Weekend

1. Decision Tree Yes


EatOut
No
Late
Yes No
EatOut Home

2. Linear function
Representations
3. Multivariate linear
function

4. Single layer perceptron


Representations
5. Multi-layer neural
network
Hypothesis Space
• One way to think about a supervised learning
machine is as a device that explores a
h pothesis space .
– Each setting of the parameters in the machine is a
different hypothesis about the function that maps
input vectors to output vectors.
Terminology
• Features: The number of features or distinct
traits that can be used to describe each item
in a quantitative manner.
• Feature vector: n-dimensional vector of
numerical features that represent some object
• Instance Space X: Set of all possible objects
describable by features.
• Example (x,y): Instance x with label y=f(x).
Terminology
• Concept c: Subset of objects from X (c is
unknown).
• Target Function f: Maps each instance x ∈ X to
target label y ∈ Y
• Example (x,y): Instance x with label y=f(x).
• Training Data S: Collection of examples observed
by learning algorithm.
Used to discover potentially predictive relationships
Foundations of Machine Learning
Module 1: Introduction
Part c: Hypothesis Space and Inductive Bias

Sudeshna Sarkar
IIT Kharagpur
Inductive Learning or Prediction
• Given examples of a function (X, F(X))
– Predict function F(X) for new examples X
• Classification
F(X) = Discrete
• Regression
F(X) = Continuous
• Probability estimation
F(X) = Probability(X):
Features
• Features: Properties that describe each
instance in a quantitative manner.
• Feature vector: n-dimensional vector of
features that represent some object
Feature Space
Example:
<0.5,2.8,+>
+
3.0

+ + +
-
+ + - - -
+ -
2.0

- +
- + + -
- - -
1.0

+ + + - -
0.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0

Slide by Jesse Davis: University of Washington


Terminology
Hypothesis:
Function for labeling examples

Label: + + Label: -
3.0

+ ? + +
-
+ + - - -
+ -
2.0

- ?
? +
- + + -
- - -
1.0

+ + + - ?
-
0.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0

Slide by Jesse Davis: University of Washington


Terminology
Hypothesis Space:
Set of legal hypotheses

+
3.0

+ + +
-
+ + - - -
+ -
2.0

- +
- + + -
- - -
1.0

+ + + - -
0.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0

Slide by Jesse Davis: University of Washington


Representations
Weekend

1. Decision Tree Yes


EatOut
No
Late
Yes No
EatOut Home

2. Linear function
Representations
3. Multivariate linear
function
4. Single layer perceptron
5. Multi-layer neural
networks
Hypothesis Space
• The space of all hypotheses that can, in
principle, be output by a learning algorithm.

• We can think about a supervised learning


machine as a device that explores a
h pothesis space .
– Each setting of the parameters in the machine is a
different hypothesis about the function that maps
input vectors to output vectors.
Terminology
• Example (x,y): Instance x with label y.
• Training Data S: Collection of examples observed by
learning algorithm.
• Instance Space X: Set of all possible objects
describable by features.
• Concept c: Subset of objects from X (c is unknown).
• Target Function f: Maps each instance x ∈ X to target
label y ∈ Y
Classifier
• Hypothesis h: Function that approximates f.
• Hypothesis Space ℋ : Set of functions we allow for
approximating f.
• The set of hypotheses that can be produced, can be
restricted further by specifying a language bias.
• Input: Training set 𝒮 ⊆
• Output: A hypothesis ℎ ∈ ℋ
Hypothesis Spaces
• If there are 4 (N) input features, there are
6 𝑁
2 2 possible Boolean functions.
• We cannot figure out which one is correct
unless we see every possible input-output pair
24 (2𝑁 )
Example
Hypothesis language
1. may contain representations of all polynomial
functions from X to Y if X = ℛ 𝑛 and Y = ℛ,
2. may be able to represent all conjunctive
concepts over X when = 𝐵𝑛 and = 𝐵 (with
B the set of booleans).
• Hypothesis language reflects an inductive bias
that the learner has
Inductive Bias
• Need to make assumptions
– E perie ce alo e does ’t allow us to ake
conclusions about unseen data instances

• Two types of bias:


– Restriction: Limit the hypothesis space
(e.g., look at rules)
– Preference: Impose ordering on hypothesis space
(e.g., more general, consistent with data)
Inductive learning
• Inductive learning: Inducing a general function
from training examples
– Construct hypothesis h to agree with c on the training
examples.
– A hypothesis is consistent if it agrees with all training
examples.
– A hypothesis said to generalize well if it correctly
predicts the value of y for novel example.
• Inductive Learning is an Ill Posed Problem:
Unless we see all possible examples the data is not sufficient
for an inductive learning algorithm to find a unique solution.
Inductive Learning Hypothesis
• Any hypothesis h found to approximate the
target function c well over a sufficiently large
set of training examples D will also
approximate the target function well over
other unobserved examples.
Learning as Refining the Hypothesis
Space
• Concept learning is a task of searching an hypotheses
space of possible representations looking for the
representation(s) that best fits the data, given the
bias.
• The tendency to prefer one hypothesis over another
is called a bias.
• Given a representation, data, and a bias, the problem
of learning can be reduced to one of search.
Occam's Razor
⁻ A classical example of Inductive Bias

• the simplest consistent hypothesis about the


target function is actually the best
Some more Types of Inductive Bias
• Minimum description length: when forming a
hypothesis, attempt to minimize the length of the
description of the hypothesis.

• Maximum margin: when drawing a boundary


between two classes, attempt to maximize the width
of the boundary (SVM)
Important issues in Machine Learning
• What are good hypothesis spaces?
• Algorithms that work with the hypothesis spaces
• How to optimize accuracy over future data points
(overfitting)
• How can we have confidence in the result? (How
much training data – statistical qs)
• Are some learning problems computationally
intractable?
Generalization
• Components of generalization error
– Bias: how much the average model over all
training sets differ from the true model?
• Error due to inaccurate assumptions/simplifications
made by the model
– Variance: how much models estimated from
different training sets differ from each other
Underfitting and Overfitting
• Underfitting: odel is too si ple to represe t all
the relevant class characteristics
– High bias and low variance
– High training error and high test error
• Overfitting: odel is too co ple a d fits
irrelevant characteristics (noise) in the data
– Low bias and high variance
– Low training error and high test error
Foundations of Machine Learning
Module 1: Introduction
Part D: Evaluation and Cross validation

Sudeshna Sarkar
IIT Kharagpur
Experimental Evaluation of Learning
Algorithms
• Evaluating the performance of learning systems is
important because:
– Learning systems are usually designed to predict the
lass of future unla eled data points.
• Typical choices for Performance Evaluation:
– Error
– Accuracy
– Precision/Recall
• Typical choices for Sampling Methods:
– Train/Test Sets
– K-Fold Cross-validation
Evaluating predictions
• Suppose we want to make a prediction of a
value for a target feature on example x:
– y is the observed value of target feature on
example x.
– is the predicted value of target feature on
example x.
– How is the error measured?
Measures of error
• Absolute error: |𝑓 − |
𝑛
𝑛
• Sum of squares error: 𝑖= 𝑓 −
𝑛
𝑛
• Number of misclassifications: 𝑖= 𝛿 𝑓 ,
𝑛

• 𝛿 𝑓 , is 1 if f(x)  y, and 0, otherwise.


Confusion Matrix

True class 
Hypothesized
Pos Neg • Accuracy =
class (TP+TN)/(P+N)
Yes TP FP • Precision =
No FN TN TP/(TP+FP)
P=TP+FN N=FP+TN • Recall/TP rate =
TP/P
• FP Rate = FP/N
Sample Error and True Error
• The sample error of hypothesis f with respect to
target function c and data sample S is:
errors(f)= 1/n xS(f(x),c(x))
• The true error (denoted errorD(f)) of hypothesis f
with respect to target function c and distribution D,
is the probability that h will misclassify an instance
drawn at random according to D.
errorD(f)= PrxD[f(x)  c(x)]
Why Errors
• Errors in learning are caused by:
– Limited representation (representation bias)
– Limited search (search bias)
– Limited data (variance)
– Limited features (noise)
Difficulties in evaluating hypotheses
with limited data
• Bias in the estimate: The sample error is a poor
estimator of true error
– ==> test the hypothesis on an independent test set
• We divide the examples into:
– Training examples that are used to train the learner
– Test examples that are used to evaluate the learner
• Variance in the estimate: The smaller the test set,
the greater the expected variance.
Validation set

Validation fails to use all the available data


k-fold cross-validation
1. Split the data into k equal subsets
2. Perform k rounds of learning; on each round
– 1/k of the data is held out as a test set and
– the remaining examples are used as training data.
3. Compute the average test set score of the k rounds
K-fold cross validation
Trade-off
• In machine learning, there is always a trade-
off between
– complex hypotheses that fit the training data well
– simpler hypotheses that may generalise better.
• As the amount of training data increases, the
generalization error decreases.
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part A: Linear Regression

Sudeshna Sarkar
IIT Kharagpur
Regression
• In regression the output is continuous
• Many models could be used – Simplest is linear
regression
– Fit data with the best hyper-plane which "goes
through" the points

y
dependent
variable
(output)

x – independent variable (input)


A Simple Example: Fitting a Polynomial
• The green curve is the true
function (which is not a
polynomial)

• We may use a loss function that


measures the squared error in
the prediction of y(x) from x.

from Bishop’s book on Machine Learning 3


Some fits to the data: which is best?
from Bishop

4
Types of Regression Models

Regression
1 feature Models 2+ features

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X
Simple Linear Regression Equation
E(y)

Regression line

Intercept
Slope β1
b0

x
Linear Regression Model

• Relationship Between Variables Is a Linear


Function
Population Population Random
Y-Intercept Slope Error

Y=𝛽0 + 𝛽1 𝑥1 + 𝜖
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5 Sample 15
4 56.9 12.5 houses
5 66.6 18.0 from the
6 82.5 14.3 region.
7 126.3 27.5
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
House price vs size
Linear Regression – Multiple Variables

Yi = b0 + b1X1 + b2 X2 + + bp Xp +e

• b0 is the intercept (i.e. the average value for Y if all


the X’s are zero), bj is the slope for the jth variable Xj

11
Regression Model
• Our model assumes that
E(Y | X = x) = b0 + b1x (the “population line”)

Population Yi = b0 + b1X1 + b2 X2 + + bp Xp +e
line

Yˆi = b̂0 + b̂1 X1 + b̂2 X2 +


Least Squares
line
+ b̂ p X p

We use 𝛽 ෢0 through 𝛽 ෢𝑝 as guesses for b0 through bp


෡𝑖 as a guess for Yi. The guesses will not be perfect.
and 𝑌
Assumption
• The data may not form a perfect line.
• When we actually take a measurement (i.e., observe
the data), we observe:
Yi = b0 + b1Xi + i,
where i is the random error associated with the ith
observation.
Assumptions about the Error
• E(i ) = 0 for i = 1, 2,…,n.

• (i ) =  where  is unknown.

• The errors are independent.

• The i are normally distributed (with mean 0 and


standard deviation ).
The regression line
The least-squares regression line is the unique line such that
the sum of the squared vertical (y) distances between the
data points and the line is the smallest possible.
Criterion for choosing what line to draw:
method of least squares
෢0
• The method of least squares chooses the line ( 𝛽
and 𝛽෢1 ) that makes the sum of squares of the
residuals σ ℇ𝑖 2 as small as possible
• Minimizes
n

 i 0 1i
[ y
i 1
 (b  b x )]2

for the given observations ( xi , yi )


How do we "learn" parameters
• For the 2-d problem

Y = b0 + b1X
• To find the values for the coefficients which minimize the
objective function we take the partial derivates of the
objective function (SSE) with respect to the coefficients. Set
these to 0, and solve.

n å xy - å x å y å y - b åx
b1 = b0 =
1

nå x 2 - (å x )
2
n

17
Multiple Linear Regression
Y  b 0  b1 X 1  b 2 X 2    b n X n
𝑛

ℎ 𝑥 = ෍ 𝛽𝑖 𝑥𝑖
𝑖=0
• There is a closed form which requires matrix
inversion, etc.
• There are iterative techniques to find weights
– delta rule (also called LMS method) which will update
towards the objective of minimizing the SSE.

18
Linear Regression
𝑛

ℎ 𝑥 = ෍ 𝛽𝑖 𝑥𝑖
𝑖=0

To learn the parameters θ (𝛽𝑖 ) ?


• Make h(x) close to y, for the available training
examples.
• Define a cost function 𝐽 𝜃
1 𝑚 (𝑖) (𝑖) 2
J(θ) = σ (ℎ 𝑥 − 𝑦 )
2 𝑖=1
• Find 𝜃 that minimizes J(𝜃).
LMS Algorithm
• Start a search algorithm (e.g. gradient descent algorithm,)
with initial guess of 𝜃.
• Repeatedly update 𝜃 to make J(𝜃) smaller, until it converges
to minima.
𝜕
βj = βj − 𝛼 𝐽 𝜃
𝜕βj
• J is a convex quadratic function, so has a single global minima.
gradient descent eventually converges at the global minima.
• At each iteration this algorithm takes a step in the direction of
steepest descent(-ve direction of gradient).
LMS Update Rule
• If you have only one training example (𝑥, 𝑦)
𝜕 𝜕 1
J(𝜃) = (ℎ 𝑥 − 𝑦)2
𝜕𝜃 𝜕𝜃𝑗 2
1 𝜕
= 2. (ℎ 𝑥 − 𝑦) (ℎ 𝑥 − 𝑦)
2 𝜕𝜃𝑗
𝑛
𝜕
= ℎ 𝑥 −𝑦 . (෍ 𝜃𝑖 𝑥𝑖 − 𝑦)
𝜕𝜃𝑗
𝑖=0
= (ℎ 𝑥 − 𝑦)𝑥𝑗
• For a single training example, this gives the update
rule:
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ 𝑥 𝑖 )𝑥𝑗 (𝑖)
m training examples
Repeat until convergence {
𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 σ𝑚
𝑖=1(𝑦
𝑖
−ℎ 𝑥 𝑖
) 𝑥𝑗 (𝑖)
}

Batch Gradient Descent: looks at every example on


each step.
Stochastic gradient descent
• Repeatedly run through the training set.
• Whenever a training point is encountered, update the
parameters according to the gradient of the error with
respect to that training example only.

Repeat {
for I = 1 to m do
𝜃𝑗 ≔ 𝜃𝑗 + 𝛼(𝑦 𝑖 −ℎ 𝑥 𝑖
)𝑥𝑗 (𝑖) (for every j)
end for
} until convergence
Thank You
Delta Rule for Classification
1
z
0
x

1
z
0
x

• What would happen in this adjusted case for perceptron and delta rule and
where would the decision point (i.e. .5 crossing) be?

CS 478 - Regression 25
Delta Rule for Classification
1
z
0
x

1
z
0
x

• Leads to misclassifications even though the data is linearly separable


• For Delta rule the objective function is to minimize the regression line SSE,
not maximize classification

CS 478 - Regression 26
Delta Rule for Classification
1
z
0
x

1
z
0
x

1
z
0
x
• What would happen if we were doing a regression fit with a sigmoid/logistic
curve rather than a line?

CS 478 - Regression 27
Delta Rule for Classification
1
z
0
x

1
z
0
x

1
z
0
x
• Sigmoid fits many decision cases quite well! This is basically what logistic
regression does.

28
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part B: Introduction to Decision Tree

Sudeshna Sarkar
IIT Kharagpur
Definition
• A decision tree is a classifier in the form of a
tree structure with two types of nodes:
– Decision node: Specifies a choice or test of
some attribute, with one branch for each
outcome
– Leaf node: Indicates classification of an example
Decision Tree Example 1
Whether to approve a loan

Employed?
No Yes

Credit
Income?
Score?
High Low High Low

Approve Reject Approve Reject


Decision Tree Example 3
Issues
• Given some training examples, what decision tree
should be generated?
• One proposal: prefer the smallest tree that is
consistent with the data (Bias)
– the tree with the least depth?
– the tree with the fewest nodes?
• Possible method:
– search the space of decision trees for the smallest decision
tree that fits the data
Example Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long Home
e2 reads unknown new short Work
e3 skips unknown old long Work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Possible splits
skips 9
length reads 9

long short skips 9


thread reads 9
Skips 2
Skips 7
Reads 9
Reads 0 new old

Skips 6
Skips 3
Reads 2
Reads 7
Two Example DTs
Decision Tree for PlayTennis
• Attributes and their values:
– Outlook: Sunny, Overcast, Rain
– Humidity: High, Normal
– Wind: Strong, Weak
– Temperature: Hot, Mild, Cool

• Target concept - Play Tennis: Yes, No


Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an


attribute value node

No Yes Each leaf node assigns a classification


Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ? No
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree
decision trees represent disjunctions of conjunctions
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

(Outlook=Sunny  Humidity=Normal)
 (Outlook=Overcast)
 (Outlook=Rain  Wind=Weak)
Searching for a good tree
• How should you go about building a decision tree?
• The space of decision trees is too big for systematic
search.

• Stop and
– return the a value for the target feature or
– a distribution over target feature values

• Choose a test (e.g. an input feature) to split on.


– For each value of the test, build a subtree for those
examples with this value for the test.
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part C: Learning Decision Tree

Sudeshna Sarkar
IIT Kharagpur
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Which Attribute is ”best”?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

ICS320 4
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits
to message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
Entropy

• The entropy is 0 if the outcome


is ``certain”.
• The entropy is maximum if we
have no knowledge of the
system (or any outcome is
equally possible).

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = -p+log2p+- p-log2 p-
Information Gain
Gain(S,A): expected reduction in entropy due to partitioning S
on attribute A

Gain(S,A)=Entropy(S) − vvalues(A) |Sv|/|S| Entropy(Sv)


Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-])
-13/64*Entropy([11+,2-])
=0.27
=0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


ICS320 9
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0

Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940

Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


Splitting Rule: GINI Index
• GINI Index
– Measure of node impurity

GINInode(Node) = 1- [ p(c)] 2

c  classes

Sv
GINIsplit (A) =  S
GINI(N v )
v Value s(A)
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Continuous Attribute – Binary Split
• For continuous attribute
– Partition the continuous value of attribute A into a
discrete set of intervals
– Create a new boolean attribute Ac , looking for a
threshold c,

true if Ac  c
Ac = 
 false otherwise
How to choose c ?
• consider all possible splits and finds the best cut
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification
Hypothesis Space Search in Decision Trees
• Conduct a search of the space of decision trees which
can represent all possible discrete functions.

• Goal: to find the best decision tree

• Finding a minimal decision tree consistent with a set of data


is NP-hard.

• Perform a greedy heuristic search: hill climbing without


backtracking

• Statistics-based decisions using all data

20
Bias and Occam’s Razor
Prefer short hypotheses.
Argument in favor:
– Fewer short hypotheses than long hypotheses
– A short hypothesis that fits the data is unlikely to
be a coincidence
– A long hypothesis that fits the data might be a
coincidence

ICS320 21
Foundations of Machine Learning
Module 2: Linear Regression and Decision Tree
Part D: Overfitting

Sudeshna Sarkar
IIT Kharagpur
Overfitting
• Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data
if there is another hypothesis, h’, such that h has
smaller error than h’ on the training data but h
has larger error on the test data than h’.
Overfitting
• Learning a tree that classifies the training data perfectly may not
lead to the tree with the best generalization performance.
– There may be noise in the training data
– May be based on insufficient data
• A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training
data but h has larger error on the test data than h’.

On training

accuracy On testing

Complexity of tree
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

Lack of data points makes it difficult to predict correctly the class labels
of that region
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary

• Training error no longer provides a good estimate of


how well the tree will perform on previously unseen
records
Avoid Overfitting
• How can we avoid overfitting a decision tree?
– Prepruning: Stop growing when data split not statistically
significant
– Postpruning: Grow full tree then remove nodes

• Methods for evaluating subtrees to prune:


– Minimum description length (MDL):
Minimize: size(tree) + size(misclassifications(tree))
– Cross-validation

ICS320 10
Pre-Pruning (Early Stopping)
• Evaluate splits before installing them:
– Don’t install splits that don’t look worthwhile
– when no worthwhile splits to install, done
Pre-Pruning (Early Stopping)
• Typical stopping conditions for a node:
– Stop if all instances belong to the same class
– Stop if all the attribute values are the same
• More restrictive conditions:
– Stop if number of instances is less than some user-specified
threshold
– Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
– Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Reduced-error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level

General Strategy: Overfit and Simplify


Reduced Error Pruning
Model Selection & Generalization
• Learning is an ill-posed problem; data is not sufficient
to find a unique solution
• The need for inductive bias, assumptions about H
• Generalization: How well a model performs on new
data
• Overfitting: H more complex than C or f
• Underfitting: H less complex than C or f

15
Triple Trade-Off
• There is a trade-off between three factors:
– Complexity of H, c (H),
– Training set size, N,
– Generalization error, E on new data overfitting

• As N increases, E decreases
• As c (H) increases, first E decreases and then E increases
• As c (H) increases, the training error decreases for some time
and then stays constant (frequently at 0)

16
Notes on Overfitting
• overfitting happens when a model is capturing
idiosyncrasies of the data rather than generalities.
– Often caused by too many parameters relative to the
amount of training data.
– E.g. an order-N polynomial can intersect any N+1 data
points
Dealing with Overfitting
• Use more data
• Use a tuning set
• Regularization
• Be a Bayesian

18
Regularization
• In a linear regression model overfitting is
characterized by large weights.

19
Penalize large weights in Linear Regression
• Introduce a penalty term in the loss function.

Regularized Regression
1. (L2-Regularization or Ridge Regression)

1. L1-Regularization

20
Foundations of Machine Learning
Module 3: Instance Based Learning and Feature
Selection
Part A: Instance Based Learning

Sudeshna Sarkar
IIT Kharagpur
Instance-Based Learning
• One way of solving tasks of approximating
discrete or real valued target functions
• Have training examples: (xn, f(xn)), n=1..N.
• Key idea:
– just store the training examples
– when a test example is given then find the closest
matches

2
Inductive Assumption

• Similar inputs map to similar outputs


– If not true => learning is impossible
– If true => learning reduces to defining “similar”

• Not all similarities created equal


– predicting a person’s weight may depend on
different attributes than predicting their IQ
Basic k-nearest neighbor classification

• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x
– Predict the most frequent class among those yi’s.

• Example:
[Link]

4
What is the decision boundary?
Voronoi diagram

5
Basic k-nearest neighbor classification

• Training method:
– Save the training examples
• At prediction time:
– Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x
– Predict the most frequent class among those yi’s.

• Improvements:
– Weighting examples from the neighborhood
– Measuring “closeness”
– Finding “close” examples in a large training set quickly

6
k-Nearest Neighbor

 
N 2
Dist(c1 ,c2 )   attri (c1 )  attri (c2 )
i1


k  NearestNeighbors  k  MIN(Dist(ci ,ctest )) 
1 k 1 k
predictiontest   classi (or  valuei )
k i1 k i1

• Average of k points more reliable when:


– noise in attributes o
+ o oooo

attribute_2
– noise in class labels o + oo+o
o oo o
++ +
– classes partially overlap ++++
+
+
attribute_1
How to choose “k”

• Large k:
– less sensitive to noise (particularly class noise)
– better probability estimates for discrete classes
– larger training sets allow larger values of k
• Small k:
– captures fine structure of problem space better
– may be necessary with small training sets
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows large, kNN
becomes Bayes optimal
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p418
From Hastie, Tibshirani, Friedman 2001 p419
Distance-Weighted kNN

• tradeoff between small and large k can be difficult


– use large k, but more emphasis on nearer neighbors?

k k

 w  class
i i  w  value
i i
predictiontest  i1
k (or i 1
k )
w
i 1
i w
i 1
i

1
wk 
Dist(ck , ctest )
Locally Weighted Averaging
• Let k = number of training points
• Let weight fall-off rapidly with distance
k k

 w  class
i i  w  value
i i
predictiontest  i1
k (or i 1
k )
w
i 1
i w
i 1
i

1
wk 
e Ke rnelWidthDist(ck ,c te st )

 KernelWidth controls size of neighborhood that


has large effect on value (analogous to k)
Locally Weighted Regression
• All algs so far are strict averagers: interpolate,
but can’t extrapolate
• Do weighted regression, centered at test
point, weight controlled by distance and
KernelWidth
• Local regressor can be linear, quadratic, n-th
degree polynomial, neural net, …
• Yields piecewise approximation to surface that
typically is more complex than local regressor
Euclidean Distance

 
N 2
D(c1,c2)   attri (c1)  attri (c2)
i1

• gives all attributes equal weight?


– only if scale of attributes and differences are
similar
– scale attributes to equal range or equal o
o oooo

attribute_2
oo o
variance o
+ o
+ +++
+ +
• assumes spherical classes +
+
attribute_1
Euclidean Distance?

o + o o
attribute_2 oo oo

attribute_2
oo
oo o
+ + o
+ + o oo
o + o
++ + + + o oo
+
++++
+ + o
+
attribute_1 attribute_1

• if classes are not spherical?


• if some attributes are more/less important
than other attributes?
• if some attributes have more/less noise in
them than other attributes?
Weighted Euclidean Distance

 
N 2
D(c1,c2)   wi  attri (c1)  attri (c2)
i1

• large weights => attribute is more important


• small weights => attribute is less important
• zero weights => attribute doesn’t matter

• Weights allow kNN to be effective with axis-parallel


elliptical classes
• Where do weights come from?
Curse of Dimensionality

• as number of dimensions increases, distance between points becomes larger and more uniform
• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp
distance

• when more irrelevant than relevant dimensions, distance becomes less reliable
• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions

attri (c1)  attri (c2)  attr (c1)  attr (c2)


relevant 2 irrelevant 2
D(c1,c2)    j j
i1 j 1
K-NN and irrelevant features
+ + + oo o oo? o + + o + o oooo+ o ooooo +

19
K-NN and irrelevant features
+
o
+ o
? o o
+
o o
o o
o o
+ o
+ +
+ o
o + o o
o
o
o

20
K-NN and irrelevant features

+
+ oo o o
? +
o o o oo
+ o o
+ o o + o + oo
+
o o

21
Ways of rescaling for KNN
Normalized L1 distance:

Scale by IG:

Modified value distance metric:

22
Ways of rescaling for KNN
Dot product:

Cosine distance:

TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :

#docs in
#occur. of
corpus
term i in
doc j
#docs in corpus
that contain
term i
23
Combining distances to neighbors
Standard KNN: yˆ  arg max y C ( y, Neighbors( x))
C ( y, D' ) | {( x' , y ' )  D': y '  y} |
Distance-weighted KNN:

yˆ  arg max y C ( y, Neighbors( x))


C ( y, D' )   (SIM ( x, x' ))
{( x ', y ')D ': y ' y}

C ( y, D' )  1   (1  SIM ( x, x' ))


{( x ', y ')D ': y ' y }

SIM ( x, x' )  1  ( x, x' )


24
Advantages of Memory-Based Methods
• Lazy learning: don’t do any work until you know what you
want to predict (and from what variables!)
– never need to learn a global model
– many simple local models taken together can represent a more
complex global model
– better focussed learning
– handles missing values, time varying distributions, ...
• Very efficient cross-validation
• Intelligible learning method to many users
• Nearest neighbors support explanation and training
• Can use any distance metric: string-edit distance, …
Weaknesses of Memory-Based Methods

• Curse of Dimensionality:
– often works best with 25 or fewer dimensions
• Run-time cost scales with training set size
• Large training sets will not fit in memory
• Many MBL methods are strict averagers
• Sometimes doesn’t seem to perform as well as other methods
such as neural nets
• Predicted values for regression not continuous
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction

Part B: Feature Selection

Sudeshna Sarkar
IIT Kharagpur
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Reduction in ML
- Irrelevant and
- redundant features
- can confuse learners.

- Limited training data.


- Limited computational resources.
- Curse of dimensionality.
Feature Selection
Problem of selecting some subset of features, while
ignoring the rest

Feature Extraction
• Project the original xi , i =1,...,d dimensions to new
𝑘 < 𝑑 dimensions, zj , j =1,...,k

Criteria for selection/extraction:


either improve or maintain the classification
accuracy, simplify classifier complexity.
Feature Selection - Definition
• Given a set of features 𝐹 = {𝑥1 , … , 𝑥𝑛 }
the Feature Selection problem is
to find a subset 𝐹′ ⊆ 𝐹 that maximizes the learners
ability to classify patterns.
• Formally 𝐹′ should maximize some scoring function
Subset selection
• d initial features
𝑑
• There are 2 possible subsets
• Criteria to decide which subset is the best:
– classifier based on these m features has the
lowest probability of error of all such classifiers
• Can’t go over all 2𝑑 possibilities
• Need some heuristics

7
Feature Selection Steps

Feature selection is an
optimization problem.
o Step 1: Search the space
of possible feature
subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)

Search strategies
– Optimum
– Heuristic
– Randomized

Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset

• Unsupervised (filter method)


– Look at input only
– Select the subset that has the most information
Evaluation Strategies
Filter Methods Wrapper Methods
Subset selection
• Select uncorrelated features
• Forward search
– Start from empty set of features
– Try each of remaining features
– Estimate classification/regression error for adding specific
feature
– Select feature that gives maximum improvement in
validation error
– Stop when no significant improvement
• Backward search
– Start with original set of size d
– Drop features with smallest impact on error
Feature selection
Univariate (looks at each feature independently of others)
– Pearson correlation coefficient
– F-score Univariate methods measure
– Chi-square some type of correlation
– Signal to noise ratio between two random variables
– mutual information • the label (yi) and a fixed feature
– Etc. (xij for fixed j)
• Rank features by importance
• Ranking cut-off is determined by user
Pearson correlation coefficient
• Measures the correlation between two variables
• Formula for Pearson correlation =

• The correlation r is between +1 and −1.


 +1 means perfect positive correlation
 −1 in the other direction
Pearson correlation coefficient

From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes

S2N(X,Y) = (μX - μY)/(σX – σY)

• Large values indicate a strong correlation


Multivariate feature selection
• Multivariate (considers all features simultaneously)
• Consider the vector w for any linear classifier.
• Classification of a point x is given by wTx+w0.
• Small entries of w will have little effect on the dot
product and therefore those features are less relevant.
• For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1.
– A ranking of features given by this w is 0, 2, 1.
Multivariate feature selection
• The w can be obtained by any of linear classifiers
• A variant of this approach is called recursive feature
elimination:
– Compute w on all features
– Remove feature with smallest wi
– Recompute w on reduced data
– If stopping criterion not met then go to step 2
Feature Extraction
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction

Part C: Feature Extraction

Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
• Given a set of features 𝐹 = {𝑥1 , … , 𝑥𝑁 }
the Feature Extraction(“Construction”) problem is
is to map 𝐹 to some feature set 𝐹′′ that maximizes
the learner’s ability to classify patterns
Feature Extraction

• Find a projection matrix w from N-dimensional to


M-dimensional vectors that keeps error low

𝒛 = 𝑤𝑇𝒙
PCA
• Assume that N features are linear
combination of M < 𝑁 vectors
𝑧𝑖 = 𝑤𝑖1 𝑥𝑖1 + ⋯ + 𝑤𝑖𝑑 𝑥𝑖𝑁
• What we expect from such basis
– Uncorrelated or otherwise can be reduced further
– Have large variance (e.g. 𝑤𝑖1 have large variation)
or otherwise bear no information
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables

x , x ,, x  
1 2 p
N

define the first principal component of the sample


by the linear transformation
N
z1  a x j   ai1 xij ,
T
1 j  1,2, , p.
i 1

where the vector a1  (a11 , a21 ,, a N 1 )


x j  ( x1 j , x2 j ,, x Nj )
is chosen such that var[ z1 ] is maximum.
PCA
• Choose directions such that a total variance of data
will be maximum
– Maximize Total Variance
• Choose directions that are orthogonal
– Minimize correlation
• Choose 𝑀 < 𝑁 orthogonal directions which
maximize total variance
PCA
• 𝑁-dimensional feature space
• N × 𝑁 symmetric covariance matrix estimated from
samples 𝐶𝑜𝑣 𝒙 = Σ
• Select 𝑀 largest eigenvalue of the covariance matrix
and associated 𝑀 eigenvectors
• The first eigenvector will be a direction with largest
variance
PCA for image compression

p=1 p=2 p=4 p=8

Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?

• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes

Two classes are


separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes

Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class

• Linear discriminant analysis (LDA) finds


most discriminant projection by
• maximizing between-class distance
• and minimizing within-class distance Within-class distance
Linear Discriminant Analysis
• Find a low-dimensional space such that when
𝒙 is projected, classes are well-separated
Means and Scatter after projection
Good Projection
• Means are as far away as possible
• Scatter is small as possible
• Fisher Linear Discriminant

 m1  m2 
2

J w 
s s
2
1
2
2
Multiple Classes
• For 𝑐 classes, compute 𝑐 − 1 discriminants, project
N-dimensional features into 𝑐 − 1 space.
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction

Part C: Feature Extraction

Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
• Given a set of features 𝐹 = {𝑥1 , … , 𝑥𝑁 }
the Feature Extraction(“Construction”) problem is
is to map 𝐹 to some feature set 𝐹′′ that maximizes
the learner’s ability to classify patterns
Feature Extraction
• Find a projection matrix w from N-dimensional to M-
dimensional vectors that keeps error low
• Assume that N features are linear combination of M < 𝑁
vectors
𝑧𝑖 = 𝑤𝑖1 𝑥𝑖1 + ⋯ + 𝑤𝑖𝑑 𝑥𝑖𝑁
𝒛 = 𝑤𝑇𝒙

• What we expect from such basis


– Uncorrelated, cannot be reduced further
– Have large variance or otherwise bear no information
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables

x , x ,, x  
1 2 p
N

define the first principal component of the sample


by the linear transformation
N
z1  w x j   wi1 xij ,
T
1 j  1,2, , p.
i 1

where the vector w1  ( w11 , w21 ,, wN 1 )


x j  ( x1 j , x2 j ,, x Nj )
is chosen such that var[ z1 ] is maximum.
PCA
PCA
• Choose directions such that a total variance of data
will be maximum
1. Maximize Total Variance
• Choose directions that are orthogonal
2. Minimize correlation

• Choose 𝑀 < 𝑁 orthogonal directions which


maximize total variance
PCA
• 𝑁-dimensional feature space
• N × 𝑁 symmetric covariance matrix estimated from
samples 𝐶𝑜𝑣 𝒙 = Σ
• Select 𝑀 largest eigenvalue of the covariance matrix
and associated 𝑀 eigenvectors
• The first eigenvector will be a direction with largest
variance
PCA for image compression

p=1 p=2 p=4 p=8

Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?

• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes

Two classes are


separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes

Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class

• Linear discriminant analysis (LDA) finds


most discriminant projection by
• maximizing between-class distance
• and minimizing within-class distance Within-class distance
Linear Discriminant Analysis
• Find a low-dimensional space such that when
𝒙 is projected, classes are well-separated
Means and Scatter after projection
Good Projection
• Means are as far away as possible
• Scatter is small as possible
• Fisher Linear Discriminant

 m1  m2 
2

J w 
s s
2
1
2
2
Thank You
Foundations of Machine Learning
Module 3: Instance based Learning
and Feature Reduction
Part D: Collaborative Filtering
Sudeshna Sarkar
IIT Kharagpur
Recommender Systems
• Item Prediction: predict a ranked list of items that a
user is likely to buy or use. predict the rating score
that a user is likely to give to an item that s/he has
not seen or used before. E.g.,
– rating on an unseen movie. In this case, the utility of item s
to user u is the rating given to s by u.

• Rating Prediction: Predict whether someone will like


a movie, book, webpage, etc.
The Recommendation Problem
• We have a set of users U and a set of items S to be
recommended to the users.
• Let p be an utility function that measures the
usefulness of item s ( S) to user u ( U), i.e.,
– p:U×S  R, where R is a totally ordered set (e.g., non-
negative integers or real numbers in a range)
• Objective
– Learn p based on the past data
– Use p to predict the utility value of each item s ( S) to each
user u ( U)

3
Recommender Systems
• Content based :
– recommend items similar to the ones the
user preferred in the past
• Collaborative filtering:
– Look at what similar users liked
– Similar users = Similar likes and dislikes
Collaborative Filtering
• Present each user with a vector of ratings
• Two types:
– Yes / No
– Explicit Ratings
• Predict Rating by User-based Nearest
Neighbour
Collaborative Filtering for Rating
Prediction
• User-based Nearest Neighbour
– Neighbour = similar users
– Generate a prediction for an item i by analyzing
ratings for i from users in u’s neighbourhood
Neighborhood formation phase
• Let the record (or profile) of the target user be u
(represented as a vector), and the record of another
user be v (v  T).
• The similarity between the target user, u, and a
neighbor, v, can be calculated using the Pearson’s
correlation coefficient:

sim (u, v) 
 iC
(ru ,i  ru )( rv ,i  rv )
,
iC u,i u
( r  r ) 2
iC v,i v
( r  r ) 2

7
Recommendation Phase
• Use the following formula to compute the rating
prediction of item i for target user u

p(u, i)  ru 
 vV
sim(u, v)  (rv ,i r v )
vV sim(u, v)
where V is the set of k similar users, rv,i is the rating
of user v given to item i,

8
Issue with the user-based kNN CF
• The problem with the user-based formulation
of collaborative filtering is the lack of
scalability:
– it requires the real-time comparison of the target
user to all user records in order to generate
predictions.
• A variation of this approach that remedies this
problem is called item-based CF.

CS583, Bing Liu, UIC 9


Item-based CF
• The item-based approach works by comparing
items based on their pattern of ratings across
users. The similarity of items i and j is
computed as follows:

sim (i, j ) 
 uU
(ru ,i  ru )( ru , j  ru )

uU u,i u
( r  r ) 2
uU u, j u
( r  r ) 2

CS583, Bing Liu, UIC 10


Recommendation phase
• After computing the similarity between items
we select a set of k most similar items to the
target item and generate a predicted value of
user u’s rating

p(u, i) 
 r  sim(i, j )
jJ u , j

 jJ sim(i, j)
where J is the set of k similar items
CS583, Bing Liu, UIC 11
Thank You
Foundations of Machine Learning

Module 4:
Part A: Probability Basics

Sudeshna Sarkar
IIT Kharagpur
• Probability is the study of randomness and
uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.

Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 3
Sample Space
• Sample space Ω : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin
three times.
Ω = {(ℎℎℎ, ℎℎ𝑡, ℎ𝑡ℎ, ℎ𝑡𝑡, 𝑡ℎℎ, 𝑡ℎ𝑡, 𝑡𝑡ℎ, 𝑡𝑡𝑡}
– the experiment is the number of customers that
arrive at a service desk during a fixed time period, the
sample space should be the set of nonnegative
integers: Ω = 𝑍 + = 0, 1, 2, 3, …
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0  P(A)  1.
– P() =1 and 𝑃 𝜙 = 0
– If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A  B, then P(A)  P(B).
• For any two events A and B,
P(A  B) = P(A) + P(B) - P(A  B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:

Where N(a) is the number that event a happens in n trials

8
Random Variable
• A random variable is a function defined on the
sample space Ω
– maps the outcome of a random event into real
scalar values


X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies

• X is a RV with arity k if it can take on exactly one


value out of k values,
– e.g., the possible values that X can take on are
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Probability of Discrete RV
• Probability mass function (pmf): P  X  xi 
• Simple facts about pmf
–  i P  X  xi   1
– P  X  xi  X  x j   0 if i  j
– P  X  xi  X  x j   P  X  xi   P  X  x j  if i  j
– P  X  x1  X  x2   X  xk   1
Common Distributions
• Uniform 𝑋~ 𝑈[1, ⋯ , 𝑁]
– X takes values 1, 2, …, N
– PX  i  1 N
– E.g. picking balls of different colors from a box
• Binomial 𝑋~𝐵𝑖𝑛(𝑛, 𝑝)
– X takes values 0, 1, …, n
n i n i
– P  X  i     p 1  p 
i
– E.g. coin flips
Joint Distribution
• Given two discrete RVs X and Y, their joint
distribution is the distribution of X and Y together
– e.g.
you and your friend each toss a coin 10 times
P(You get 5 heads AND you friend get 7 heads)
•  x y
P X  x  Y  y  1

  P  You get i heads AND your friend get j heads   1


50 100
i 0 j 0
Conditional Probability
 
• P X  x Y  y is the probability of 𝑋 = 𝑥, given
the occurrence of 𝑌 = 𝑦
– E.g. you get 0 heads, given that your friend gets 3
heads
P X  x  Y  y
• P X  x Y  y 
P Y  y
Law of Total Probability
• Given two discrete RVs X and Y, which take values in
x1, , xm  and  y1, , yn  , We have

P  X  xi   P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j
Marginalization

Marginal Probability Joint Probability

P  X  xi    P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j

Conditional Probability Marginal Probability


Bayes Rule
• X and Y are discrete RVs…

P X  x  Y  y
P X  x Y  y 
P Y  y

 
P Y  y j X  xi P  X  xi 
 
P X  xi Y  y j 
 P Y  y
k j 
X  xk P  X  xk 
Independent RVs

• X and Y are independent means that 𝑋 = 𝑥


does not affect the probability of 𝑌 = 𝑦

• Definition: X and Y are independent iff


– P(XY) = P(X)P(Y)
– P X  x  Y  y   P X  x P Y  y 
More on Independence

P X  x  Y  y   P X  x P Y  y 

P X  x Y  y  P X  x P Y  y X  x  P Y  y 

• E.g. no matter how many heads you get, your


friend will not be affected, and vice versa
Conditionally Independent RVs
• Intuition: X and Y are conditionally
independent given Z means that once Z is
known, the value of X does not add any
additional information about Y
• Definition: X and Y are conditionally
independent given Z iff

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 
More on Conditional Independence

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 

P  X  x Y  y, Z  z   P  X  x Z  z 

P  Y  y X  x, Z  z   P  Y  y Z  z 
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function 𝑓(𝑥) that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf

f  x   0, x
– 

– 
f  x  1
f  x   1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P  0  X  1  
0
f  x dx
Cumulative Distribution Function
• FX  v   P  X  v 
• Discrete RVs
– FX  v    v P  X  vi 
i

• Continuous vRVs
– FX  v   

f  x  dx
d
– FX  x   f  x 
dx
Common Distributions

• Normal 𝑋~𝑁(𝜇, 𝜎 2 )
1  2
 x   
– f  x  exp  , x 
2  2 2

– E.g. the height of the entire population
0.4

0.35

0.3

0.25
f(x)

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix

• 1
f X  x1 , , xd  
 2  d 2

12

 1 
 exp   x      x    
T 1

 2 
Mean
Mean and Variance
• Mean (Expectation):   E  X 
– Discrete RVs: E  X    vi P  X  vi 
v i

– Continuous RVs: E X  
xf  x  dx

• Variance: V  X   E  X   
2

V  X   vi    P  X  vi 
2
– Discrete RVs:
vi

V X    x    f  x dx
2
– Continuous RVs:

Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:

28
Variance Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the variance of the
distribution by:

29
Thank You
Foundations of Machine Learning

Module 4:
Part B: Bayesian Learning

Sudeshna Sarkar
IIT Kharagpur
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem

P( D | h) P(h)
Bayes Rule: P ( h | D) 
P ( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P (cancer )  .008, P(cancer)  .992


P ( | cancer )  .98, P(  | cancer )  .02
P ( | cancer)  .03, P(  | cancer )  .97
P ( | cancer) P(cancer)
P (cancer |  ) 
P()
P ( | cancer) P (cancer )
P (cancer |  ) 
P()
Maximum A Posteriori (MAP) Hypothesis
P( D | h) P(h)
P ( h | D) 
P ( D)
The Goal of Bayesian Learning: the most probable hypothesis
given the training data (Maximum A Posteriori hypothesis)

hMAP  arg max P(h | D)


hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP  arg max P(h | D)
hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH

• If every hypothesis in H is equally probable a priori,


we only need to consider the likelihood of the data D
given h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P( D | h) P(h)
P ( h | D) 
P ( D)
Output the hypothesis hMAP with the highest posterior probability
hMAP  max P(h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies 𝑁(𝑓 𝑥𝑖 , 𝜎 2 )
Assume that ei is drawn independently for each xi .
Compute ML Hypo

hML  arg max p( D | h)


hH
m 1 d  h ( xi ) 2
1  ( i
 arg max 
)
2 
e
hH 2
i 1
2

m
1 1 d i  h( xi ) 2
 arg max   ln( 2 )  (
2
)
hH i 1 2 2 
m
 arg min  (d i  h( xi )) 2
hH i 1

10
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• ℎ𝑀𝐴𝑃 (𝑥) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max
v j V
 P(v
hi H
j | hi ) P(hi | D)
where V is the set of all the values a classification can take and vj is one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D)  .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0


P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D)  .6
hiH
i i
Why “Optimal”?
• Optimal in the sense that no other classifier
using the same H and prior knowledge can
outperform it on average

12
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to
P(h|D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance

13
Error for Gibbs Algorithm
• Surprising fact: Assume the expected value is
taken over target concepts drawn at random,
according to the prior probability distribution
assumed by the learner, then (Haussler et al.
1994)

E f [errorX , f GibbsClassifier ]  2 E f [errorX , f BayesOPtimal ],


where f denotes a target function, X denotes the instance space.
Thank You
Foundations of Machine Learning

Module 4:
Part C: Naïve Bayes

Sudeshna Sarkar
IIT Kharagpur
Bayes Theorem
P( D | h) P(h)
P ( h | D) 
P ( D)
Naïve Bayes
• Bayes classification
P(Y | X)  P( X | Y ) P(Y )  P( X 1 ,  , X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1 ,  , Xn |C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 ,  , X n | Y )  P( X 1 | X 2 ,  , X n , Y ) P( X 2 ,  , X n | Y )
 P( X 1 | Y ) P( X 2 ,  , X n | Y )
 P( X 1 | Y ) P( X 2 | Y )    P( X n | Y )
Naïve Bayes
Bayes rule:

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:


Naïve Bayes Algorithm – discrete Xi

• Train Naïve Bayes (examples)


for each* value yk
estimate
for each* value xij of each attribute Xi
estimate

• Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...


Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates (MLE’s):

Number of items in set D for


which Y=yk
Example
• Example: Play Tennis

7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

9
Estimating Parameters: Y, Xi discrete-valued

If unlucky, our MLE estimate for P(Xi | Y) may be zero.

MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent

• We can use Naïve Bayes in many cases anyway


– often the right classification, even when not the right
probability
Gaussian Naïve Bayes (continuous X)
• Algorithm: Continuous-valued Features
– Conditional probability often modeled with the normal
distribution

Sometimes assume variance


– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
12
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes (examples)
for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance

• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training


example

ith feature kth class


(z)=1 if z true,
else 0
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes  21.64, Yes  2.35
   xn ,    ( xn  )2
2
N n1 N n1  No  23.88, No  7.09

– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64 ) 2

P( x | Yes )  exp    exp  
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x  23 .88 ) 2
 1  ( x  23 .88 ) 2

P( x | No)  
exp  
  
exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
15
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Thank You
Foundations of Machine Learning

Module 4:
Part D: Bayesian Networks

Sudeshna Sarkar
IIT Kharagpur
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day

Traffic Meeting
Jam postponed

Late for
Work

Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work

Late for
meeting

Each node is asserted to be conditionally independent of


its non-descendants, given its immediate parents
5
Inference in Bayesian Networks
• Computes posterior probabilities given evidence about
some nodes
• Exploits probabilistic independence for efficient
computation.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known to
be NP-hard.
• In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be useful.
• Efficient algorithms leverage the structure of the graph
6
Applications of Bayesian Networks
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
cause
• Classification: P(class|data) cause
• Decision-making
(given a cost function) C C2
1

symptom
symptom
Bayesian Networks
• Structure of the graph  Conditional independence relations

In general,
p(X1, X2,....XN) =  p(Xi | parents(Xi ) )

The full joint distribution


The graph-structured approximation
• Requires that graph is acyclic (no directed cycles)

• 2 components to a Bayesian network


– The graph structure (conditional independence
assumptions)
– The numerical probabilities (for each variable given its
parents)
Examples
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)

A: D Conditionally independent effects:


p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent
B: S1 C: S2 Given A

A: Traffic B: Late wakeup Independent Causes:


p(A,B,C) = p(C|A,B)p(A)p(B)
“Explaining away”
C: late

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn

C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2

----------------------------------------------------

S1 S2 S3 Sn Hidden

Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St

Widely used in sequence learning eg, speech recognition, POS


tagging
Inference is linear in n
Learning Bayesian Belief Networks
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
– estimate the conditional probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data.
– Similar to learning the weights for the hidden units of
a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance.
– Use a heuristic search or constraint-based technique to
search through potential structures.
14
Thank You
Foundations of Machine Learning

Module 5:
Part A: Logistic Regression

Sudeshna Sarkar
IIT Kharagpur
Logistic Regression for classification
• Linear Regression: Logistic:
𝑛 1
𝑔 𝑧 =
1+𝑒 −𝑧
ℎ 𝑥 = ෍ 𝛽𝑖 𝑥𝑖 = 𝛽 𝑇 𝑋
𝑖=0
• Logistic Regression for
classification:
1
ℎ𝛽 𝑥 = 𝑇𝑋 = g(𝛽 𝑇 𝑥)
1 + 𝑒 −𝛽
1
𝑔 𝑧 =
1+𝑒 −𝑧
is called the logistic function or the
sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• 𝑔(𝑧) → 1 as 𝑧 → ∞
• 𝑔(𝑧) → 0 as 𝑧 → −∞

𝑑 1
𝑔 𝑧 =
𝑑𝑧 1 + 𝑒 −𝑧
1 −𝑧
= . 𝑒
(1 + 𝑒 −𝑧 )2
1 1
= −𝑧
. (1 − −𝑧
)
1+𝑒 1+𝑒
= 𝑔(𝑧)(1 − 𝑔 𝑧
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; 𝛽) be our estimate of P(y|x), where 𝛽 is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and
𝑃 𝑦 = 1 𝑥 = ℎ𝛽 𝑥
𝑃 𝑦 = 0 𝑥 = 1 − ℎ𝛽 (𝑥)
• Can be written more compactly
𝑃 𝑦 𝑥 = ℎ(𝑥)𝑦 (1 − ℎ 𝑥 )1−𝑦
• We can used the gradient method

5
Maximize likelihood
𝐿 𝛽 = 𝑝(𝑦|𝑋;
Ԧ 𝛽)
𝑚

= ෑ 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝛽)
𝑖=1
𝑚

= ෑ ℎ(𝑥𝑖 )𝑦𝑖 (1 − ℎ 𝑥𝑖 )1−𝑦𝑖


𝑖=1
𝑙 𝛽 = log 𝐿 𝛽
𝑚

= ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))


𝑖=1
𝑚

𝑙 𝛽 = ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))


𝑖=1
• How do we maximize the likelihood? Gradient ascent
– Updates: 𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
Assume one training example (x,y), and take derivatives to derive the
stochastic gradient ascent rule.
𝜕
𝑙 𝛽
𝜕𝛽𝑗
1
=( 𝑦
𝑔(𝛽𝑇 (𝑥) 1 𝜕 𝑇
− 1−𝑦 𝑇
) 𝑔(𝛽 𝑥)
1 − 𝑔 𝛽 𝑥 𝜕𝛽𝑗
1 1 𝑇 𝑇
𝜕 𝑇
=( 𝑦 − 1−𝑦 ) 𝑔(𝛽 𝑥)(1 − 𝑔(𝛽 𝑥) 𝛽 𝑥
𝑔(𝛽 𝑇 (𝑥) 𝑇
1−𝑔 𝛽 𝑥 𝜕𝛽𝑗

= (𝑦 1 − 𝑔 𝛽𝑇 𝑥 − 1 − 𝑦 𝑔 𝛽𝑇 𝑥 )𝑥𝑗

= (𝑦 − ℎ𝛽 𝑥 )𝑥𝑗
𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ𝛽 𝑥 𝑖 )𝑥𝑗 (𝑖)
Foundations of Machine Learning
Module 5:
Part B: Introduction to Support
Vector Machine
Sudeshna Sarkar
IIT Kharagpur

1
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.

2
Logistic Regression and Confidence
• Logistic Regression:
𝑝 𝑦 = 1 𝑥 = ℎ𝛽 𝑥 = 𝑔(𝛽𝑇 𝑥)
• Predict 1 on an input x iff ℎ𝛽 𝑥 ≥ 0.5,
equivalently, 𝛽𝑇 𝑥 ≥ 0
• The larger the value of ℎ𝛽 𝑥 , the larger is the probability,
and higher the confidence.
• Similarly, confident prediction of 𝑦 = 0 if 𝛽𝑇 𝑥 ≪ 0
• More confident of prediction from points (instances) located
far from the decision surface.

3
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating line
to use?
• Bayesian answer:
– Use all
– Weight each line by its posterior
probability
• Can we approximate the correct
answer efficiently?

4
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support vectors to
decide which side of the
separator a test case is on. The support vectors are
indicated by the circles
around them.

5
Functional Margin
• Functional Margin of a point (𝑥𝑖 , 𝑦𝑖 ) wrt (𝑤, 𝑏)
– Measured by the distance of a point (𝑥𝑖 , 𝑦𝑖 ) from the
decision boundary (𝑤, 𝑏)
𝛾 𝑖 = 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
– Larger functional margin →more confidence for
correct prediction
– Problem: w and b can be scaled to make this value
larger
• Functional Margin of training set
{ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 } wrt (𝑤, 𝑏) is
𝛾 = min 𝛾 𝑖
1≤𝑖≤𝑚

6
Geometric Margin
• For a decision surface (𝑤, 𝑏) P=(a1,a2)
• the vector orthogonal to it is
given by 𝑤. Q=(b1,b2)

• The unit length orthogonal 𝑤
𝑤
vector is
𝑤
𝑤
• 𝑃 =𝑄+𝛾 (w,b)
𝑤

7
Geometric Margin
𝑤
𝑃 =𝑄+𝛾
𝑤
𝑤
𝑏1, 𝑏2 = 𝑎1, 𝑎2 − 𝛾 P=(a1,a2)
𝑤
𝑇
𝑤
→ 𝑤 𝑎1, 𝑎2 − 𝛾 +𝑏 =0
𝑤
𝑤 𝑇 𝑎1, 𝑎2 + 𝑏 Q=(b1,b2)
→𝛾= →
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏 (w,b)
𝛾 = 𝑦. ( 𝑎1, 𝑎2 + )
𝑤 𝑤

Geometric margin : 𝑤 = 1
Geometric margin of (w,b) wrt S={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 }
-- smallest of the geometric margins of individual points. 8
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability

x1

9
Maximize Margin Width
𝛾
• Maximize subject to
𝑤
• 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 𝛾 for 𝑖 = 1,2, . . , 𝑚
• Scale so that 𝛾 = 1
1 2
• Maximizing is the same as minimizing 𝑤
𝑤
• Minimize 𝒘. 𝒘 subject to the constraints
• for all (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑚 :
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1

10
Large Margin Linear Classifier
• Formulation: x2
Margin

1 2 x+
minimize w
2
such that x+

yi (wT xi  b)  1 n
x-

x1

denotes +1
denotes -1 11
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and


linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.

12
Foundations of Machine Learning
Module 5:
Part C: Support Vector Machine:
Dual
Sudeshna Sarkar
IIT Kharagpur

1
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and


linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.

2
Lagrangian Duality in brief
The Primal Problem min w f ( w)
s.t. g i ( w)  0, i  1,, k
hi ( w)  0, i  1,, l
The generalized Lagrangian:
k l
L( w, a , b )  f ( w)   a i g i ( w)   b i hi ( w)
i 1 i 1

the a's (ai≥0) and b's are called the Lagrange multipliers
Lemma:
 f ( w) if w satisfies primal constraints
maxa , b ,a i 0 L( w, a , b )  
  otherwise
A re-written Primal:
min w maxa , b ,a i 0 L( w, a , b )
Lagrangian Duality, cont.
The Primal Problem:  min w maxa , b ,a i 0 L( w, a , b )
*
p

a , b ,a i  0 min w L( w, a , b )
The Dual Problem: d *  max

Theorem (weak duality):


d *  maxa , b ,a i 0 min w L( w, a , b )  min w maxa , b ,a i 0 L( w, a , b )  p *

Theorem (strong duality):


Iff there exist a saddle point of 𝐿 𝑤, 𝛼, 𝛽 , we have d *  p *
The KKT conditions
If there exists some saddle point of L, then it satisfies the
following "Karush-Kuhn-Tucker" (KKT) conditions:

L( w, a , b )  0, i  1, , k
wi

L( w, a , b )  0, i  1,, l
b i
αi g i ( w)  0, i  1,, m
g i ( w)  0, i  1,, m
a i  0, i  1,, m
Theorem: If w*, a* and b* satisfy the KKT condition, then it
is also a solution to the primal and the dual problems.
Support Vectors
• Only a few 𝛼𝑖 ’s can be nonzero
• Call the training data points whose 𝛼𝑖 ’s are
nonzero the support vectors
αi g i ( w)  0, i  1,, m

If 𝛼𝑖 > 0 then 𝑔 𝑤 = 0

6
Solving the Optimization Problem
1 2
Quadratic minimize w
programming
2
with linear s.t. yi (wT xi  b)  1
constraints

Lagrangian Function

minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1


n
1 2

2 i 1

s.t. ai  0

7
Solving the Optimization Problem
minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1
n
1 2

2 i 1

s.t. ai  0

Lp
n

Minimize 0 w   a i yi xi
wrt w and b w i 1
n
Lp
a y
for fixed 𝛼
0 i i 0
bm i 1
1 m m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )  b a i yi
i 1 2 i , j 1 i 1
m
1 m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
8
The Dual problem
Now we have the following dual opt problem:
m
1 m
maxa J (a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
s.t. a i  0, i  1, , k
m

a y
i 1
i i  0.

This is a quadratic programming problem.


– A global maximum of ai can always be found.
Support vector machines
• Once we have the Lagrange multipliers {𝛼𝑗 } we can
reconstruct the parameter vector 𝑤 as a weighted
combination of the training examples:
w a y x
m
w   a i yi x i i i i
i 1 iSV

• For testing with a new data z


– Compute
 
wT z  b   a i yi xTi z  b
iSV

and classify z as class 1 if the sum is positive, and class 2


otherwise
Note: w need not be formed explicitly
Solving the Optimization Problem
 The discriminant function is:
g ( x)  w T x  b   i i xb
a x
iSV
T

 It relies on a dot product between the test point x and the


support vectors xi
 Solving the optimization problem involved computing the
dot products xiTxj between all pairs of training points
 The optimal w is a linear combination of a small number
of data points.

11
Foundations of Machine Learning
Module 5: Support Vector Machine
Part D: SVM – Maximum Margin
with Noise
Sudeshna Sarkar
IIT Kharagpur

1
Linear SVM formulation
Find w and b such that
2
is maximized
𝑤

And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),


𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1

Find w and b such that


𝑤 2 = 𝑤. 𝑤 is minimized
And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),
𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1
Limitations of previous SVM
formulation
• What if the data is
not linearly
separable?

• Or noisy data
points?

Extend the definition of maximum margin to allow


non-separating planes.
3
How to formulate?
• Minimize 𝑤 2 = 𝑤. 𝑤 and number of
misclassifications, i.e., minimize
𝑤. 𝑤 + #(training errors)

• No longer QP formulation

4
Objective to be minimized
• Minimize
𝑤. 𝑤
+ 𝐶 (distance of error points to their

Figure from [Link] 5


Maximum Margin with Noise
x1 M=
2
𝑤. 𝑤

x2 x3 Minimize
𝑚

𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑘
𝑘=1
𝑚 constraints
𝑤. 𝑥𝑘 + 𝑏 ≥ 1 − 𝜉𝑘 if 𝑦𝑘 = 1
C controls the relative 𝑤. 𝑥𝑘 + 𝑏 ≤ −1 + 𝜉𝑘 if 𝑦𝑘 = −1
importance of maximizing ≡
the margin and fitting the
training data. 𝒚𝒌 𝒘. 𝒙𝒌 + 𝒃 ≥ 𝟏 − 𝝃𝒌 , k=1,…,m
Controls overfitting. 𝝃𝒌 ≥ 𝟎, k=1,…,m
Lagrangian
𝐿 𝑤, 𝑏, 𝜉, 𝛼, 𝛽
𝑚
1
= 𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑖
2
𝑖=1
𝑚 𝑚

+ ෍ 𝛼𝑖 𝑦𝑖 𝑥. 𝑤 + 𝑏 − 1 + 𝜉𝑖 − ෍ 𝛽𝑖 𝜉𝑖
𝑖=1 𝑖=1

𝛼𝑖 ’s and 𝛽𝑖 ’s are Lagrange multipliers (≥ 0).


Dual Formulation
Find 𝛼1 , 𝛼2 , … , 𝛼𝑚 s.t.
m
1 m
max J ( )    i    i j yi y j (xTi x j )
i 1 2 i , j 1

Linear SVM Noise Accounted

s.t.  i  0, i  1, , m s.t. 0   i  C , i  1,, m


m

 y
m
 0.
 y
i 1
i i  0.
i 1
i i

8
Solution to Soft Margin Classification
• 𝑥𝑖 with non-zero 𝛼𝑖 will be support vectors.
• Solution to the dual problem is:
𝑚

𝑤 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑚

𝑏 = 𝑦𝑘 1 − 𝜉𝑘 − ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑘
𝑖=1
for any 𝑘 s.t. 𝛼𝑘 > 0
For classification,
𝑚

𝑓 𝑥 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 . 𝑥 + 𝑏
𝑖=1
(no need to compute 𝑤 explicitly)

9
Thank You

10
Foundations of Machine Learning
Module 5: Support Vector Machine
Part E: Nonlinear SVM and Kernel
function
Sudeshna Sarkar
IIT Kharagpur

1
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.

2
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)

This slide is from [Link]/~pift6080/documents/papers/svm_tutorial.ppt


Kernel
• Original input attributes is mapped to a new set of
input features via feature mapping Φ.
• Since the algorithm can be written in terms of the
scalar product, we replace 𝑥𝑎 . 𝑥𝑏 with 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
• For certain Φ’s there is a simple operation on two
vectors in the low-dim space that can be used to
compute the scalar product of their two images in the
high-dim space
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Let the kernel do the work rather than do the scalar
product in the high dimensional space.
5
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:

g ( x)  w T  ( x)  b   i i  ( x)  b
 
iSV
( x )T

• We only use the dot product of feature vectors in both


the training and test.
• A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
The kernel trick

𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Often 𝐾 𝑥𝑎 , 𝑥𝑏 may be very inexpensive to compute even if
𝜙 𝑥𝑎 may be extremely high dimensional.
Kernel Example
ഥ = [𝑥1 𝑥2 ]
2-dimensional vectors 𝒙
let 𝑲 𝒙𝒊 , 𝒙𝒋 = (𝟏 + 𝒙𝒊 . 𝒙𝒋 )𝟐
We need to show that 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 . 𝜙(𝑥𝑗 )

K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi). φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 . 𝑥𝑗
• Polynomial of power p:
𝐾 𝑥𝑖 , 𝑥𝑗 = (1 + 𝑥𝑖 . 𝑥𝑗 )𝑝
• Gaussian (radial-basis function):
2
𝑥𝑖 −𝑥𝑗

𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒 2𝜎2
• Sigmoid
𝐾 𝑥𝑖 , 𝑥𝑗 = tanh(𝛽0 𝑥𝑖 . 𝑥𝑗 + 𝛽1 )
In general, functions that satisfy Mercer’s condition can
be kernel functions.
9
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.

෍ 𝐾(𝑥𝑖 , 𝑥𝑗 )𝑐𝑖 𝑐𝑗 ≥ 0
𝑖,𝑗
• can be expressed as a dot product in a high dimensional
space.

10
SVM examples

© Eric Xing @ CMU, 2006-2010


Examples for Non Linear SVMs –
Gaussian Kernel

© Eric Xing @ CMU, 2006-2010


Nonlinear SVM: Optimization
 Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize  i   i j yi y j K (xi , x j )
i 1 2 i 1 j 1

such that 0  i  C
n

 y
i 1
i i 0

 The solution of the discriminant function is


𝑔 𝑥 = ෍ 𝛼𝑖 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑏
𝑖𝜖𝑆𝑉
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in
the original space.
Multi-class classification
• SVMs can only handle two-class outputs
• Learn N SVM’s
– SVM 1 learns Class1 vs REST
– SVM 2 learns Class2 vs REST
– :
– SVM N learns ClassN vs REST
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Thank You

16
Foundations of Machine Learning
Module 5: Support Vector Machine
Part F: SVM – Solution to the Dual
Problem
Sudeshna Sarkar
IIT Kharagpur

1
The SMO algorithm
The SMO algorithm can efficiently solve the dual problem.
First we discuss Coordinate Ascent.
Coordinate Ascent
• Consider solving the unconstrained optimization problem:
max 𝑊(𝛼1 , 𝛼2 , … , 𝛼𝑛 )
𝛼

Loop until convergence: {


for 𝑖 = 1 𝑡𝑜 𝑛 {
𝛼𝑖 = 𝑎𝑟𝑔 max 𝑊(𝛼1 , … , 𝛼ෝ𝑖 , … , 𝛼𝑛 ) ;
ෞ𝑖
𝛼
}
}
Coordinate ascent

• Ellipses are the contours of the function.


• At each step, the path is parallel to one of the axes.
Sequential minimal optimization
• Constrained optimization:
m
1 m
maxa J (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1, , m
m

a y
i 1
i i  0.

• Question: can we do coordinate along one


direction at a time (i.e., hold all a[-i] fixed, and
update ai?)
The SMO algorithm
m
1 m
maxa W (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1,, m
m

a y
i 1
i i  0.

• Choose a set of 𝛼1 ’s satisfying the constraints.


• 𝛼1 is exactly determined by the other 𝛼’s.
• We have to update at least two of them
simultaneously to keep satisfying the constraints.
The SMO algorithm
Repeat till convergence {
1. Select some pair ai and aj to update next
(using a heuristic that tries to pick the two that
will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize W(a) with respect to ai and aj, while
holding all the other ak 's (k i; j) fixed.
}
• The update to ai and aj can be computed very
efficiently.
Thank You

7
Foundations of Machine Learning

Module 6: Neural Network


Part A: Introduction

Sudeshna Sarkar
IIT Kharagpur
Introduction
• Inspired by the human brain.
• Some NNs are models of biological neural networks
• Human brain contains a massively interconnected
net of 1010-1011 (10 billion) neurons (cortical cells)
– Massive parallelism – large number of simple
processing units
– Connectionism – highly interconnected
– Associative distributed memory
• Pattern and strength of synaptic connections
Neuron

Neural Unit
ANNs
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Perceptrons
• Basic unit in a neural network: Linear separator
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs, 𝑦 = σ𝑛𝑖=0 𝑤𝑖 𝑥𝑖
– A threshold function, i.e., 1 if y > 0, −1 if y <= 0
x0
x1 w0
w1
x2 w2

⋮ wn
Σ 𝜑=
1 if 𝑦 > 0
𝑦 = ෍ 𝑤𝑖 𝑥𝑖 −1 otherwise
xn
Perceptron training rule
Updates perceptron weights for a training ex as follows:
𝑤𝑖 = 𝑤𝑖 + 𝛥𝑤𝑖

𝛥𝑤𝑖 = 𝜂 𝑦 − 𝑦ො 𝑥𝑖
• If the data is linearly separable and 𝜂 is sufficiently small, it will
converge to a hypothesis that classifies all training data correctly in a
finite number of iterations
Gradient Descent
• Perceptron training rule may not converge if points are not
linearly separable
• Gradient descent by changing the weights by the total error
for all training points.
– If the data is not linearly separable, then it will converge to
the best fit
Linear neurons
• The neuron has a real- • Define the error as the
valued output which is a squared residuals summed
weighted sum of its inputs over all training cases:
1
𝐸 = ෍(𝑦 − 𝑦) ො 2
𝑦ො = ෍ 𝑤𝑖 𝑥𝑖 = 𝐰 𝑇 𝐱 2
𝑗
𝑖
• Differentiate to get error derivatives for weights
𝜕𝐸 1 𝜕𝑦ෝ𝑗 𝜕𝐸𝑗
= ෍ = − ෍ 𝑥𝑖,𝑗 (𝑦𝑗 − 𝑦ෝ𝑗 )
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕𝑦ෝ𝑗
𝑗=1..𝑚 𝑗=1..𝑚
• The batch delta rule changes the weights in proportion to
their error derivatives summed over all training cases
𝜕𝐸
∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
Error Surface
• The error surface lies in a space with a horizontal axis for each
weight and one vertical axis for the error.
– For a linear neuron, it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
Batch Line and Stochastic Learning
Batch Learning Stochastic/ Online Learning
• Steepest descent on the For each example compute the
error surface gradient.
1
𝐸 = (𝑦 − 𝑦) ො 2
2
𝜕𝐸 1 𝜕ෝ𝑦. 𝜕𝐸𝑗
=
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕ෝ 𝑦.
= −𝑥𝑖 (𝑦. − 𝑦ෝ. )
Computation at Units
• Compute a 0-1 or a graded function of the
weighted sum of the inputs
•  () is the activation function
x1 w1

  ( w.x)
x2

w2

wn
xn w.x   wi xi
Neuron Model: Logistic Unit
1 1
 (z)  z

1 e 1  e  w .x
𝜙′ 𝑧 = 𝜑 𝑧 1 − 𝜑 𝑧
1 1
ො = ෍(𝑦 − 𝜑 𝑤. 𝑥𝑑 )2
𝐸 = ෍(𝑦 − 𝑦) 2
2 2
𝑑 𝑑
𝜕𝐸 1 𝜕𝐸𝑑 𝜕 𝑦ෞ
𝑑.
=෍
𝜕𝑤𝑖 2 𝜕ෞ
𝑦𝑑 𝜕𝑤𝑖
𝑑
𝜕𝑦
= ෍(𝑦𝑑 − 𝑦ෞ
𝑑. ) 𝑦 − 𝜑(𝑤. 𝑥𝑑 )
𝜕𝑤𝑖 𝑑
𝑑

= − ෍(𝑦𝑑 − 𝑦ෞ
𝑑 . ) 𝜑′ 𝑤. 𝑥𝑑 𝑥𝑖,𝑑
𝑑

= − ෍ 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
𝑑
Training Rule: ∆𝑤𝑖 = 𝜂 σ𝑑 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part B: Multi-layer Neural
Network
Sudeshna Sarkar
IIT Kharagpur
Limitations of Perceptrons
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as the
corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions
A solution: multiple layers
output layer
y y

z2

hidden layer
z1 z2 z1

x2

input layer
x1 x2 x1
Power/Expressiveness of Multilayer
Networks
• Can represent interactions among inputs
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Multilayer Network

Outputls
Inputs

First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

n1
n n2 yn2
xn
Input Hidden Output
layer layer

Error signals

6
The back-propagation training algorithm
• Step 1: Initialisation
Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small range
1

v01
v11 1
x1 1 1 w11
v21 w01

1 y1
v22
x2 2 2 w21
v22
Input v02 Output

1
x z y
Backprop
• Initialization
– Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small
range
• Forward computing:
– Apply an input vector x to input units
– Compute activation/output vector z on hidden layer
𝑧𝑗 = 𝜑(σ𝑖 𝑣𝑖𝑗 𝑥𝑖 )
– Compute the output vector y on output layer
𝑦𝑘 = 𝜑(σ𝑗 𝑤𝑗𝑘 𝑧𝑗 )
y is the result of the computation.
Learning for BP Nets
• Update of weights in W (between output and hidden layers):
– delta rule
• Not applicable to updating V (between input and hidden)
– don’t know the target values for hidden units z1, Z2, … ,ZP
• Solution: Propagate errors at output units to hidden units to
drive the update of weights in V (again by delta rule)
(error BACKPROPAGATION learning)
• Error backpropagation can be continued downward if the net
has more than one hidden layer.
• How to compute errors on hidden units?
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑦) ො 2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛

𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 ෍ 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
Derivation
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝑛
𝜕𝑛𝑒𝑡𝑗 𝜕
= ෍ 𝑤𝑘𝑗 𝑜𝑘 = 𝑜𝑖
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑘=1
𝜕𝑜𝑗 𝜕
= 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
Consider 𝐸 as as a function of the inputs of all neurons 𝑍 = 𝑧1 , 𝑧2 , …
receiving input from neuron 𝑗,
𝜕𝐸 𝑜𝑗 𝜕𝐸 𝑛𝑒𝑡𝑧1 , 𝑛𝑒𝑡𝑧2 , …
=
𝜕𝑜𝑗 𝜕𝑜𝑗
taking the total derivative with respect to 𝑜𝑗 , a recursive expression for
the derivative is obtained:
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
=෍ =෍ 𝑤𝑗𝑧𝑙
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙
𝑙 𝑙
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
=෍ =෍ 𝑤
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙 𝑙
• Therefore, the derivative with respect to 𝑜𝑗 can be calculated if all the derivatives
with respect to the outputs 𝑜𝑧𝑙 of the next layer – the one closer to the output
neuron – are known.
• Putting it all together:
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
With
𝑜𝑗 − 𝑡𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 ෍ 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part C: Neural Network and
Backpropagation Algorithm

Sudeshna Sarkar
IIT Kharagpur
Single layer Perceptron
• Single layer perceptrons learn o x
linear decision boundaries
x2
0 0
0 0 o o
+ + 0 0
+ 0
+ ++
x: class I (y = 1)
o: class II (y = -1)
x1
x x
x2

+ 0
o x

0 +
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2

Boolean OR + +
OR

- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2

Boolean AND - +

AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2

Boolean XOR
+ -

XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR

o -0.5

input input
ouput 1 -1
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network with
a single hidden layer.
• Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network

Outputls
Inputs

First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

n1
n n2 yn2
xn
Input Hidden Output
layer layer

Error signals

9
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑜)2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛

𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 ෍ 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
1
For one output neuron, the error function is 𝐸 = (𝑦 − 𝑜)2
2
For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛

𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 ෍ 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝜕𝐸 𝜕𝑜𝑙
=෍ 𝑤 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗 𝑜𝑖
𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
with
𝑜𝑗 − 𝑦𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 ෍ 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit 𝑘
𝛿𝑘 ← 𝑜𝑘 (1 − 𝑜𝑘 )(𝑦𝑘 − 𝑜𝑘 )
• For each hidden unit h 𝑥𝑑 = input

𝛿ℎ ← 𝑜ℎ (1 − 𝑜ℎ ) ෍ 𝑤ℎ,𝑘 , 𝛿𝑘 , 𝑦𝑑 = target output


𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑜𝑑 = observed unit output
• Update each network weight 𝑤𝑖 , 𝑗
𝑤𝑖,𝑗 ← 𝑤𝑖,𝑗 + ∆𝑤𝑖,𝑗 𝑤𝑖𝑗 = wt from i to j
where
∆𝑤𝑖,𝑗 = 𝜂𝛿𝑗 𝑥𝑖,𝑗
Backpropagation
• Gradient descent over entire network weight vector
• Can be generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• May include weight momentum 𝛼
∆𝑤𝑖,𝑗 𝑛 = 𝜂𝛿𝑗 𝑥𝑖,𝑗 + 𝛼∆𝑤𝑖,𝑗 𝑛 − 1
• Training may be slow.
• Using network after training is very fast
Training practices: batch vs. stochastic
vs. mini-batch gradient descent
• Batch gradient descent:
1. Calculate outputs for the entire Too slow to converge
dataset Gets stuck in local minima
2. Accumulate the errors, back-
propagate and update
• Stochastic/online gradient descent: Converges to the solution faster
1. Feed forward a training example Often helps get the system out of
local minima
2. Back-propagate the error and
update the parameters
• Mini-batch gradient descent:
Learning in epochs
Stopping
• Train the NN on the entire training set over and over
again
• Each such episode of training is called an “epoch”

Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima

• NN can get stuck in local minima for small networks.


• For most large networks (many weights) local minima rarely occurs.
• It is unlikely that you are in a minima in every dimension
simultaneously.
ANN
• Highly expressive non-linear functions
• Highly parallel network of logistic function units
• Minimizes sum of squared training errors
• Can add a regularization term (weight squared)
• Local minima
• Overfitting
Thank You
Foundations of Machine Learning

Module 6: Neural Network


Part D: Deep Neural Network

Sudeshna Sarkar
IIT Kharagpur
Deep Learning
• Breakthrough results in
– Image classification
– Speech Recognition
– Machine Translation
– Multi-modal learning
Deep Neural Network
• Problem: training networks with many hidden layers
doesn’t work very well
• Local minima, very slow training if initialize with zero
weights.
• Diffusion of gradient.
Hierarchical Representation
• Hierarchical Representation help represent complex
functions.
• NLP: character ->word -> Chunk -> Clause -> Sentence
• Image: pixel > edge -> texton -> motif -> part -> object
• Deep Learning: learning a hierarchy of internal
representations
• Learned internal representation at the hidden layers
(trainable feature extractor)
• Feature learning

Trainable Trainable
Input Feature … Trainable
Feature Output
Extractor Classifier
Extractor
Unsupervised Pre-training
 We will use greedy, layer wise pre-training
 Train one layer at a time
 Fix the parameters of previous hidden layers
 Previous layers viewed as feature extraction
 find hidden unit features that are more common in training
input than in random inputs
Tuning the Classifier
• After pre-training of the layers
– Add output layer
– Train the whole network using
supervised learning (Back propagation)
Deep neural network
• Feed forward NN
• Stacked Autoencoders (multilayer neural net
with target output = input)
• Stacked restricted Boltzmann machine
• Convolutional Neural Network
A Deep Architecture: Multi-Layer Perceptron
Output Layer
y
Here predicting a supervised target

h3 …
Hidden layers
These learn more
h2 …
abstract representations
as you head up
h1 …

Input layer x …
Raw sensory inputs
A Neural Network
• Training : Back
Propagation of Error
– Calculate total error at
the top
– Calculate contributions
to error at each step
going backwards INPUT LAYER HIDDEN LAYER OUTPUT LAYER

– The weights are


modified as the error is
propagated
Training Deep Networks
• Difficulties of supervised training of deep networks
1. Early layers of MLP do not get trained well
• Diffusion of Gradient – error attenuates as it propagates to
earlier layers
• Leads to very slow training
• the error to earlier layers drops quickly as the top layers
"mostly" solve the task
2. Often not enough labeled data available while there may be
lots of unlabeled data
3. Deep networks tend to have more local minima problems
than shallow networks during supervised training

10
Training of neural networks
• Forward Propagation :
– Sum inputs, produce
activation
– feed-forward

Activation Functions examples

INPUT LAYER HIDDEN LAYER OUTPUT LAYER


Activation Functions
Non-linearity

𝑒 𝑥 −𝑒 −𝑥
• tanh(x)= 𝑥 −𝑥
𝑒 +𝑒

1
• sigmoid(x) =
1+𝑒 −𝑥

• Rectified linear
relu(x) = max(0,x)
- Simplifies backprop
- Makes learning faster
- Make feature sparse
→ Preferred option
Autoencoder
Unlabeled training examples
set
a1
{𝑥 (1) , 𝑥 (2) , 𝑥 (3) . . . }, 𝑥 (𝑖) ∈
ℝ𝑛
a2
Set the target values to be
a3 equal to the inputs. 𝑦 (𝑖) = 𝑥 (𝑖)
Network is trained to output
the input (learn identify
function).
ℎ𝑤,𝑏 𝑥 ≈ 𝑥
Solution may be trivial!
Autoencoders and sparsity
1. Place constraints on the
network, like limiting the
number of hidden units, to
discover interesting structure
about the data.
2. Impose sparsity constraint.
a neuron is “active” if its output
value is close to 1
It is “inactive” if its output value is
close to 0.
constrain the neurons to be inactive
most of the time.
Auto-Encoders

15
Stacked Auto-Encoders
• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network
to fine- tune all weights

e zi
yi 
e j
z

j
16
Convolutional Neural netwoks
• A CNN consists of a number of convolutional and
subsampling layers.
• Input to a convolutional layer is a m x m x r image
where m x m is the height and width of the image
and r is the number of channels, e.g. an RGB image
has r=3
• Convolutional layer will have k filters (or kernels)
• size n x n x q
• n is smaller than the dimension of the image and,
• q can either be the same as the number of
channels r or smaller and may vary for each kernel
Convolutional Neural netwoks

Convolutional layers consist of a rectangular grid of neurons


Each neuron takes inputs from a rectangular section of the previous layer
the weights for this rectangular section are the same for each neuron in the
convolutional layer.
Pooling: Using features obtained after
Convolution for Classification
The pooling layer takes small rectangular
blocks from the convolutional layer and
subsamples it to produce a single output
from that block : max, average, etc.
CNN properties
• CNN takes advantage of the sub-structure of
the input
• Achieved with local connections and tied
weights followed by some form of pooling
which results in translation invariant features.

• CNN are easier to train and have many fewer


parameters than fully connected networks
with the same number of hidden units.
Recurrent Neural Network (RNN)
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A: Finite Hypothesis Space
Sudeshna Sarkar
IIT Kharagpur
Goal of Learning Theory
• To understand
– What kinds of tasks are learnable?
– What kind of data is required for learnability?
– What are the (space, time) requirements of the learning
algorithm.?
• To develop and analyze models
– Develop algorithms that provably meet desired criteria
– Prove guarantees for successful algorithms

2
Goal of Learning Theory
• Two core aspects of ML
– Algorithm Design. How to optimize?
– Confidence for rule effectiveness on future data.
• We need particular settings (models)
– Probably Approximately Correct (PAC)
Pr 𝑃 𝑐⨁ℎ ≤ 𝜖 ≥ 1 − 𝛿

C h

C ⨁h=
Error region3
Prototypical Concept Learning Task
• Given
– 𝑑 𝑑
Instances X (e.g., 𝑋 = 𝑅 or 𝑋 = {0,1}
h 𝑐
+ + −
– Distribution 𝒟 over X - + + -
– Target function c - -
– Hypothesis Space ℋ Instance space X
– Training Examples S = 𝑥𝑖 , 𝑐(𝑥𝑖 ) 𝑥𝑖 i.i.d. from 𝒟
• Determine
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐(𝑥) for all 𝑥 in S?
– A hypothesis h ∈ ℋ s.t. ℎ 𝑥 = 𝑐 𝑥 for all 𝑥 in X?
• An algorithm does optimization over S, find hypothesis h.
• Goal: Find h which has small error over 𝒟

4
Computational Learning Theory
• Can we be certain about how the learning algorithm
generalizes?
• We would have to see all the examples.

• Inductive inference –
generalizing beyond the training h 𝑐
data is impossible unless we add + + −
- + + -
more assumptions (e.g., - -
priors over H) Instance space X
We need a bias!
Function Approximation
• How many 𝑁
labeled examples in order to determine which
of the 22 hypothesis is the correct one?
• All 2𝑁 instances in X must be labeled!
• Inductive inference: generalizing beyond the training data is
impossible unless we add more assumptions (e.g., bias)

- 𝐻 = ℎ: 𝑋 → 𝑌
|𝑋| 2𝑁
+ ℎ1 ||H|=2 = 2
c + +
- +
- + -
ℎ2
Instance space X
Error of a hypothesis
The true error of hypothesis h, with respect to the target
concept c and observation distribution 𝒟 is the probability that h
will misclassify an instance drawn according to 𝒟
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ 𝑃𝑟𝑥~𝒟 𝑐 𝑥 ≠ ℎ 𝑥
In a perfect world, we’d like the true error to be 0.

Bias: Fix hypothesis space H


c may not be in H => Find h close to c
A hypothesis h is approximately correct if
𝑒𝑟𝑟𝑜𝑟𝒟 ℎ ≤ 𝜀
PAC model
• Goal: h has small error over D.
• True error: 𝑒𝑟𝑟𝑜𝑟𝐷 ℎ = Pr ℎ 𝑥 ≠ 𝑐 ∗ 𝑥
𝑥~𝐷
• How often ℎ 𝑥 ≠ 𝑐 ∗ 𝑥 over future instances drawn at
random from D
• But, can only measure:
1
Training error: 𝑒𝑟𝑟𝑜𝑟𝑆 ℎ = ෍ 𝐼(ℎ 𝑥𝑖 ≠ 𝑐 ∗ 𝑥 )
𝑚 𝑖
How often ℎ 𝑥 ≠ 𝑐 ∗ 𝑥 over training Instances

• Sample Complexity: bound 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) in terms of 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ)


Probably Approximately Correct Learning

• PAC Learning concerns efficient learning


• We would like to prove that
– With high probability an (efficient) learning algorithm will
find a hypothesis that is approximately identical to the
hidden target concept.

• We specify two parameters, 𝜀 and 𝛿 and


require that with probability at least (1−𝛿) a
system learn a concept with error at most 𝜀.
Sample Complexity for Supervised Learning
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈
𝐻 with 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) > 0.
• inversely linear in 𝜖
• logarithmic in |H|
• 𝜖 error parameter: D might place low weight on certain parts of the
space
• 𝛿 confidence parameter: there is a small chance the examples we
get are not representative of the distribution
Sample Complexity for Supervised
Learning
1 1
Theorem: 𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛 labeled examples are sufficient so that
∈ 𝛿
with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) > 0.
Proof: Assume k bad hypotheses Hbad={ℎ1 , ℎ2 , … , ℎ𝑘 } with
𝑒𝑟𝑟𝐷 (ℎ𝑖 ) ≥∈
• Fix ℎ𝑖 . Prob. ℎ𝑖 consistent with first training example is ≤ 1 −
∈. Prob. ℎ𝑖 consistent with first m training examples is ≤
(1 −∈)𝑚 .
• Prob. that at least one ℎ𝑖 consistent with first m training
examples is
≤ 𝑘(1 −∈)𝑚 ≤ |𝐻|(1 −∈)𝑚 .
• Calculate value of m so that |𝐻|(1 −∈)𝑚 ≤ 𝛿
• Use the fact that 1 − 𝑥 ≤ 𝑒 −𝑥 , sufficient to set |𝐻|𝑒 −∈𝑚 ≤ 𝛿
Sample Complexity: Finite Hypothesis
Spaces Realizable Case
PAC: How many examples suffice to guarantee small error whp.
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.

Statistical Learning Way:


With probability at least 1 − 𝛿, all ℎ ∈ 𝐻 s.t. 𝑒𝑟𝑟𝑆 ℎ = 0 we have
1 1
𝑒𝑟𝑟𝐷 (ℎ) ≤ | |
𝐼𝑛 𝐻 + 𝐼𝑛
𝑚 𝛿
P(consist( H bad , D))  H e m  
m 
e 
H

 m  ln( )
H
  
m    ln  /  (flip inequality )

 H 
 H 
m   ln /

  
 1 
m   ln  ln H  / 
  
Sample complexity: inconsistent finite |ℋ|
• For a single hypothesis to have misleading training error
−2𝑚𝜀2
Pr 𝑒𝑟𝑟𝑜𝑟𝒟 𝑓 ≤ 𝜀 + 𝑒𝑟𝑟𝑜𝑟𝐷 𝑓 ≤ 𝑒
• We want to ensure that the best hypothesis has error
bounded in this way
– So consider that any one of them could have a large error
−2𝑚𝜀 2
Pr (∃𝑓 ∈ ℋ)𝑒𝑟𝑟𝑜𝑟𝒟 𝑓 ≤ 𝜀 + 𝑒𝑟𝑟𝑜𝑟𝐷 𝑓 ≤ |ℋ|𝑒
• From this we can derive the bound for the number of
samples needed.
1 1
𝑚 ≥ 2 (ln ℋ + ln( ))
2𝜀 𝛿
Sample Complexity: Finite Hypothesis Spaces

Consistent Case
Theorem
1 1
𝑚 ≥ 𝐼𝑛 |𝐻| + 𝐼𝑛
∈ 𝛿
labeled examples are sufficient so that with prob. 1 − 𝛿, all ℎ ∈ 𝐻 with
𝑒𝑟𝑟𝐷 (ℎ) ≥∈ have 𝑒𝑟𝑟𝑆 (ℎ) > 0.

Inconsistent Case
What if there is no perfect h?
Theorem: After m examples, with probability ≥ 1 − 𝛿, all ℎ ∈ 𝐻 have
𝑒𝑟𝑟𝐷 ℎ − 𝑒𝑟𝑟𝑆 (ℎ) <∈, for
2 2
𝑚≥ 2
𝐼𝑛 |𝐻| + 𝐼𝑛
2∈ 𝛿
Sample complexity: example
• 𝒞 : Conjunction of n Boolean literals. Is 𝒞 PAC-learnable?
|ℋ| = 3𝑛
1 1
𝑚 ≥ (𝑛 ln 3 + ln( ))
𝜀 𝛿

• Concrete examples:
– δ=ε=0.05, n=10 gives 280 examples
– δ=0.01, ε=0.05, n=10 gives 312 examples
– δ=ε=0.01, n=10 gives 1,560 examples
– δ=ε=0.01, n=50 gives 5,954 examples
• Result holds for any consistent learner, such as FindS.
Sample Complexity of Learning
Arbitrary Boolean Functions
• Consider any boolean function over n boolean features
such as the hypothesis space of DNF or decision trees.
There are 22^n of these, so a sufficient number of
examples to learn a PAC concept is:
1 2𝑛 1 1 𝑛 1
𝑚 ≥ (ln 2 + ln( )) = (2 ln 2 + ln( ))
𝜀 𝛿 𝜀 𝛿

• δ=ε=0.05, n=10 gives 14,256 examples


• δ=ε=0.05, n=20 gives 14,536,410 examples
• δ=ε=0.05, n=50 gives 1.561x1016 examples

17
Thank You
Concept Learning Task
“Days in which Aldo enjoys swimming”
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

• Hypothesis Representation: Conjunction of constraints on the


6 instance attributes
• “?” : any value is acceptable
• specify a single required value for the attribute
• “” : that no value is acceptable
Concept Learning

h = (?, Cold, High, ?, ?, ?)


indicates that Aldo enjoys his favorite sport on
cold days with high humidity

Most general hypothesis: (?, ?, ?, ?, ?, ? )


Most specific hypothesis: (, , , , , )
Find-S Algorithm
1. Initialize h to the most specific hypothesis in ℋ
2. For each positive training instance x
For each attribute constraint ai in h
IF the constraint ai in h is satisfied by x
THEN do nothing
ELSE replace ai in h by next more general
constraint satisfied by x
3. Output hypothesis h
Concept Learning
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

Finding a Maximally Specific Hypothesis


Find-S Algorithm
h1  (, , , , , )
h2  (Sunny, Warm, Normal, Strong, Warm, Same)
h3  (Sunny, Warm, ?, Strong, Warm, Same)
h4  (Sunny, Warm, ?, Strong, ?, ?)

Back
Thank You
Foundations of Machine Learning
Module 7: Computational
Learning Theory
Part A
Sudeshna Sarkar
IIT Kharagpur
Sample Complexity: Infinite
Hypothesis Spaces
• Need some measure of the expressiveness of infinite
hypothesis spaces.
• The Vapnik-Chervonenkis (VC) dimension provides
such a measure, denoted VC(H).
• Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
Shattering
• Consider a hypothesis for the 2-class problem.
• A set of 𝑁 points (instances) can be labeled as + or
− in 2𝑁 ways.
• If for every such labeling a function can be found in
ℋ consistent with this labeling, we set that the set
of instances is shattered by ℋ.
Three points in R2
• It is enough to find one set of three points that can be
shattered.
• It is not necessary to be able to shatter every possible set of
three points in 2 dimensions
Shattering Instances
• Consider 2 instances described using a single real-
valued feature being shattered by a single
interval.

x y
Shattering Instances (cont)
But 3 instances cannot be shattered by a single interval.
x y z

x y z + –
_ x,y,z
x y,z
y x,z
x,y z
x,y,z
y,z x
z x,y
Cannot do x,z y

7
VC Dimension
• The Vapnik-Chervonenkis dimension, VC(H). of hypothesis
space H defined over instance space X is the size of the largest
finite subset of X shattered by H. If arbitrarily large finite
subsets of X can be shattered then VC(H) = 

• If there exists at least one subset of X of size d that can be


shattered then VC(H) ≥ d.
• If no subset of size d can be shattered, then VC(H) < d.

• For a single intervals on the real line, all sets of 2 instances can
be shattered, but no set of 3 instances can, so VC(H) = 2.
VC Dimension
• An unbiased hypothesis space shatters the entire instance
space.
• The larger the subset of X that can be shattered, the more
expressive (and less biased) the hypothesis space is.
• The VC dimension of the set of oriented lines in 2-d is
three.

• Since there are 2m partitions of m instances, in order for H


to shatter instances: |H| ≥ 2m.
• Since |H| ≥ 2m, to shatter m instances, VC(H) ≤ log2|H|

9
VC Dimension Example
Consider axis-parallel rectangles in the real-plane,
i.e. conjunctions of intervals on two real-valued
features. Some 4 instances can be shattered.

Some 4 instances cannot be shattered:


VC Dimension Example (cont)
• No five instances can be shattered since there can be at most
4 distinct extreme points (min and max on each of the 2
dimensions) and these 4 cannot be included without including
any possible 5th point.

• Therefore VC(H) = 4
• Generalizes to axis-parallel hyper-rectangles (conjunctions of
intervals in n dimensions): VC(H)=2n.

11
Upper Bound on Sample Complexity with VC

• Using VC dimension as a measure of expressiveness, the


following number of examples have been shown to be
sufficient for PAC Learning (Blumer et al., 1989).
1 2  13  
 4 log 2    8VC ( H ) log 2   
     

• Compared to the previous result using ln|H|, this bound has


some extra constants and an extra log2(1/ε) factor. Since
VC(H) ≤ log2|H|, this can provide a tighter upper bound on
the number of examples needed for PAC learning.

12
Sample Complexity Lower Bound with VC
• There is also a general lower bound on the minimum number of
examples necessary for PAC learning (Ehrenfeucht, et al., 1989):
Consider any concept class C such that 𝑉𝐶 𝐻 > 2 , any learner 𝐿
and any 0 < 𝜀 < 1Τ8 , 0 < 𝛿 < 1Τ100 .
Then there exists a distribution D and target concept in C such that if
L observes fewer than:
1  1  VC(C )  1 
max  log 2  , 
   32 
examples, then with probability at least δ, L outputs a hypothesis
having error greater than ε.
• Ignoring constant factors, this lower bound is the same as the upper
bound except for the extra log2(1/ ε) factor in the upper bound.
13
Thank You
Foundations of Machine Learning

Module 8: Ensemble Learning


Part A

Sudeshna Sarkar
IIT Kharagpur
What is Ensemble Classification?
• Use multiple learning algorithms (classifiers)
• Combine the decisions
• Can be more accurate than the individual classifiers
• Generate a group of base-learners
• Different learners use different
– Algorithms
– Hyperparameters
– Representations (Modalities)
– Training sets
Why should it work?
• Works well only if the individual classifiers
disagree
– Error rate < 0.5 and errors are independent
– Error rate is highly correlated with the correlations
of the errors made by the different learners
Bias vs. Variance
• We would like low bias error and low variance error
• Ensembles using multiple trained (high variance/low
bias) models can average out the variance, leaving
just the bias
– Less worry about overfit (stopping criteria, etc.)
with the base models
Combining Weak Learners
• Combining weak learners
– Assume n independent models, each having accuracy of
70%.
– If all n give the same class output then you can be confident
it is correct with probability 1-(1-.7)n.
– Normally not completely independent, but unlikely that all n
would give the same output
• Accuracy better than the base accuracy of the models by using
the majority output.
– If n1 models say class 1 and n2<n1 models say class 2, then
P(class1) = 1 – Binomial(n, n2, .7)

n! n -r
P(r) = p (1 - p)
r

r!(n - r)!
Ensemble Creation Approaches
• Get less correlated errors between models
– Injecting randomness
• initial weights (eg, NN), different learning parameters,
different splits (eg, DT) etc.
– Different Training sets
• Bagging, Boosting, different features, etc.
– Forcing differences
• different objective functions
– Different machine learning model
Ensemble Combining Approaches
• Unweighted Voting (e.g. Bagging)
• Weighted voting – based on accuracy (e.g. Boosting),
Expertise, etc.
• Stacking - Learn the combination function
Combine Learners: Voting
• Unweighted voting
• Linear combination
(weighted vote)
• weight ∝ accuracy
• weight ∝ 1Τvariance
L
y = åw jd j
j=1
L
w j ³ 0 and åw j =1
j=1

• Bayesian
(
P Ci |x = ) å
all models M j
( )( )
P Ci |x , Mj P Mj
Fixed Combination Rules
Bayes Optimal Classifier
• The Bayes Optimal Classifier is an ensemble of all the
hypotheses in the hypothesis space.
• On average, no other ensemble can outperform it.
• The vote for each hypothesis
– proportional to the likelihood that the training dataset
would be sampled from a system if that hypothesis were
true.
– is multiplied by the prior probability of that hypothesis.
• y is the predicted class,
• C is the set of all possible classes,
• H is the hypothesis space,
• T is the training data.
The Bayes Optimal Classifier represents a hypothesis
that is not necessarily in H.
But it is the optimal hypothesis in the ensemble space.
Practicality of Bayes Optimal Classifier
• Cannot be practically implemented.
• Most hypothesis spaces are too large
• Many hypotheses output a class or a value, and not
probability
• Estimating the prior probability for each
hypothesizes is not always possible.
BMA

• All possible models in the model space used


weighted by their probability of being the “Correct”
model
• Optimal given the correct model space and priors
Why are Ensembles Successful?
• Bayesian perspective:
PC i | x    PC | x ,M PM 
allmodelsMj
i j j

• If dj are independent
 1  1   1
Var  y   Var   d j   2 Var   d j   2 L  Var d j   Var d j 
1
  
 j L  L  j  L L

Bias does not change, variance decreases by L


• If dependent, error increase with positive correlation
  1 
Vary   2 Var  d j   2  Vard j   2 Cov(di , d j )
1
L  j  L  j j i j 
Challenge for developing Ensemble Models

• The main challenge is to obtain base models which are


independent and make independent kinds of errors.
• Independence between two base classifiers can be assessed
in this case by measuring the degree of overlap in training
examples they misclassify
(|AB|/|AB|)
Thank You
Foundations of Machine Learning

Module 8: Ensemble Learning


Part B: Bagging and Boosting

Sudeshna Sarkar
IIT Kharagpur
Bagging
• Bagging = “bootstrap aggregation”
– Draw N items from X with replacement
• Desired learners with high variance (unstable)
– Decision trees and ANNs are unstable
– K-NN is stable
• Use bootstrapping to generate L training sets and
train one base-learner with each (Breiman, 1996)
• Use voting
Bagging
• Sampling with replacement

• Build classifier on each bootstrap sample

• Each sample has probability (1 – 1/n)n of being


selected
Boosting
• An iterative procedure. Adaptively change distribution of
training data.
– Initially, all N records are assigned equal weights
– Weights change at the end of boosting round
• On each iteration t:
– Weight each training example by how incorrectly it was
classified
– Learn a hypothesis: ℎ𝑡
– A strength for this hypothesis: 𝛼𝑡
• Final classifier:
– A linear combination of the votes of the different
classifiers weighted by their strength
• “weak” learners
– P(correct) > 50%, but not necessarily much better
Adaboost
• Boosting can turn a weak algorithm into a strong
learner.
• Input: S={ 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) }
• 𝐷𝑡 (𝑖) : weight of i th training example
• Weak learner A
• For 𝑡 = 1,2, … , 𝑇
– Construct 𝐷𝑡 on {𝑥1 , 𝑥2 …}
– Run A on 𝐷𝑡 producing ℎ𝑡 : 𝑋 → {−1,1}
𝜖𝑡 =error of ℎ𝑡 over 𝐷𝑡
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where 𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
– Choose 𝛼𝑡 ∈ ℝ.
– Update:
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝐷𝑡 + 1 𝑖 =
𝑍𝑡
Where 𝑍𝑡 is a normalization factor
𝑚

𝑍𝑡 = ෍ 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 )
𝑖=1
Output the final classifier:
𝑇

𝐻 𝑥 = 𝑠𝑖𝑔𝑛 ෍ 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Given: 𝑥1 , 𝑦1 , … , (𝑥𝑚 , 𝑦𝑚 ) where
𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = −1, +1
Initialize 𝐷1 𝑖 = 1Τ𝑚.
For 𝑡 = 1, … , 𝑇:
– Train weak learner using distribution 𝐷𝑡 .
– Get weak classifier ℎ𝑡 : 𝑋 → ℝ.
Choose 𝛼𝑡 to minimize training error
– Choose 𝛼𝑡 ∈ ℝ.
1 1−∈𝑡
– Update: 𝛼𝑡 = 𝐼𝑛
𝐷𝑡 𝑖 exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) 2 ∈𝑡
𝐷𝑡 + 1 𝑖 =
𝑍𝑡 where
Where 𝑍𝑡 is a normalization factor 𝑚
𝑚

𝑍𝑡 = ෍ 𝐷𝑡 𝑖 𝑒𝑥𝑝 (−𝛼𝑡 𝑦𝑖 ℎ𝑡 𝑥𝑖 ) ∈𝑡 = ෍ 𝐷𝑡 𝑖 δ(ℎ𝑡 𝑥𝑖 ≠ 𝑦𝑖 )


𝑖=1 𝑖=1
Output the final classifier:
𝑇

𝐻 𝑥 = 𝑠𝑖𝑔𝑛 ෍ 𝛼𝑡 ℎ𝑡 (𝑥) .
𝑡=1
Strong weak classifiers
• If each classifiers is (at least slightly) better than random
∈𝑡 < 0.5

• Ican be shown that AdaBoost will achieve zero training


error (expotentially fast):

𝑚 𝑇
1
෍ 𝛿(𝐻(𝑥𝑖 ) ≠ 𝑦𝑖 ) ≤ ෑ 𝑍𝑡 ≤ 𝑒𝑥𝑝 −2 ෍(1Τ2 −∈𝑡 )2
𝑚
𝑖=1 𝑡 𝑡=1
Illustrating AdaBoost
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744

Overall +++ - - - - - ++
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part A: Introduction and kmeans

Sudeshna Sarkar
IIT Kharagpur
Unsupervised learning
• Unsupervised learning:
– Data with no target attribute. Describe hidden structure from
unlabeled data.
– Explore the data to find some intrinsic structures in them.
• Clustering: the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are
more similar to each other than to those in other clusters.
• Useful for
– Automatically organizing data.
– Understanding hidden structure in data.
– Preprocessing for further analysis.

2
Applications: News Clustering (Google)
Gene Expression Clustering
Other Applications
• Biology: classification of plants and animal kingdom
given their features
• Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
• Clustering weblog data to discover groups of similar
access patterns.
• Recognize communities in social networks.
An illustration
• This data set has four natural clusters.

6
An illustration
• This data set has four natural clusters.

7
Aspects of clustering
• A clustering algorithm such as
– Partitional clustering eg, kmeans The quality of a
– Hierarchical clustering eg, AHC clustering result
– Mixture of Gaussians depends on the
algorithm, the
• A distance or similarity function
distance function,
– such as Euclidean, Minkowski, cosine
and the application.
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
8
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of
objects using some criterion
• Model-based: Hypothesize a model for each cluster and find
best fit of models to data
• Density-based: Guided by connectivity and density functions
• Graph-Theoretic Clustering

9
Partitioning Algorithms
• Partitioning method: Construct a partition of a
database D of m objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic method: k-means (MacQueen, 1967)

10
Hierarchical Clustering
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

• Produce a nested sequence of clusters.


• One approach: recursive application of a partitional
clustering algorithm.
Model Based Clustering
• A model is hypothesized
• e,g., Assume data is
generated by a mixture of
underlying probability
distributions
• Fit the data to model
Density based Clustering
• Based on density
connected points
• Locates regions of high
density separated by
regions of low density
• e.g., DBSCAN
Graph Theoretic Clustering
• Weights of edges
between items (nodes)
based on similarity
• E.g., look for minimum
cut in a graph
(Dis)similarity measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures
1ൗ
𝑚 𝑝
𝑝
𝑑 𝒙𝒊 , 𝒙𝒋 = ෍ 𝑥𝑖𝑠 − 𝑥𝑗𝑠
𝑠=1
Manhattan (p=1), Euclidean (p=2)
– Cosine distance
𝑥𝑖 . 𝑥𝑗
cosine 𝑥𝑖 , 𝑥𝑗 =
𝑥𝑖 . 𝑥𝑖
(Dis)similarity measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance
𝑑 𝑥𝑖 , 𝑥𝑖 = 𝑥𝑖 − 𝑥𝑗 Σ −1 𝑥𝑖 − 𝑥𝑗

• Pearson correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑥𝑗 )
𝑟 𝑥𝑖 , 𝑥𝑗 =
𝜎𝑥𝑖 𝜎𝑥𝑗
Quality of Clustering
• Internal evaluation:
– assign the best score to the algorithm that produces clusters with high
similarity within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin index
𝑘
1 𝜎𝑖 + 𝜎𝑗
𝐷𝐵 = ෍ max
𝑛 𝑗≠𝑖 𝑑(𝑐𝑖 , 𝑐𝑗 )
𝑖=1
• External evaluation:
– evaluated based on data such as known class labels and external
benchmarks, eg, Rand Index, Jaccard Index, f-measure
𝑇𝑃 + 𝑇𝑁
𝑅𝐼 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
𝐴∩𝐵 𝑇𝑃
𝐽 𝐴, 𝐵 = =
𝐴∪𝐵 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part B: kmeans clustering

Sudeshna Sarkar
IIT Kharagpur
Partitioning Algorithms
• Given k
• Construct a partition of m objects 𝑋 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒎 }
where 𝒙𝒊 = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 ) is a vector in a real-valued space 𝑋 ⊆ ℝ𝑛 , n is the
number of attributes.
• into a set of k clusters 𝑆 = {𝑆1 , 𝑆2 , … , 𝑆𝑘 }
• The cluster mean 𝜇𝑖 serves as a prototype of the cluster 𝑆𝑖 .
• Find k clusters that optimizes a chosen criterion
– E.g., the within-cluster sum of squares (WCSS)
(sum of distance functions of each point in the cluster to the cluster
mean)
𝑘
2
argmin ෍ ෍ 𝑥𝑖 − 𝜇𝑖
𝑆
𝑖=1 𝑥∈𝑆𝑖

Heuristic method: k-means (MacQueen, 1967)


2
K-means algorithm
Given k
1. Randomly choose k data points (seeds) to be the
initial cluster centres
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current
cluster memberships.
4. If a convergence criterion is not met, go to 2.

3
Stopping/convergence criterion
OR
1. no re-assignments of data points to different
clusters
2. no (or minimum) change of centroids
3. minimum decrease in the sum of squared error
𝑘
2
𝑆𝑆𝐸 = ෍ ෍ 𝑥𝑖 − 𝜇𝑖
𝑖=1 𝑥∈𝑆𝑖

4
Kmeans illustrated
Similarity / Distance measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures
1ൗ
𝑛 𝑝
𝑝
𝑑 𝒙𝒊 , 𝒙𝒋 = ෍ 𝑥𝑖𝑠 − 𝑥𝑗𝑠
𝑠=1
Manhattan (p=1), Euclidean (p=2)
– Cosine distance
Similarity / Distance measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance
𝑑 𝑥𝑖 , 𝑥𝑖 = 𝑥𝑖 − 𝑥𝑗 Σ −1 𝑥𝑖 − 𝑥𝑗

• Pearson correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑥𝑗 )
𝑟 𝑥𝑖 , 𝑥𝑗 =
𝜎𝑥𝑖 𝜎𝑥𝑗
Convergence of K-Means
• Recomputation monotonically decreases each square
error since
(𝑚𝑗 is number of members in cluster j):
σ 𝑥𝑖 − 𝑎 2 reaches minimum for:
෍ −2 𝑥𝑖 − 𝑎 = 0

෍ 𝑥𝑖 = ෍ 𝑎 = 𝑚𝑗 𝑎

𝑎 = 1ൗ𝑚𝑗 ෍ 𝑥𝑖 = 𝑐𝑗

• K-means typically converges quickly


8
Time Complexity
• Computing distance between two items is O(n)
where n is the dimensionality of the vectors.
• Reassigning clusters: O(km) distance
computations, or O(kmn).
• Computing centroids: Each item gets added
once to some centroid: O(mn).
• Assume these two steps are each done once
for t iterations: O(tknm).

9
Advantages
• Fast, robust easy to understand.
• Relatively efficient: O(tkmn)
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally
weight underlying factors.
• Applicable only when mean is defined i.e. fails
for categorical data.
• Only local optima
K-Means on RGB image
x1={r1, g1, b1} Classification Results
x2={r2, g2, b2} x1C(x1)
x2C(x2)
… …
Classifier
xi={ri, gi, bi} xiC(xi)
(K-Means) …

Cluster Parameters
𝜃 1 for C1
𝜃 2 for C2

𝜃 k for Ck

example from Bishop’s Book


12
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, Christopher M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
BCS Summer School, example fromChristopher
Bishop’s Book
M.
Exeter, 2003 Bishop
Model-based clustering
• Assume 𝑘 probability distributions with
parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘
• Given data 𝑋, compute 𝜃1 , 𝜃2 , … , 𝜃𝑘 such that
𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [likelihood] or
ln 𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [log likelihood]
is maximized.
• Every point 𝑥𝜖𝑋 may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘 randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters

– Maximation step: estimate model parameters that


maximize the likelihood for the given assignment of
points
EM Algorithm
Expectation step: (probabilistically) assign points to clusters
compute Prob(point|mean)
Prob(mean|point) =
Prob(mean) Prob(point|mean) / Prob(point)

Maximation step: estimate model parameters that maximize


the likelihood for the given assignment of points
Each mean = Weighted avg. of points
Weight = Prob(mean|point)
EM Algorithm
• Initialize 𝑘 cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
Pr(𝑥𝑖 |𝐶𝑘 )
Pr 𝑥𝑖 ∈ 𝐶𝑘 =
σ𝑗 Pr(𝑥𝑖 |𝐶𝑗 )
σ𝑖 Pr(𝑥𝑖 ∈ 𝐶𝑘 )
𝑤𝑘 =
𝑛
– Maximization step: estimate model parameters
𝑛
1 Pr(𝑥𝑖 ∈ 𝐶𝑘 )
𝑟𝑘 = ෍
𝑛 σ𝑗 Pr(𝑥𝑖 ∈ 𝐶𝑗 )
𝑖=1
K-means Algorithm
• Goal: represent a data set in terms of K
clusters each of which is summarized by a
prototype 𝜇𝑘
• Initialize prototypes, then iterate between two
phases:
– E-step: assign each data point to nearest prototype
– M-step: update prototypes to be the cluster
means
Thank You
Foundations of Machine Learning

Module 9: Clustering
Part C: Hierarchical Clustering

Sudeshna Sarkar
IIT Kharagpur
Hierarchical Clustering
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

• Produce a nested sequence of clusters.


• One approach: recursive application of a partitional
clustering algorithm.
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster
(i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is
recursively divided further
– stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

3
Dendrogram: Hierarchical Clustering
Dendrogram
0.2
– Given an input set S
0.15
– nodes represent subsets
0.1
of S
0.05
– Features of the tree:
0
1 3 2 5 4 6 – The root is the whole
input set S.
6 5 – The leaves are the
3
4
4
individual elements of S.
– The internal nodes are
2
5
2
defined as the union of
3
1
1 their children.

4
4
Dendrogram: Hierarchical Clustering
Dendrogram
– Each level of the tree
represents a
partition of the input
data into several
(nested) clusters or
groups.
– May be cut at any
level: Each
connected
component forms a
cluster.
5
5
Hierarchical clustering
Hierrarchical Agglomerative clustering
• Initially each data point forms a cluster.
• Compute the distance matrix between the
clusters.
• Repeat
– Merge the two closest clusters
– Update the distance matrix
• Until only a single cluster remains.

Different definitions of the distance leads to


different algorithms.
7
Initialization
• Each individual point is taken as a cluster
• Construct distance/proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
.
Distance/Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
Merge the two closest clusters (C2 and C5) and update the
distance matrix.
C1 C2 C3 C4 C5
C1

C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• Update the distance matrix
C2
U
C1 C5 C3 C4

C3 C1 ?

C4 C2 U C5 ? ? ? ?

C3 ?

C1 C4 ?

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Closest Pair
• A few ways to measure distances of two clusters.
• Single-link
– Similarity of the most similar (single-link)
• Complete-link
– Similarity of the least similar points
• Centroid
– Clusters whose centroids (centers of gravity) are the
most similar
• Average-link
– Average cosine between pairs of elements

12
Distance between two clusters
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj

sim(Ci ,C j )  max sim( x, y )


xCi , yC j
Single Link Example
It Can result in
“straggly” (long and
thin) clusters due to
chaining effect.

14
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete link method
• The distance between two clusters is the distance of
two furthest data points in the two clusters.
sim(ci ,c j )  min sim( x, y)
xci , yc j

• Makes “tighter,” spherical clusters that are typically preferable.


• It is sensitive to outliers because they are far away

16
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete Link Example

18
Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial
instances, which is O(N2).
• In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
• In order to maintain an overall O(N2)
performance, computing similarity to each other
cluster must be done in constant time.
– Often O(N3) if done naively or O(N2 log N) if done
more cleverly
19
Average Link Clustering
• Similarity of two clusters = average similarity between
any object in Ci and any object in Cj
1  
sim(ci , c j ) 
Ci C j
  sim( x, y)
 
xCi yC j
• Compromise between single and complete link. Less
susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged
cluster
– Averaged over all pairs between the two original
clusters
20
The complexity
• All the algorithms are at least O(n2). n is the
number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in
O(n2logn).
• Due the complexity, hard to use for large data
sets.

21
Model-based clustering
• Assume data generated from 𝑘 probability
distributions
• Goal: find the distribution parameters
• Algorithm: Expectation Maximization (EM)
• Output: Distribution parameters and a soft
assignment of points to clusters
Model-based clustering
• Assume 𝑘 probability distributions with
parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘
• Given data 𝑋, compute 𝜃1 , 𝜃2 , … , 𝜃𝑘 such that
𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [likelihood] or
ln 𝑃𝑟(𝑋|𝜃1 , 𝜃2 , … , 𝜃𝑘 ) [log likelihood]
is maximized.
• Every point 𝑥𝜖𝑋 may be generated by multiple
distributions with some probability
EM Algorithm
• Initialize the parameters 𝜃1 , 𝜃2 , … , 𝜃𝑘 randomly
• Let each parameter corresponds to a cluster center (mean)
• Iterate between two steps
– Expectation step: (probabilistically) assign points to
clusters

– Maximation step: estimate model parameters that


maximize the likelihood for the given assignment of
points
EM Algorithm
Expectation step: (probabilistically) assign points to clusters
compute Prob(point|mean)
Prob(mean|point) =
Prob(mean) Prob(point|mean) / Prob(point)

Maximation step: estimate model parameters that maximize


the likelihood for the given assignment of points
Each mean = Weighted avg. of points
Weight = Prob(mean|point)
EM Algorithm
• Initialize 𝑘 cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
Pr(𝑥𝑖 |𝐶𝑘 )
Pr 𝑥𝑖 ∈ 𝐶𝑘 =
σ𝑗 Pr(𝑥𝑖 |𝐶𝑗 )
σ𝑖 Pr(𝑥𝑖 ∈ 𝐶𝑘 )
𝑤𝑘 =
𝑛
– Maximization step: estimate model parameters
𝑛
1 Pr(𝑥𝑖 ∈ 𝐶𝑘 )
𝑟𝑘 = ෍
𝑛 σ𝑗 Pr(𝑥𝑖 ∈ 𝐶𝑗 )
𝑖=1
K-means Algorithm
• Goal: represent a data set in terms of K
clusters each of which is summarized by a
prototype 𝜇𝑘
• Initialize prototypes, then iterate between two
phases:
– E-step: assign each data point to nearest prototype
– M-step: update prototypes to be the cluster
means
Thank You

You might also like