Machine Learning: Linear Models For Classification 1
Machine Learning: Linear Models For Classification 1
1
Marcello Restelli
0.5
1 Linear Classification
2 Discriminant Functions
Least Squares
The Perceptron Algorithm
Classification Problems
Linear Classification
y(x, w) = f (xT w + w0 )
Notation
In two-class problems, we have a binary target value t ∈ {0, 1}, such that
t = 1 is the positive class and t = 0 is the negative class
We can interpret the value of t as the probability of the positive class
The output of the model can be represented as the probability that the
model assigns to the positive class
If there are K classes, we can use a 1-of-K encoding scheme
t is a vector of length K and contains a single 1 for the correct class and 0
elsewhere
Example: if K = 5, then an input that belongs to class 2 corresponds to
target vector:
T
t = (0, 1, 0, 0, 0)
We can interpret t as a vector of class probabilities
p(x|Ck )p(Ck )
P(Ck |x) =
p(x)
Two classes
y(x) = xT w + w0
Assign x to C1 is y(x) ≥ 0 and
class C2 otherwise
Decision boundary: y(x) = 0
Given two points on the decision
surface:
y(xA ) = y(xB ) = 0
wT (xA − xB ) = 0
Multiple classes
T
W̃ is a (D + 1) × K matrix whose k-th column is w̃k = (wk0 , wk T )
T
x̃ = (1, xT )
The reason for the failure is due to the assumption of a Gaussian conditional
distribution that is not satisfied by binary target vectors
Perceptron
Perceptron Criterion
We are seeking a vector w such that wT φ(xn ) > 0 when xn ∈ C1 and
wT φ(xn ) < 0 otherwise
The perceptron criterion assigns
zero error to correct classification
wT φ(xn )tn to misclassified patterns xn (it is proportional to the distance to
the decision boundary)
The loss function to be minimized is
X
LP (w) = − wT φ(xn )tn
n∈M
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Logistic Regression
exp(wk T φn )
where ynk = p(Ck |φn ) = P T
j exp(wj φn )
1 1
0.5 0.5
0 0
−10 −5 0 5 10 −10 −5 0 5 10
1 (
y(x, w) = 1 if w · φ(x) > 0
1 + e−w·φ(x) y(x, w) =
0 otherwise
Both algorithms use the same updating rule:
w ← w + α (y(xn , w) − tn ) φn
Marcello Restelli March 15, 2017 31 / 31