0% found this document useful (0 votes)
4 views4 pages

Lecture 03

The document discusses Support Vector Machines (SVMs) for linear binary classification, focusing on the separable case where two classes can be separated by a hyperplane. It introduces key concepts such as hyperplanes, margins, and the optimization problem to find the maximum-margin hyperplane, along with the role of support vectors in determining this hyperplane. The dual optimization problem is also presented, highlighting the relationship between primal and dual variables in the SVM framework.

Uploaded by

khushimehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Lecture 03

The document discusses Support Vector Machines (SVMs) for linear binary classification, focusing on the separable case where two classes can be separated by a hyperplane. It introduces key concepts such as hyperplanes, margins, and the optimization problem to find the maximum-margin hyperplane, along with the role of support vectors in determining this hyperplane. The dual optimization problem is also presented, highlighting the relationship between primal and dual variables in the SVM framework.

Uploaded by

khushimehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture-03: Support Vector Machines – Separable Case

1 Linear binary classification


Let the input space be X = R N for the number of dimensions N ⩾ 1, the output space Y = {−1, 1}, and
the target function be some mapping c : X → Y. We will also use ∥ x ∥ to denote the 2-norm of x ∈ X.
Definition 1.1. For a vector w ∈ R N and scalar b ∈ R, we define a hyperplane as a set of points
 
⟨w, x ⟩ b
Ew,b ≜ x ∈ R N : =− .
∥w∥ ∥w∥

Definition 1.2. Distance of a point from a set is defined as d( x, A) ≜ min {d( x, y) : y ∈ A}. For x, y ∈ R N ,
the distance d( x, y) ≜ ∥ x − y∥2 .
Lemma 1.3. For a vector w ∈ R N and b ∈ R, we have d(0, Ew,b ) = − ∥wb ∥ .
w
Proof. For a vector w, we define a unit vector u ≜ ∥w∥
. It follows that x0 ≜ − ∥wb ∥ u lies on the hyperplane
Ew,b , which is parallel to the unit vector u and at distance d(0, x0 ) = − ∥wb ∥ from the origin. Any point
x ∈ Ew,b on the hyperplane can be written as a sum of two orthogonal vectors x = x0 + x − x0 where
⟨ x − x0 , w⟩ = 0. Therefore, d(0, x )2 = d(0, x0 )2 + d( x0 , x )2 ⩾ d(0, x02 ), and hence d(0, Ew,b ) = d(0, x0 ).
Remark 1. A hyperplane Ew,b = x ∈ R N : ⟨w, x ⟩ + b = 0 is defined in terms of the unit vector w/∥w∥


and its distance −b/∥w∥ from the origin.


|⟨w,x ⟩+b|
Lemma 1.4. The distance of any point x ∈ R N to a hyperplane Ew,b is given by d( x, Ew,b ) = ∥w∥
.

Proof. Let u = w/ ∥w∥ be the unit vector in the direction of w. Any point y on a hyperplane Ew,b , can be
written as sum of two orthogonal vectors y = x0 + y − x0 where ⟨y − x0 , x0 ⟩ = 0 and x0 = − ∥wb ∥ u. Any
point x ∈ R N can be represented as x = ⟨ x, u⟩ u + v, such that ⟨v, w⟩ = 0. Therefore,
 ⟨ x, w⟩ + b 2
d( x, Ew,b )2 = min d( x, y)2 = min d( x0 + y − x0 , ⟨ x, u⟩ u + v)2 ⩾ .
y∈ Ew,b y∈ Ew,b ∥w∥

Remark 2. The distance of a point x ∈ R N from the hyperplane Ew,b is given by d( x, Ew,b ). If ⟨w, x ⟩ + b >
0, then the point x lies above the hyperplane Ew,b , and if ⟨w, x ⟩ + b < 0, then point x lies below the
hyperplane Ew,b .
Assumption 1.5. We are given a training sample z ∈ (X × Y)m consisting of m labeled training examples
zi = ( xi , yi ) ∈ X × Y, where each example xi is generated i.i.d. by a fixed but unknown distribution D,
and the label yi = c( xi ) for an unknown concept c : X → Y.
Assumption 1.6. We define the hypothesis set as a collection of separating hyperplanes
n o
H ≜ x 7→ sign(⟨w, x ⟩ + b) : w ∈ R N , b ∈ R .

Remark 3. Any hypothesis h ∈ H is identified by the pair (w, b) such that h( x ) = sign(⟨w, x ⟩ + b) for all
x ∈ R N . A hypothesis h ∈ H labels positively all points falling on one side of the hyperplane Ew,b ≜
x ∈ R N : ⟨w, x ⟩ + b = 0 and labels negatively all others. This problem is referred to as linear binary


classification problem.
Remark 4. For the Hamming loss function (y, y′ ) 7→ 1{y̸=y′ } , the generalization error is R(h) = P { h( x ) ̸= c( x )} .
The objective is to select an h ∈ H such that the generalization error R(h) is minimized.

1
2 SVMs — separable case
Support vector machines are one of the most theoretically well-motivated and practically most effective
binary classification algorithms. We first introduce this algorithm for separable datasets, then present
its general version for non-separable datasets.
Assumption 2.1. Consider a training sample of size m denoted by z ∈ (X × Y)m and define the disjoint
sets T−1 ≜ {i ∈ [m] : yi = −1} and T1 ≜ {i ∈ [m] : yi = 1}. We assume that T1 , T−1 are non empty and can
be separated by a hyperplane. That is, there exists a hyperplane Ew,b such that [m] = T−1 ∪ T1 and

T1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b > 0} , T−1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b < 0} .

Remark 5. For such a hyperplane Ew,b , we have yi h( xi ) > 0 for all i ∈ [m].
Remark 6. Let Ew,b be one of infinite such planes. Which hyperplane should a learning algorithm select?
The solution Ew∗ ,b∗ returned by the SVM algorithm is the hyperplane with the maximum margin, or the
distance to the closest points, and is thus known as the maximum-margin hyperplane.

2.1 Primal optimization problem


Assumption 2.1 confirms the existence of at least one pair (w, b) such that ⟨w, xi ⟩ + b ̸= 0 for all i ∈ [m].
For any training sample z and hyperplane Ew,b , we define the minimum d0 (z, w, b) ≜ mini∈[m] |⟨w, xi ⟩ + b|.
In terms of this minimum, we can write the margin, the minimum distance of sample z to the hyper-
plane Ew,b , as
|⟨w, xi ⟩ + b| d0 (z, w, b)
ρ(w, b) ≜ min d(zi , Ew,b ) = min = .
i ∈[m] i ∈[m] ∥w∥ ∥w∥
Correct classification is achieved for a labeled example ( xi , yi ) when label yi = sign(⟨w, xi ⟩ + b). Since
|⟨w, xi ⟩ + b| ⩾ d0 (z, w, b) for all labeled example zi = ( xi , yi ), a correct classification is achieved when

yi (⟨w, xi ⟩ + b) ⩾ d0 (z, w, b) for all i ∈ [m].

SVM finds the maximum margin hyperplane that solves the following problem

1 ∥ w ∥2
min
w,b 2 d0 (z, w, b)2
subject to: yi (⟨w, xi ⟩ + b) ⩾ d0 (z, w, b) for all i ∈ [m].

We observe that the solution to this optimization problem remains unchanged to scaling of (w, b). Nor-
malizing the pair (w, b) by d0 (z, w, b), we get the canonical hyperplane Ew,b such that mini∈[m] |⟨w, xi ⟩ + b| =
1. The marginal hyperplanes are defined to be parallel to the separating hyperplane and passing through
the closest points on the negative or positive sides. Since they are parallel to the separating hyperplane,
they admit the same normal vector w. By the definition of a canonical representation, for a point x on
a marginal hyperplane, |⟨w, x ⟩ + b| = 1, and thus the marginal hyperplanes are ⟨w, x ⟩ + b = ±1. Hence
our original problem statement translates to finding (w, b) so as to maximize the margin ρ such that all
points are correctly separated is equivalent to

1
min ∥ w ∥2 (1)
w,b 2
subject to: yi (⟨w, xi ⟩ + b) ⩾ 1 for all i ∈ [m].

The objective function F : w 7→ 21 ∥w∥2 is infinitely differentiable, its gradient is ∇w ( F ) = w and its
Hessian is the identity matrix ∇2 F (w) = I with strictly positive eigenvalues. Therefore, ∇2 F (w) ≻ 0 and
F is strictly convex. The constraints are all defined by the affine functions gi : (w, b) 7→ 1 − yi (⟨w, xi ⟩ + b)
and are thus qualified. Thus the optimization problem in (1) has a unique solution, and can be solved
by a quadratic program.

2.2 Support vectors


In this section, we will show that the normal vector w to the resulting hyperplane is a linear combination
of some feature vectors, referred to as support vectors. Consider the dual variables αi ⩾ 0 for all i ∈ [m]

2
associated to the m affine constraints and let α ≜ (αi : i ∈ [m]). Then, we can define the Lagrangian for
all canonical pairs (w, b) ∈ R N +1 and Lagrange dual variables α ∈ Rm + as
m
1
L(w, b, α) ≜ ∥w∥2 − ∑ αi [yi (⟨w, xi ⟩ + b) − 1].
2 i =1

Since the primal problem in (1) has convex cost function with affine constraints, Ew∗ ,b∗ is the optimal
separating cannonical hyperplane if and only if there exists α∗ ∈ Rm
+ that satisfies the following three
KKT conditions:
m m
∇w L|w=w∗ = w∗ − ∑ αi yi xi = 0, ∇b L|b=b∗ = − ∑ αi∗ yi = 0, αi∗ [yi (⟨w∗ , xi ⟩ + b∗ ) − 1] = 0.
i =1 i =1

Remark 7. The complementary condition implies that αi∗ = 0 if the labeled points are not on the sup-
porting hyperplane, i.e. yi (⟨w∗ , xi ⟩ + b∗ ) ̸= 1.
Definition 2.2 (Support vectors). We can define the support vectors as the examples or feature vectors
for which the corresponding Lagrange variable αi∗ ̸= 0, i.e.

S ≜ {i ∈ [m] : αi∗ ̸= 0} ⊆ {i ∈ [m] : yi (⟨w∗ , xi ⟩ + b∗ ) = 1} .


Remark 8. The optimal primal variables (w∗ , b∗ ) in the SVM solution are the stationary points of the
associated Lagrangian, and hence we can write the normal vector as a linear combination of support
vectors, i.e.
m
w∗ = ∑ αi∗ yi xi = ∑ αi∗ yi xi . (2)
i =1 i ∈S
Remark 9. Support vectors completely determine the maximum-margin hyperplane solution. Vectors
not lying on the marginal hyperplane do not affect the definition of these hyperplanes.
Remark 10. The slope of the hyperplane w∗ is unique but the support vectors are not unique. A hyper-
plane is sufficiently determined by N + 1 points in N dimensions. Thus, when more than N + 1 points
lie on a marginal hyperplane, different choices are possible for the N + 1 support vectors.
Remark 11. We have expressed the normal vector w∗ for the optimal hyperplane, in terms of the optimal
dual variable α∗ ∈ Rm
+ . We have not yet found the optimal dual variables, or the normalized distance
b∗ .

2.3 Dual optimization problem


In this section, we will show that the hypothesis h ∈ H and distance b can be expressed as inner prod-
ucts. To this end, we look at the the dual form of the constrained primal optimization problem (1).
Recall that the dual function F (α) = infw,b L(w, b, α). The Lagrangian L is minimized at the optimal
primal variables (w∗ , b∗ ) such that ∇w L(w∗ , b∗ ) = ∇b L(w∗ , b∗ ) = 0 to write the optimal normal vector
w∗ = ∑im=1 αi yi xi in terms of the dual variables α ∈ Rm
+ as expressed in (2), together with the constraint
∑im=1 αi yi = 0.
Definition 2.3 (Gram matrix). For a labeled sample z ∈ (X × Y)m , we can define a Gram matrix A ∈
Rm×m defined by the (i, j)th entries Aij ≜ yi xi , y j x j for all i, j ∈ [m].
Remark 12. The matrix A is the Gram matrix associated with vectors (y1 x1 , . . . , ym xm ) and hence is pos-
itive semidefinite. We can easily check that for any α ∈ Rm , we have
2

α T Aα = ∑ αi yi xi , α j y j x j = ∑ αi yi xi ⩾ 0.
i,j∈[m] i ∈[m]

Substituting w∗ = ∑im=1 αi yi xi , ∑im=1 αi yi = 0, and the definition of Gram matrix A, in the Lagrangian
L(w∗ , b∗ , α), we can write the dual function as F (α) = L(w∗ , b∗ , α) = ∑im=1 αi − 12 ∑im=1 ∑m
j=1 αi Aij α j . There-
fore, we can write the dual SVM optimization problem as
1
max ∥α∥1 − α T Aα (3)
α 2
m
subject to: αi ⩾ 0, for all i ∈ [m], and ∑ αi yi = 0.
i =1

3
The objective function G : α 7→ ∥α∥1 − 12 α T Aα is infinitely differentiable, and its Hessian is given by
∇2 G = − A ⪯ 0, and hence G is a concave function. Since the constraints are affine and convex, the
dual maximization problem (3) is equivalent to a convex optimization problem. Since G is a quadratic
function of Lagrange variables α, this dual optimization problem is also a quadratic program, as in the
case of the primal optimization. Since the constraints are affine, they are qualified and strong duality
holds. Thus, the primal and dual problems are equivalent, i.e., the solution α∗ of the dual problem (3)
can be used directly to determine the hypothesis returned by SVMs. Using (2) for the normal to the
supporting hyperplane, we can write the hypothesis
!
m
h( x ) = sign(⟨w∗ , x ⟩ + b∗ ) = sign ∑ αi∗ yi ⟨xi , x⟩ + b∗ .
i =1

For any support vector xi for i ∈ S, we have yi = ⟨w∗ , xi ⟩ + b∗ , and hence we can write for all j. ∈ S
m
b∗ = y j − ∑ αi∗ yi xi , x j . (4)
i =1

Combining the above two results, we get for any j ∈ S


!
h( x ) = sign ∑ αi∗ yi xi , x − x j + y j .
i ∈S

Remark 13. The hypothesis solution depends only on inner products between vectors and not directly
on the vectors themselves.
Remark 14. Since (4) holds for all i ∈ S, that is for all i such that αi∗ ̸= 0, we can write
m m m m
0= ∑ αi∗ yi b∗ = ∑ αi∗ y2i − ∑ αi∗ Ai,j α∗j = ∑ αi∗ − ∥w∗ ∥2 .
i =1 i =1 i,j=1 i =1

1 1
That is, we can write the optimal margin ρ as ρ2 = = ∥ α ∗ ∥1
.
∥ w ∗ ∥2

You might also like