0% found this document useful (0 votes)

4 views4 pages

Lecture 03

The document discusses Support Vector Machines (SVMs) for linear binary classification, focusing on the separable case where two classes can be separated by a hyperplane. It introduces key concepts such as hyperplanes, margins, and the optimization problem to find the maximum-margin hyperplane, along with the role of support vectors in determining this hyperplane. The dual optimization problem is also presented, highlighting the relationship between primal and dual variables in the SVM framework.

Uploaded by

khushimehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views4 pages

Lecture 03

Uploaded by

khushimehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture-03: Support Vector Machines – Separable Case

1 Linear binary classification

Let the input space be X = R N for the number of dimensions N ⩾ 1, the output space Y = {−1, 1}, and
the target function be some mapping c : X → Y. We will also use ∥ x ∥ to denote the 2-norm of x ∈ X.
Definition 1.1. For a vector w ∈ R N and scalar b ∈ R, we define a hyperplane as a set of points

⟨w, x ⟩ b
Ew,b ≜ x ∈ R N : =− .
∥w∥ ∥w∥

Definition 1.2. Distance of a point from a set is defined as d( x, A) ≜ min {d( x, y) : y ∈ A}. For x, y ∈ R N ,
the distance d( x, y) ≜ ∥ x − y∥2 .
Lemma 1.3. For a vector w ∈ R N and b ∈ R, we have d(0, Ew,b ) = − ∥wb ∥ .
w
Proof. For a vector w, we define a unit vector u ≜ ∥w∥
. It follows that x0 ≜ − ∥wb ∥ u lies on the hyperplane
Ew,b , which is parallel to the unit vector u and at distance d(0, x0 ) = − ∥wb ∥ from the origin. Any point
x ∈ Ew,b on the hyperplane can be written as a sum of two orthogonal vectors x = x0 + x − x0 where
⟨ x − x0 , w⟩ = 0. Therefore, d(0, x )2 = d(0, x0 )2 + d( x0 , x )2 ⩾ d(0, x02 ), and hence d(0, Ew,b ) = d(0, x0 ).
Remark 1. A hyperplane Ew,b = x ∈ R N : ⟨w, x ⟩ + b = 0 is defined in terms of the unit vector w/∥w∥

and its distance −b/∥w∥ from the origin.

|⟨w,x ⟩+b|
Lemma 1.4. The distance of any point x ∈ R N to a hyperplane Ew,b is given by d( x, Ew,b ) = ∥w∥
.

Proof. Let u = w/ ∥w∥ be the unit vector in the direction of w. Any point y on a hyperplane Ew,b , can be
written as sum of two orthogonal vectors y = x0 + y − x0 where ⟨y − x0 , x0 ⟩ = 0 and x0 = − ∥wb ∥ u. Any
point x ∈ R N can be represented as x = ⟨ x, u⟩ u + v, such that ⟨v, w⟩ = 0. Therefore,
⟨ x, w⟩ + b 2
d( x, Ew,b )2 = min d( x, y)2 = min d( x0 + y − x0 , ⟨ x, u⟩ u + v)2 ⩾ .
y∈ Ew,b y∈ Ew,b ∥w∥

Remark 2. The distance of a point x ∈ R N from the hyperplane Ew,b is given by d( x, Ew,b ). If ⟨w, x ⟩ + b >
0, then the point x lies above the hyperplane Ew,b , and if ⟨w, x ⟩ + b < 0, then point x lies below the
hyperplane Ew,b .
Assumption 1.5. We are given a training sample z ∈ (X × Y)m consisting of m labeled training examples
zi = ( xi , yi ) ∈ X × Y, where each example xi is generated i.i.d. by a fixed but unknown distribution D,
and the label yi = c( xi ) for an unknown concept c : X → Y.
Assumption 1.6. We define the hypothesis set as a collection of separating hyperplanes
n o
H ≜ x 7→ sign(⟨w, x ⟩ + b) : w ∈ R N , b ∈ R .

Remark 3. Any hypothesis h ∈ H is identified by the pair (w, b) such that h( x ) = sign(⟨w, x ⟩ + b) for all
x ∈ R N . A hypothesis h ∈ H labels positively all points falling on one side of the hyperplane Ew,b ≜
x ∈ R N : ⟨w, x ⟩ + b = 0 and labels negatively all others. This problem is referred to as linear binary

classification problem.
Remark 4. For the Hamming loss function (y, y′ ) 7→ 1{y̸=y′ } , the generalization error is R(h) = P { h( x ) ̸= c( x )} .
The objective is to select an h ∈ H such that the generalization error R(h) is minimized.

1
2 SVMs — separable case
Support vector machines are one of the most theoretically well-motivated and practically most effective
binary classification algorithms. We first introduce this algorithm for separable datasets, then present
its general version for non-separable datasets.
Assumption 2.1. Consider a training sample of size m denoted by z ∈ (X × Y)m and define the disjoint
sets T−1 ≜ {i ∈ [m] : yi = −1} and T1 ≜ {i ∈ [m] : yi = 1}. We assume that T1 , T−1 are non empty and can
be separated by a hyperplane. That is, there exists a hyperplane Ew,b such that [m] = T−1 ∪ T1 and

T1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b > 0} , T−1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b < 0} .

Remark 5. For such a hyperplane Ew,b , we have yi h( xi ) > 0 for all i ∈ [m].
Remark 6. Let Ew,b be one of infinite such planes. Which hyperplane should a learning algorithm select?
The solution Ew∗ ,b∗ returned by the SVM algorithm is the hyperplane with the maximum margin, or the
distance to the closest points, and is thus known as the maximum-margin hyperplane.

2.1 Primal optimization problem

Assumption 2.1 confirms the existence of at least one pair (w, b) such that ⟨w, xi ⟩ + b ̸= 0 for all i ∈ [m].
For any training sample z and hyperplane Ew,b , we define the minimum d0 (z, w, b) ≜ mini∈[m] |⟨w, xi ⟩ + b|.
In terms of this minimum, we can write the margin, the minimum distance of sample z to the hyper-
plane Ew,b , as
|⟨w, xi ⟩ + b| d0 (z, w, b)
ρ(w, b) ≜ min d(zi , Ew,b ) = min = .
i ∈[m] i ∈[m] ∥w∥ ∥w∥
Correct classification is achieved for a labeled example ( xi , yi ) when label yi = sign(⟨w, xi ⟩ + b). Since
|⟨w, xi ⟩ + b| ⩾ d0 (z, w, b) for all labeled example zi = ( xi , yi ), a correct classification is achieved when

yi (⟨w, xi ⟩ + b) ⩾ d0 (z, w, b) for all i ∈ [m].

SVM finds the maximum margin hyperplane that solves the following problem

1 ∥ w ∥2
min
w,b 2 d0 (z, w, b)2
subject to: yi (⟨w, xi ⟩ + b) ⩾ d0 (z, w, b) for all i ∈ [m].

We observe that the solution to this optimization problem remains unchanged to scaling of (w, b). Nor-
malizing the pair (w, b) by d0 (z, w, b), we get the canonical hyperplane Ew,b such that mini∈[m] |⟨w, xi ⟩ + b| =
1. The marginal hyperplanes are defined to be parallel to the separating hyperplane and passing through
the closest points on the negative or positive sides. Since they are parallel to the separating hyperplane,
they admit the same normal vector w. By the definition of a canonical representation, for a point x on
a marginal hyperplane, |⟨w, x ⟩ + b| = 1, and thus the marginal hyperplanes are ⟨w, x ⟩ + b = ±1. Hence
our original problem statement translates to finding (w, b) so as to maximize the margin ρ such that all
points are correctly separated is equivalent to

1
min ∥ w ∥2 (1)
w,b 2
subject to: yi (⟨w, xi ⟩ + b) ⩾ 1 for all i ∈ [m].

The objective function F : w 7→ 21 ∥w∥2 is infinitely differentiable, its gradient is ∇w ( F ) = w and its
Hessian is the identity matrix ∇2 F (w) = I with strictly positive eigenvalues. Therefore, ∇2 F (w) ≻ 0 and
F is strictly convex. The constraints are all defined by the affine functions gi : (w, b) 7→ 1 − yi (⟨w, xi ⟩ + b)
and are thus qualified. Thus the optimization problem in (1) has a unique solution, and can be solved
by a quadratic program.

2.2 Support vectors

In this section, we will show that the normal vector w to the resulting hyperplane is a linear combination
of some feature vectors, referred to as support vectors. Consider the dual variables αi ⩾ 0 for all i ∈ [m]

2
associated to the m affine constraints and let α ≜ (αi : i ∈ [m]). Then, we can define the Lagrangian for
all canonical pairs (w, b) ∈ R N +1 and Lagrange dual variables α ∈ Rm + as
m
1
L(w, b, α) ≜ ∥w∥2 − ∑ αi [yi (⟨w, xi ⟩ + b) − 1].
2 i =1

Since the primal problem in (1) has convex cost function with affine constraints, Ew∗ ,b∗ is the optimal
separating cannonical hyperplane if and only if there exists α∗ ∈ Rm
+ that satisfies the following three
KKT conditions:
m m
∇w L|w=w∗ = w∗ − ∑ αi yi xi = 0, ∇b L|b=b∗ = − ∑ αi∗ yi = 0, αi∗ [yi (⟨w∗ , xi ⟩ + b∗ ) − 1] = 0.
i =1 i =1

Remark 7. The complementary condition implies that αi∗ = 0 if the labeled points are not on the sup-
porting hyperplane, i.e. yi (⟨w∗ , xi ⟩ + b∗ ) ̸= 1.
Definition 2.2 (Support vectors). We can define the support vectors as the examples or feature vectors
for which the corresponding Lagrange variable αi∗ ̸= 0, i.e.

S ≜ {i ∈ [m] : αi∗ ̸= 0} ⊆ {i ∈ [m] : yi (⟨w∗ , xi ⟩ + b∗ ) = 1} .

Remark 8. The optimal primal variables (w∗ , b∗ ) in the SVM solution are the stationary points of the
associated Lagrangian, and hence we can write the normal vector as a linear combination of support
vectors, i.e.
m
w∗ = ∑ αi∗ yi xi = ∑ αi∗ yi xi . (2)
i =1 i ∈S
Remark 9. Support vectors completely determine the maximum-margin hyperplane solution. Vectors
not lying on the marginal hyperplane do not affect the definition of these hyperplanes.
Remark 10. The slope of the hyperplane w∗ is unique but the support vectors are not unique. A hyper-
plane is sufficiently determined by N + 1 points in N dimensions. Thus, when more than N + 1 points
lie on a marginal hyperplane, different choices are possible for the N + 1 support vectors.
Remark 11. We have expressed the normal vector w∗ for the optimal hyperplane, in terms of the optimal
dual variable α∗ ∈ Rm
+ . We have not yet found the optimal dual variables, or the normalized distance
b∗ .

2.3 Dual optimization problem

In this section, we will show that the hypothesis h ∈ H and distance b can be expressed as inner prod-
ucts. To this end, we look at the the dual form of the constrained primal optimization problem (1).
Recall that the dual function F (α) = infw,b L(w, b, α). The Lagrangian L is minimized at the optimal
primal variables (w∗ , b∗ ) such that ∇w L(w∗ , b∗ ) = ∇b L(w∗ , b∗ ) = 0 to write the optimal normal vector
w∗ = ∑im=1 αi yi xi in terms of the dual variables α ∈ Rm
+ as expressed in (2), together with the constraint
∑im=1 αi yi = 0.
Definition 2.3 (Gram matrix). For a labeled sample z ∈ (X × Y)m , we can define a Gram matrix A ∈
Rm×m defined by the (i, j)th entries Aij ≜ yi xi , y j x j for all i, j ∈ [m].
Remark 12. The matrix A is the Gram matrix associated with vectors (y1 x1 , . . . , ym xm ) and hence is pos-
itive semidefinite. We can easily check that for any α ∈ Rm , we have
2

α T Aα = ∑ αi yi xi , α j y j x j = ∑ αi yi xi ⩾ 0.
i,j∈[m] i ∈[m]

Substituting w∗ = ∑im=1 αi yi xi , ∑im=1 αi yi = 0, and the definition of Gram matrix A, in the Lagrangian
L(w∗ , b∗ , α), we can write the dual function as F (α) = L(w∗ , b∗ , α) = ∑im=1 αi − 12 ∑im=1 ∑m
j=1 αi Aij α j . There-
fore, we can write the dual SVM optimization problem as
1
max ∥α∥1 − α T Aα (3)
α 2
m
subject to: αi ⩾ 0, for all i ∈ [m], and ∑ αi yi = 0.
i =1

3
The objective function G : α 7→ ∥α∥1 − 12 α T Aα is infinitely differentiable, and its Hessian is given by
∇2 G = − A ⪯ 0, and hence G is a concave function. Since the constraints are affine and convex, the
dual maximization problem (3) is equivalent to a convex optimization problem. Since G is a quadratic
function of Lagrange variables α, this dual optimization problem is also a quadratic program, as in the
case of the primal optimization. Since the constraints are affine, they are qualified and strong duality
holds. Thus, the primal and dual problems are equivalent, i.e., the solution α∗ of the dual problem (3)
can be used directly to determine the hypothesis returned by SVMs. Using (2) for the normal to the
supporting hyperplane, we can write the hypothesis
!
m
h( x ) = sign(⟨w∗ , x ⟩ + b∗ ) = sign ∑ αi∗ yi ⟨xi , x⟩ + b∗ .
i =1

For any support vector xi for i ∈ S, we have yi = ⟨w∗ , xi ⟩ + b∗ , and hence we can write for all j. ∈ S
m
b∗ = y j − ∑ αi∗ yi xi , x j . (4)
i =1

Combining the above two results, we get for any j ∈ S

!
h( x ) = sign ∑ αi∗ yi xi , x − x j + y j .
i ∈S

Remark 13. The hypothesis solution depends only on inner products between vectors and not directly
on the vectors themselves.
Remark 14. Since (4) holds for all i ∈ S, that is for all i such that αi∗ ̸= 0, we can write
m m m m
0= ∑ αi∗ yi b∗ = ∑ αi∗ y2i − ∑ αi∗ Ai,j α∗j = ∑ αi∗ − ∥w∗ ∥2 .
i =1 i =1 i,j=1 i =1

1 1
That is, we can write the optimal margin ρ as ρ2 = = ∥ α ∗ ∥1
.
∥ w ∗ ∥2

Support Vector Machines: Artificial Neural Networks Unit 6
No ratings yet
Support Vector Machines: Artificial Neural Networks Unit 6
10 pages
Week-10 - Summary
No ratings yet
Week-10 - Summary
13 pages
Support Vector Machines
No ratings yet
Support Vector Machines
5 pages
Main 7
No ratings yet
Main 7
25 pages
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
No ratings yet
Lecture: Classification With Support Vector Machines: CS 2XX: Mathematics For AI and ML
28 pages
Report 1
No ratings yet
Report 1
6 pages
Support Vector Machines and Kernels
No ratings yet
Support Vector Machines and Kernels
16 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Support Vector Machines PDF
No ratings yet
Support Vector Machines PDF
5 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
55 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
114 pages
Learning From Data Lecture 8: Support Vector Machines (SVM) : Alaa Othman June 10, 2025
No ratings yet
Learning From Data Lecture 8: Support Vector Machines (SVM) : Alaa Othman June 10, 2025
70 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Support Vector Machines
No ratings yet
Support Vector Machines
18 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
45 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Optimization for Margin Classifiers
No ratings yet
Optimization for Margin Classifiers
4 pages
SVM: Classification & Optimization
No ratings yet
SVM: Classification & Optimization
44 pages
Support Vector Machine
No ratings yet
Support Vector Machine
49 pages
Support Vector Machine
No ratings yet
Support Vector Machine
46 pages
Eml 12 SVM 010925
No ratings yet
Eml 12 SVM 010925
41 pages
Dis11 Sol
No ratings yet
Dis11 Sol
5 pages
Support Vecto Machine
No ratings yet
Support Vecto Machine
62 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
29 pages
Support Vector Machines Guide
No ratings yet
Support Vector Machines Guide
25 pages
SVM Lecture Notes by Linda Sellie
No ratings yet
SVM Lecture Notes by Linda Sellie
58 pages
Support Vector Machines Guide
No ratings yet
Support Vector Machines Guide
19 pages
Machine Learning: SVM & Kernels
No ratings yet
Machine Learning: SVM & Kernels
5 pages
ML-chap13 2024 110331
No ratings yet
ML-chap13 2024 110331
67 pages
Chapter 9 (SVM)
No ratings yet
Chapter 9 (SVM)
73 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
34 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
28 pages
SVM Overview and Applications
No ratings yet
SVM Overview and Applications
33 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
49 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
8 pages
ML Module 4
No ratings yet
ML Module 4
16 pages
SVM Basics for Machine Learning Enthusiasts
No ratings yet
SVM Basics for Machine Learning Enthusiasts
4 pages
Survey Piccialli sciandrone4OR
No ratings yet
Survey Piccialli sciandrone4OR
29 pages
SVM 1
No ratings yet
SVM 1
6 pages
MIT15 097S12 Lec12
No ratings yet
MIT15 097S12 Lec12
14 pages
SVM: Linear & Non-linear Classification
No ratings yet
SVM: Linear & Non-linear Classification
35 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Tutorial4 SVM
No ratings yet
Tutorial4 SVM
8 pages
SVM 1997
No ratings yet
SVM 1997
11 pages
Top 25 Statistical Mechanics Interview Questions and Answers - InterviewPrep
No ratings yet
Top 25 Statistical Mechanics Interview Questions and Answers - InterviewPrep
9 pages
Risk Neutral
No ratings yet
Risk Neutral
12 pages
Hierarchical Multi-Turn RL for LLMs
No ratings yet
Hierarchical Multi-Turn RL for LLMs
39 pages
Reap
No ratings yet
Reap
1 page
Class 3
No ratings yet
Class 3
42 pages
02 - Solution - Tutorial - Forecasting
No ratings yet
02 - Solution - Tutorial - Forecasting
13 pages
FGDGB
No ratings yet
FGDGB
1 page
Search Algorithm True/False Analysis
No ratings yet
Search Algorithm True/False Analysis
6 pages
Exploratory Data Analysis of Selena Gomez Songs Lyrics
No ratings yet
Exploratory Data Analysis of Selena Gomez Songs Lyrics
8 pages
Reliable Communication and Error Correction
No ratings yet
Reliable Communication and Error Correction
1 page
Numerical Comparison of Various Order Ex
No ratings yet
Numerical Comparison of Various Order Ex
16 pages
DSA Patterns
No ratings yet
DSA Patterns
5 pages
AI & LLMs: A Developer's Guide
100% (1)
AI & LLMs: A Developer's Guide
2 pages
MONAI DenseNet121 for Weed Detection
No ratings yet
MONAI DenseNet121 for Weed Detection
2 pages
Algorithms and Complexity Test Overview
No ratings yet
Algorithms and Complexity Test Overview
5 pages
Unit 4 Cryptographic Hash Functions and Digital Signature
No ratings yet
Unit 4 Cryptographic Hash Functions and Digital Signature
118 pages
Keypoint Detection Techniques
No ratings yet
Keypoint Detection Techniques
13 pages
Activity 2
No ratings yet
Activity 2
7 pages
Computer Security: R. Shipsey
No ratings yet
Computer Security: R. Shipsey
36 pages
Stability Analysis with Routh-Hurwitz
No ratings yet
Stability Analysis with Routh-Hurwitz
64 pages
Unit 3.3-3.4
No ratings yet
Unit 3.3-3.4
11 pages
Studies of Newmark Method For Solving Nonlinear Sy
No ratings yet
Studies of Newmark Method For Solving Nonlinear Sy
13 pages
Kakuro Cheat Sheet
100% (1)
Kakuro Cheat Sheet
1 page
Chapter 5: Root Locus
No ratings yet
Chapter 5: Root Locus
25 pages
Stress Paper
No ratings yet
Stress Paper
13 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Fast Computation of The Kurtogram For The Detection of Transient Faults
No ratings yet
Fast Computation of The Kurtogram For The Detection of Transient Faults
17 pages
Test 1 Sample
No ratings yet
Test 1 Sample
4 pages
Ai QB
No ratings yet
Ai QB
15 pages
Unit 01 4 (Notes)
No ratings yet
Unit 01 4 (Notes)
10 pages

Lecture 03

Uploaded by

Lecture 03

Uploaded by

Lecture-03: Support Vector Machines – Separable Case

1 Linear binary classification

and its distance −b/∥w∥ from the origin.

T1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b > 0} , T−1 = {i ∈ [m] : h( xi ) = ⟨w, xi ⟩ + b < 0} .

2.1 Primal optimization problem

yi (⟨w, xi ⟩ + b) ⩾ d0 (z, w, b) for all i ∈ [m].

2.2 Support vectors

S ≜ {i ∈ [m] : αi∗ ̸= 0} ⊆ {i ∈ [m] : yi (⟨w∗ , xi ⟩ + b∗ ) = 1} .

2.3 Dual optimization problem

Combining the above two results, we get for any j ∈ S

You might also like