0% found this document useful (0 votes)
2 views4 pages

Maximal Margin Classifier Explained

The document outlines the construction of a Maximal Margin Classifier for binary classification, detailing the setup of a linear classifier defined by a hyperplane. It explains the concepts of distance from a point to a hyperplane, functional and geometric margins, and the optimization problem to maximize the geometric margin under certain constraints. The final section connects this formulation to the standard hard-margin SVM problem, demonstrating how maximizing the margin relates to minimizing the norm of the weight vector.

Uploaded by

Jaat Sachit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views4 pages

Maximal Margin Classifier Explained

The document outlines the construction of a Maximal Margin Classifier for binary classification, detailing the setup of a linear classifier defined by a hyperplane. It explains the concepts of distance from a point to a hyperplane, functional and geometric margins, and the optimization problem to maximize the geometric margin under certain constraints. The final section connects this formulation to the standard hard-margin SVM problem, demonstrating how maximizing the margin relates to minimizing the norm of the weight vector.

Uploaded by

Jaat Sachit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Maximal Margin Classifier – Construction from Scratch

1. Problem Setup

We have a binary classification problem with n training observations:


- Feature vectors: x_i ∈ R^p for i = 1,...,n
- Class labels: y_i ∈ {−1, +1}

We seek a linear classifier defined by a hyperplane:


β_0 + β^T x = 0

where β = (β_1,...,β_p)^T is the normal vector to the hyperplane and β_0 is the intercept.

The classifier predicts:


■(x) = sign(β_0 + β^T x)

A point x_i with label y_i is classified correctly if:


y_i (β_0 + β^T x_i) > 0

2. Distance from a Point to a Hyperplane

Consider the hyperplane:


β_0 + β^T x = 0

We want the perpendicular distance from x_i to this hyperplane.

Parameterize the line through x_i in the direction of the unit normal β / ||β||:
x(t) = x_i − t · (β / ||β||)

The point x(t*) on this line that lies on the hyperplane satisfies:
β_0 + β^T x(t*) = 0

Substitute x(t*):
β_0 + β^T(x_i − t* β / ||β||) = 0
β_0 + β^T x_i − t* (β^T β / ||β||) = 0
β_0 + β^T x_i − t* ||β|| = 0

Solve for t*:


t* = (β_0 + β^T x_i) / ||β||

The perpendicular distance from x_i to the hyperplane is |t*|, hence:


dist(x_i, hyperplane) = |β_0 + β^T x_i| / ||β||

If we want a signed distance consistent with the label y_i, we use:


signed_dist(x_i) = y_i (β_0 + β^T x_i) / ||β||

3. Hyperplane Scaling Invariance

If β_0 + β^T x = 0 defines a hyperplane, then for any k ≠ 0, the equation


k(β_0 + β^T x) = 0
or equivalently
(kβ_0) + (kβ)^T x = 0
defines the same set of points x. Therefore, a hyperplane is invariant under nonzero
scaling of (β_0, β).

Thus, we can impose a constraint on β, such as ||β|| = 1, without changing which geometric
hyperplanes we can represent. This simply chooses a normalized representation of the
hyperplane.

4. Functional Margin and Geometric Margin

For a classifier (β_0, β), the functional margin of (x_i, y_i) is:
γ■_i = y_i (β_0 + β^T x_i)

This is positive if x_i is correctly classified, negative otherwise. The functional margin
of the classifier on the dataset is:
γ■ = min_i γ■_i = min_i y_i (β_0 + β^T x_i)

The geometric margin uses the actual perpendicular distance:


γ_i = y_i (β_0 + β^T x_i) / ||β||
and the geometric margin of the classifier is:
γ = min_i γ_i = min_i y_i (β_0 + β^T x_i) / ||β||

Note the relationship:


γ = γ■ / ||β||

so the geometric margin is invariant to scaling of (β_0, β).

5. Optimization Problem for the Maximal Margin Hyperplane

We assume the data are linearly separable, so there exists some (β_0, β) such that:
y_i (β_0 + β^T x_i) > 0 for all i

The maximal margin hyperplane is the one that maximizes the geometric margin γ. We use the
scaling freedom to impose ||β|| = 1. Under this normalization, the perpendicular distance
from x_i to the hyperplane becomes:
dist(x_i, hyperplane) = y_i (β_0 + β^T x_i)
since ||β|| = 1.

We then define M to be the (common) lower bound on these distances:


y_i (β_0 + β^T x_i) ≥ M for all i

Given ||β|| = 1, this inequality says each observation lies on the correct side of the
hyperplane and at least distance M away from it.

Hence the optimization problem is:

maximize M
subject to ||β||^2 = 1
y_i (β_0 + β^T x_i) ≥ M, for all i = 1,...,n

In component form this is:


maximize M
subject to ∑_{j=1}^p β_j^2 = 1
y_i (β_0 + β_1 x_{i1} + ... + β_p x_{ip}) ≥ M, for all i

6. Interpretation of the Constraints

- Constraint y_i (β_0 + β^T x_i) ≥ M:


If M > 0, then each training point satisfies y_i (β_0 + β^T x_i) ≥ M > 0. Thus, every
point is on the correct side of the hyperplane, and its perpendicular distance to the
hyperplane is at least M (because ||β|| = 1). This provides a buffer or cushion, not just
correctness.

- Constraint ∑ β_j^2 = 1 (or ||β|| = 1):


This does not change which hyperplanes we can express (due to scaling invariance).
Instead, it ensures that y_i (β_0 + β^T x_i) directly equals the perpendicular distance to
the hyperplane, making M a true geometric margin.

Therefore, M is exactly the margin of the hyperplane: the smallest perpendicular distance
of any training observation to the hyperplane.

7. Connection to the Standard Hard-Margin SVM Formulation

The above problem is:


maximize M
subject to ||β|| = 1
y_i (β_0 + β^T x_i) ≥ M, for all i

We can rescale (β_0, β) and M to obtain the common SVM form. Define:
w = β / M, b = β_0 / M

Then:
w^T x_i + b = (β^T x_i + β_0) / M

From y_i (β_0 + β^T x_i) ≥ M we get:


y_i (w^T x_i + b) = y_i (β_0 + β^T x_i) / M ≥ 1

Also, since ||β|| = 1:


||w|| = ||β / M|| = 1 / M => M = 1 / ||w||

Maximizing M is thus equivalent to minimizing ||w||, or equivalently minimizing (1/2)


||w||^2. This yields the usual hard-margin SVM primal problem:

minimize (1/2) ||w||^2


subject to y_i (w^T x_i + b) ≥ 1, for all i

Thus, the optimization problem with variables (β_0, β, M) and constraints (∑ β_j^2 = 1,
y_i (β_0 + β^T x_i) ≥ M) is just another way of expressing the construction of the maximal
margin classifier: it explicitly maximizes the geometric margin M.

You might also like