Maximal Margin Classifier – Construction from Scratch
1. Problem Setup
We have a binary classification problem with n training observations:
- Feature vectors: x_i ∈ R^p for i = 1,...,n
- Class labels: y_i ∈ {−1, +1}
We seek a linear classifier defined by a hyperplane:
β_0 + β^T x = 0
where β = (β_1,...,β_p)^T is the normal vector to the hyperplane and β_0 is the intercept.
The classifier predicts:
■(x) = sign(β_0 + β^T x)
A point x_i with label y_i is classified correctly if:
y_i (β_0 + β^T x_i) > 0
2. Distance from a Point to a Hyperplane
Consider the hyperplane:
β_0 + β^T x = 0
We want the perpendicular distance from x_i to this hyperplane.
Parameterize the line through x_i in the direction of the unit normal β / ||β||:
x(t) = x_i − t · (β / ||β||)
The point x(t*) on this line that lies on the hyperplane satisfies:
β_0 + β^T x(t*) = 0
Substitute x(t*):
β_0 + β^T(x_i − t* β / ||β||) = 0
β_0 + β^T x_i − t* (β^T β / ||β||) = 0
β_0 + β^T x_i − t* ||β|| = 0
Solve for t*:
t* = (β_0 + β^T x_i) / ||β||
The perpendicular distance from x_i to the hyperplane is |t*|, hence:
dist(x_i, hyperplane) = |β_0 + β^T x_i| / ||β||
If we want a signed distance consistent with the label y_i, we use:
signed_dist(x_i) = y_i (β_0 + β^T x_i) / ||β||
3. Hyperplane Scaling Invariance
If β_0 + β^T x = 0 defines a hyperplane, then for any k ≠ 0, the equation
k(β_0 + β^T x) = 0
or equivalently
(kβ_0) + (kβ)^T x = 0
defines the same set of points x. Therefore, a hyperplane is invariant under nonzero
scaling of (β_0, β).
Thus, we can impose a constraint on β, such as ||β|| = 1, without changing which geometric
hyperplanes we can represent. This simply chooses a normalized representation of the
hyperplane.
4. Functional Margin and Geometric Margin
For a classifier (β_0, β), the functional margin of (x_i, y_i) is:
γ■_i = y_i (β_0 + β^T x_i)
This is positive if x_i is correctly classified, negative otherwise. The functional margin
of the classifier on the dataset is:
γ■ = min_i γ■_i = min_i y_i (β_0 + β^T x_i)
The geometric margin uses the actual perpendicular distance:
γ_i = y_i (β_0 + β^T x_i) / ||β||
and the geometric margin of the classifier is:
γ = min_i γ_i = min_i y_i (β_0 + β^T x_i) / ||β||
Note the relationship:
γ = γ■ / ||β||
so the geometric margin is invariant to scaling of (β_0, β).
5. Optimization Problem for the Maximal Margin Hyperplane
We assume the data are linearly separable, so there exists some (β_0, β) such that:
y_i (β_0 + β^T x_i) > 0 for all i
The maximal margin hyperplane is the one that maximizes the geometric margin γ. We use the
scaling freedom to impose ||β|| = 1. Under this normalization, the perpendicular distance
from x_i to the hyperplane becomes:
dist(x_i, hyperplane) = y_i (β_0 + β^T x_i)
since ||β|| = 1.
We then define M to be the (common) lower bound on these distances:
y_i (β_0 + β^T x_i) ≥ M for all i
Given ||β|| = 1, this inequality says each observation lies on the correct side of the
hyperplane and at least distance M away from it.
Hence the optimization problem is:
maximize M
subject to ||β||^2 = 1
y_i (β_0 + β^T x_i) ≥ M, for all i = 1,...,n
In component form this is:
maximize M
subject to ∑_{j=1}^p β_j^2 = 1
y_i (β_0 + β_1 x_{i1} + ... + β_p x_{ip}) ≥ M, for all i
6. Interpretation of the Constraints
- Constraint y_i (β_0 + β^T x_i) ≥ M:
If M > 0, then each training point satisfies y_i (β_0 + β^T x_i) ≥ M > 0. Thus, every
point is on the correct side of the hyperplane, and its perpendicular distance to the
hyperplane is at least M (because ||β|| = 1). This provides a buffer or cushion, not just
correctness.
- Constraint ∑ β_j^2 = 1 (or ||β|| = 1):
This does not change which hyperplanes we can express (due to scaling invariance).
Instead, it ensures that y_i (β_0 + β^T x_i) directly equals the perpendicular distance to
the hyperplane, making M a true geometric margin.
Therefore, M is exactly the margin of the hyperplane: the smallest perpendicular distance
of any training observation to the hyperplane.
7. Connection to the Standard Hard-Margin SVM Formulation
The above problem is:
maximize M
subject to ||β|| = 1
y_i (β_0 + β^T x_i) ≥ M, for all i
We can rescale (β_0, β) and M to obtain the common SVM form. Define:
w = β / M, b = β_0 / M
Then:
w^T x_i + b = (β^T x_i + β_0) / M
From y_i (β_0 + β^T x_i) ≥ M we get:
y_i (w^T x_i + b) = y_i (β_0 + β^T x_i) / M ≥ 1
Also, since ||β|| = 1:
||w|| = ||β / M|| = 1 / M => M = 1 / ||w||
Maximizing M is thus equivalent to minimizing ||w||, or equivalently minimizing (1/2)
||w||^2. This yields the usual hard-margin SVM primal problem:
minimize (1/2) ||w||^2
subject to y_i (w^T x_i + b) ≥ 1, for all i
Thus, the optimization problem with variables (β_0, β, M) and constraints (∑ β_j^2 = 1,
y_i (β_0 + β^T x_i) ≥ M) is just another way of expressing the construction of the maximal
margin classifier: it explicitly maximizes the geometric margin M.