Machine Learning
PAC-Learning and VC-Dimension
Marcello Restelli
April 4, 2017
Outline
1 PAC-Learning
2 VC-Dimension
Marcello Restelli April 4, 2017 2 / 26
PAC-Learning
PAC-Learning
Overfitting happens because training error is bad estimate of
generalization error
Can we infer something about generalization error from training error?
Overfitting happens when the learner doesn’t see “enough” examples
Can we estimate how many examples are enough?
Marcello Restelli April 4, 2017 4 / 26
PAC-Learning
A Simple Setting...
Given
Set of instances X
Set of hypotheses H
Set of possible target concepts C (Boolean functions)
Training instances generated by a fixed, unknown probability distribution
P over X
Learner observes sequence D of training examples hx, c(x)i, for some
target concept c ∈ C
Instances x are drawn from distribution P
Teacher provides deterministic target value c(x) for each instance
Learner must output a hypothesis h estimating c
h is evaluated by its performance on subsequent instances drawn
according to P
Ltrue = Prx∈P [c(x) 6= h(x)]
We want to bound Ltrue given Ltrain
Marcello Restelli April 4, 2017 5 / 26
PAC-Learning
Version Spaces
First consider when training error of h is zero
Version Space: VSH,D : subset of hypotheses in H consistent with
training data D
Hypothesis Space H
Ltrain = 0.2, Ltrain = 0, Ltrain = 0.4,
Ltrue = 0.1 Ltrue = 0.2 Ltrue = 0.3
Ltrain = 0,
Ltrue = 0.1
Ltrain = 0.1, Ltrain = 0.3,
Ltrue = 0.3 Ltrue = 0.2
Can we bound the error in the version space?
Marcello Restelli April 4, 2017 6 / 26
PAC-Learning
How Likely is learner to Pick a Bad Hypothesis?
Theorem
If the hypothesis space H is finite and D is a sequence of N ≥ 1 independent
random examples of some target concept c, then for any 0 ≤ ≤ 1, the
probability that VSH,D contains a hypothesis error greater then is less than
|H|e−N :
Pr(∃h ∈ H : Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) ≤ |H|e−N
Marcello Restelli April 4, 2017 7 / 26
PAC-Learning
How Likely is learner to Pick a Bad Hypothesis?
Proof.
Pr((Ltrain (h1 ) = 0 ∧ Ltrue (h1 ) ≥ ) ∨ · · · ∨ (Ltrain (h|H| ) = 0 ∧ Ltrue (h|H| ) ≥ ))
X
≤ Pr(Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) (Union bound)
h∈H
X
≤ Pr(Ltrain (h) = 0|Ltrue (h) ≥ ) (Bound using Bayes’ rule)
h∈H
X
≤ (1 − )N (Bound on individual hi s)
h∈H
≤ |H|(1 − )N (k ≤ |H|)
−N
≤ |H|e (1 − ≤ e− , for 0 ≤ ≤ 1)
Marcello Restelli April 4, 2017 7 / 26
PAC-Learning
Using a Probably Approximately Correct (PAC) Bound
If we want this probability to be at most δ
|H|e−N ≤ δ
Pick and δ, compute N
1 1
N≥ ln |H| + ln
δ
Pick N and δ, compute
1 1
≥ ln |H| + ln
N δ
M
Note: the number of M-ary boolean functions is 22 . So the bounds have
an exponential dependency on the number of features M
Marcello Restelli April 4, 2017 8 / 26
PAC-Learning
Example: Learning Conjunctions
Suppose H contains conjunctions of constraints on up to M Boolean
attributes (i.e., M literals)
|H| = 3M
How many examples are sufficient to ensure with probability at least
(1 − δ) that every h in VSH,D satisfies Ltrue (h) ≤ ?
1 M 1
N≥ ln 3 + ln
δ
1 1
≥ M ln 3 + ln
δ
Marcello Restelli April 4, 2017 9 / 26
PAC-Learning
PAC Learning
Consider a class C of possible target concepts defined over a set of instances
X of length n, and a Learner L using hypothesis space H.
Definition
C is PAC-learnable if there exists an algorithm L such that for every f ∈ C,
for any distribution P, for any such that 0 ≤ < 1/2, and δ such that
0 ≤ δ < 1, algorithm L, with probability at least 1 − δ, outputs a concept h
such that Ltrue (h) ≤ using a number of samples that is polynomial of 1/
and 1/δ
Definition
C is efficiently PAC-learnable by L using H iff for all c ∈ C, distributions P
over X, such that 0 < < 1/2, and δ such that 0 < δ < 1/2, learner L will
with probability at least (1 − δ) output a hypothesis h ∈ H such that
Ltrue (h) ≤ , in time that is polynomial in 1/, 1/δ, M and size(c)
Marcello Restelli April 4, 2017 10 / 26
PAC-Learning
Agnostic Learning
Usually the train error is not equal to zero: the Version Space is empty!
What Happens with Inconsistent Hypotheses?
We need to bound the gap between training and true errors
Ltrue (h) ≤ Ltrain (h) +
Using the Hoeffding bound: for N i.i.d. coin flips X1 , . . . , XN , where
Xi ∈ {0, 1} and 0 < < 1, we define the empirical mean
1
X = (X1 + · · · + XN ), obtaining the following bound:
N
2
Pr(E[X] − X > ) ≤ e−2N
Theorem
Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < < 1: for any
learned hypothesis h:
2
Pr(Ltrue (h) − Ltrain (h) > ) ≤ |H|e−2N
Marcello Restelli April 4, 2017 11 / 26
PAC-Learning
PAC Bound and Bias-Variance Tradeoff
s
ln |H| + ln δ1
Ltrue (h) ≤ Ltrain (h) +
| {z } 2N
Bias | {z }
Variance
For large |H|
Low bias (assuming we can find a good h)
High variance (because bound is loser)
For small |H|
High bias (is there a good h?)
Low variance (tighter bound)
Given δ, how large should N be?
1 1
N ≥ 2 ln |H| + ln
2 δ
Marcello Restelli April 4, 2017 12 / 26
PAC-Learning
What about Continuous Hypothesis Spaces?
Continuous hypothesis space
|H| = ∞
Infinite variance???
Marcello Restelli April 4, 2017 13 / 26
PAC-Learning
Example: Learning Axis Aligned Rectangles
We want to learn an unknown target R
axis-aligned rectangle: R
We have randomly drawn samples + R0
with a label that indicate whether the +
point is contained or not in R +
+ +
Consider the hypothesis +
corresponding to the tightest
rectangle R0 around positive samples
The error region is the difference
between R and R0 , that can be seen as
the union of four rectangular regions
Marcello Restelli April 4, 2017 14 / 26
PAC-Learning
Example: Learning Axis Aligned Rectangles
R
In each of these regions we want an
error less than /4 + R0
When N samples are drawn, a bad +
event is when the probability of all +
+ +
the N samples of being outside this +
region is at most (1 − /4)N
The same holds for the other three
regions, ans so by union bound we
get 4(1 − /4)N
We want that the probability of a bad event is less than δ:
4(1 − /4)N ≤ δ
By exploiting the inequality (1 − x) ≤ e−x , we get:
N ≥ (4/) ln (4/δ)
Marcello Restelli April 4, 2017 15 / 26
PAC-Learning
What about Continuous Hypothesis Spaces?
Continuous hypothesis space
|H| = ∞
Infinite variance???
It is important the number of points that can be classified exactly!
Question: Can we get a bound error as a function of the number of
points that can be completely labeled?
Marcello Restelli April 4, 2017 16 / 26
VC-Dimension
Shattering a Set of Instances
Definition (Dichotomy)
A dichotomy of a set S is a partition of S into two disjoint subsets
Definition (Shattering)
A set of instances S is shattered by hypothesis space H if and only if for
every dichotomy of S there exists some hypothesis in H consistent with this
dichotomy
Marcello Restelli April 4, 2017 18 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Three Instances Shattered
X X
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
Example: Four Instances Shattered
+ -
- +
Marcello Restelli April 4, 2017 20 / 26
VC-Dimension
Example: Four Instances Shattered
+ -
- +
Marcello Restelli April 4, 2017 20 / 26
VC-Dimension
VC Dimension
Definition
The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H
defined over instance space X is the size of the largest finite subset of X
shattered by H. If arbitrarily large finite sets of X can be shattered by H, then
VC(H) ≡ ∞
Marcello Restelli April 4, 2017 21 / 26
VC-Dimension
VC Dimension of Linear Decision Surfaces
How many points can a linear boundary classify exactly in 1-D?
2
How many points can a linear boundary classify exactly in 2-D?
3
How many points can a linear boundary classify exactly in M-D?
M+1
Rule of thumb: number of parameters in model often matches max
number of points
But in general it can be completely different!
There are problem where the number of parameters is infinte (e.g.,
SVMs) and the VC dimension is finite!
There can also be a hypothesis space with 1 parameter and infinite
VC-dimension!
Marcello Restelli April 4, 2017 22 / 26
VC-Dimension
VC-Dimension Examples
Examples:
Linear classifier
VC(H) = M + 1, for M features plus constant term
Neural networks
VC(H) = number of parameters
Local minima means NNs will probably not find best parameters
1-Nearest neighbor
VC(H) = ∞
SVM with Gaussian Kernel
VC(H) = ∞
Marcello Restelli April 4, 2017 23 / 26
VC-Dimension
Sample Complexity from VC Dimension
How many randomly drawn examples suffice to guarantee error of at most
with probability at least (1 − δ)?
1 2 13
N≥ 4 log2 + 8VC(H) log2
δ
Marcello Restelli April 4, 2017 24 / 26
VC-Dimension
PAC Bound using VC dimension
v
u
u VC(H) ln 2N + 1 + ln 4
t VC(H) δ
Ltrue (h) ≤ Ltrain (h) +
N
Same bias/variance tradeoff as always
Now, just a function of VC(H)
Structural Risk Minimization: choose the hypothesis space H to
minimize the above bound on expected true error!
Marcello Restelli April 4, 2017 25 / 26
VC-Dimension
VC Dimension Properties
Theorem
The VC dimension of a hypothesis space |H| < ∞ is bounded from above:
VC(H) ≤ log2 (|H|)
Proof.
If VC(H) = d then there exist at least 2d functions in H, since there are at
least 2d possible labelings: |H| ≥ 2d
Theorem
Concept class C with VC(C) = ∞ is not PAC-learnable.
Marcello Restelli April 4, 2017 26 / 26