Lecture 12
Statistical Learning: First Steps
Sasha Rakhlin
Oct 17, 2019
1 / 24
Outline
Setup
Perceptron
2 / 24
What is Generalization?
3 / 24
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
3 / 24
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 24
Outline
Setup
Perceptron
4 / 24
Supervised Learning: data S = {(X1 , Y1 ), . . . , (Xn , Yn )} are i.i.d. from
unknown distribution P.
Learning algorithm: a mapping {(X1 , Y1 ), . . . , (Xn , Yn )} z→ f̂n .
Goals:
▸ Prediction: small expected loss
L(f̂n ) = EX,Y `(Y, f̂n (X)).
Here (X, Y) ∼ P. Interpretation: good prediction on a random example
from same population.
▸ Estimation: small ∥f̂n − f∗ ∥, or ∥̂
θ − θ∗ ∥, where f∗ or θ∗ are
parameters of P (e.g. regression function f∗ (x) = E[Y∣X = x], or
f∗ (x) = ⟨θ∗ , x⟩, etc).
In this course, we mostly focus on prediction, but will also outline
connections between prediction and estimation.
5 / 24
Why not estimate the underlying distribution P first?
This is in general a harder problem than prediction. Consider classification.
We might be attempting to learn parts/properties of the distribution that
are irrelevant, while all we care about is the “boundary” between the two
classes.
6 / 24
Key difficulty: our goals are in terms of unknown quantities related to
unknown P. Have to use empirical data instead. Purview of statistics.
For instance, we can calculate the empirical loss of f ∶ X → Y
̂ 1 n
L(f) = ∑ `(Yi , f(Xi ))
n i=1
7 / 24
The function x ↦ f̂n (x) = f̂n (x; X1 , Y1 , . . . , Xn , Yn ) is random, since it
depends on the random data S = (X1 , Y1 , . . . , Xn , Yn ). Thus, the risk
L(f̂n ) = E [`(f̂n (X), Y)∣S ]
= E [`(f̂n (X; X1 , Y1 , . . . , Xn , Yn ), Y)∣S ]
is a random variable. We might aim for EL(f̂n ) small, or L(f̂n ) small with
high probability (over the training data).
8 / 24
Quiz: what is random here?
1. ̂
L(f) for a given fixed f
2. f̂n
L(f̂n )
3. ̂
4. L(f̂n )
5. L(f) for a given fixed f
It is important that these are understood before we proceed further.
9 / 24
Theoretical analysis of performance is typically easier if f̂n has closed form
(in terms of the training data).
E.g. ordinary least squares f̂n (x) = x T (X T X)−1 X T Y.
Unfortunately, most ML and many Statistical procedures are not explicitly
defined but arise as
▸ solutions to an optimization objective (e.g. logistic regression)
▸ as an iterative procedure without an immediately obvious objective
function (e.g. AdaBoost, Random Forests, etc)
10 / 24
The Gold Standard
Within the framework we set up, the smallest expected loss is achieved by
the Bayes optimal function
f∗ = arg min L(f)
f
where the minimization is over all (measurable) prediction rules f ∶ X → Y.
The value of the lowest expected loss is called the Bayes error:
L(f∗ ) = inf L(f)
f
Of course, we cannot calculate any of these quantities since P is unknown.
11 / 24
Bayes Optimal Function
Bayes optimal function f∗ takes on the following forms in these two
particular cases:
▸ Binary classification (Y = {0, 1}) with the indicator loss:
f∗ (x) = I{η(x) ≥ 1/2}, where η(x) = E[Y∣X = x]
⌘(x)
0
▸ Regression (Y = R) with squared loss:
f∗ (x) = η(x), where η(x) = E[Y∣X = x]
12 / 24
The big question: is there a way to construct a learning algorithm with a
guarantee that
L(f̂n ) − L(f∗ )
is small for large enough sample size n?
13 / 24
Consistency
An algorithm that ensures
lim L(f̂n ) = L(f∗ ) almost surely
n→∞
is called consistent. Consistency ensures that our algorithm is approaching
the best possible prediction performance as the sample size increases.
The good news: consistency is possible to achieve.
▸ easy if X is a finite or countable set
▸ not too hard if X is infinite, and the underlying relationship between x
and y is “continuous”
14 / 24
The bad news...
In general, we cannot prove anything quantitative about L(f̂n ) − L(f∗ ),
unless we make further assumptions (incorporate prior knowledge).
“No Free Lunch” Theorems: unless we posit assumptions,
▸ For any algorithm f̂n , any n and any > 0, there exists a distribution
P such that L(f∗ ) = 0 and
1
EL(f̂n ) ≥ −
2
▸ For any algorithm f̂n , and any sequence an that converges to 0, there
exists a probability distribution P such that L(f∗ ) = 0 and for all n
EL(f̂n ) ≥ an
Reference: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern Recognition),
(Bousquet, Boucheron, Lugosi, 2004)
15 / 24
is this really “bad news”?
Not really. We always have some domain knowledge.
Two ways of incorporating prior knowledge:
▸ Direct way: assumptions on distribution P (e.g. margin, smoothness of
regression function, etc)
▸ Indirect way: redefine the goal to perform as well as a reference set F
of predictors:
L(f̂n ) − inf L(f)
f∈F
F encapsulates our inductive bias.
We often make both of these assumptions.
16 / 24
Outline
Setup
Perceptron
17 / 24
We start our study of Statistical Learning with the classical Perceptron
algorithm.
Reason: simplicity. We will give a three-line proof of Perceptron, followed
by two interesting consequences with one-line proofs each. These
consequences are, perhaps, the easiest nontrivial statistical guarantees I can
think of.
18 / 24
Perceptron
<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>
19 / 24
Perceptron
(x1 , y1 ), . . . , (xT , yT ) ∈ X × {±1} (T may or may not be same as n)
Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).
On round t,
▸ Consider (xt , yt )
▸ Form prediction ̂
yt = sign(⟨wt , xt ⟩)
▸ If ̂
yt ≠ yt , update
wt+1 = wt + yt xt
else
wt+1 = wt
20 / 24
Perceptron
wt
<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>
xt
<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>
21 / 24
For simplicity, suppose all data are in a unit ball, ∥xt ∥ ≤ 1.
Margin with respect to (x1 , y1 ), . . . , (xT , yT ):
γ = max min (yi ⟨w, xi ⟩)+ ,
∥w∥=1 i∈[T ]
where (a)+ = max{0, a}.
Theorem (Novikoff ’62).
Perceptron makes at most 1/γ2 mistakes (and corrections) on any
sequence of examples with margin γ.
22 / 24
Proof: Let m be the number of mistakes after T iterations. If a mistake
is made on round t,
∥wt+1 ∥2 = ∥wt + yt xt ∥2 ≤ ∥wt ∥2 + 2yt ⟨wt , xt ⟩ + 1 ≤ ∥wt ∥2 + 1.
Hence,
∥wT ∥2 ≤ m.
For optimal hyperplane w∗
γ ≤ ⟨w∗ , yt xt ⟩ = ⟨w∗ , wt+1 − wt ⟩ .
Hence (adding and canceling),
√
mγ ≤ ⟨w∗ , wT ⟩ ≤ ∥wT ∥ ≤ m.
23 / 24
Recap
For any T and (x1 , y1 ), . . . , (xT , yT ),
T
D2
∑ I{yt ⟨wt , xt ⟩ ≤ 0} ≤
t=1 γ2
where γ = γ(x1∶T , y1∶T ) is margin and D = D(x1∶T , y1∶T ) = maxt ∥xt ∥.
Let w∗ denote the max margin hyperplane, ∥w∗ ∥ = 1.
24 / 24