Addendum Bias Variance
Addendum Bias Variance
Christopher Ré
October 25, 2020
By setting ∇θ ℓ(θ, λ) = 0 we can solve for the θ̂ that minimizes the above prob-
lem. Explicitly, we have:
( )−1 T
θ̂ = X T X + λI X y (1)
Since σi2 ≥ 0 for all i ∈ [d], if we set λ > 0 then X T X + λI is full rank, and the
inverse of (X T X + λI) exists. In turn, this means there is a unique such θ̂.
1
out the full formal argument, but it suffices to make one observation that is
immediate from Eq. 1: the variance of θ̂ is proportional to the eigenvalues of
(X T X + λI)−1 . To see this, observe that the eigenvalues of an inverse are just
the inverse of the eigenvalues:
(( { }
)−1 ) 1 1
eig X T X + λI = , . . . ,
σ12 + λ σd2 + λ
Now, condition on the points we draw, namely X. Then, recall that ran-
domness is in the label noise (recall the linear regression model y ∼ Xθ∗ +
N (0, τ 2 I) = N (Xθ∗ , τ 2 I)).
Recall a fact about the multivariate normal distribution:
The last line above suggests that the more regularization we add (larger the λ),
the more the estimated θ̂ will be shrunk towards 0. In other words, regulariza-
tion adds bias (towards zero in this case). Though we paid the cost of higher
bias, we gain by reducing the variance of θ̂. To see this bias-variance tradeoff
concretely, observe the covariance matrix of θ̂:
C := Cov[θ̂]
( ) ( )
= (X T X + λI)−1 X T (τ 2 I) X(X T X + λI)−1
and
{ }
τ 2 σ12 τ 2 σd2
eig(C) = , . . . ,
(σ12 + λ)2 (σd2 + λ)2
Gradient Descent We show that you can initialize gradient descent in a way
that effectively regularizes undetermined least squares–even with no regulariza-
tion penalty (λ = 0). Our first observation is that any point x ∈ Rd can be
decomposed into two orthogonal components x0 , x1 such that
2
Recall that Null(X) and Range(X T ) are orthogonal subspaces by the fundamen-
tal theory of linear algebra. We write P0 for the projection on the null and P1
for the projection on the range, then x0 = P0 (x) and x1 = P1 (x).
If one initializes at a point θ then, we observe that the gradient is orthogonal
to the null space. That is, if g(θ) = X T (Xθ−y) then g T P0 (v) = 0 for any v ∈ Rd .
But, then:
That is, no learning happens in the null. Whatever portion is in the null that
we initialize stays there throughout execution.
A key property of the Moore-Penrose pseudoinverse, is that if θ̂ = (X T X)+ X T y
then P0 (θ̂) = 0. Hence, the gradient descent solution initialized at θ0 can be
written θ̂ + P0 (θ0 ). Two immediate observations:
We’ve argued that there are many ways to find equivalent solutions, and that
this allows us to understand the effect on the model fitting procedure as regu-
larization. Thus, there are many ways to find that equivalent solution. Many
modern methods of machine learning including dropout and data augmentation
are not penalty, but their effect is understood as regularization. One contrast
with the above methods is that they often depend on some property of the data
or for how much they effectively regularization. In some sense, they adapt to
the data. A final comment is that in the same sense above, adding more data
regularizes the model as well!