0% found this document useful (0 votes)
21 views3 pages

Addendum Bias Variance

Uploaded by

oyedelestudy12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
21 views3 pages

Addendum Bias Variance

Uploaded by

oyedelestudy12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 3

Some Calculations from Bias Variance

Christopher Ré
October 25, 2020

This note contains a reprise of the eigenvalue arguments to understand how


variance is reduced by regularization. We also describe different ways regu-
larization can occur including from the algorithm or initialization. This note
contains some additional calculations from the lecture and Piazza, just so that
we have typeset versions of them. They contain no new information over the
lecture, but they do supplement the notes.
Recall we have a design matrix X ∈ Rn×d and labels y ∈ Rn . We are inter-
ested in the underdetermined case n < d so that rank(X) ≤ n < d. We consider
the following optimization problem for least squares with a regularization pa-
rameter λ ≥ 0:
1 λ
ℓ(θ; λ) = min ∥Xθ − y∥2 + ∥θ∥2
θ∈Rd 2 2

Normal Equations Computing derivatives as we did for the normal equa-


tions, we see that:

∇θ ℓ(θ; λ) = X T (Xθ − y) + λθ = (X T X + λI)θ − X T y

By setting ∇θ ℓ(θ, λ) = 0 we can solve for the θ̂ that minimizes the above prob-
lem. Explicitly, we have:
( )−1 T
θ̂ = X T X + λI X y (1)

To see that the inverse in Eq. 1 exists, we observe that X T X is a symmetric,


real d×d matrix so it has d eigenvalues (some may be 0). Moreover, it is positive
semidefinite, and we capture this by writing eig(X T X) = {σ12 , . . . , σd2 }. Now,
inspired by the regularized problem, we examine:
{ }
eig(X T X + λI) = σ12 + λ, . . . , σd2 + λ

Since σi2 ≥ 0 for all i ∈ [d], if we set λ > 0 then X T X + λI is full rank, and the
inverse of (X T X + λI) exists. In turn, this means there is a unique such θ̂.

Variance Recall that in bias-variance, we are concerned with the variance of


θ̂ as we sample the training set. We want to argue that as the regularization
parameter λ increases, the variance in the fitted θ̂ decreases. We won’t carry

1
out the full formal argument, but it suffices to make one observation that is
immediate from Eq. 1: the variance of θ̂ is proportional to the eigenvalues of
(X T X + λI)−1 . To see this, observe that the eigenvalues of an inverse are just
the inverse of the eigenvalues:
(( { }
)−1 ) 1 1
eig X T X + λI = , . . . ,
σ12 + λ σd2 + λ

Now, condition on the points we draw, namely X. Then, recall that ran-
domness is in the label noise (recall the linear regression model y ∼ Xθ∗ +
N (0, τ 2 I) = N (Xθ∗ , τ 2 I)).
Recall a fact about the multivariate normal distribution:

if y ∼ N (µ, Σ) then Ay ∼ N (Aµ, AΣAT )

Using linearity, we can verify that the expectation of θ̂ is

E[θ̂] = E[(X T X + λI)−1 X T y]


= E[(X T X + λI)−1 X T (Xθ∗ + N (0, τ 2 , I))]
= E[(X T X + λI)−1 X T (Xθ∗ )]
= (X T X + λI)−1 (X T X)θ∗ (essentially a “shrunk” θ∗ )

The last line above suggests that the more regularization we add (larger the λ),
the more the estimated θ̂ will be shrunk towards 0. In other words, regulariza-
tion adds bias (towards zero in this case). Though we paid the cost of higher
bias, we gain by reducing the variance of θ̂. To see this bias-variance tradeoff
concretely, observe the covariance matrix of θ̂:

C := Cov[θ̂]
( ) ( )
= (X T X + λI)−1 X T (τ 2 I) X(X T X + λI)−1
and
{ }
τ 2 σ12 τ 2 σd2
eig(C) = , . . . ,
(σ12 + λ)2 (σd2 + λ)2

Notice that the entire spectrum of the covariance is a decreasing function of λ.


By decomposing in the eigenvalue basis, we can see that actually E[∥θ̂ − θ∗ ∥2 ]
is a decreasing function of λ, as desired.

Gradient Descent We show that you can initialize gradient descent in a way
that effectively regularizes undetermined least squares–even with no regulariza-
tion penalty (λ = 0). Our first observation is that any point x ∈ Rd can be
decomposed into two orthogonal components x0 , x1 such that

x = x0 + x1 and x0 ∈ Null(X) and x1 ∈ Range(X T ).

2
Recall that Null(X) and Range(X T ) are orthogonal subspaces by the fundamen-
tal theory of linear algebra. We write P0 for the projection on the null and P1
for the projection on the range, then x0 = P0 (x) and x1 = P1 (x).
If one initializes at a point θ then, we observe that the gradient is orthogonal
to the null space. That is, if g(θ) = X T (Xθ−y) then g T P0 (v) = 0 for any v ∈ Rd .
But, then:

P0 (θ(t+1) ) = P0 (θt − αg(θ(t) )) = P0 (θt ) − αP0 g(θ(t) ) = P0 (θ(t) )

That is, no learning happens in the null. Whatever portion is in the null that
we initialize stays there throughout execution.
A key property of the Moore-Penrose pseudoinverse, is that if θ̂ = (X T X)+ X T y
then P0 (θ̂) = 0. Hence, the gradient descent solution initialized at θ0 can be
written θ̂ + P0 (θ0 ). Two immediate observations:

• Using the Moore-Penrose inverse acts as regularization, because it selects


the solution θ̂.
• So does gradient descent–provided that we initialize at θ0 = 0. This
is particularly interesting, as many modern machine learning techniques
operate in these underdetermined regimes.

We’ve argued that there are many ways to find equivalent solutions, and that
this allows us to understand the effect on the model fitting procedure as regu-
larization. Thus, there are many ways to find that equivalent solution. Many
modern methods of machine learning including dropout and data augmentation
are not penalty, but their effect is understood as regularization. One contrast
with the above methods is that they often depend on some property of the data
or for how much they effectively regularization. In some sense, they adapt to
the data. A final comment is that in the same sense above, adding more data
regularizes the model as well!

You might also like