Notes Linearregression
Notes Linearregression
1 Linear Regression
Given training data {(xn , yn )}N D
n=1 , where xn ∈ R and yn ∈ R, linear regression
D
learns a model vector w ∈ R by minimizing the square error defined on training
data:
N
1X T
w∗ = argmin (w xn − yn )2 (1)
w 2 n=1
1
= argmin kXw − yk22 , (2)
w 2
| {z }
f (w)
∇f (w) = X T Xw − X T y.
2. The function f (w) is a convex function. To see this, you can first assume
D = 1, N = 1, then the function becomes single-variate f (w) = (xw −
y)2 which is clearly convex. In general, we can verify the convexity of a
function from its second order derivative. In linear regression case
∇2 f (w) = X T X.
1
For a convex and continuously differentiable convex function, we know
w∗ is a global minimum of f (w) if and only if ∇f (w∗ ) = 0.
So, to solve (2), it’s equivalent to finding a w∗ such that ∇f (w∗ ) = 0. This
means w∗ is minimum if and only if it satisfies the following equation:
normal equation: X T Xw∗ = X T y. (3)
This is called “normal equation” for linear regression. To solve (3), we consider
the following two cases:
When X T X is invertible, eq (3) directly implies w∗ = (X T X)−1 X T y
is the unique solution of linear regression. This often happens when we face
an over-determined system—number of samples is much larger than number of
variables (N D). An intuitive way to see this: When N D, we have
many training samples to fit but don’t have enough degree of freedom (number
of variables), so it’s unlikely to fit data very well and the minimizer can be
uniquely determined.
When X T X is not invertible, equation (3) does not have a unique so-
lution. In fact, (3) will have infinite number of solutions in this case, and so
does our linear regression problem—there will be infiinite number of solutions
w∗ that achieves the same minimal square error on the training data.
X T X will not be invertible when N < D. To illustrate why we have infinite
number of solutions, consider in a two-dimensional problem (D = 2) we have
only one training sample x1 = [1, −1], y1 = 1. We can see w = [a + 1, a] for any
a ∈ R will get 0 training error:
w T x 1 = a + 1 − a = 1 = y1 .
This is true for any problem with N < D—in this case, you can always find
a vector in the null space of X (a vector such that Xv = 0), and then for a
solution w∗ , any vector with w∗ + av with a ∈ R will get the same square error
with w∗ . This case (N < D) is also called the under-determined problem,
since you have too many degree of freedom in your problem and don’t have
enough constraints (data).
So how to find a solution in the under-determined case? Indeed, we could use
the following approach to find the minimum-norm solution w+ : Let W =
argminw kXw − yk2 denote the set of solutions, we aim to find the minimum
norm solution that
w+ = argmin kwk2 . (4)
w∈W
2
Definition 1 (Pseudo Inverse). If A = U ΣV T is the SVD of A, the pseudo in-
verse of A can be defined by A+ = V Σ+ U T . Note that A+ is an n-by-m matrix,
and Σ+ is an n-by-m diagonal matrix with Σ+ = diag[ σ11 , σ12 , . . . , σ1r , 0, . . . , 0].
If A is inevitable, we must have m = n = r, and in this case we can easily
verify that pseudo inverse will be the same with the normal inverse matrix:
w+ = X + y.
Note that the second equation comes from the fact that U is unitary (kU vk =
kvk for any v if U T U = U U T = I).
Letting z = V T w, we have kzk = kwk (since V is unitary). Therefore, the
least norm solution of kΣV T w − U T yk is equivalent to finding the least norm
solution of
min kΣz − U T yk2 .
z
For this new problem, since Σ is diagonal, it’s obvious that the minimum-norm
solution is
z + = Σ+ U T y.
Therefore the minimum norm solution of the original system is
w+ = V z + = V Σ+ U T y.
3
1.1 Computational time
To compute the closed form solution of linear regression, we can:
So the total time in this case is O(nd2 + d3 ). In practice, one can replace these
steps by Gaussian elimination, which can reduce the time to O(nd2 ).
1. When d is small, nd2 is not too expensive, so a closed form solution can
be easily computed for linear regression.
2. When d is large, nd2 is usually too large and we need to use other iterative
algorithms to solve linear regression (next lecture).