Notes Linearregression

Lecture notes (linear regression)
1 Linear Regression
Given training data {(xn , yn )}N D
n=1 , where xn ∈ R and yn ∈ R, linear regression
D
learns a model vector w ∈ R by minimizing the square error defined on training
data:
N
1X T
w∗ = argmin (w xn − yn )2 (1)
w 2 n=1
1
= argmin kXw − yk22 , (2)
w 2
| {z }
f (w)
where X ∈ RN ×D is the data matrix; each row of X corresponds to the feature

vector of a training sample; y ∈ RN is the label vector with y = [y1 , . . . , yN ]. In
the prediction phase, given a test sample x̃, the model computes the prediction
by wT x̃.
Next we discuss how to to solve the minimization problem in (2):
1. The gradient of (2) can be written as
∇f (w) = X T Xw − X T y.
Note that ∇f (w) is a D-dimensional vector, each element (∇f (w))i =

∂f (w)
∂wi .
2. The function f (w) is a convex function. To see this, you can first assume
D = 1, N = 1, then the function becomes single-variate f (w) = (xw −
y)2 which is clearly convex. In general, we can verify the convexity of a
function from its second order derivative. In linear regression case
∇2 f (w) = X T X.
Here X T X is a semi-positive definite matrix (all the eigenvalues are non-

negative), which implies f (w) is convex. We will give more details in the
next lecture when we talk about optimization.
1
For a convex and continuously differentiable convex function, we know
w∗ is a global minimum of f (w) if and only if ∇f (w∗ ) = 0.
So, to solve (2), it’s equivalent to finding a w∗ such that ∇f (w∗ ) = 0. This
means w∗ is minimum if and only if it satisfies the following equation:
normal equation: X T Xw∗ = X T y. (3)
This is called “normal equation” for linear regression. To solve (3), we consider
the following two cases:
When X T X is invertible, eq (3) directly implies w∗ = (X T X)−1 X T y
is the unique solution of linear regression. This often happens when we face
an over-determined system—number of samples is much larger than number of
variables (N D). An intuitive way to see this: When N D, we have
many training samples to fit but don’t have enough degree of freedom (number
of variables), so it’s unlikely to fit data very well and the minimizer can be
uniquely determined.
When X T X is not invertible, equation (3) does not have a unique so-
lution. In fact, (3) will have infinite number of solutions in this case, and so
does our linear regression problem—there will be infiinite number of solutions
w∗ that achieves the same minimal square error on the training data.
X T X will not be invertible when N < D. To illustrate why we have infinite
number of solutions, consider in a two-dimensional problem (D = 2) we have
only one training sample x1 = [1, −1], y1 = 1. We can see w = [a + 1, a] for any
a ∈ R will get 0 training error:
w T x 1 = a + 1 − a = 1 = y1 .
This is true for any problem with N < D—in this case, you can always find
a vector in the null space of X (a vector such that Xv = 0), and then for a
solution w∗ , any vector with w∗ + av with a ∈ R will get the same square error
with w∗ . This case (N < D) is also called the under-determined problem,
since you have too many degree of freedom in your problem and don’t have
enough constraints (data).
So how to find a solution in the under-determined case? Indeed, we could use
the following approach to find the minimum-norm solution w+ : Let W =
argminw kXw − yk2 denote the set of solutions, we aim to find the minimum
norm solution that
w+ = argmin kwk2 . (4)
w∈W
Theorem 1 (Singular Value Decomposition (SVD)). Given an m-by-n real

matrix A, it can be decomposed as
A = U ΣV T ,
where U ∈ Rm×m and V ∈ Rn×n are unitiry (U T U = U U T = I, V T V =
V V T = I) and Σ ∈ Rm×n is a diagonal matrix with non-negative real numbers
in the diagonal. We denote Σ = diag[σ1 , . . . , σr , 0, . . . , 0], each σi > 0 and r is
the rank of A.
2
Definition 1 (Pseudo Inverse). If A = U ΣV T is the SVD of A, the pseudo in-
verse of A can be defined by A+ = V Σ+ U T . Note that A+ is an n-by-m matrix,
and Σ+ is an n-by-m diagonal matrix with Σ+ = diag[ σ11 , σ12 , . . . , σ1r , 0, . . . , 0].
If A is inevitable, we must have m = n = r, and in this case we can easily
verify that pseudo inverse will be the same with the normal inverse matrix:
A−1 = (V T )−1 Σ−1 U −1 = V Σ+ U T = A+ .
We get this equation because V T V = I (V is the inverse of V T ) and U T U = I

(U T is the inverse of U ). If A is not invertible, the definition of pseudo inverse
can be understood by: we inverse all the positive singular values, and keep the
zero singular values remain zero.
You can also verify that A+ A is an n-by-n diagonal matrix with the first r
diagonal entries to be 1, others to be 0; AA+ will be a similar diagonal matrix
with size m-by-m. So this is a more general notion of matrix inversion.
Now let’s show the closed form solution of the minimum norm solution of
linear regression (4) can be obtained by pseudo inverse:
Theorem 2. The minimum norm solution of kXw − yk22 is given by
w+ = X + y.
Therefore, if X = U ΣV T is the SVD of X, then w+ = V Σ+ U T y.

Proof. First, we rewrite the linear regression objective:
kXw − yk2 = kU ΣV T w − yk2 = kΣV T w − U T yk2 .
Note that the second equation comes from the fact that U is unitary (kU vk =
kvk for any v if U T U = U U T = I).
Letting z = V T w, we have kzk = kwk (since V is unitary). Therefore, the
least norm solution of kΣV T w − U T yk is equivalent to finding the least norm
solution of
min kΣz − U T yk2 .
z
For this new problem, since Σ is diagonal, it’s obvious that the minimum-norm
solution is
z + = Σ+ U T y.
Therefore the minimum norm solution of the original system is
w+ = V z + = V Σ+ U T y.
3
1.1 Computational time
To compute the closed form solution of linear regression, we can:
1. Compute X T X, which costs O(nd2 ) time and d2 memory.

2. Inverse X T X, which costs O(d3 ) time.
3. Compute X T y, which costs O(nd) time.
4. Compute {(X T X)−1 }{X T y}, which costs O(nd) time.
So the total time in this case is O(nd2 + d3 ). In practice, one can replace these
steps by Gaussian elimination, which can reduce the time to O(nd2 ).
1. When d is small, nd2 is not too expensive, so a closed form solution can
be easily computed for linear regression.
2. When d is large, nd2 is usually too large and we need to use other iterative
algorithms to solve linear regression (next lecture).

Notes Linearregression

Uploaded by

Notes Linearregression

Uploaded by

Lecture notes (linear regression)

where X ∈ RN ×D is the data matrix; each row of X corresponds to the feature

Note that ∇f (w) is a D-dimensional vector, each element (∇f (w))i =

Here X T X is a semi-positive definite matrix (all the eigenvalues are non-

Theorem 1 (Singular Value Decomposition (SVD)). Given an m-by-n real

A−1 = (V T )−1 Σ−1 U −1 = V Σ+ U T = A+ .

We get this equation because V T V = I (V is the inverse of V T ) and U T U = I

Therefore, if X = U ΣV T is the SVD of X, then w+ = V Σ+ U T y.

kXw − yk2 = kU ΣV T w − yk2 = kΣV T w − U T yk2 .

1. Compute X T X, which costs O(nd2 ) time and d2 memory.

You might also like