0% found this document useful (0 votes)
19 views27 pages

Pass Fall2025 Math2050 Week 7

This document discusses the Peer-Assisted Study Sessions Program for MATH2050, focusing on least-squares and regularized least-squares methods in linear algebra. It explains the concept of finding the best-fit line for a set of points, the Ordinary Least-Squares (OLS) method, and introduces regularization techniques to balance fitting data and maintaining simplicity in solutions. The document also outlines the derivation of solutions using normal equations and QR factorization, along with the implications of regularization on the objective function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views27 pages

Pass Fall2025 Math2050 Week 7

This document discusses the Peer-Assisted Study Sessions Program for MATH2050, focusing on least-squares and regularized least-squares methods in linear algebra. It explains the concept of finding the best-fit line for a set of points, the Ordinary Least-Squares (OLS) method, and introduces regularization techniques to balance fitting data and maintaining simplicity in solutions. The document also outlines the derivation of solutions using normal equations and QR factorization, along with the implications of regularization on the objective function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Accreditation, Testing & Quality Assurance Department

Peer-Assisted Study Sessions Program (PASS)


MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Session 5: Least-squares
and Regularized Least-squares
1 Introduction
Let (1, 0), (2, 1), and (3, 3) be three points in the plane. How can you find the line y = c0 + c1 x that “best fits” these
points? One way is to note that if the three points were collinear, then the following system of equations would be
consistent:
c0 + c1 = 0
c0 + 2c1 = 1
c0 + 3c1 = 3
This system can be written in the matrix form Ax = b, where
 

1 1
 0 " #
c0
A = 1 2 , b = 1 , x= .
 
1 3 c1
3

Because the points are not collinear, however, the system is inconsistent. Although it is impossible to find x such that
Ax = b, you can look for an x that minimizes the norm of the error ∥Ax − b∥. The solution
" #
c0
x=
c1

of this minimization problem is called the least squares regression line

y = c0 + c1 x.

PASS Leader - Le Mai Thanh Son (Zeref) 1/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Given an m × n matrix A and a vector b in Rm , the least squares problem is to find x in Rn such that

∥Ax − b∥2

is minimized.
REMARK: The term least squares comes from the fact that minimizing ∥Ax − b∥ is equivalent to
minimizing ∥Ax − b∥2 , which is a sum of squares.

Example. We consider the least squares problem with data


   
2 0 1
A = −1 1 , b =  0  .
0 2 −1
The over-determined set of three equations in two variables Ax = b,

2x1 = 1, −x1 + x2 = 0, 2x2 = −1,

has no solution. (From the first equation we have x1 = 1/2, and from the last equation we have x2 = −1/2; but then
the second equation does not hold.) The corresponding least squares problem is

minimize (2x1 − 1)2 + (−x1 + x2 )2 + (2x2 + 1)2 .

Its unique solution is x̂ = (1/3, −1/3). The least squares approximate solution x̂ does not satisfy the equations
Ax = b; the corresponding residuals are

r̂ = Ax̂ − b = (−1/3, −2/3, 1/3),

with sum of squares value ∥Ax̂ − b∥2 = 2/3. Let us compare this to another choice of x, x̃ = (1/2, −1/2), which
corresponds to (exactly) solving the first and last of the three equations in Ax = b. It gives the residual

r̃ = Ax̃ − b = (0, −1, 0),

with sum of squares value ∥Ax̃ − b∥2 = 1.


The column interpretation tells us that
     
2 0 2/3
(1/3) −1 + (−1/3) 1 = −2/3
0 2 −2/3
is the linear combination of the columns of A that is closest to b.

Now, we will derive several expressions for the solution of the least squares problem, under one assumption on
the data matrix A: the columns of A are linearly independent.

2 Ordinary Least-Squares (OLS)


2.1 Overview
In linear algebra, we’re likely trying to solve the system

Ax = b.

This system represents finding a set of coefficients x that, when applied to the columns of matrix A, produces the
vector b. Often, there is no exact solution (e.g., you have more equations than unknowns, so A is “tall”, i.e. the matrix
has more rows than columns).
The “best” approximate solution is found by minimizing the error. The Ordinary Least-Squares (OLS) method
finds the vector x that minimizes the squared L2 -norm (the sum of squares) of the error, or “residual” r (r = Ax − b):
Its objective function is:
Find x that minimizes ∥Ax − b∥22

PASS Leader - Le Mai Thanh Son (Zeref) 2/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

2.2 Method 1
We know that any minimizer x̂ of the function
f (x) = ∥Ax − b∥2
must satisfy
∂f
(x̂) = 0, i = 1, . . . , n,
∂xi
which we can express as the vector equation
∇f (x̂) = 0,
where ∇f (x̂) is the gradient of f evaluated at x̂. So first, we need to find the gradient of f .
First, we can express the least squares objective as the squared L2 norm:
f (x) = ∥Ax − b∥2 .
We can rewrite this as a dot product:
f (x) = (Ax − b)T (Ax − b).
Expanding this expression:
f (x) = xT AT Ax − xT AT b − bT Ax + bT b.
Since xT AT b is a scalar (xT ∈ R1×n , AT ∈ Rn×m , b ∈ Rm ), its transpose is equal to itself:
(xT AT b)T = bT Ax = xT AT b.
Hence, the two middle terms are identical. Therefore,
f (x) = xT AT Ax − 2bT Ax + bT b.
Now we take the derivative with respect to the vector x. Using standard matrix calculus identities:
∂ T ∂ T
(x Y x) = (Y + Y T )x, and for symmetric Y = Y T , (x Y x) = 2Y x,
∂x ∂x
and
∂ T
(y x) = y,
∂x
we have:
∂f (x) ∂ T T ∂ ∂ T
= (x A Ax) − (2bT Ax) + (b b).
∂x ∂x ∂x ∂x
Simplifying,
∂f (x)
= 2AT Ax − 2AT b + 0.
∂x
Therefore, the gradient of f (x) with respect to x is
∇x f (x) = 2AT Ax − 2AT b.
Equivalently, it can be written in compact form as
∇x f (x) = 2AT (Ax − b).

2.3 Method 2
Writing the least squares objective out as a sum, we get
 2
m
X Xn
f (x) = ∥Ax − b∥2 =  Aij xj − bi  .
i=1 j=1

To find ∇f (x)k , we take the partial derivative of f with respect to xk . Differentiating the sum term by term, we get
 
m n m
∂f X X X
2(AT )ki (Ax − b)i = 2AT (Ax − b) k .

∇f (x)k = (x) = 2 Aij xj − bi  (Aik ) =
∂xk i=1 j=1 i=1

This is our formula written out in terms of its components.


Now we continue the derivation of the solution of the least squares problem, we will get:
∇f (x) = 2AT (Ax − b)

PASS Leader - Le Mai Thanh Son (Zeref) 3/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

2.4 Solution using normal equations


Now, we have the same derivation of the solution of the least squares problem from both methods. Any minimizer x̂
of ∥Ax − b∥2 must satisfy
∇f (x̂) = 2AT (Ax̂ − b) = 0,
which can be written as
AT Ax̂ = AT b.
These equations are called the normal equations. The coefficient matrix AT A is the Gram matrix associated with
A; its entries are inner products of columns of A.
Our assumption that the columns of A are linearly independent implies that the Gram matrix AT A is invertible. This
implies that
x̂ = (AT A)−1 AT b
This is the only solution of the normal equations. So this must be the unique solution of the least squares problem.
The matrix (AT A)−1 AT is the pseudo-inverse of the matrix A. So we can write the solution of the least squares
problem in the simple form
x̂ = A† b
A† is a left inverse of A, which means that x̂ = A† b solves Ax = b if this set of over-determined equations has a
solution. But now we see that x̂ = A† b is the least squares approximate solution, i.e., it minimizes ∥Ax − b∥2 . And if
there is a solution of Ax = b, then x̂ = A† b is it.

2.5 Solution via QR factorization


The QR factorization gives a simple formula for the pseudo-inverse. If A is left-invertible, its columns are linearly
independent and the QR factorization A = QR exists. We have

AT A = (QR)T (QR) = RT QT QR = RT R,

so
A† = (AT A)−1 AT = (RT R)−1 (QR)T = R−1 R−T RT QT = R−1 QT .
We can use the QR factorization to compute the least squares approximate solution. Let A = QR be the QR
factorization of A (which exists by our assumption that its columns are linearly independent). We have already seen
that the pseudo-inverse A† can be expressed as

A† = R−1 QT ,

so we have
x̂ = R−1 QT b

To compute x̂, we first multiply b by QT ; then we compute R−1 (QT b) using back substitution. This is summarized in
the following algorithm, which computes the least squares approximate solution x̂, given A and b.

Least Squares via QR Factorization


Given an m × n matrix A with linearly independent columns and an m-vector b.
1. QR factorization. Compute the QR factorization A = QR.

2. Compute QT b.
3. Back substitution. Solve the triangular equation Rx̂ = QT b.

PASS Leader - Le Mai Thanh Son (Zeref) 4/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

3 Regularized Least-Squares
3.1 Overview
Regularization adds a “penalty” term to the original objective function. The goal is no longer just to fit the data, but
to balance two competing goals:
• Goal 1: Fit the data well (keep ∥Ax − b∥22 small).
• Goal 2: Keep the solution x “simple” by penalizing large coefficients (keep ∥x∥22 small).
The most common type of regularized least-squares is Tikhonov regularization, also known as Ridge Regression
or L2 Regularization.
Its objective function is:

Find x that minimizes ∥Ax − b∥22 + λ∥x∥22

• ∥Ax − b∥22 is the original “fit” term.


• ∥x∥22 is the “penalty” term. It is the squared L2 -norm of x (the sum of the squares of its coefficients).
• λ (lambda) is a non-negative scalar, called the regularization parameter (or hyperparameter). It controls the
trade-off.
– If λ = 0, you get the original OLS problem.
– If λ → ∞, you force x to be all zeros (a very “simple” but useless model).
– A good λ finds the right balance.

3.2 Solution
Find the vector x that minimizes the regularized objective function:
J(x) = ∥Ax − b∥22 + λ∥x∥22 .

3.2.1 Expand the Objective Function


Let’s expand the norms using the identity ∥v∥22 = v T v.
First Term: ∥Ax − b∥22 . This term becomes (Ax − b)T (Ax − b). Paying attention to transpose rules like (AB)T =
B T AT , we will have:

(Ax − b)T (Ax − b) = (xT AT − bT )(Ax − b)


= (xT AT )(Ax) − (xT AT )b − (bT )(Ax) + (bT )b
= xT AT Ax − xT AT b − bT Ax + bT b.
Now, the terms xT AT b and bT Ax are both scalars (1 × 1 matrices). The transpose of a scalar is just itself. Let’s see:
(bT Ax)T = xT AT (bT )T = xT AT b.
This means xT AT b and bT Ax are identical. We can combine them:

∥Ax − b∥22 = xT (AT A)x − 2xT AT b + bT b.


Second Term: λ∥x∥22 . This one is much easier:
λ∥x∥22 = λ(xT x).
Full Function J(x): Putting it all together, our function to minimize is:
J(x) = xT (AT A)x − 2xT AT b + bT b + λxT x.
We can group the xT (·)x terms:
J(x) = xT (AT A + λI)x − 2xT (AT b) + bT b.
(Note: we used λxT x = xT (λI)x to factor it into the main quadratic term.)

PASS Leader - Le Mai Thanh Son (Zeref) 5/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

3.2.2 Take the Gradient


We need to find the gradient of J(x) with respect to the vector x, which we write as ∇x J(x).
We’ll use two standard rules from matrix calculus:

1. Gradient of a quadratic form:


∇x (xT M x) = (M + M T )x.
If M is symmetric (like our AT A + λI), this simplifies to

∇x (xT M x) = 2M x.

2. Gradient of a linear term:


∇x (xT c) = c,
where c is a vector that does not depend on x.

Let’s apply this to our function J(x):

J(x) = xT (AT A + λI)x − 2xT (AT b) + |{z}


bT b .
| {z } | {z }
Part 1 Part 2 Part 3

• Gradient of Part 1: The matrix M = AT A + λI is symmetric. So, we use rule (1), gradient of a quadratic
form:
∇x (Part 1) = 2(AT A + λI)x.

• Gradient of Part 2: The vector c = AT b is constant. Using rule (2), gradient of a linear term:

∇x (−2xT AT b) = −2(AT b).

• Gradient of Part 3: The term bT b is a constant scalar, so its derivative is zero:

∇x (bT b) = 0.

Combining the gradients:


∇x J(x) = 2(AT A + λI)x − 2AT b.

3.2.3 Set Gradient to Zero and Solve


To find the minimum, we set the gradient to the zero vector and solve for x:

∇x J(x) = 0

2(AT A + λI)x − 2AT b = 0


Divide by 2:
(AT A + λI)x − AT b = 0
Move the AT b term to the other side:
(AT A + λI)x = AT b
This is a standard linear system M x = c. To solve for x, we left-multiply by the inverse of the matrix M = (AT A+λI):

x = (AT A + λI)−1 AT b
This is the closed-form solution for regularized least squares.

PASS Leader - Le Mai Thanh Son (Zeref) 6/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

4 Practice Problem Set


[2]
Question 1: Find the least squares solution of the system Ax = b.
   
2 1 2
a. A = 1 2 , b =  0 .
1 1 −3
   
1 0 1 4
1 1 1 −1
b. A=
0
,
 0 .
b= 
1 1
1 1 0 1
   
0 2 1 1
1
 1 −1
0
 
c. 2
A= 1 0,  1 .
b= 
1 1 1 −1
0 2 −1 0

To find the least-squares solution x̂, we don’t solve Ax = b. Instead, we solve a related, “square” system called the
normal equations:

(AT A)x̂ = AT b
The solution x̂ is then:

x̂ = (AT A)−1 (AT b)


Let’s apply this four-step process to each of your problems:

1. Calculate AT A.
2. Calculate AT b.
3. Set up the system (AT A)x̂ = AT b.

4. Solve for x̂.

(a)
   
2 1 2
Given: A = 1 2 , b= 0 
1 1 −3

1. Calculate AT A:
 
    2 1    
T 2 1 1 T 2 1 1  (4 + 1 + 1) (2 + 2 + 1) 6 5
A = ⇒ A A= 1 2 =
 = .
1 2 1 1 2 1 (2 + 2 + 1) (1 + 4 + 1) 5 6
1 1

2. Calculate AT b:
 
  2    
T 2 1 1   (4 + 0 − 3) 1
A b= 0 = = .
1 2 1 (2 + 0 − 3) −1
−3

3. Set up the system:

PASS Leader - Le Mai Thanh Son (Zeref) 7/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

    
6 5 x1 1
= .
5 6 x2 −1

4. Solve for x̂:


   
T −1 1 6 −5 1 6 −5
(A A) = = .
(6)(6) − (5)(5) −5 6 11 −5 6
Now multiply by AT b:
        
1 6 −5 1 1 6+5 1 11 1
x̂ = (AT A)−1 AT b = = = = .
11 −5 6 −1 11 −5 − 6 11 −11 −1


1
Therefore, x̂ =
−1

(b)
  
1 0 1 4
1 1 1 −1
Given: A=
0
, b= 
1 1 0
1 1 0 1

1. Calculate AT A:
 
    1 0 1  
1 1 0 1 1 1 0 1  3 2 2
1 1 1 
AT = 0 AT A = 0

1 1 1 ⇒ 1 1 1 
0 = 2 3 2 .
1 1
1 1 1 0 1 1 1 0 2 2 3
1 1 0

2. Calculate AT b:
 
  4    
1 1 0 1   (4 − 1 + 0 + 1) 4
−1
AT b = 0 1 1 1 
0
 = (0 − 1 + 0 + 1) = 0 .
1 1 1 0 (4 − 1 + 0 + 0) 3
1

3. Set up the system:


    
3 2 2 x1 4
2 3 2 x2  = 0 .
2 2 3 x3 3

4. Solve for x̂: For a 3 × 3 system, it’s often easiest to use Gaussian elimination (row reduction) on the augmented
matrix:
     
3 2 2 4 1 0 −1 1 1 0 −1 1
R1 ←R1 −R3 R2 ←R2 −2R1
 2 3 2 0  −−−−−−−−→  2 3 2 0  −−−−−−−−→  0 3 4 −2 
2 2 3 3 2 2 3 3 2 2 3 3
     
1 0 −1 1 1 0 −1 1 1 0 −1 1
R3 ←R3 −2R1 R2 ←R2 −R3 R3 ←R3 /7
−−−−−−−−→  0 3 4 −2  −−−−−−−−→  0 1 −1 −3  −−−−−−→  0 1 −1 −3  .
0 2 5 1 0 0 7 7 0 0 1 1

Now, use back-substitution:


x3 = 1

PASS Leader - Le Mai Thanh Son (Zeref) 8/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

x2 − x3 = −3 ⇒ x2 − 1 = −3 ⇒ x2 = −2
x1 − x3 = 1 ⇒ x1 − 1 = 1 ⇒ x1 = 2

 
2
Therefore, x̂ = −2
1

(c)
  

0 2 1 1
1
 1 −1
0
 
Given: 2
A= 1 0, 1
b= 
1 1 1 −1
0 2 −1 0

1. Calculate AT A:
 
    0 2 1  
0 1 2 1 0 0 1 2 1 0 1 1 −1 6 4 0
AT = 2 AT A = 2

1 1 1 2 ⇒ 1 2
1 1 2  1 0 = 4 11 0 .
1 −1 0 1 −1 1 −1 0 1 −1 1 1 1 0 0 4
0 2 −1

2. Calculate AT b:
 
 1    
0 1 2 1 0 0
 (0 + 0 + 2 − 1 + 0) 1
AT b = 2 1 1 1 2  1
 
 = (2 + 0 + 1 − 1 + 0) = 2 .
1 −1 0 1 −1 −1 (1 + 0 + 0 − 1 + 0) 0
0

3. Set up the system:


    
6 4 0 x1 1
4 11 0 x2  = 2 .
0 0 4 x3 0

4. Solve for x̂:


This system is simpler than it looks. We can write it out:

1. 6x1 + 4x2 = 1
2. 4x1 + 11x2 = 2
3. 4x3 = 0
From equation (3), we immediately get x3 = 0.

Now we solve the 2 × 2 system for x1 and x2 :


1 − 6x1
From (1): 4x2 = 1 − 6x1 ⇒ x2 =
4
Substitute this into (2):
 
1 − 6x1
4x1 + 11 =2
4
Multiply by 4 to clear the fraction:

16x1 + 11(1 − 6x1 ) = 8 ⇒ 16x1 + 11 − 66x1 = 8 ⇒ −50x1 = −3

PASS Leader - Le Mai Thanh Son (Zeref) 9/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

3
⇒ x1 = (or 0.06)
50
Now find x2 :

1 − 6(3/50) 1 − 18/50 32/50 8 4


x2 = = = = = (or 0.16)
4 4 4 50 25

3
 
 50  0.06
Therefore, x̂ =  4  or x̂ = 0.16
 
 25  0
0

[2]
Question 2: The table shows the numbers of doctorate degrees y awarded in the education fields in the
United States during the years 2001 to 2004. Find the least squares regression line for the data. Let t represent
the year, with t = 1 corresponding to 2001. Then use the model to predict the numbers of doctorate degrees
for the year 2010. (Source: U.S. National Science Foundation)

Year 2001 2002 2003 2004

Doctorate degrees, y 6337 6487 6627 6635

Our goal is to find the line y = c0 + c1 t that best fits the data.

1. Set up the Data


Based on the problem, t = 1 corresponds to 2001. Our data points (t, y) are:

(1, 6337), (2, 6487), (3, 6627), (4, 6635)

2. Formulate the Least-Squares System


We are looking for the coefficients " #
c0
x=
c1
that best solve the system Ax = b.

• The matrix A is formed using the t-values (with a column of 1s for the c0 intercept):
 
1 1
1 2
A= 1

3
1 4

• The vector b is the list of y-values (doctorate degrees):


 
6337
6487
b= 6627

6635

PASS Leader - Le Mai Thanh Son (Zeref) 10/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

3. Solve the Normal Equations


We find the least-squares solution x̂ by solving:

(AT A)x̂ = AT b

Step 3a: Calculate AT A


 
T 1 1 1 1
A =
1 2 3 4
 
  1 1    
1 1 1 1 
1 2
 = (1 + 1 + 1 + 1) (1 + 2 + 3 + 4) 4 10
AT A = =
1 2 3 4 1 3 (1 + 2 + 3 + 4) (12 + 22 + 32 + 42 ) 10 30
1 4
Step 3b: Calculate AT b
 
  6337    
1 1 1 1 
6487 = 6337 + 6487 + 6627 + 6635 26086
AT b =

=
1 2 3 4 6627 1(6337) + 2(6487) + 3(6627) + 4(6635) 65732
6635
Step 3c: Solve the System
    
4 10 c0 26086
=
10 30 c1 65732
This gives two equations:
4c0 + 10c1 = 26086
10c0 + 30c1 = 65732
Divide the first by 2 and the second by 10:
2c0 + 5c1 = 13043
c0 + 3c1 = 6573.2
From the second equation:
c0 = 6573.2 − 3c1
Substitute into the first:
2(6573.2 − 3c1 ) + 5c1 = 13043
13146.4 − 6c1 + 5c1 = 13043
13146.4 − c1 = 13043 =⇒ c1 = 103.4
Now find c0 :
c0 = 6573.2 − 3(103.4) = 6573.2 − 310.2 = 6263
Thus, the least-squares regression line is:
y = 6263 + 103.4t

4. Predict for the Year 2010


• The pattern is t = Year − 2000.

• For 2010, t = 2010 − 2000 = 10.

y = 6263 + 103.4t
y(10) = 6263 + 103.4(10) = 6263 + 1034 = 7297
Therefore, the least-squares regression line is y = 6263 + 103.4t. The model predicts 7297 doctorate degrees for the
year 2010.

PASS Leader - Le Mai Thanh Son (Zeref) 11/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

[2]
Question 3: The table shows the world carbon dioxide emissions y (in millions of metric tons) during the
years 1999 to 2004. Find the least squares regression quadratic polynomial for the data. Let t represent the
year, with t = −1 corresponding to 1999. Then use the model to predict the world carbon dioxide emissions
for the year 2008.

Year 1999 2000 2001 2002 2003

CO2 , y 10 9 10 13 18

Our goal is to find the coefficients c0 , c1 , c2 for the polynomial:

y = c 0 + c 1 t + c 2 t2

1. Set Up the Data


The problem states that t = −1 corresponds to 1999. We can build our data table:

Year t y (CO2 ) t2
1999 −1 10 1
2000 0 9 0
2001 1 10 1
2002 2 13 4
2003 3 18 9

2. Formulate the Least-Squares System


We are looking for the coefficient vector  
c0
x̂ = c1 
 

c2
that best solves the system Ax̂ = b.

• The matrix A is built from the polynomial terms [1, t, t2 ]:


 
1 −1 1
1 0 0
 
A= 1 1 1

1 2 4
1 3 9

• The vector b is the list of y-values:  


10
9
 
10
b= 
13
18

3. Use the Normal Equations


We find the best-fit solution x̂ by solving the normal equations:

(AT A)x̂ = AT b

PASS Leader - Le Mai Thanh Son (Zeref) 12/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Step 3a: Calculate AT A  


1 1 1 1 1
AT = −1 0 1 2 3
1 0 1 4 9
 
5 5 15
AT A =  5 15 35
15 35 99
Step 3b: Calculate AT b  
  10  
1 1 1 1 1 9
 60
AT b = −1 0 1 2 3 10
 
 =  80 
1 0 1 4 9 13 234
18
Step 3c: Set Up and Solve the System
    
5 5 15 c0 60
5 15 35 c1  =  80 
15 35 99 c2 234
We simplify by dividing the first row by 5 and the second row by 5:

1. c0 + c1 + 3c2 = 12
2. c0 + 3c1 + 7c2 = 16
3. 15c0 + 35c1 + 99c2 = 234

Subtract (1) from (2):


2c1 + 4c2 = 4 =⇒ c1 + 2c2 = 2 =⇒ c1 = 2 − 2c2
From (1):
c0 = 12 − c1 − 3c2
Substitute c1 = 2 − 2c2 :
c0 = 12 − (2 − 2c2 ) − 3c2 = 10 − c2
Substitute c0 and c1 into (3):
15(10 − c2 ) + 35(2 − 2c2 ) + 99c2 = 234
150 − 15c2 + 70 − 70c2 + 99c2 = 234
220 + 14c2 = 234 =⇒ 14c2 = 14 =⇒ c2 = 1
Now find c0 and c1 :
c0 = 10 − c2 = 9, c1 = 2 − 2c2 = 0
Therefore,
c0 = 9, c1 = 0, c2 = 1
Thus, the least-squares regression quadratic polynomial is:

y = 9 + t2

4. Predict Emissions for 2008


1. Find the t-value for 2008:
t = Year − 2000 = 2008 − 2000 = 8

2. Use the model to predict:


y(8) = 9 + (8)2 = 9 + 64 = 73

Final Solution:
• Least-Squares Polynomial: y = 9 + t2

PASS Leader - Le Mai Thanh Son (Zeref) 13/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

• Prediction for 2008: 73 million metric tons of CO2

[1]
Question 4

Approximating a vector as a multiple of another one. In the special case n = 1, the general least
squares problem reduces to finding a scalar x that minimizes ∥ax − b∥2 , where a and b are m-vectors (we write
the matrix A here in lower case, since it is an m-vector). Assuming a and b are nonzero, show that

∥ax̂ − b∥2 = ∥b∥2 (sin θ)2 ,

where θ = ∠(a, b). This shows that the optimal relative error in approximating one vector by a multiple of
another one depends on their angle.

We wish to find a scalar x that minimizes


∥ax − b∥2 ,
where a and b are nonzero m-vectors.

Expanding the squared norm:

∥ax − b∥2 = (ax − b)T (ax − b) = x2 ∥a∥2 − 2x(aT b) + ∥b∥2 .

Let
f (x) = x2 ∥a∥2 − 2x(aT b) + ∥b∥2 .
Differentiate with respect to x:
f ′ (x) = 2x∥a∥2 − 2(aT b).
Set f ′ (x) = 0 to find the minimizer:

aT b
2x̂∥a∥2 − 2(aT b) = 0 ⇒ x̂ = .
∥a∥2

Substitute x̂ into f (x):


2 2
aT b (aT b)a
∥ax̂ − b∥2 = a · −b = −b .
∥a∥2 ∥a∥2
Using the expanded form:

∥ax̂ − b∥2 = x̂2 ∥a∥2 − 2x̂(aT b) + ∥b∥2


 T 2  T 
a b 2 a b
= ∥a∥ − 2 (aT b) + ∥b∥2
∥a∥2 ∥a∥2
(aT b)2 2(aT b)2
= − + ∥b∥2
∥a∥2 ∥a∥2
(aT b)2
= ∥b∥2 − .
∥a∥2

Recall the definition of the angle θ between vectors a and b:

aT b
cos θ = ⇒ aT b = ∥a∥ ∥b∥ cos θ.
∥a∥ ∥b∥
Substitute into the expression:
(aT b)2 ∥a∥2 ∥b∥2 cos2 θ
= = ∥b∥2 cos2 θ.
∥a∥2 ∥a∥2
Thus,
∥ax̂ − b∥2 = ∥b∥2 − ∥b∥2 cos2 θ = ∥b∥2 (1 − cos2 θ).

PASS Leader - Le Mai Thanh Son (Zeref) 14/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Using the identity sin2 θ = 1 − cos2 θ, we conclude:

∥ax̂ − b∥2 = ∥b∥2 sin2 θ

[1]
Question 5

Least-squares with orthonormal columns. Suppose the m × n matrix Q has orthonormal columns and
b is an m-vector. Show that
x̂ = QT b
is the vector that minimizes ∥Qx − b∥2 .

We are given an m × n matrix Q with orthonormal columns (so QT Q = In ) and an m-vector b.


We wish to find the vector x̂ that minimizes
∥Qx − b∥2 .

Expand the squared norm:


∥Qx − b∥2 = (Qx − b)T (Qx − b) = xT QT Qx − 2xT QT b + bT b.
Since Q has orthonormal columns, QT Q = In , so:
∥Qx − b∥2 = xT x − 2xT QT b + ∥b∥2 = ∥x∥2 − 2xT (QT b) + ∥b∥2 .

Let
f (x) = ∥x∥2 − 2xT (QT b) + ∥b∥2 .
This is a quadratic function in x. To find the minimizer, take the gradient and set it to zero:
∇f (x) = 2x − 2QT b = 0 ⇒ x = QT b.

Alternatively, the least squares problem minimizes ∥Qx − b∥2 . The normal equations are:
QT Qx = QT b.
Since QT Q = In , we immediately obtain:
x̂ = QT b.

[1]
Question 6

Least angle property of least squares. Suppose the m × n matrix A has linearly independent columns,
and b is an m-vector. Let x̂ = A† b denote the least squares approximate solution of Ax = b.

(a) Show that for any n-vector x,


(Ax)T b = (Ax)T (Ax̂),
i.e., the inner product of Ax and b is the same as the inner product of Ax and Ax̂.
Hint: Use (Ax)T b = xT (AT b) and (AT A)x̂ = AT b.
(b) Show that when Ax̂ and b are both nonzero, we have

(Ax̂)T b ∥Ax̂∥
= .
∥Ax̂∥∥b∥ ∥b∥

The left-hand side is the cosine of the angle between Ax̂ and b.
Hint: Apply part (a) with x = x̂.
(c) Least angle property of least squares. The choice x = x̂ minimizes the distance between Ax and b.
Show that x = x̂ also minimizes the angle between Ax and b. (You can assume that A and b are nonzero.)
Remark: For any positive scalar α, x = αx̂ also minimizes the angle between Ax and b.

PASS Leader - Le Mai Thanh Son (Zeref) 15/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

(a)
Start with the left-hand side:
(Ax)T b = xT (AT b).
Since x̂ is the least squares solution, it satisfies the normal equations:

AT Ax̂ = AT b.

Substitute into the expression:


xT (AT b) = xT (AT Ax̂) = (Ax)T (Ax̂).
Thus,
(Ax)T b = (Ax)T (Ax̂)

(b)
From part (a), with x = x̂:
(Ax̂)T b = (Ax̂)T (Ax̂) = ∥Ax̂∥2 .
Divide both sides by ∥Ax̂∥ ∥b∥:
(Ax̂)T b ∥Ax̂∥2 ∥Ax̂∥
= = .
∥Ax̂∥ ∥b∥ ∥Ax̂∥ ∥b∥ ∥b∥

Therefore,
(Ax̂)T b ∥Ax̂∥
=
∥Ax̂∥ ∥b∥ ∥b∥

(c)
We wish to show that the least squares solution x̂ = A† b minimizes the angle between Ax and b. Since the cosine
function is decreasing on [0, π], minimizing the angle θ is equivalent to maximizing cos θ, where

(Ax)T b
cos θ = .
∥Ax∥ ∥b∥

From part (a), for any vector x, we have:


(Ax)T b = (Ax)T (Ax̂).
Substituting into the expression for cos θ:
(Ax)T (Ax̂)
cos θ = . (1)
∥Ax∥ ∥b∥
By the Cauchy–Schwarz inequality:
(Ax)T (Ax̂) ≤ ∥Ax∥ ∥Ax̂∥,
with equality if and only if Ax is a positive scalar multiple of Ax̂. Therefore:

∥Ax∥ ∥Ax̂∥ ∥Ax̂∥


cos θ ≤ = .
∥Ax∥ ∥b∥ ∥b∥

∥Ax̂∥
Thus, for any x, we can see that the cosine of the angle between Ax and b is at most .
∥b∥
On the other hand, when x = x̂, from equation (1), we have:

(Ax̂)T (Ax̂) ∥Ax̂∥2 ∥Ax̂∥


cos θ = = = ,
∥Ax̂∥ ∥b∥ ∥Ax̂∥ ∥b∥ ∥b∥

so the maximum value of cos θ is achieved ↔ x = x̂ also minimizes the angle between Ax and b.
Moreover, if Ax is a positive multiple of Ax̂, then equality holds in the Cauchy–Schwarz inequality. Since A has
linearly independent columns, this occurs if and only if x is a positive multiple of x̂.

PASS Leader - Le Mai Thanh Son (Zeref) 16/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

x = x̂ maximizes cos θ, thereby minimizing the angle θ between Ax and b.

[1]
Question 7

Weighted least squares. In least squares, the objective (to be minimized) is


m
X
∥Ax − b∥2 = (ãTi x − bi )2 ,
i=1

where ãTi are the rows of A, and the n-vector x is to be chosen.


In the weighted least squares problem, we minimize the objective
m
X
wi (ãTi x − bi )2 ,
i=1

where wi are given positive weights. The weights allow us to assign different importance to the components of
the residual vector.

(a) Show that the weighted least squares objective can be expressed as ∥D(Ax − b)∥2 for an appropriate
diagonal matrix D. This allows us to solve the weighted least squares problem as a standard least squares
problem, by minimizing
∥Bx − d∥2 , where B = DA and d = Db.

(b) Show that when A has linearly independent columns, so does the matrix B.
(c) The least squares approximate solution is given by

x̂ = (AT A)−1 AT b.

Give a similar formula for the solution of the weighted least squares problem. You might want to use the
matrix W = diag(w) in your formula.

(a)
1. Expressing the WLS Objective as ∥D(Ax − b)∥2 . The weighted least-squares (WLS) objective is given as:
m
X
wi (ãTi x − bi )2
i=1

We are given that wi > 0, so we can write wi = ( wi )2 . The objective becomes:
m m
X √ X √ 2
( wi )2 (ãTi x − bi )2 = wi (ãTi x − bi )
i=1 i=1

This sum of squares is the squared norm of a vector. Let’s define a vector v where the i-th component is vi =

wi (ãTi x − bi ). Then the objective is ∥v∥2 .
Now, let’s look at the vector Ax − b. Its components are:
 T 
ã1 x − b1
 ãT2 x − b2 
Ax − b = 
 
.. 
 . 
ãTm x − bm

To get our vector v, we need to multiply the i-th component of Ax − b by wi . We can do this with a diagonal matrix.

PASS Leader - Le Mai Thanh Son (Zeref) 17/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Let’s define D as the m × m diagonal matrix with the square roots of the weights on its diagonal:
√ 
w1 0 ··· 0

√ √ √  0 w2 · · · 0 
D = diag( w1 , w2 , . . . , wm ) =  .
 
. .. .. 
 .. .. . . 

0 0 ··· wm

Now, let’s compute D(Ax − b):


 √
ãT1 x − b1 w1 (ãT1 x − b1 )
 
√ 
w1 ··· 0  T  √ T
..   ã2 x − b2  =  w2 (ã2 x − b2 )  = v

D(Ax − b) =  ... ..

. .  ..   .. 
√  .   . 
0 ··· wm T √ T
ãm x − bm wm (ãm x − bm )

The squared norm of this vector is:


m
X √ 2
∥D(Ax − b)∥2 = wi (ãTi x − bi ) ,
i=1

which is exactly the WLS objective function.

2. Expressing as a Standard Least-Squares Problem ∥Bx − d∥2 . Using the distributive property of matrix
multiplication, we can rewrite the expression inside the norm:

D(Ax − b) = (DA)x − (Db)

If we define B = DA and d = Db, the objective function becomes:

∥Bx − d∥2

This is now in the form of a standard least-squares problem, where we want to find the vector x that minimizes the
squared norm of the residual vector Bx − d.

(b)
To show that a matrix M has linearly independent columns, we must show that the only solution to M x = 0 is the
trivial solution x = 0.
• We are given that A has linearly independent columns, i.e., Ax = 0 ⇒ x = 0.
• We want to show that B = DA has linearly independent columns, i.e., if we have Bx = 0, we can imply that
x = 0.
So we suppose that
Bx = 0
Substituting B = DA gives
(DA)x = 0
This is
D(Ax) = 0
Let y = Ax. The equation becomes
Dy = 0
√ √ √
Take D = diag( w1 , w2 , . . . , wm ) with wi > 0, so
√ 
w1 0 ··· 0

 0 w2 ··· 0 
D= .
 
.. .. ..
 ..

. . . 

0 0 ··· wm

PASS Leader - Le Mai Thanh Son (Zeref) 18/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Writing y = [ y1 , . . . , ym ]T , the equation Dy = 0 is


√   
w1 y1 0

 w2 y2  0
 =  ..  .
   
 ..
 .  .

wm ym 0
√ √
Since each wi > 0, we have wi ̸= 0, hence wi yi = 0 implies yi = 0 for all i.
Therefore
y=0
.
By the definition y = Ax, we have Ax = 0.
Because A has linearly independent columns, Ax = 0 implies x = 0.
Thus
Bx = 0 ⇒ x = 0
Therefore, B = DA also has linearly independent columns.

(c)
The solution x̂ to a standard least-squares problem

min ∥Bx − d∥2


x

(where B has linearly independent columns) is obtained by solving the normal equations:

B T B x̂ = B T d.

Thus, the solution is:


x̂ = (B T B)−1 B T d.
We can find the solution to our WLS problem by substituting B = DA and d = Db into this formula.

Step 1: Substitute B and d.


x̂wls = ((DA)T (DA))−1 (DA)T (Db)

Step 2: Simplify the Terms. Recall that the transpose of a product satisfies (DA)T = AT DT .
Since D is diagonal, it is symmetric; hence DT = D.
Therefore, (DA)T = AT D.

Step 3: Simplify the B T B Term.

B T B = (DA)T (DA) = (AT D)(DA) = AT (DD)A = AT D2 A

Step 4: Simplify the B T d Term.

B T d = (DA)T (Db) = (AT D)(Db) = AT (DD)b = AT D2 b


√ √ √
Step 5: Analyze D2 . The matrix D = diag( w1 , w2 , . . . , wm ). Hence,
√ √ √
D2 = diag(( w1 )2 , ( w2 )2 , . . . , ( wm )2 ) = diag(w1 , w2 , . . . , wm ).

This is precisely the weight matrix W . Therefore, W = D2 .

PASS Leader - Le Mai Thanh Son (Zeref) 19/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Step 6: Substitute W into the Expressions.

B T B = AT W A, B T d = AT W b

Plugging these back into the formula for x̂wls , we get:

x̂wls = (AT W A)−1 (AT W b)

This is the solution to the weighted least-squares problem. It has the same form as the standard least-squares solution
but includes the weight matrix W , which gives higher importance to equations with larger weights.

[1]
Question 8

Network tomography. A network consists of n links, labeled 1, . . . , n. A path through the network is a
subset of the links. (The order of the links on a path does not matter here.) Each link has a (positive) delay,
which is the time it takes to traverse it. We let d denote the n-vector that gives the link delays. The total
travel time of a path is the sum of the delays of the links on the path.
Our goal is to estimate the link delays (i.e., the vector d), from a large number of (noisy) measurements of the
travel times along different paths. This data is given to you as an N × n matrix P , where
(
1, link j is on path i,
Pij =
0, otherwise,
and an N -vector t whose entries are the (noisy) travel times along the N paths. You can assume that N > n.
You will choose your estimate dˆ by minimizing the RMS deviation between the measured travel times (t) and
the travel times predicted by the sum of the link delays. Explain how to do this, and give a matrix expression
ˆ If your expression requires assumptions about the data P or t, state them explicitly.
for d.

Note. The RMS (Root Mean Square) deviation between two vectors u and v of length N is defined as:
v
u
u1 X N
RMS(u, v) = t (ui − vi )2 .
N i=1

This measures the average magnitude of the errors between u and v. Minimizing the RMS deviation is equivalent
to minimizing the sum of squared errors
XN
(ui − vi )2 ,
i=1
1
since the square root and the factor are monotonic functions. Thus, minimizing the RMS deviation is the
N
same as minimizing the squared Euclidean norm

∥u − v∥2 .

Remark. This problem arises in several contexts. The network could be a computer network, and a path
gives the sequence of communication links data packets traverse. The network could be a transportation system,
with the links representing road segments.

1. Formulating the Model


Our first goal is to create a predicted travel time for each path based on the unknown link delays.
Let d be the n × 1 vector of unknown link delays:
 
d1
 .. 
d= .  (delay of link j is dj )
dn

Let tpredicted be the N × 1 vector of predicted travel times for the N paths.

PASS Leader - Le Mai Thanh Son (Zeref) 20/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

The problem states that the total travel time for a path is the sum of the delays of the links on that path.
Let’s look at a single path, path i. The predicted travel time for this path, (tpredicted )i , is:

(tpredicted )i = (sum of delays dj for all links j on path i)

The matrix P (N × n) is defined to help us express this sum.


Define: (
1, if link j is on path i,
Pij =
0, otherwise.
Then, we can write the sum for path i as:
n
X
(tpredicted )i = Pi1 d1 + Pi2 d2 + · · · + Pin dn = Pij dj
j=1

This expression is exactly the i-th row of the matrix–vector product P d. If we do this for all N paths, we get the
complete vector of predicted travel times:
 Pn
P1j dj

Pj=1
n
 j=1 P2j dj 
tpredicted = P d = 
 
.. 

Pn . 
j=1 PN j dj

Hence, our linear model is:


P d = tpredicted

2. Setting up the Minimization Problem


• We are given t, the N × 1 vector of measured (noisy) travel times. We want to find the delay vector dˆ that makes
our predicted times P d as close as possible to the measured times t.
• The problem states that we must minimize the RMS deviation between the measured travel times (t) and the
travel times predicted by the sum of the link delays (P d).

As the note in the problem explains, minimizing the RMS deviation:


v
u
u1 X N
2
RMS(P d, t) = t (P d)i − ti
N i=1

is equivalent to minimizing the sum of squared errors:


N
X 2
(P d)i − ti
i=1

In vector form, this is the squared Euclidean norm of the residual vector r = P d − t:

Minimize ∥P d − t∥22

This is a standard linear least-squares problem. We are trying to find the vector dˆ that minimizes this objective
function.

3. Solving for the Optimal dˆ


We are looking for the vector dˆ that solves the least-squares problem:

dˆ = arg min ∥P d − t∥22


d

This problem is solved by finding the dˆ that satisfies the normal equations. The normal equations are derived by
taking the gradient of the objective function with respect to d and setting it to zero.

PASS Leader - Le Mai Thanh Son (Zeref) 21/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

The normal equations for this problem are:


(P T P )dˆ = P T t
To find our estimate d,ˆ we solve for it by isolating d.
ˆ We can do this by multiplying both sides by the inverse of the
matrix (P T P ).
This gives us the final matrix expression for d:ˆ

dˆ = (P T P )−1 P T t

This formula provides the optimal least-squares estimate for the delay vector d, minimizing the total squared error
between the measured and predicted travel times.

4. Required Assumptions
We are asked to state any assumptions needed for this expression. For the least-squares solution to work, we must be
able to compute the inverse (P T P )−1 .

• The matrix P T P is an n × n matrix (since P is N × n and P T is n × N ).


• For this n × n matrix to be invertible, it must have full rank n.

• This is true if and only if the N × n matrix P has linearly independent columns.

P has linearly independent columns.

[1]
Question 9

Least squares and QR factorization. Suppose A is an m × n matrix with linearly independent columns
and QR factorization A = QR, and b is an m-vector. The vector Ax̂ is the linear combination of the columns of
A that is closest to the vector b, i.e., it is the projection of b onto the set of linear combinations of the columns
of A.

(a) Show that Ax̂ = QQT b. (The matrix QQT is called the projection matrix.)
(b) Show that ∥Ax̂ − b∥2 = ∥b∥2 − ∥QT b∥2 . (This is the square of the distance between b and the closest linear
combination of the columns of A.)

(a)
1. Start with the Normal Equations. The least-squares solution x̂ is the unique vector that satisfies the normal
equations:
AT Ax̂ = AT b
2. Substitute A = QR into the equations.

(QR)T (QR)x̂ = (QR)T b

(RT QT )(QR)x̂ = RT QT b
3. Use the property QT Q = In .
RT (QT Q)Rx̂ = RT QT b
RT (In )Rx̂ = RT QT b
RT Rx̂ = RT QT b
4. Isolate x̂. Since R is invertible, RT is also invertible. We can multiply both sides on the left by (RT )−1 :

(RT )−1 (RT Rx̂) = (RT )−1 (RT QT b)

((RT )−1 RT )Rx̂ = ((RT )−1 RT )QT b

PASS Leader - Le Mai Thanh Son (Zeref) 22/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

In Rx̂ = In QT b
Rx̂ = QT b
This is a very useful intermediate result. It shows that solving for x̂ using QR is equivalent to solving a triangular
system Rx̂ = QT b, which can be efficiently solved by back-substitution.

5. Find the projection Ax̂. Now we want to find the vector Ax̂:

Ax̂ = (QR)x̂ = Q(Rx̂)

From step 4, we know that Rx̂ = QT b. Substituting this, we get:

Ax̂ = Q(QT b)

Ax̂ = QQT b

This completes the proof. As the problem notes, QQT is the m × m matrix that projects any vector b onto the column
space of Q, which is the same as the column space of A.

(b)
1. Use the Geometry of Projections. The problem is asking for the squared norm of the residual vector

r = Ax̂ − b (or equivalently r = b − Ax̂, since the norm is the same).

By definition, the projection Ax̂ is the part of b that lies in the column space of A. The residual r = b − Ax̂ is the
part of b that is orthogonal to the column space of A.
This means that b can be decomposed into two orthogonal vectors: Ax̂ and (b − Ax̂):

b = Ax̂ + (b − Ax̂)

2. Apply the Pythagorean Theorem. Because Ax̂ and (b − Ax̂) are orthogonal, the Pythagorean theorem applies:

∥b∥2 = ∥Ax̂∥2 + ∥b − Ax̂∥2

We want to find ∥Ax̂ − b∥2 = ∥b − Ax̂∥2 . Rearranging the equation gives:

∥b − Ax̂∥2 = ∥b∥2 − ∥Ax̂∥2

3. Find an expression for ∥Ax̂∥2 . From the QR decomposition, we have two useful expressions:

Ax̂ = QRx̂, Rx̂ = QT b

Using the first one:


∥Ax̂∥2 = ∥QRx̂∥2
Because Q has orthonormal columns, it preserves vector norms (it is an isometry). That is:

∥Qv∥2 = v T QT Qv = v T In v = ∥v∥2

Therefore,
∥Ax̂∥2 = ∥Q(Rx̂)∥2 = ∥Rx̂∥2
From step 4 previously, Rx̂ = QT b. Substituting this gives:

∥Ax̂∥2 = ∥QT b∥2

4. Substitute back into the Pythagorean theorem. From step 2, we had:

∥Ax̂ − b∥2 = ∥b∥2 − ∥Ax̂∥2

PASS Leader - Le Mai Thanh Son (Zeref) 23/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Substitute our result from step 3:


∥Ax̂ − b∥2 = ∥b∥2 − ∥QT b∥2
This completes the proof. This result is computationally very useful - it gives the sum of squared errors directly,
without ever needing to explicitly compute x̂ or Ax̂.

[1]
Question 10

Minimizing a squared norm plus an affine function. A generalization of the least squares problem adds
an affine function to the least squares objective,

minimize ∥Ax − b∥2 + cT x + d,

where the n-vector x is the variable to be chosen, and the (given) data are the m × n matrix A, the m-vector
b, the n-vector c, and the number d. We will use the same assumption we use in least squares: The columns of
A are linearly independent. This generalized problem can be solved by reducing it to a standard least squares
problem, using a trick called completing the square.
Show that the objective of the problem above can be expressed in the form

∥Ax − b∥2 + cT x + d = ∥Ax − b + f ∥2 + g,

for some m-vector f and some constant g. It follows that we can solve the generalized least squares problem
by minimizing ∥Ax − (b − f )∥, an ordinary least squares problem with solution

x̂ = A† (b − f ).

Hints. Express the norm squared term on the right-hand side as ∥(Ax − b) + f ∥2 and expand it. Then argue
that the equality above holds provided 2AT f = c. One possible choice is
1 † T
f= (A ) c.
2
(You must justify these statements.)

Our goal is to show that the original objective function:

J(x) = ∥Ax − b∥2 + cT x + d

is equal to the new objective function:


K(x) = ∥Ax − b + f ∥2 + g
for a specific choice of the m-vector f and the scalar g.

1. Expand the Right-Hand Side


We will start with the expression on the right-hand side, K(x), and expand it. It is helpful to group the term Ax − b
together.

K(x) = ∥(Ax − b) + f ∥2 + g
We can expand the squared norm ∥u + v∥2 as (u + v)T (u + v) = ∥u∥2 + 2uT v + ∥v∥2 . Let u = Ax − b and v = f .

K(x) = (∥Ax − b∥2 + 2(Ax − b)T f + ∥f ∥2 ) + g


Now, let’s distribute the term 2(Ax − b)T f :

K(x) = ∥Ax − b∥2 + 2(xT AT − bT )f + ∥f ∥2 + g


K(x) = ∥Ax − b∥2 + 2xT AT f − 2bT f + ∥f ∥2 + g
To compare this to J(x), we need to group the terms that depend on x. The term 2xT AT f is a scalar, so we can take
its transpose without changing its value. We do this to get x on the right side of the product:

PASS Leader - Le Mai Thanh Son (Zeref) 24/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

2xT (AT f ) = 2(AT f )T x


Substituting this back into our expression for K(x):

K(x) = ∥Ax − b∥2 + 2(AT f )T x + (∥f ∥2 − 2bT f + g)

2. Equate the Two Forms


Now we set the original objective J(x) equal to our expanded form of K(x):

J(x) = K(x)
∥Ax − b∥ + c x + d = ∥Ax − b∥2 + 2(AT f )T x + (∥f ∥2 − 2bT f + g)
2 T

We can cancel the ∥Ax − b∥2 term from both sides, leaving:

cT x + d = 2(AT f )T x + (∥f ∥2 − 2bT f + g)


For this equation to hold true for any vector x, the terms that multiply x must be identical, and the constant terms
must be identical.
a) Matching the x terms:

cT x = 2(AT f )T x
Taking the transpose of both sides gives the condition for f :

c = 2(AT f ) or AT f = 21 c
This proves the statement from the hint: the equality holds provided 2AT f = c.
b) Matching the constant terms:

d = ∥f ∥2 − 2bT f + g
We can solve this for g to find the value of the constant:

g = d − ∥f ∥2 + 2bT f

3. Justify the Choice of f


We must show that the suggested choice f = 12 (A† )T c satisfies the condition we found in step 2a (2AT f = c).
We are given that A has linearly independent columns, so its pseudoinverse A† is defined as
A† = (AT A)−1 AT .
Let’s find the transpose of the pseudoinverse:
T T
(A† )T = (AT A)−1 AT = (AT )T (AT A)−1
Since AT A is a symmetric matrix, its inverse (AT A)−1 is also symmetric, meaning
T
(AT A)−1 = (AT A)−1 .
Therefore,
(A† )T = A(AT A)−1 .
Now, substitute f = 21 (A† )T c = 21 A(AT A)−1 c into our condition 2AT f :
−1
2AT f = 2AT 1 T

2 A(A A) c
−1
2AT f = AT A(A A) T

c
T T T −1
2A f = (A A)(A A) c
T
2A f = In c
2AT f = c
This proves that the suggested choice for f is a valid solution.

PASS Leader - Le Mai Thanh Son (Zeref) 25/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

4. Conclusion
We have successfully shown that by choosing:
1. f = 21 (A† )T c (which satisfies 2AT f = c)
2. g = d − ∥f ∥2 + 2bT f
The original objective function is equivalent to the new one:

∥Ax − b∥2 + cT x + d = ∥Ax − b + f ∥2 + g


Since Ax − b + f = Ax − (b − f ), this is:

∥Ax − (b − f )∥2 + g
To solve the original problem, we must find the x that minimizes this expression. Since g is a constant, we only need
to minimize the squared norm term. This is a standard least-squares problem:

min ∥Ax − (b − f )∥2


x

The solution is given by x̂ = A† (b − f ), as stated in the problem.

PASS Leader - Le Mai Thanh Son (Zeref) 26/27


Accreditation, Testing & Quality Assurance Department
Peer-Assisted Study Sessions Program (PASS)
MATH2050 - Linear Algebra
Week 07, October 31, 2025
Fall 2025 Semester

Acknowledgments
I, Le Mai Thanh Son (Zeref), PASS Leader for MATH2050 - Linear Algebra course this semester, would like to express
my gratitude to the following individuals and resources for their contributions in the Linear Algebra field that I use
in this document:

• Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares [1]
• Elementary Linear Algebra [2]

Their contributions have been instrumental in shaping the outcomes of this work.

PASS Leader - Le Mai Thanh Son (Zeref) 27/27

You might also like