Pass Fall2025 Math2050 Week 7
Pass Fall2025 Math2050 Week 7
Session 5: Least-squares
and Regularized Least-squares
1 Introduction
Let (1, 0), (2, 1), and (3, 3) be three points in the plane. How can you find the line y = c0 + c1 x that “best fits” these
points? One way is to note that if the three points were collinear, then the following system of equations would be
consistent:
c0 + c1 = 0
c0 + 2c1 = 1
c0 + 3c1 = 3
This system can be written in the matrix form Ax = b, where
1 1
0 " #
c0
A = 1 2 , b = 1 , x= .
1 3 c1
3
Because the points are not collinear, however, the system is inconsistent. Although it is impossible to find x such that
Ax = b, you can look for an x that minimizes the norm of the error ∥Ax − b∥. The solution
" #
c0
x=
c1
y = c0 + c1 x.
Given an m × n matrix A and a vector b in Rm , the least squares problem is to find x in Rn such that
∥Ax − b∥2
is minimized.
REMARK: The term least squares comes from the fact that minimizing ∥Ax − b∥ is equivalent to
minimizing ∥Ax − b∥2 , which is a sum of squares.
has no solution. (From the first equation we have x1 = 1/2, and from the last equation we have x2 = −1/2; but then
the second equation does not hold.) The corresponding least squares problem is
Its unique solution is x̂ = (1/3, −1/3). The least squares approximate solution x̂ does not satisfy the equations
Ax = b; the corresponding residuals are
with sum of squares value ∥Ax̂ − b∥2 = 2/3. Let us compare this to another choice of x, x̃ = (1/2, −1/2), which
corresponds to (exactly) solving the first and last of the three equations in Ax = b. It gives the residual
Now, we will derive several expressions for the solution of the least squares problem, under one assumption on
the data matrix A: the columns of A are linearly independent.
Ax = b.
This system represents finding a set of coefficients x that, when applied to the columns of matrix A, produces the
vector b. Often, there is no exact solution (e.g., you have more equations than unknowns, so A is “tall”, i.e. the matrix
has more rows than columns).
The “best” approximate solution is found by minimizing the error. The Ordinary Least-Squares (OLS) method
finds the vector x that minimizes the squared L2 -norm (the sum of squares) of the error, or “residual” r (r = Ax − b):
Its objective function is:
Find x that minimizes ∥Ax − b∥22
2.2 Method 1
We know that any minimizer x̂ of the function
f (x) = ∥Ax − b∥2
must satisfy
∂f
(x̂) = 0, i = 1, . . . , n,
∂xi
which we can express as the vector equation
∇f (x̂) = 0,
where ∇f (x̂) is the gradient of f evaluated at x̂. So first, we need to find the gradient of f .
First, we can express the least squares objective as the squared L2 norm:
f (x) = ∥Ax − b∥2 .
We can rewrite this as a dot product:
f (x) = (Ax − b)T (Ax − b).
Expanding this expression:
f (x) = xT AT Ax − xT AT b − bT Ax + bT b.
Since xT AT b is a scalar (xT ∈ R1×n , AT ∈ Rn×m , b ∈ Rm ), its transpose is equal to itself:
(xT AT b)T = bT Ax = xT AT b.
Hence, the two middle terms are identical. Therefore,
f (x) = xT AT Ax − 2bT Ax + bT b.
Now we take the derivative with respect to the vector x. Using standard matrix calculus identities:
∂ T ∂ T
(x Y x) = (Y + Y T )x, and for symmetric Y = Y T , (x Y x) = 2Y x,
∂x ∂x
and
∂ T
(y x) = y,
∂x
we have:
∂f (x) ∂ T T ∂ ∂ T
= (x A Ax) − (2bT Ax) + (b b).
∂x ∂x ∂x ∂x
Simplifying,
∂f (x)
= 2AT Ax − 2AT b + 0.
∂x
Therefore, the gradient of f (x) with respect to x is
∇x f (x) = 2AT Ax − 2AT b.
Equivalently, it can be written in compact form as
∇x f (x) = 2AT (Ax − b).
2.3 Method 2
Writing the least squares objective out as a sum, we get
2
m
X Xn
f (x) = ∥Ax − b∥2 = Aij xj − bi .
i=1 j=1
To find ∇f (x)k , we take the partial derivative of f with respect to xk . Differentiating the sum term by term, we get
m n m
∂f X X X
2(AT )ki (Ax − b)i = 2AT (Ax − b) k .
∇f (x)k = (x) = 2 Aij xj − bi (Aik ) =
∂xk i=1 j=1 i=1
AT A = (QR)T (QR) = RT QT QR = RT R,
so
A† = (AT A)−1 AT = (RT R)−1 (QR)T = R−1 R−T RT QT = R−1 QT .
We can use the QR factorization to compute the least squares approximate solution. Let A = QR be the QR
factorization of A (which exists by our assumption that its columns are linearly independent). We have already seen
that the pseudo-inverse A† can be expressed as
A† = R−1 QT ,
so we have
x̂ = R−1 QT b
To compute x̂, we first multiply b by QT ; then we compute R−1 (QT b) using back substitution. This is summarized in
the following algorithm, which computes the least squares approximate solution x̂, given A and b.
2. Compute QT b.
3. Back substitution. Solve the triangular equation Rx̂ = QT b.
3 Regularized Least-Squares
3.1 Overview
Regularization adds a “penalty” term to the original objective function. The goal is no longer just to fit the data, but
to balance two competing goals:
• Goal 1: Fit the data well (keep ∥Ax − b∥22 small).
• Goal 2: Keep the solution x “simple” by penalizing large coefficients (keep ∥x∥22 small).
The most common type of regularized least-squares is Tikhonov regularization, also known as Ridge Regression
or L2 Regularization.
Its objective function is:
3.2 Solution
Find the vector x that minimizes the regularized objective function:
J(x) = ∥Ax − b∥22 + λ∥x∥22 .
∇x (xT M x) = 2M x.
• Gradient of Part 1: The matrix M = AT A + λI is symmetric. So, we use rule (1), gradient of a quadratic
form:
∇x (Part 1) = 2(AT A + λI)x.
• Gradient of Part 2: The vector c = AT b is constant. Using rule (2), gradient of a linear term:
∇x (bT b) = 0.
∇x J(x) = 0
x = (AT A + λI)−1 AT b
This is the closed-form solution for regularized least squares.
To find the least-squares solution x̂, we don’t solve Ax = b. Instead, we solve a related, “square” system called the
normal equations:
(AT A)x̂ = AT b
The solution x̂ is then:
1. Calculate AT A.
2. Calculate AT b.
3. Set up the system (AT A)x̂ = AT b.
(a)
2 1 2
Given: A = 1 2 , b= 0
1 1 −3
1. Calculate AT A:
2 1
T 2 1 1 T 2 1 1 (4 + 1 + 1) (2 + 2 + 1) 6 5
A = ⇒ A A= 1 2 =
= .
1 2 1 1 2 1 (2 + 2 + 1) (1 + 4 + 1) 5 6
1 1
2. Calculate AT b:
2
T 2 1 1 (4 + 0 − 3) 1
A b= 0 = = .
1 2 1 (2 + 0 − 3) −1
−3
6 5 x1 1
= .
5 6 x2 −1
1
Therefore, x̂ =
−1
(b)
1 0 1 4
1 1 1 −1
Given: A=
0
, b=
1 1 0
1 1 0 1
1. Calculate AT A:
1 0 1
1 1 0 1 1 1 0 1 3 2 2
1 1 1
AT = 0 AT A = 0
1 1 1 ⇒ 1 1 1
0 = 2 3 2 .
1 1
1 1 1 0 1 1 1 0 2 2 3
1 1 0
2. Calculate AT b:
4
1 1 0 1 (4 − 1 + 0 + 1) 4
−1
AT b = 0 1 1 1
0
= (0 − 1 + 0 + 1) = 0 .
1 1 1 0 (4 − 1 + 0 + 0) 3
1
4. Solve for x̂: For a 3 × 3 system, it’s often easiest to use Gaussian elimination (row reduction) on the augmented
matrix:
3 2 2 4 1 0 −1 1 1 0 −1 1
R1 ←R1 −R3 R2 ←R2 −2R1
2 3 2 0 −−−−−−−−→ 2 3 2 0 −−−−−−−−→ 0 3 4 −2
2 2 3 3 2 2 3 3 2 2 3 3
1 0 −1 1 1 0 −1 1 1 0 −1 1
R3 ←R3 −2R1 R2 ←R2 −R3 R3 ←R3 /7
−−−−−−−−→ 0 3 4 −2 −−−−−−−−→ 0 1 −1 −3 −−−−−−→ 0 1 −1 −3 .
0 2 5 1 0 0 7 7 0 0 1 1
x2 − x3 = −3 ⇒ x2 − 1 = −3 ⇒ x2 = −2
x1 − x3 = 1 ⇒ x1 − 1 = 1 ⇒ x1 = 2
2
Therefore, x̂ = −2
1
(c)
0 2 1 1
1
1 −1
0
Given: 2
A= 1 0, 1
b=
1 1 1 −1
0 2 −1 0
1. Calculate AT A:
0 2 1
0 1 2 1 0 0 1 2 1 0 1 1 −1 6 4 0
AT = 2 AT A = 2
1 1 1 2 ⇒ 1 2
1 1 2 1 0 = 4 11 0 .
1 −1 0 1 −1 1 −1 0 1 −1 1 1 1 0 0 4
0 2 −1
2. Calculate AT b:
1
0 1 2 1 0 0
(0 + 0 + 2 − 1 + 0) 1
AT b = 2 1 1 1 2 1
= (2 + 0 + 1 − 1 + 0) = 2 .
1 −1 0 1 −1 −1 (1 + 0 + 0 − 1 + 0) 0
0
1. 6x1 + 4x2 = 1
2. 4x1 + 11x2 = 2
3. 4x3 = 0
From equation (3), we immediately get x3 = 0.
3
⇒ x1 = (or 0.06)
50
Now find x2 :
3
50 0.06
Therefore, x̂ = 4 or x̂ = 0.16
25 0
0
[2]
Question 2: The table shows the numbers of doctorate degrees y awarded in the education fields in the
United States during the years 2001 to 2004. Find the least squares regression line for the data. Let t represent
the year, with t = 1 corresponding to 2001. Then use the model to predict the numbers of doctorate degrees
for the year 2010. (Source: U.S. National Science Foundation)
Our goal is to find the line y = c0 + c1 t that best fits the data.
• The matrix A is formed using the t-values (with a column of 1s for the c0 intercept):
1 1
1 2
A= 1
3
1 4
6635
(AT A)x̂ = AT b
y = 6263 + 103.4t
y(10) = 6263 + 103.4(10) = 6263 + 1034 = 7297
Therefore, the least-squares regression line is y = 6263 + 103.4t. The model predicts 7297 doctorate degrees for the
year 2010.
[2]
Question 3: The table shows the world carbon dioxide emissions y (in millions of metric tons) during the
years 1999 to 2004. Find the least squares regression quadratic polynomial for the data. Let t represent the
year, with t = −1 corresponding to 1999. Then use the model to predict the world carbon dioxide emissions
for the year 2008.
CO2 , y 10 9 10 13 18
y = c 0 + c 1 t + c 2 t2
Year t y (CO2 ) t2
1999 −1 10 1
2000 0 9 0
2001 1 10 1
2002 2 13 4
2003 3 18 9
c2
that best solves the system Ax̂ = b.
(AT A)x̂ = AT b
1. c0 + c1 + 3c2 = 12
2. c0 + 3c1 + 7c2 = 16
3. 15c0 + 35c1 + 99c2 = 234
y = 9 + t2
Final Solution:
• Least-Squares Polynomial: y = 9 + t2
[1]
Question 4
Approximating a vector as a multiple of another one. In the special case n = 1, the general least
squares problem reduces to finding a scalar x that minimizes ∥ax − b∥2 , where a and b are m-vectors (we write
the matrix A here in lower case, since it is an m-vector). Assuming a and b are nonzero, show that
where θ = ∠(a, b). This shows that the optimal relative error in approximating one vector by a multiple of
another one depends on their angle.
Let
f (x) = x2 ∥a∥2 − 2x(aT b) + ∥b∥2 .
Differentiate with respect to x:
f ′ (x) = 2x∥a∥2 − 2(aT b).
Set f ′ (x) = 0 to find the minimizer:
aT b
2x̂∥a∥2 − 2(aT b) = 0 ⇒ x̂ = .
∥a∥2
aT b
cos θ = ⇒ aT b = ∥a∥ ∥b∥ cos θ.
∥a∥ ∥b∥
Substitute into the expression:
(aT b)2 ∥a∥2 ∥b∥2 cos2 θ
= = ∥b∥2 cos2 θ.
∥a∥2 ∥a∥2
Thus,
∥ax̂ − b∥2 = ∥b∥2 − ∥b∥2 cos2 θ = ∥b∥2 (1 − cos2 θ).
[1]
Question 5
Least-squares with orthonormal columns. Suppose the m × n matrix Q has orthonormal columns and
b is an m-vector. Show that
x̂ = QT b
is the vector that minimizes ∥Qx − b∥2 .
Let
f (x) = ∥x∥2 − 2xT (QT b) + ∥b∥2 .
This is a quadratic function in x. To find the minimizer, take the gradient and set it to zero:
∇f (x) = 2x − 2QT b = 0 ⇒ x = QT b.
Alternatively, the least squares problem minimizes ∥Qx − b∥2 . The normal equations are:
QT Qx = QT b.
Since QT Q = In , we immediately obtain:
x̂ = QT b.
[1]
Question 6
Least angle property of least squares. Suppose the m × n matrix A has linearly independent columns,
and b is an m-vector. Let x̂ = A† b denote the least squares approximate solution of Ax = b.
(Ax̂)T b ∥Ax̂∥
= .
∥Ax̂∥∥b∥ ∥b∥
The left-hand side is the cosine of the angle between Ax̂ and b.
Hint: Apply part (a) with x = x̂.
(c) Least angle property of least squares. The choice x = x̂ minimizes the distance between Ax and b.
Show that x = x̂ also minimizes the angle between Ax and b. (You can assume that A and b are nonzero.)
Remark: For any positive scalar α, x = αx̂ also minimizes the angle between Ax and b.
(a)
Start with the left-hand side:
(Ax)T b = xT (AT b).
Since x̂ is the least squares solution, it satisfies the normal equations:
AT Ax̂ = AT b.
(b)
From part (a), with x = x̂:
(Ax̂)T b = (Ax̂)T (Ax̂) = ∥Ax̂∥2 .
Divide both sides by ∥Ax̂∥ ∥b∥:
(Ax̂)T b ∥Ax̂∥2 ∥Ax̂∥
= = .
∥Ax̂∥ ∥b∥ ∥Ax̂∥ ∥b∥ ∥b∥
Therefore,
(Ax̂)T b ∥Ax̂∥
=
∥Ax̂∥ ∥b∥ ∥b∥
(c)
We wish to show that the least squares solution x̂ = A† b minimizes the angle between Ax and b. Since the cosine
function is decreasing on [0, π], minimizing the angle θ is equivalent to maximizing cos θ, where
(Ax)T b
cos θ = .
∥Ax∥ ∥b∥
∥Ax̂∥
Thus, for any x, we can see that the cosine of the angle between Ax and b is at most .
∥b∥
On the other hand, when x = x̂, from equation (1), we have:
so the maximum value of cos θ is achieved ↔ x = x̂ also minimizes the angle between Ax and b.
Moreover, if Ax is a positive multiple of Ax̂, then equality holds in the Cauchy–Schwarz inequality. Since A has
linearly independent columns, this occurs if and only if x is a positive multiple of x̂.
[1]
Question 7
where wi are given positive weights. The weights allow us to assign different importance to the components of
the residual vector.
(a) Show that the weighted least squares objective can be expressed as ∥D(Ax − b)∥2 for an appropriate
diagonal matrix D. This allows us to solve the weighted least squares problem as a standard least squares
problem, by minimizing
∥Bx − d∥2 , where B = DA and d = Db.
(b) Show that when A has linearly independent columns, so does the matrix B.
(c) The least squares approximate solution is given by
x̂ = (AT A)−1 AT b.
Give a similar formula for the solution of the weighted least squares problem. You might want to use the
matrix W = diag(w) in your formula.
(a)
1. Expressing the WLS Objective as ∥D(Ax − b)∥2 . The weighted least-squares (WLS) objective is given as:
m
X
wi (ãTi x − bi )2
i=1
√
We are given that wi > 0, so we can write wi = ( wi )2 . The objective becomes:
m m
X √ X √ 2
( wi )2 (ãTi x − bi )2 = wi (ãTi x − bi )
i=1 i=1
This sum of squares is the squared norm of a vector. Let’s define a vector v where the i-th component is vi =
√
wi (ãTi x − bi ). Then the objective is ∥v∥2 .
Now, let’s look at the vector Ax − b. Its components are:
T
ã1 x − b1
ãT2 x − b2
Ax − b =
..
.
ãTm x − bm
√
To get our vector v, we need to multiply the i-th component of Ax − b by wi . We can do this with a diagonal matrix.
Let’s define D as the m × m diagonal matrix with the square roots of the weights on its diagonal:
√
w1 0 ··· 0
√
√ √ √ 0 w2 · · · 0
D = diag( w1 , w2 , . . . , wm ) = .
. .. ..
.. .. . .
√
0 0 ··· wm
2. Expressing as a Standard Least-Squares Problem ∥Bx − d∥2 . Using the distributive property of matrix
multiplication, we can rewrite the expression inside the norm:
∥Bx − d∥2
This is now in the form of a standard least-squares problem, where we want to find the vector x that minimizes the
squared norm of the residual vector Bx − d.
(b)
To show that a matrix M has linearly independent columns, we must show that the only solution to M x = 0 is the
trivial solution x = 0.
• We are given that A has linearly independent columns, i.e., Ax = 0 ⇒ x = 0.
• We want to show that B = DA has linearly independent columns, i.e., if we have Bx = 0, we can imply that
x = 0.
So we suppose that
Bx = 0
Substituting B = DA gives
(DA)x = 0
This is
D(Ax) = 0
Let y = Ax. The equation becomes
Dy = 0
√ √ √
Take D = diag( w1 , w2 , . . . , wm ) with wi > 0, so
√
w1 0 ··· 0
√
0 w2 ··· 0
D= .
.. .. ..
..
. . .
√
0 0 ··· wm
(c)
The solution x̂ to a standard least-squares problem
(where B has linearly independent columns) is obtained by solving the normal equations:
B T B x̂ = B T d.
Step 2: Simplify the Terms. Recall that the transpose of a product satisfies (DA)T = AT DT .
Since D is diagonal, it is symmetric; hence DT = D.
Therefore, (DA)T = AT D.
B T B = AT W A, B T d = AT W b
This is the solution to the weighted least-squares problem. It has the same form as the standard least-squares solution
but includes the weight matrix W , which gives higher importance to equations with larger weights.
[1]
Question 8
Network tomography. A network consists of n links, labeled 1, . . . , n. A path through the network is a
subset of the links. (The order of the links on a path does not matter here.) Each link has a (positive) delay,
which is the time it takes to traverse it. We let d denote the n-vector that gives the link delays. The total
travel time of a path is the sum of the delays of the links on the path.
Our goal is to estimate the link delays (i.e., the vector d), from a large number of (noisy) measurements of the
travel times along different paths. This data is given to you as an N × n matrix P , where
(
1, link j is on path i,
Pij =
0, otherwise,
and an N -vector t whose entries are the (noisy) travel times along the N paths. You can assume that N > n.
You will choose your estimate dˆ by minimizing the RMS deviation between the measured travel times (t) and
the travel times predicted by the sum of the link delays. Explain how to do this, and give a matrix expression
ˆ If your expression requires assumptions about the data P or t, state them explicitly.
for d.
Note. The RMS (Root Mean Square) deviation between two vectors u and v of length N is defined as:
v
u
u1 X N
RMS(u, v) = t (ui − vi )2 .
N i=1
This measures the average magnitude of the errors between u and v. Minimizing the RMS deviation is equivalent
to minimizing the sum of squared errors
XN
(ui − vi )2 ,
i=1
1
since the square root and the factor are monotonic functions. Thus, minimizing the RMS deviation is the
N
same as minimizing the squared Euclidean norm
∥u − v∥2 .
Remark. This problem arises in several contexts. The network could be a computer network, and a path
gives the sequence of communication links data packets traverse. The network could be a transportation system,
with the links representing road segments.
Let tpredicted be the N × 1 vector of predicted travel times for the N paths.
The problem states that the total travel time for a path is the sum of the delays of the links on that path.
Let’s look at a single path, path i. The predicted travel time for this path, (tpredicted )i , is:
This expression is exactly the i-th row of the matrix–vector product P d. If we do this for all N paths, we get the
complete vector of predicted travel times:
Pn
P1j dj
Pj=1
n
j=1 P2j dj
tpredicted = P d =
..
Pn .
j=1 PN j dj
In vector form, this is the squared Euclidean norm of the residual vector r = P d − t:
Minimize ∥P d − t∥22
This is a standard linear least-squares problem. We are trying to find the vector dˆ that minimizes this objective
function.
This problem is solved by finding the dˆ that satisfies the normal equations. The normal equations are derived by
taking the gradient of the objective function with respect to d and setting it to zero.
dˆ = (P T P )−1 P T t
This formula provides the optimal least-squares estimate for the delay vector d, minimizing the total squared error
between the measured and predicted travel times.
4. Required Assumptions
We are asked to state any assumptions needed for this expression. For the least-squares solution to work, we must be
able to compute the inverse (P T P )−1 .
• This is true if and only if the N × n matrix P has linearly independent columns.
[1]
Question 9
Least squares and QR factorization. Suppose A is an m × n matrix with linearly independent columns
and QR factorization A = QR, and b is an m-vector. The vector Ax̂ is the linear combination of the columns of
A that is closest to the vector b, i.e., it is the projection of b onto the set of linear combinations of the columns
of A.
(a) Show that Ax̂ = QQT b. (The matrix QQT is called the projection matrix.)
(b) Show that ∥Ax̂ − b∥2 = ∥b∥2 − ∥QT b∥2 . (This is the square of the distance between b and the closest linear
combination of the columns of A.)
(a)
1. Start with the Normal Equations. The least-squares solution x̂ is the unique vector that satisfies the normal
equations:
AT Ax̂ = AT b
2. Substitute A = QR into the equations.
(RT QT )(QR)x̂ = RT QT b
3. Use the property QT Q = In .
RT (QT Q)Rx̂ = RT QT b
RT (In )Rx̂ = RT QT b
RT Rx̂ = RT QT b
4. Isolate x̂. Since R is invertible, RT is also invertible. We can multiply both sides on the left by (RT )−1 :
In Rx̂ = In QT b
Rx̂ = QT b
This is a very useful intermediate result. It shows that solving for x̂ using QR is equivalent to solving a triangular
system Rx̂ = QT b, which can be efficiently solved by back-substitution.
5. Find the projection Ax̂. Now we want to find the vector Ax̂:
Ax̂ = Q(QT b)
Ax̂ = QQT b
This completes the proof. As the problem notes, QQT is the m × m matrix that projects any vector b onto the column
space of Q, which is the same as the column space of A.
(b)
1. Use the Geometry of Projections. The problem is asking for the squared norm of the residual vector
By definition, the projection Ax̂ is the part of b that lies in the column space of A. The residual r = b − Ax̂ is the
part of b that is orthogonal to the column space of A.
This means that b can be decomposed into two orthogonal vectors: Ax̂ and (b − Ax̂):
b = Ax̂ + (b − Ax̂)
2. Apply the Pythagorean Theorem. Because Ax̂ and (b − Ax̂) are orthogonal, the Pythagorean theorem applies:
3. Find an expression for ∥Ax̂∥2 . From the QR decomposition, we have two useful expressions:
∥Qv∥2 = v T QT Qv = v T In v = ∥v∥2
Therefore,
∥Ax̂∥2 = ∥Q(Rx̂)∥2 = ∥Rx̂∥2
From step 4 previously, Rx̂ = QT b. Substituting this gives:
[1]
Question 10
Minimizing a squared norm plus an affine function. A generalization of the least squares problem adds
an affine function to the least squares objective,
where the n-vector x is the variable to be chosen, and the (given) data are the m × n matrix A, the m-vector
b, the n-vector c, and the number d. We will use the same assumption we use in least squares: The columns of
A are linearly independent. This generalized problem can be solved by reducing it to a standard least squares
problem, using a trick called completing the square.
Show that the objective of the problem above can be expressed in the form
for some m-vector f and some constant g. It follows that we can solve the generalized least squares problem
by minimizing ∥Ax − (b − f )∥, an ordinary least squares problem with solution
x̂ = A† (b − f ).
Hints. Express the norm squared term on the right-hand side as ∥(Ax − b) + f ∥2 and expand it. Then argue
that the equality above holds provided 2AT f = c. One possible choice is
1 † T
f= (A ) c.
2
(You must justify these statements.)
K(x) = ∥(Ax − b) + f ∥2 + g
We can expand the squared norm ∥u + v∥2 as (u + v)T (u + v) = ∥u∥2 + 2uT v + ∥v∥2 . Let u = Ax − b and v = f .
J(x) = K(x)
∥Ax − b∥ + c x + d = ∥Ax − b∥2 + 2(AT f )T x + (∥f ∥2 − 2bT f + g)
2 T
We can cancel the ∥Ax − b∥2 term from both sides, leaving:
cT x = 2(AT f )T x
Taking the transpose of both sides gives the condition for f :
c = 2(AT f ) or AT f = 21 c
This proves the statement from the hint: the equality holds provided 2AT f = c.
b) Matching the constant terms:
d = ∥f ∥2 − 2bT f + g
We can solve this for g to find the value of the constant:
g = d − ∥f ∥2 + 2bT f
4. Conclusion
We have successfully shown that by choosing:
1. f = 21 (A† )T c (which satisfies 2AT f = c)
2. g = d − ∥f ∥2 + 2bT f
The original objective function is equivalent to the new one:
∥Ax − (b − f )∥2 + g
To solve the original problem, we must find the x that minimizes this expression. Since g is a constant, we only need
to minimize the squared norm term. This is a standard least-squares problem:
Acknowledgments
I, Le Mai Thanh Son (Zeref), PASS Leader for MATH2050 - Linear Algebra course this semester, would like to express
my gratitude to the following individuals and resources for their contributions in the Linear Algebra field that I use
in this document:
• Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares [1]
• Elementary Linear Algebra [2]
Their contributions have been instrumental in shaping the outcomes of this work.