Mathematical Foundations for Data
Science (SSCSH ZC416)
Lecture-1
BITS Pilani Dr. Pritee Agarwal
(pritee.a@[Link])
Pilani Campus
Agenda
• Matrices
• Gauss Elimination
• Consistency of Linear Systems
BITS Pilani, Pilani Campus
Matrices
A matrix is a rectangular array of numbers or functions
which we will enclose in brackets. For example,
a11 a12 a13
0.3 1 5 a ,
0 0.2 16 , 21 22a a 23
a31 a32 a33
(1)
e x 2x 2 4
6x , a1 a2 a3 , 1
e 4x
2
The numbers (or functions) are called entries or, less
commonly, elements of the matrix.
The first matrix in (1) has two rows, which are the
horizontal lines of entries.
BITS Pilani, Pilani Campus
Matrix – Notations
We shall denote matrices by capital boldface letters A, B,
C, … , or by writing the general entry in brackets; thus A =
[ajk], and so on.
By an m × n matrix (read m by n matrix) we mean a matrix
with m rows and n columns—rows always come first! m × n
is called the size of the matrix. Thus an m × n matrix is of the
form
(2)
BITS Pilani, Pilani Campus
Vectors
A vector is a matrix with only one row or column. Its entries
are called the components of the vector.
We shall denote vectors by lowercase boldface letters a, b, …
or by its general component in brackets, a = [aj], and so on.
Our special vectors in (1) suggest that a (general) row vector
is of the form
A column vector
BITS Pilani, Pilani Campus
Equality of Matrices
Equality of Matrices
Two matrices A = [ajk] and B = [bjk] are equal, written A = B, if
and only if (1) they have the same size and (2) the
corresponding entries are equal, that is, a11 = b11, a12 = b12, and
so on.
Matrices that are not equal are called different. Thus, matrices
of different sizes are always different.
BITS Pilani, Pilani Campus
Algebra of Matrices
Addition of Matrices
The sum of two matrices A = [ajk] and B = [bjk] of the same size is written A + B and
has the entries ajk + bjk obtained by adding the corresponding entries of A and B.
Matrices of different sizes cannot be added.
Scalar Multiplication (Multiplication by a Number)
The product of any m × n matrix A = [ajk] and any scalar c (number c) is written cA
and is the m × n matrix cA = [cajk] obtained by multiplying each entry of A by c.
(a) AB BA (a) c( A B) cA cB
(b) (A B) C A (B C) (written A B C) (b) ( c k) A c A k A
(c) A0 A (c) c( kA) ( ck) A (written ckA)
(d) A ( A ) 0. (d) 1 A A.
Here 0 denotes the zero matrix (of size m × n), that is, the m × n matrix with all entries
zero.
BITS Pilani, Pilani Campus
Matrix Multiplication
Multiplication of a Matrix by a Matrix
The product C = AB (in this order) of an m × n matrix A = [ajk]
times an r × p matrix B = [bjk] is defined if and only if r = n and
is then the m × p matrix C = [cjk] with entries
(1)
The condition r = n means that the second factor, B, must have
as many rows as the first factor has columns, namely n. A
diagram of sizes that shows when matrix multiplication is
possible is as follows:
A B = C
[m × n] [n × p] = [m × p].
BITS Pilani, Pilani Campus
Matrix Multiplication
EXAMPLE 1
Matrix Multiplication
3 5 1 2 2 3 1 22 2 43 42
AB 4 0 2 5 0 7 8 26 16 14 6
6 3 2 9 4 1 1 9 4 37 28
Here c11 = 3 · 2 + 5 · 5 + (−1) · 9 = 22, and so on. The entry in
the box is c23 = 4 · 3 + 0 · 7 + 2 · 1 = 14. The product BA is
not defined.
BITS Pilani, Pilani Campus
Matrix Multiplication
Matrix Multiplication Is Not Commutative,
AB ≠ BA in General
This is illustrated by Example 1, where one of the two
products is not even defined. But it also holds for square
matrices. For instance,
1 1 1 1 0 0
100 100 1 1 0 0
1 1 1 1 99 99
but .
1 1 100 100 99 99
It is interesting that this also shows that AB = 0 does not
necessarily imply BA = 0 or A = 0 or B = 0.
BITS Pilani, Pilani Campus
Transposition of Matrices &
Vectors
Transposition of Matrices and Vectors
The transpose of an m × n matrix A = [ajk] is the n × m matrix
AT (read A transpose) that has the first row of A as its first
column, the second row of A as its second column, and so on.
Thus the transpose of A in (2) is AT = [akj], written out
(9)
As a special case, transposition converts row vectors to
column vectors and conversely.
BITS Pilani, Pilani Campus
Transposition of Matrices
Rules for transposition are
(a) ( A T )T A
(10)
(b) (A B) T A T B T
(c) ( cA ) T cA T
(d) ( AB) T BT A T .
CAUTION! Note that in (10d) the transposed matrices are in reversed order.
BITS Pilani, Pilani Campus
Special Matrices
• Symmetric: aij = aji
• Skew Symmetric : aij = - aji
• Triangular: Upper Triangular aij = 0 for all i > j
Lower Triangular aij = 0 for all i < j
• Diagonal Matrix: aij = 0 for all i ≠ j
• Sparse Matrix: Many zeroes and few non-zero entities
BITS Pilani, Pilani Campus
Linear Systems
A linear system of m equations in n unknowns x1, … , xn is
a set of equations of the form
(1)
The system is called linear because each variable xj appears
in the first power only, just as in the equation of a straight
line. a11, … , amn are given numbers, called the coefficients
of the system. b1, … , bm on the right are also given numbers.
If all the bj are zero, then (1) is called a homogeneous
system. If at least one bj is not zero, then (1) is called a
nonhomogeneous system.
BITS Pilani, Pilani Campus
Elementary Row Operations
Elementary Operations for Equations:
Interchange of two equations
Addition of a constant multiple of one equation to another
equation
Multiplication of an equation by a nonzero constant c
Clearly, the interchange of two equations does not alter the
solution set. Neither does their addition because we can
undo it by a corresponding subtraction. Similarly for their
multiplication, which we can undo by multiplying the new
equation by 1/c (since c ≠ 0), producing the original equation.
BITS Pilani, Pilani Campus
Elementary Row Operations
We now call a linear system S1 row-equivalent to a linear system S2 if S1 can be
obtained from S2 by (finitely many!) row operations. This justifies Gauss
elimination and establishes the following result.
Theorem 1
Row-Equivalent Systems
Row-equivalent linear systems have the same set of solutions.
Because of this theorem, systems having the same solution sets are often called
equivalent systems. But note well that we are dealing with row operations. No
column operations on the augmented matrix are permitted in this context
because they would generally alter the solution set.
BITS Pilani, Pilani Campus
Row Echelon Form(REF)
At the end of the ERO the form of the coefficient matrix, the
augmented matrix, and the system itself are called the row
echelon form.
In it, rows of zeros, if present, are the last rows, and, in each
nonzero row, the leftmost nonzero entry is farther to the right
than in the previous row. For instance, in Example 4 the
coefficient matrix and its augmented in row echelon form are
3 2 1 3 2 1 3
1 1
0 1 1
(8) and 0 2 .
3 3 3 3
0 0 0 0 0 0 12
Note that we do not require that the leftmost nonzero entries be 1
since this would have no theoretic or numeric advantage.
BITS Pilani, Pilani Campus
Row Echelon Form and
Information from it
The original system of m equations in n unknowns has
augmented matrix [A | b]. This is to be row reduced to
matrix [R | f].
The two systems Ax = b and Rx = f are equivalent: if either
one has a solution, so does the other, and the solutions are
identical.
BITS Pilani, Pilani Campus
Linear Systems – Matrix form
Matrix Form of the Linear System (1).
From the definition of matrix multiplication we see that the
m equations of (1) may be written as a single vector
equation
(2) Ax = b
where the coefficient matrix A = [ajk] is the m × n matrix
are column vectors.
BITS Pilani, Pilani Campus
Matrix Form of Linear System
Matrix Form of the Linear System (1). (continued)
We assume that the coefficients ajk are not all zero, so that A is not a zero matrix.
Note that x has n components, whereas b has m components. The matrix
is called the augmented matrix of the system (1). The dashed vertical line could
be omitted, as we shall do later. It is merely a reminder that the last column of
à did not come from matrix A but came from vector b. Thus, we augmented the
matrix A.
BITS Pilani, Pilani Campus
Gauss Elimination and Back
Substitution
Triangular form:
Triangular means that all the nonzero entries of the
corresponding coefficient matrix lie above the diagonal and
form an upside-down 90° triangle. Then we can solve the
system by back substitution.
Since a linear system is completely determined by its
augmented matrix, Gauss elimination can be done by
merely considering the matrices.
(We do this again in the next example, emphasizing the matrices by writing
them first and the equations behind them, just as a help in order not to lose
track.)
BITS Pilani, Pilani Campus
Gauss Elimination
Solve the linear system
x1 x2 x3 0
x1 x2 x3 0
10 x2 25 x3 90
20 x1 10 x2 80.
Solution by Gauss Elimination.
This system could be solved rather quickly by noticing its
particular form. But this is not the point. The point is that the
Gauss elimination is systematic and will work in general,
also for large systems. We apply it to our system and then do
back substitution.
BITS Pilani, Pilani Campus
Example – Gauss Elimination
Solution by Gauss Elimination. (continued)
As indicated, let us write the augmented matrix of the system first and then the
system itself:
Augmented Matrix à Equations
Pivot 1 Pivot 1 x1 x2 x3 0
1 1 1 0
1 1 1 0 x1 x2 x3 0
Eliminate
Eliminate 0 10 25 90 10 x2 25 x3 90
20 10 0 80 20 x1 10 x2 80.
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Step 1. Elimination of x1
Call the first row of A the pivot row and the first equation
the pivot equation. Call the coefficient 1 of its x1-term the
pivot in this step. Use this equation to eliminate x1
(get rid of x1) in the other equations. For this, do:
Add 1 times the pivot equation to the second equation.
Add −20 times the pivot equation to the fourth equation.
This corresponds to row operations on the augmented
matrix as indicated in BLUE behind the new matrix in (3). So
the operations are performed on the preceding matrix.
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Step 1. Elimination of x1 (continued)
The result is
(3) 1 1 1 0 x1 x2 x3 0
0 0 0 0 Row 2 Row 1 0 0
0 10 25 90 10 x2 25 x3 90
0 30 20 80 Row 4 20 Row 1 30 x2 20 x3 80.
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Step 2. Elimination of x2
The first equation remains as it is. We want the new second equation to serve as
the next pivot equation. But since it has no x2-term (in fact, it is 0 = 0), we must
first change the order of the equations and the corresponding rows of the new
matrix. We put 0 = 0 at the end and move the third equation and the fourth
equation one place up. This is called partial pivoting (as opposed to the rarely
used total pivoting, in which the order of the unknowns is also changed).
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Step 2. Elimination of x2 (continued)
It gives
1 1 1 0 x1 x2 x3 0
0 10 25 90
Pivot 10 10 x2 25 x3 90
Pivot 10
Eliminate 30 0 30 20 80 30 x2 20 x3 80
0 0 0 0 Eliminate 30x2 0 0.
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Step 2. Elimination of x2 (continued)
To eliminate x2, do:
Add −3 times the pivot equation to the third equation.
The result is
1 1 1 0 x1 x2 x3 0
0 10 25 90
(4) 10 x2 25 x3 90
0 0 95 190 Row 3 3 Row 2 95 x3 190
0 0 0 0 0 0.
BITS Pilani, Pilani Campus
Gauss Elimination
Solution by Gauss Elimination. (continued)
Back Substitution. Determination of x3, x2, x1 (in this order)
Working backward from the last to the first equation of this
“triangular” system (4), we can now readily find x3, then x2,
and then x1:
95 x3 190
10 x2 25 x3 90
x1 x2 x3 0.
This is the answer to our problem. The solution is unique.
BITS Pilani, Pilani Campus
Gauss Elimination
At the end of the Gauss elimination (before the back substitution), the row
echelon form of the augmented matrix will be
• in upper triangular form
• having the first r rows non-zero
• exactly n-r rows would be zero rows
• the rhs would also have the last n-r rows zero
• any one of the n-r last rows is non-zero would imply inconsistency
• complexity is O(n3), where n is the number of rows
• facilitates the back substitution
BITS Pilani, Pilani Campus
Solution to System of Linear
Equations
A linear system is called
• overdetermined if it has more equations than unknowns
• determined if m = n, and
• underdetermined if it has fewer equations than unknowns.
Furthermore, a system is called
• consistent if it has at least one solution (thus, one solution or infinitely many
solutions),
• inconsistent if it has no solutions at all, as in
x1 + x2 = 1,
x1 + x2 = 0
BITS Pilani, Pilani Campus
Solution
The number of nonzero rows, r, in the row-reduced
coefficient matrix R is called the rank of R and also the
rank of A. Here is the method for determining whether
Ax = b has solutions and what they are:
(a) No solution. If r is less than m (meaning that R actually
has at least one row of all 0s) and at least one of the numbers
fr+1, fr+2, … , fm is not zero, then the system Rx = f is
inconsistent: No solution is possible. Therefore the system
Ax = b is inconsistent as well.
BITS Pilani, Pilani Campus
Solution
If the system is consistent (either r = m, or r < m and all the
numbers fr+1, fr+2, … , fm are zero), then there are solutions.
(b) Unique solution. If the system is consistent and r = n,
there is exactly one solution, which can be found by back
substitution.
(c) Infinitely many solutions. To obtain any of these
solutions, choose values of xr+1, … , xn arbitrarily. Then solve
the rth equation for xr (in terms of those arbitrary values),
then the (r − 1)st equation for xr−1, and so on up the line.
BITS Pilani, Pilani Campus
Mathematical Foundations for Data
Science (SSCSH ZC416)
Lecture-2
BITS Pilani Dr. Pritee Agarwal
(pritee.a@[Link])
Pilani Campus
Agenda
• Fields
• Vector spaces and subspaces
• Linear independence and dependence
• Basis and dimensions
• Linear transformations
• Affine Spaces
BITS Pilani, Pilani Campus
Field – Definition and
Examples
Group : (G, ‘*’) is a group if
i.* is closed
ii.* is associative
iii.* has an identity
iv.* has an inverse
Eg: < R , + > , < R, x >
< G, * > is Abelian if a * b = b * a " a,bÎ G
Eg: < R, + > , < R, *> are Abelian
<F,+, .> is a Field if <F, +> and <F, .> are Abelian
Eg: < R, +, .> , < C, +, .>, < Q, +, .>
BITS Pilani, Pilani Campus
Vector Space
Real Vector Space
A nonempty set V of elements a, b, … is called a real vector
space (or real linear space), and these elements are called
vectors (regardless of their nature, which will come out from
the context or will be left arbitrary) if, in V, there are defined
two algebraic operations (called vector addition and scalar
multiplication) as follows.
I. Vector addition associates with every pair of vectors a and b
of V a unique vector of V, called the sum of a and b and
denoted by a + b, such that the following axioms are satisfied.
BITS Pilani, Pilani Campus
Vector Space
Real Vector Space (continued 1)
I.1 Commutativity. For any two vectors a and b of V,
a + b = b + a.
I.2 Associativity. For any three vectors a, b, c of V,
(a + b) + c = a + (b + c) (written a + b + c).
I.3 There is a unique vector in V, called the zero vector and
denoted by 0, such that for every a in V,
a + 0 = a.
I.4 For every a in V, there is a unique vector in V that is
denoted by −a and is such that
a + (−a) = 0.
BITS Pilani, Pilani Campus
Vector Space
Real Vector Space (continued 2)
II. Scalar multiplication. The real numbers are called scalars. Scalar
multiplication associates with every a in V and every scalar c a unique
vector of V, called the product of c and a and denoted by ca (or ac) such
that the following axioms are satisfied.
II.1 Distributivity. For every scalar c and vectors a and b in V,
c(a + b) = ca + cb.
II.2 Distributivity. For all scalars c and k and every a in V,
(c + k)a = ca + ka.
II.3 Associativity. For all scalars c and k and every a in V,
c(ka) = (ck)a (written cka).
II.4 For every a in V,
1a = a.
BITS Pilani, Pilani Campus
Subspace
By a subspace of a vector space V we mean
“a nonempty subset of V (including V itself) that forms a
vector space with respect to the two algebraic operations
(addition and scalar multiplication) defined for the vectors
of V.”
• Space (W , , ) : within a vector space
• W ≠ Φ and W Í (V, +,.) over F is a subspace if
• 0 Î W, α w1 +w2 Î W
• Ex: V = {(x1, x2) | x1 , x2 Î R} over R, W = {(x1 , 0)|x1Î R}
• Set of singular matrices is not a subspace of M2x2
BITS Pilani, Pilani Campus
Linear Dependence and
Independence of Vectors
Given any set of m vectors a(1), … , a(m) (with the same
number of components), a linear combination of these
vectors is an expression of the form
c1a(1) + c2 a(2) + … + cma(m)
where c1, c2, … , cm are any scalars.
Now consider the equation
(1) c1a(1) + c2 a(2) + … + cma(m) = 0
Clearly, this vector equation (1) holds if we choose all cj’s zero,
because then it becomes 0 = 0.
If this is the only m-tuple of scalars for which (1) holds,
then our vectors a(1) … , a(m) are said to form a linearly
independent set or, more briefly, we call them linearly
independent.
BITS Pilani, Pilani Campus
Linear Dependence and
Independence of Vectors
Otherwise, if (1) also holds with scalars not all zero, we
call these vectors linearly dependent.
This means that we can express at least one of the vectors
as a linear combination of the other vectors. For instance,
if (1) holds with, say, c1 ≠ 0, we can solve (1) for a(1):
a(1) = k2 a(2) + … + kma(m) where kj = −cj /c1.
The rank of a matrix A is the maximum number of linearly
independent row vectors of A.
It is denoted by rank A.
BITS Pilani, Pilani Campus
Linear Dependence and
Independence of Vectors
Linear Independence and Dependence of Vectors
Consider p vectors that each have n components. Then these
vectors are linearly independent if the matrix formed, with these
vectors as row vectors, has rank p.
However, these vectors are linearly dependent if that matrix has
rank less than p.
Linear Dependence of Vectors
Consider p vectors each having n components. If n < p, then these vectors are linearly
dependent.
BITS Pilani, Pilani Campus
Linear Dependence and
Independence
Let S = {v1 , v2 , .... vn} Î V
Elements of S are LI if
n
åa v = 0
i i α i = 0 " i is the only solution
i=1
Elements of S are LD if
n
åa v = 0
i i has atleast one non zero solution
i=1
Eg: V = Rn over R
LI and LD are related to rank
BITS Pilani, Pilani Campus
Linear Dependence and
Independence
BITS Pilani, Pilani Campus
Basis and Dimension
The maximum number of linearly independent vectors in V is
called the dimension of V and is denoted by dim V.
A linearly independent set in V consisting of a maximum
possible number of vectors in V is called a basis for V.
The number of vectors of a basis for V equals dim V.
The vector space Rn consisting of all vectors with n components
(n real numbers) has dimension n.
•R over R One dimensional vector space
•C over C One dimensional vector space
•R over Q Infinite dimensional
BITS Pilani, Pilani Campus
Basis and Dimension
The set of all linear combinations of given vectors a(1), … ,
a(p) with the same number of components is called the
span of these vectors. Obviously, a span is a vector space.
If in addition, the given vectors a(1), … , a(p) are linearly
independent, then they form a basis for that vector space.
This then leads to another equivalent definition of basis.
A set of vectors is a basis for a vector space V if (1) the
vectors in the set are linearly independent, and if (2) any
vector in V can be expressed as a linear combination of the
vectors in the set. If (2) holds, we also say that the set of
vectors spans the vector space V.
BITS Pilani, Pilani Campus
Row Space and Column
Space
○ If A is an mn matrix
■ the subspace of Rn spanned by the row vectors of A is called
the row space of A
■ the subspace of Rm spanned by the column vectors is called
the column space of A
The solution space of the homogeneous system of equation Ax = 0,
which is a subspace of Rn, is called the nullspace of A.
a11 a12 a1n a11 a12 a1n
a a a a a a
Amn = 21 22 2n
c1 = 21 , c2 = 22 ,, cn = 2n
a
a
am1 am2 amn m1 m2 amn
BITS Pilani, Pilani Campus
Basis for Row Space and
Column Space
If a matrix R is in row echelon form
– the row vectors with the leading 1’s (i.e., the nonzero
row vectors) form a basis for the row space of R
– the column vectors with the leading 1’s of the row
vectors form a basis for the column space of R
BITS Pilani, Pilani Campus
Basis for Row Space
1 3 1 3
Find a basis of row space of A =
0 1 1 0
3 0 6 1
3 4 2 1
2 0 4 2
Sol:
1 3 1 3 1 3 1 3 w 1
0 1 1 0 0 1 1 0 w 2
A= 3 0 6 1 .
REF B=
0 0 0 1 w 3
3 4 2 1 0 0 0 0
2 0 4 2 0 0 0 0
a1 a 2 a3 a4 b1 b 2 b3 b 4
a basis for RS(A) = {the nonzero row vectors of B}
= {w1, w2, w3} = {(1, 3, 1, 3) , (0, 1, 1, 0) ,(0, 0, 0, 1)}
BITS Pilani, Pilani Campus
Basis for Column Space
Find a basis for the column space
1 2 1 2 0
of the matrix A.
A= 2 4 1 1 0
3 6 1 4 1
0 0 1 5 0
a1 a2 a3 a4 a5
Reduce A to the reduced row- echelon form
1 2 0 3 0
E = 0 0 1 5 0 = [e1 e2 e3 e4 e5 ]
0 0 0 0 1
0 0 0 0 0
e2 = 2e1 a2 = 2a1
e4 = 3e1 + 5e3 a4 = 3a1 + 5a3
{a1 , a3 , a5 } is a basis for column space of A
BITS Pilani, Pilani Campus
Solution Space/ Null Space
● Determine a basis and the dimension of the solution space of the
homogeneous system
2x1 + 2x2 – x3 + x5 = 0
-x1 + x2 + 2x3 – 3x4 + x5 = 0
x1 + x2 – 2x3 – x5 = 0
x3+ x4 + x5 = 0
○ The general solution of the given system is
x1 = -s-t, x2 = s,
x3 = -t, x4 = 0, x5 = t
○ Therefore, the solution vectors can be written as
1 1
1 0
v 1 = 0 an d v 2 = 1
0 0
0 1
BITS Pilani, Pilani Campus
Solution Space/ Null Space
Find the solution space of a homogeneous system Ax = 0.
1 2 2 1
A = 3 6 5 4
1 2 0 3
1 2 2 1 1 2 0 3
A = 3 6 5 4 REF
0 0 1 1
1 2 0 3 0 0 0 0
x1 = –2s – 3t, x2 = s, x3 = –t, x4 = t
x1 2 s 3t 2 3
x s 1 0
x = 2 = = s t = sv 1 t v 2
x3 t 0 1
NS ( A) = {sv tv | s, t Î R}
x
4 t 0 1 1 2
BITS Pilani, Pilani Campus
Rank of a matrix
(Row and column space have equal dimensions)
n If A is an mn matrix, then the row space and the column
space of A have the same dimension.
The dimension of the row (or column) space of a
matrix A is called the rank of A.
BITS Pilani, Pilani Campus
Nullity of Matrix
§ The dimension of the nullspace of A is called the
nullity of A
§ Notes: rank(AT) = dim(RS(AT)) = dim(CS(A)) = rank(A)
Therefore rank(AT ) = rank(A)
n (Dimension of the solution space)
If A is an mn matrix of rank r, then the dimension
of the solution space of Ax = 0 is n – r. That is
=
BITS Pilani, Pilani Campus
Rank and Nullity of Matrix
The number of leading variables in the solution of Ax=0.
(The number of nonzero rows in the row-echelon form of A)
The number of free variables (non leading variables) in the
solution of Ax = 0.
If A is an mn matrix and rank(A) = r, then
Fundamental Space Dimension
CS(A)=RS(AT) r
NS(AT) m–r
BITS Pilani, Pilani Campus
Rank and Nullity of Matrix
● Find the rank and nullity of the matrix
1 2 0 4 5 3
3 7 2 0 1 4
A=
2 5 2 4 6 1
● Solution:
4 9 2 4 4 7
○ The reduced row-echelon form of A is
1 0 4 28 37 13
0 1 2 12 16 5
0 0 0 0 0 0
0 0 0 0 0 0
○ Since there are two nonzero rows, the row space and column
space are both two-dimensional, so rank(A) = 2.
BITS Pilani, Pilani Campus
Rank and Nullity of Matrix
○ The corresponding system of equations will be
x1 – 4x3 – 28x4 – 37x5 + 13x6 = 0
x2 – 2x3 – 12x4 – 16 x5+ 5 x6 = 0
○ It follows that the general solution of the system is
x1 = 4r + 28s + 37t – 13u, x2 = 2r + 12s + 16t – 5u,
x3 = r, x4 = s, x5 = t, x6 = u
x1 4 28 37 13
or x 2 12 16 5
2
x3 1 0 0 0
= r s t u
x
4 0 1
0 0
x5 0 0 1 0
Thus, nullity(A) = 4
x6 0 0 0 1
BITS Pilani, Pilani Campus
Lecture 3
Math Foundations Team
Analytic geometry
▶ We have studied vector spaces in the previous lecture.
▶ Now we would like to provide some geometric interpretation
to these concepts.
▶ We shall take a close look at geometric vectors and the
concepts of lengths of vectors and angles between vectors.
▶ But first we need to add the concept of an inner product to
our vector space.
Norms
▶ A norm on a vector space is a function ∥.∥ : V → R, x → ∥x ∥
which assigns to each vector x a length ∥x ∥ such that for all
λ ∈ R and x , y ∈ V the following properties hold:
▶ Absolutely homogeneous: ∥λx ∥ = |λ|∥x ∥
▶ Triangle inequality: ∥x + y ≤ x + y
▶ Positive definite: ∥x ∥ ≥ 0 and ∥x ∥ = 0 =⇒ x =0
▶ Manhattan norm : |x | = i=n
P
i=1 |xi | where |.| represents
absolute value.
▶ Euclidean norm : ∥x ∥2 =
qP
i=n 2
i=1 xi .
Inner products
▶ Dot product in R n is given by x T y = i=1
Pi=n
xi yi
▶ A bilinear mapping Ω is a mapping with two arguments and is
linear in both arguments: Let V be a vector space such that
x , y , z ∈ V , and let λ, ψ ∈ R. Then we have
Ω(λx + ψ y , z ) = λΩ(x , z + ψΩ(y , z ), and
Ω(x , λy + ψ z ) = λΩ(x , y ) + ψΩ(x , z ).
▶ Let V be a vector space and Ω : V × V → R be a bilinear
mapping that takes two vectors as arguments and returns a
real number. Then Ω is called symmetric if
Ω(x , y ) = Ω(y , x ). Also Ω is called positive-definite if
∀x ∈ V {0}Ω(x , x ) > 0 and Ω(0, 0) = 0.
Inner products
▶ A positive-definite, symmetric bilinear mapping
Ω : V × V → R is called an inner product. To denote an inner
product on V we generally write ⟨x , y ⟩.
▶ The pair (V , ⟨., .⟩) is called an inner product space.
▶ Next we introduce the concept of symmetric, positive-definite
matrices and show we can express an inner product using such
matrices.
▶ We call that in a vector space V any vector x can be written
as linear combination of the basis vectors. We use this to
express an inner product in terms of a matrix.
Symmetric, positive-definite matrices
Theorem: For a real-valued, finite-dimensional vector space V and
an ordered basis B of V , it holds that ⟨., .⟩ : V × V → R is an
inner product if and only if there exists a symmetric, positive
definite matrix A ∈ R n×n with ⟨x , y ⟩ = x̂ T Aŷ .
Proof: One direction →: ⟨., .⟩ is an inner product =⇒ A is
symmetric, positive-definite such that ⟨x , y ⟩ = x̂ T Aŷ .
Other direction ←: A is symmetric, positive definite such that the
operation ⟨x , y ⟩ is defined as ⟨x , y ⟩ = x̂ T Aŷ =⇒ the operation
defined is an inner product.
Symmetric, positive-definite matrices
▶ We prove the → direction.
▶ Let ⟨x , y ⟩ be the inner product between the vectors x , y in V .
We can write x in terms of sayPn basis vectors as
x = i=1 ψi bi . Similarly y = i=1 λi bi .
Pi=n i=n
▶ Since the inner product is bilinear we can write ⟨x , y ⟩ =
ψi bi , i=1 λi bi ⟩ = i=1 j=1 ψi ⟨bi , bj ⟩λj = x̂ Aŷ
Pi=n Pi=n Pi=n Pj=n T
⟨ i=1
where Aij = ⟨bi , bj ⟩.
▶ Here x̂ , ŷ are vectors which represent the coordinates of the
original vectors x , y with respect to the basis vectors.
Symmetric, positive-definite matrices
▶ This means that the inner product is entirely determined
through the matrix A. The symmetry of the inner product
means that Aij = ⟨bi , bj ⟩ = Aji = ⟨bj , bi ⟩. Thus A is
symmetric.
▶ The positive-definiteness of the inner product means that
∀x ∈ V \{0}, x T Ax > 0.
Symmetric, positive-definite matrices
▶ Now let us consider an operation op such that x op y = x̂ T Aŷ
where A is a symmetric, positive definite matrix.
▶ We shall show that ”op” is an inner product by showing that
it has all the properties of an inner product:
▶ ”op” has symmetry because x op y = x̂ T Aŷ and
y opx = ŷ T Ax̂ = ŷ T (Ax̂ ). By a property of the dot product
we can write ŷ T (Ax̂ ) = (Ax̂ )T ŷ = x̂ T AT ŷ = x̂ T Aŷ where
the last equality in the chain is possible since A is symmetric.
▶ ”op” also has bilinearity since we see that for
r ∈ R, (r x )op y = (r x̂ )T Aŷ = r x̂ T Aŷ = r x op y .
▶ (x + y )op z = (x̂ + ŷ )T Aẑ = x̂ T Aẑ + ŷ T Aẑ = x op z + y op z .
▶ Finally if x is a non-zero vector then x̂ is also a non-zero
vector, x op x = x̂ T Ax̂ > 0 since we are given that A is
positive-definite.
Symmetric, positive definite matrices
▶ Can a symmetric, positive-definite matrix have less than full
rank? We have x T Ax > 0 for all non-zero x . Thus x = 0 is
the only vector allowed in the nullspace. The nullspace is
0-dimensional so A has full rank.
▶ What can be said about the diagonal elements of a
positive-definite matrix? From (ei )T Aei > 0 where ei is the
ith canonical basis vector, we see that Aii > 0. Thus the
diagonal entries are all strictly positive.
Lengths and distances
▶ Inner products and norms are closely related in the sense that
any inner product induces a norm: ∥x ∥ = ⟨x , x ⟩
p
▶ Not every norm is induced by an inner product, for example
the Manhattan norm.
▶ For an inner product vector space (V , ⟨., .⟩), the induced norm
∥.∥ satisfies the Cauchy-Schwarz inequality: ⟨x , y ⟩ ≤ ∥x ∥∥y ∥.
Why is this true?
Cauchy-Schwarz inequality
▶ Let u and v be two vectors and let us consider the length of
the vector u − αv where α is a constant.
▶ The length of the vector u − αv is greater than or equal to
zero. The length of the vector u − αv is
∥u − αv ∥ = ⟨u − αv , u − αv ⟩ = (u − αv )T (u − αv ).
▶ We can expand the dot product
(u − αv )T (u − αv ) = u T u − αu T v − αv T u + α2 v T v ≥ 0
▶ Now set α = u T v to get u T u − (u Tv ) ≥ 0 which leads us to
T T 2
v v v v
(u T u )(v T v ) ≥ (u T v )2 which is Cauchy-Schwarz inequality.
Metric space
▶ Consider an inner product space (V , ⟨., .⟩). Define d(x , y ) the
distance between two p vectors x and y to be
d(x , y ) = ∥x − y ∥ = ⟨x − y , x − y ⟩.
▶ If we use the dot product as the inner product, then the
distance is called the Euclidean distance.
▶ The mapping d : V × V → R is called a metric.
Properties of a metric space
A metric d has the following properties:
▶ d is positive-definite which means d(x , y ) ≥ 0∀x , y ∈ V .
d(x , y ) = 0 =⇒ x = y .
▶ d is symmetric which means d(x , y ) = d(y , x )∀x , y ∈ V .
▶ d obeys the triangle inequality as follows:
d(x , z ) ≤ d(x , y ) + d(y , z )∀x , y , z ∈ V
Inner products and metrics seem to be very similar in terms of their
properties - however there is one important difference. When x and
y are close to each other the inner product is large but the distance
metric is small. On the other hand when x and y are far apart,
then the inner product is small but the distance metric is large.
Angles and orthogonality
▶ In addition to being able to capture the lengths of vectors and
the distance between vectors, inner products can also capture
the angle ω between two vectors and can thus capture the
geometry of a vector space.
▶ The key to using the inner product to characterize the angle
between two vectors is the Cauchy-Schwarz inequality.
▶ Assume that x and y are not the 0 vector. Then the
Cauchy-Schwarz inequality tells us that
−1 ≤
xT y ≤ 1
∥x ∥∥y ∥
(1)
Angles and orthogonality
▶ Since the Cauchy-Schwarz ratio lies between -1 and 1 we can
set it equal to the cosine of a unique angle ω ∈ [0, π] such that
⟨x , y ⟩
∥x ∥∥y ∥
cos(ω) = (2)
▶ The angle ω is the angle between two vectors. What does it
capture?
▶ The notion of angle captures similarity of orientation between
two vectors. When the dot product is close to zero, the
vectors are more or less pointing in the same direction and
ω ≈ 0.
Angles and orthogonality
▶ Food for thought: Suppose we choose vectors x and y
uniformly at random in high dimensions. What happens to the
dot product between the vectors and hence the angle between
them?
▶ To choose a vector uniformly at random over a sphere let
every component in the vector be an independent Gaussian
random variable of mean 0 and unit variance.
▶ Write a small program to see what happens ...
Angles and orthogonality
▶ A key feature of the inner product is that we can use it to
characterize vectors that are orthogonal.
▶ Two vectors x and y are orthogonal if and only if the inner
product between them is 0. For an orthogonal pair of vectors
x , y we can write x ⊥ y .
▶ By the above definition the 0-vector is orthogonal to all
vectors.
▶ Vectors which are othrogonal with respect to one inner
product need not be orthogonal with respect to another inner
product.
Example - angles and orthogonality
▶ Consider the vectors x = [1, 1]T and y = [−1, 1]T
▶ With respect to the inner product defined as a dot product we
see that ⟨x , y ⟩ = x T y = 1 ∗ −1 + 1 ∗ 1 = 0.
▶ With respect to the inner product x y , the angle
T 2 0
0 1
between the two vectors x and y becomes
⟨x , y ⟩
∥x ∥∥y ∥
cos(ω) =
Example - angles and orthogonality
▶ Continuing with our example we have
x T Ay
cos(ω) =
x Ax y T Ay
T
2x1 y1 + x2 y2
=
(2x12 + x22 )(2y12 + y22 )
−1
=
3
where A =
2 0
.
0 1
▶ Thus, with respect to the new definition of inner product the
vectors x and y are no longer orthogonal.
Orthonormal matrix
▶ A square matrix A ∈ R n×n is an orthogonal matrix if and only
if its columns are orthonormal:
AT A = I = AAT
AT = A−1
▶ If the columns of a matrix are orthonormal, why are its rows
orthonormal too? This follows from the fact that the
left-inverse of a square matrix is the same as the right-inverse.
Let A be a square matrix with B and C the left and right
inverses of A: BA = I = AC =⇒ B = C . Why is this true?
Orthonormal matrix
▶ Transformations by an orthonormal matrix preserve lengths.
This can be seen as follows, using the dot product as the
definition of the inner product:
∥Ax ∥2 = Ax T Ax = x T AT Ax = x T I x = x T x .
▶ An example of an orthonormal matrix
is the 2D-rotation
cos θ −sinθ
matrix which can be expressed as where θ is
sinθ cosθ
the angle of rotation.
Orthonormal matrix
▶ Also the angle between two vectors x and y does not change
after transformation by an orthonormal matrix. This can be
seen as follows:
Ax T Ay
∥Ax ∥ ∥Ay ∥
cos(ω) =
x T AT Ay
∥x ∥ ∥y ∥
=
xT y
∥x ∥ ∥y ∥
=
Orthonormal basis
▶ We already looked at the concept of a basis of a vector space,
and found that for the vector space R n we need n basis
vectors.
▶ Our basis vectors needed only to be linearly independent - we
can ensure linear independence by ensuring that our basis
vectors point in different directions, so that a linear
combination of n − 1 basis vectors cannot cancel out the nth
basis vector.
▶ Now we will look at a special case of a basis where the vectors
are all mutually orthogonal in the sense of the inner product,
and each vector is of unit length. We call such a basis an
orthonormal basis.
Orthonormal basis
▶ Question: Can you immediately think of an orthonormal basis
for R n ? Is an orthonormal basis for a vector space unique?
▶ Formal definition of an orthonormal basis: Consider an
n-dimensional vector space V and n basis vectors
{b1 , b2 , . . . bn }. If it is true that
∀i, j = 1, . . . n, i ̸= j ⟨bi , bj ⟩ = 0 and ⟨bi , bi ⟩ = 1, then the
basis is called an orthonormal basis.
▶ If the basis vectors are only mutually orthogonal but not of
length unity, then we have an orthogonal basis.
Gram-Schmidt process
▶ Given a set of basis vectors for a vector space, can we convert
the given basis into an orthogonal basis? Yes, we shall use
Gaussian elimination to construct such a basis.
▶ Let us start with an example: Consider R 2 and two basis
vectors v1 = (3, 1)T and v2 = (2, 2)T . Put these
vectors into
columns of a matrix A such that A =
3 2
.
1 2
▶ The next step is to perform Gaussian elimination
on the
following augmented matrix: [AT A|AT ] =
10 8| 3 1
8 8| 2 2
▶ On performing Gaussian
elimination of
this augmented matrix
1 0| 0.3 0.1
we end up with
0 1| −0.25 0.75
Gram-Schmidt process
▶ Note that after the completion of Gaussian elimination the
two rows on the right hand side are orthogonal. They form a
basis for R 2 . We can normalize the vectors to get an
orthonormal basis.
▶ What is the justification for this technique?
▶ First we see that when the m × n matrix A has full column
rank, then the matrix AT A is positive definite. To see this
note that any solution x to Ax = 0 is also a solution to
AT Ax = 0 and vice-versa. Why is this the case?
▶ When A is full rank, there are no non-trivial solutions to
Ax = 0. Thus the fact that there are no non-trivial solutions
to Ax = 0 means that ∀x ∈ R m , x ̸= 0, x T Ax > 0.
Elementary transformations
▶ One of the steps of Gaussian elimination is the the subtraction
of a multiple of a given row from a row below it. This step
can be achieved by pre-multiplication of the given matrix by
an elementary matrix. An elementary matrix is like an identity
matrix except that one of the entries below the diagonal is
allowed to be non-zero.
▶ To show how the process of elimination works using an
a11 a12 a13
elementary matrix consider the matrix A = a21 a22 a23
a31 a32 a33
and assume that we want to subtract two times the first row
from the second row.
Elementary transformations
This
can be accomplished
by the following elementary matrix
1 0 0
E = −2 1 0 so that the product
0 0 1
1 0 0 a11 a12 a13
EA = −2 1 0 a21 a22 a23
0 0 1 a31 a32 a33
a11 a12 a13
= a21 − 2a11 a22 − 2a12 a23 − 2a13
a31 a32 a33
Product of elementary transformations
▶ A series of Gaussian elimination steps can be represented as a
product of elementary transformations acting on A:
Em Em−1 . . . E1 A.
▶ The product of lower triangular matrices can be seen to be
lower triangular, and the inverse of a lower triangular matrix
can also be seen as a lower triangular matrix.
▶ Thus the action of Gaussian elimination operations can be
seen in the following terms L−1 A = U where the product of
the elementary transformations is represented as the inverse of
a lower triangular matrix for notational convenience, and the
right hand side U is an upper triangular matrix. Thus we have
A = LU .
Final argument
▶ Returning to our problem we are performing Gaussian
elimination on the matrix AT A where A contains the basis
vectors as its columns. Upon Gaussian elimination on the
augmented matrix we reduce [AT A|AT ] to get [U |L−1 AT
where AT A = LU .
▶ Now we shall show that Q T = L−1 AT is an orthogonal matrix
whose rows are orthogonal.
▶ Consider Q T Q = L−1 AT A(L−1 )T = U (L−1 )T = some
upper triangular matrix
▶ But Q T Q is a symmetric matrix and can only be upper
triangular if it is diagonal. Therefore Q is an orthogonal
matrix whose columns are orthogonal. They can be
normalized to obtain an orthonormal basis.
Lecture 4
Math Foundations Team
Matrix decompositions
▶ We studied vectors and how to manipulate them in preceding
lectures.
▶ Mappings and transformations of vectors can be conveniently
described in terms of operations performed by matrices.
▶ In this lecture we shall study three aspects of matrices: how
to summarize matrices, how matrices can be decomposed, and
how the decompositions can be used for matrix
approximations.
Determinant and trace
A determinant of order n × n is a scalar associated
with a n × n
a11 a12 . . . a1n
a21 a22 . . . a2n
matrix and is denoted as follows: det(A) = .
. .. ..
. . .
an1 an2 . . . ann
We have a cofactor formula to calculate a determinant of order n:
for n = 1, we have det(A) = a11 . For n ≥ 2 we have
D = det(A) = aj1 Cj1 + aj2 Cj2 + . . . + ajn Cjn
= a1k C1k + a2k C2k + . . . + ank Cnk
The first line above represents expansion along the jth row, while
the second line represents expansion along the kth column.
Cofactor formula
▶ In the preceding slide, the Cjk = (−1)j+k Mjk where Mjk
represents the n − 1 order determinant of the submatrix of A
obtained by removing the jth row and kth column.
▶ Mjk is called the minor of ajk in D and Cjk is called the
cofactor of ajk in D.
▶ Our definition for the determinant in the previous slide shows
that the n × n determinant is defined in terms of
(n − 1) × (n − 1) determinants which in turn are defined in
terms of (n − 2) × (n − 2) determinants recursively.
▶ Let us examine the computation for a simple 3 × 3
determinant.
Example
a11 a12 a13
Let us compute D = a21 a22 a23
a31 a32 a33
a12 a13
For the entries in the second row the minors are M21= ,
a32 a33
a a a a
M22 = 11 13 , M23 = 11 12 . The cofactors are
a31 a33 a31 a32
C21 = (−1) M21 = −M21 , C22 = (−1)2+2 M22 = M22 ,
1+2
C23 = (−1)2+3 M23 = −M23 .
Expanding along the second row we can write
D = a21 C21 + a22 C22 + a23 C23 .
Behaviour of determinant
Theorem: We can state the following for a nth order determinant
under elementary row operations:(a) interchanging two rows
multiplies the value of the determinant by −1, and (b) adding a
multiple of a row to another row does not change the value of the
determinant.
Proof Sketch Let us look at how to prove (a). The proof
is by
a b
induction. The statement holds for n = 2 since = ad − bc
c d
c d
whereas = bc − ad = −(ad − bc). We make the induction
a b
hypothesis that the statement is true for all determinants of order
(n − 1). Let D represent the original determinant and E represent
the determinant with rows interchanged.
Proof sketch continued
Let us expand D and E along a row that is not interchanged.
We have
k=n
X
D = (−1)j+k ajk Mjk
k=1
k=n
X
E = ajk (−1)j+k Njk
k=1
where Njk in E is obtained by exchanging ttwo rows of the minor
Mjk in D. Mjk and Njk are determinants of order n − 1 where one
of the determinants has a pair of rows interchanged as compared
to the other determinant. Therefore Mjk = −Njk , and D = −E .
Proof sketch continued
Let us now look at adding multiples of a row to another row.
▶ Add c times row i to row j. Then we get a new determinant
D ′ whose entries in the jth row has ajk + caik . Expanding the
k=n
X
′
′
determinant D we have D = (ajk + caik )(−1)j+k Mjk =
k=1
k=n
X k=n
X
(−1)j+k ajk Mjk + (−1)j+k caik Mjk .
k=1 k=1
▶ The summation can be written as D ′ = D1 + cD2 where
D1 = D and D2 represents the determinant of a matrix similar
to the one we started out with except that rows j and i both
have coefficients aik in them. Thus two rows of D2 are equal -
this makes D2 = 0 → why is this?
Proof sketch continued
▶ In part (a) we showed that interchanging two rows will negate
the determinant. If we interchange rows i and j we will get
the same determinant since the two rows are identical. But
one of the determinants must be the negative of the other.
This is only possible when both determinants are zero. Thus
D2 = 0 and D ′ = D.
▶ Bottom line → Adding a multiple of one row to another row
does not change the determinant.
▶ This will lead us to our next result.
Another result
Theorem An n × n matrix A has rank n if and only if its
determinant is not equal to zero.
Proof sketch: A has full rank =⇒ det(A) ̸= 0: Gaussian
elimination reduces A to upper triangular matrix U = (Uij ) whose
determinant is the product of all the elements Uii . But we know
that det(A) = (−1)s det(U ) where s is the number of row
interchanges performed during Gaussian elimination. Since A has
full rank, the columns of A are linearly independent and the only
solution to Ax = 0 is x = 0. The system Ax = 0 has the same set
of solutions as Ux = 0, so this means U has only pivot columns.
The pivots are all the elements Uii . The product of all pivots is
non-zero, and hence det(U ) = det(A) ̸= 0.
Another result
det(A) ̸= 0 =⇒ A has full rank: If Q the determinant of A is
non-zero, det(A) = (−1)s det(U ) = i=n i=1 Uii means that all the
Uii are non-zero. Therefore all the columns of U are pivot columns
with the pivots being the Uii . The pivot columns are all linearly
independent, so the only solution to Ux = 0 is x = 0. This is also
the only solution to Ax = 0 which means A has full rank.
Trace
▶ The trace of a n × n square matrix A is defined as
tr (A) = i=n
P
i=1 aii , i.e the sum of the diagonal elements of A.
▶ The trace has the following properties:
▶ tr (A + B ) = tr (A) + tr (B ), for A, B ∈ Rn×n
▶ tr (αA) = αtr (A), α ∈ R
▶ tr (In ) = n
▶ tr (AB ) = tr (BA) for A ∈ Rn×k , B ∈ Rk×n .
▶ The proofs of these properties are not difficult.
Characteristic polynomial
▶ For λ ∈ R and A ∈ Rn×n we can define pA (λ) = det(A − λI )
and show that it can be written as
c0 + c1 λ + . . . cn−1 λn + (−1)n λn where c0 , c1 . . . cn−1 ∈ R.
▶ We can show that c0 = det(A) and cn−1 = (−1)n−1 tr (A)
▶ To see that c0 = det(A), set λ = 0 in det(A − λI ) to get
pA (0) = det(A) = c0
▶ The formula for cn−1 takes a little bit of work - let us expand a
a11 − λ a12 a13
3 × 3 determinant det(A − λI ) = a21
a22 − λ a23
a31 a32 a33 − λ
Characteristic polynomial
▶ Expanding the determinant along the first row we see that the
i=3
Y
(a11 − λ)C11 term contains the product (aii − λ) which
i=1
contains the powers λ3 and λ2 . The other contributors to the
determinant i.e a12 C12 and a12 C13 expand into terms where
the maximum power of λ = 1.
▶ Carrying this analogy over to the general case of n > 3 we see
that expanding along the first row the first contributor to the
i=n
Y
determinant will have the term (aii − λ) and subsequent
i=1
contributors will have a maximum power of λn−2 as the
minors for each such contributor will kill off a term containing
λ in a given row and column.
Characteristic polynomial
▶ Thus in the determinant expansion to obtain the characteristic
polynomial we see that coefficient to λn−1 can only come
i=n
Y
from the expansion of (aii − λ) and can be seen to be seen
i=1
i=n
aii = (−1)n−1 tr (A).
X
to be (−1)n−1
i=1
▶ As a corollary to this argument we can see that the coefficient
to λn in the characteristic polynomial is (−1)n
▶ We will use the characteristic polynomial to compute
eigenvalues and eigenvectors.
Eigenvalues and eigenvectors
▶ LetA ∈ Rn×n be a square matrix. Then λ ∈ R is an eigenvalue
of A and x ∈ Rn \ 0 is the corresponding eigenvector of λ if
Ax = λx . This equation is called the eigenvalue equation.
▶ The following statements are equivalent:
▶ λ is an eigenvalue of A ∈ Rn×n .
▶ There exists an x ∈ Rn \ 0 with Ax = λx , or equivalently,
(A − λIn )x = 0 can be solved non-trivially, i.e., x ̸= 0.
▶ rank(A − λIn ) < n.
▶ det(A − λIn ) = 0.
▶ If x is an eigenvector corresponding to a particular eigenvalue
λ, c x , c ∈ R \ 0 is also an eigenvector.
Eigenvalues and eigenvectors - example
▶ Consider the matrix A = 1 1 . The characteristic
1 1
polynomial det(A − λI ) = (1 − λ)2 − 1 and setting it to zero
gives us the roots of the characetristic polynomial:
(1 − λ)2 − 1 = 0 has roots λ = 2, 0.
▶ What are the eigenvectors? For λ = 0 we solve for Ax = 0x ,
so we find the nullspace of the matrix A. Using Gaussian
elimination we convert Ax = 0 to Ux = 0 where U =
1 1
.
0 0
1
Thus we discover the eigenvector for λ = 0.
−1
1
▶ Similarly we discover the eigenvector for λ = 2.
1
Eigenvalues and eigenvectors - example
▶ The general procedure to find eigenvalues and eigenvectors is
to first find the roots of the characteristic polynomials and
then find the nullspaces of the matrices A − λI for the
different roots λ.
▶ Does every n × n matrix have a full set of eigenvectors, i.e n
eigenvectors?
0 1
▶ Look at . What are its eigenvalues and eigenvectors?
0 0
▶ Point to ponder Looking at the equation Ax = λx it seems
that the action of A on x is to preserve the direction of x but
scale it up or down according to λ. Does this mean that a
rotation matrix has no eigenvalues and eigenvectors?
Some additional properties
▶ λ is an eigenvalue of A if and only if λ is a root of the
characteristic polynomial pA (λ) of A. This can be easily seen
as a consequence of the definition of pA (λ).
▶ For A ∈ Rn×n , the set of eigenvectors corresponding to an
eigenvalue λ spans a subspace of Rn called the Eigenspace of
A with respect to λ and is denoted by Eλ .
▶ The set of all eigenvalues of A is called the spectrum of A.
▶ Look at the eigenvalues and eigenspace of the n × n identity
matrix In . It has one eigenvalue λ = 1 and the eigenspace is
Rn . Every canonical vector is a basis vector for the eigenspace.
Some additional properties
▶ A matrix and its transpose have the same eigenvalues. To see
this, first note that det(A) = det(AT ). Then det(A − λI ) =
det((A − λI )T ) = det(AT − λI T ) = det(AT − λI ). The last
expression in the chain of equalities is the characteristic
polynomial for pAT (λ). Thus we have pA (λ) = pAT (λ) which
means the characteristic polynomials are equal and so the
roots of the polynomials or the eigenvalues must be equal.
▶ The eigenspace Eλ is the nullspace of A − λI .
▶ Symmetric, positive-definite matrices always have positive,
real eigenvalues.
Some theorems
▶ The eigenvectors x1 , x2 . . . xn of a n × n matrix A with n
distinct eigenvalues are linearly independent → why?
▶ Given a matrix A ∈ Rm×n we can show that AT A ∈ Rn×n is a
symmetric, positive-definite matrix when the rank of A = n.
Why is this true? Clearly AT A is a symmetric matrix and it is
positive definite since x T AT Ax = ∥Ax ∥2 > 0 ∀x ∈ Rn \ 0
since the nullspaces of AT A and A are the same, and A is a
full column rank matrix.
▶ The matrix AT A is important in machine learning since it
figures in the least-squares solution to a data matrix
represented as A where n represents the number of features
and m is the number of data vectors.
Spectral theorem
Theorem: If A ∈ Rn×n is symmetric there exists an orthonormal
basis of the corresponding vector space V consisting of the
eigenvectors of A, and each eigenvalue is real.
Proof: We will not attempt a full proof of this theorem but
provide some intuitions about why it is true. The theorem relies on
the following three statements, shown in the next slide.
Spectral theorem
▶ All roots of the characteristic polynomial pA (λ) are real.
▶ For each eigenvalue λ we can compute an orthonormal basis
for its eigenspace. We can string together the orthonormal
bases for the different eigenvalues of A to come up with the
vectors v1 , v2 ...
▶ The dimension of the eigenspace Eλ , called its geometric
multiplicity, is the same as the algebraic multiplicity of λ
which is the number of times λ appears as a root of the
characteristic polynomial.
▶ All the basis vectors from the different Eigenspaces combine
to provide an orthonormal basis for Rn .
Complex vectors
▶ In the old formulation with real vectors, length-squared
according to the Euclidean norm was x12 + x22 + . . . xn2 . If the
xi are complex we should take length-squared to be
|x1 |2 + |x2 |2 + . . . |xn |2 where |.| denotes modulus. For the
complex number a + bi,
p √ the modulus is
(a + bi)(a − bi) = a2 + b 2
▶ For complex vectors we would like to preserve the idea as
possible that ∥x∥2 = x T x . If we keep the old definition of
inner product for complex vectors we will not get a real
number as length as shown in the next bullet.
▶ Let x =
1+i
. We have
2+i
x T x = (1 + i)2 + (2 + i)2 = 1 + 2i + i 2 + 4 + 4i + i 2 = 6i + 3.
Hermitian matrices
▶ We modify the inner product between two complex vectors x
and y to x H y , where x H = x T .
▶ Now x H x = x1 x1 + . . . xn xn = ∥x∥2 according to the new
definition of length.
▶ A Hermitian matrix is a generalization of a symmetric matrix.
▶ Instead of requiring AT = A, we say a matrix is Hermitian if
it is equal to its conjugate-transpose, ie A is a Hermitian
matrix if AH = A or A = A
T
▶ As an example consider the matrix A =
1 3−i
. It is a
3+i 4
Hermitian matrix since AH = = A.
1 3−i
3+i 4
Spectral theorem
We shall now show that all eigenvalues for a symmetric matrix are
real. Let Ax = λx . Then premultiplying with x H on both sides we
have x H Ax = λx H x
Now x H Ax is a 1 × 1 matrix. Taking the Hermitian of this matrix
we have (x H Ax )H = x H AH x = x H Ax , so the Hermitian of the
matrix is itself which means that the matrix is real.
On the right hand side we note that x H x is real, so this means
that λ must be real.
Spectral theorem
Let us show that eigenvectors belonging to different eigenvalues
are orthogonal. Let Ax = λx and Ay = µy . Then we have
y H Ax = λy H x
x H Ay = µx H y
But x H Ay = (y H AH x )H = (y H Ax )H = λx H y . We already know
that x H Ay = µx H y . This means λx H y = µx H y . Since λ ̸= µ,
this must mean x H y = 0.
This shows that eigenvectors corresponding to different eigenvalues
are orthogonal.
Spectral theorem
▶ So we see that the eigenvalues of a symmetric matrix are real
and eigenvectors belonging to different eigenvalues are
orthogonal.
▶ This suggests that one can string together all the orthonormal
bases for the different eigenvalues and get an orthonormal
basis for Rn .
▶ But who is to say that when we string together the basis
vectors for all the eigenvalues, we will have enough vectors to
describe Rn ? We need n basis vectors and might end up
having fewer than n vectors.
▶ If the eigenvalues are all different, we can see that we will
indeed have enough basis vectors. But what about when there
are repeating eignevalues?
Spectral theorem
▶ We need one more piece to complete the puzzle and show
that we will have enough eigenvectors to complete the
orthonormal basis - this part we shall not prove!
▶ As a consequence of the spectral theorem we can write a real
symmetric matrix A as A = Q ΛQ T where Q is an
orthonormal basis (think orthonormal basis vectors for
example), and Λ is a diagonal matrix consisting of non-zero
entries only along the diagonal.
▶ The spectral theorem can be used in a machine learning
context since we can take the data matrix A and create a
symmetric matrix out of it - AT A and AAT which are both
used in Singular-Value Decomposition and PCA.
Trace and eigenvalues
▶ We can show that the sum of the eigenvalues of a matrix is
equal to the trace of the matrix, i.e i=n
P Pi=n
i=1 λi = i=1 aii . To
see why this is true, noteQ that the characteristic polynomial
pA (λ) can be written as i=n i=1 (λi − λ). The coefficient to
n−1 n−1
Pi=n
λ in this expansion is (−1) i=1 λi . Early on in this
lecture we showed from a direct expansion Pi=n of the determinant
that the coefficient of λ n−1 is (−1) n−1
i=1 aii . Thus we
have our result.
▶ The product of all eigenvalues is the determinant of the
matrix, i.e det(A) = i=1
Qi=n
λi . To see why this is true, let us
once again look at the factorisation of pA (λ) as
det(A − λI ) = pA (λ) = i=1
Qi=n
(λi − λ). Setting λ = 0 in this
equation gives the result.
Cholesky decomposition
Theorem A symmetric, positive-definite matrix A can be
factorized into a product A = LLT where L is a lower-triangular
matrix with positive elements.
For an example 3 × 3 matrix we can write
a11 a12 a13 l11 0 0 l11 l21 l31
a21 a22 a23 = l21 l22 0 0 l22 l32
a31 a32 a33 l31 l32 l33 0 0 l33
We can solve for the q elements of theqlower triangular matrix to get
√ 2 , l 2 + l 2 ).
l11 = a11 , l22 = a22 − l21 33 = a33 − (l31 32
For the elements below the diagonal we have l21 = al1121 , l31 = al1131
and l32 = a32 −l31 l21
l22 .
An application of Cholesky decomposition
▶ In the Data Science / Machine Learning context, distributions
on data are frequently multivariate Gaussian.
▶ Multivariate Gaussian distributions are governed by a
covariance matrix which is symmetric, positive-definite.
▶ We may need to draw samples from such distributions which
is where the Cholesky decomposition finds an important
application.
▶ To generate a sample from a multivariate Gaussian
distribution, we factor the covariance matrix into its Cholesky
factor L, generate a Gaussian random vector x on
independent variables which is easy to do, and compute Lx
which will be a sample according to the required distribution.
Lecture 5
Math Foundations Team
Introduction
▶ In the previous lecture, we discussed eigenvalues and
eigenvectors of matrices
▶ In this lecture, we will look at two related methods for
factorizing matrices into canonical forms.
▶ The first one is known as Eigenvalue decompostion. It uses
the concepts of eigenvalues and eigenvectors to generate the
decomposition
▶ The second method known as singular value decomposition or
SVD is applicable to all matrices
Diagonal Matrices
▶ A diagonal matrix is a matrix that has value zero on all off
diagonal elements.
d1
D=
..
.
dn
▶ For a diagonal matrix D, the determinant is the product of its
diagonal entries.
▶ A matrix power Dk is given by each diagonal element raised
to the power k.
▶ Inverse of a diagonal matrix is obtained by taking inverse of
non-zero diagonal entry.
Diagonalizable Matrices
▶ A matrix A ∈ Rn×n is diagonalizable if there exists an
invertible matrix P ∈ Rn×n and a diagonal matrix D such that
D = P−1 AP
▶ In the definition of diagonalization, it is required that P is an
invertible matrix. Assume p1 , p2 , ....., pn are the n columns of
P
▶ Rewriting we get AP = PD. By observing that D is a
diagonal matrix, we can simplify as
Api = λi pi
where λi is the i th diagonal entry in D.
Diagonalizable Matrix
▶ Consider a square matrix
1 4
A=
2 3
▶ Consider the invertible matrix
−2 1
P=
1 1
▶ Now consider the product P−1 AP as follows
−1
−2 1 1 4 −2 1 −1 0
. . =
1 1 2 3 1 1 0 5
Eigendecomposition of a matrix
▶ Recall the existence of eigenvalues and eigenvectors for square
matrices
▶ Eigenvalues can be used to create a matrix decomposition
known as Eigenvalue Decomposition
▶ A square matrix A ∈ Rn×n can be factored into
A = PDP−1
▶ where P is an invertible matrix of eigenvectors of A assuming
we can find n eigenvectors that form a basis of Rn
▶ and D is a diagonal matrix whose diagonal entries are the
eigenvalues of A
Example of Eigendecomposition
Let us compute the eigendecomposition of the matrix A
2.5 −1
A=
−1 2.5
▶ Step 1: Find the eigenvalues and eigenvectors
2.5 − λ −1
A − λI =
−1 2.5 − λ
▶ The characteristic equation is given by det(A − λI) = 0
▶ This leads to the equation λ2 − 5λ + 214 =0
▶ Solving the quadratic equation gives us λ1 = 3.5 and λ2 = 1.5
Example of Eigendecomposition
▶ The eigenvector
" 1 # corresponing to λ1 = 3.5 is derived as
− √2)
p1 = 1√
2
▶ The eigenvector
" 1 # corresponing to λ1 = 1.5 is derived as
√
p2 = 2
√1
2
▶ Step 2 : Construct the matrix P to diagonalize A
" 1 1
#
√ √
P= 2 2
− √12 √1
2
Example of Eigendecomposition
▶ The inverse of matrix P is given by
" 1 #
−1
√
2
− √12
P = √1 √1
2 2
▶ The eigendecompostion of the matrix A is given by
" 1 # " √1 #
√ √1 3.5 0 − √1
A= 2 2 2 2
− √12 √12 0 1.5 √12 √1
2
▶ In summary we have obtained the required matrix
factorization using eigenvalues and eigenvectors.
Symmetric Matrices and Diagonalizability
▶ Recall that a matrix A is called symmetric matrix if A = AT
−2 1
A=
1 1
▶ A Symmetric matrix A ∈ Rn×n can always be diagonalized.
▶ This follows directly from the spectral theorem discussed in
previous lecture
▶ Moreover the spectral theorem states that we can find an
orthogonal matrix P of eigenvectors of A.
Motivation for Singular Value Decomposition
▶ The singular value decomposition or (SVD) of a matrix is a
central matrix decomposition method in linear algebra.
▶ The eigenvalue decomposition is applicable to square matrices
only.
▶ The singular value decomposition exists for all rectangular
matrices
▶ SVD involves writing a matrix as a product of three matrices
U, Σ and VT .
▶ The three component matrices are derived by applying
eigenvalue decomposition discussed previously
Singular Value Decomposition Theorem
▶ Let A ∈ Rm×n be a rectangular matrix. Assume that A has rank r .
▶ The Singular value decomposition of A is defined as
A = UΣVT
▶ U ∈ Rm×m is an orthogonal matrix with column vectors ui where
i = 1, ..., m
▶ V ∈ Rn×n is an orthogonal matrix with column vectors vj where
j = 1, ..., n
▶ Σ is an m Ö n matrix with Σii = σi > 0
▶ The diagonal entries σi , i = 1, ..., r of Σ are called the singular
values.
▶ By convention, the singular values are ordered i.e Σii > Σjj where
i < j.
Properties of Σ
▶ The singular value matrix Σ is unique.
▶ Observe that the Σ ∈ Rm×n matrix is rectangular. In
particular, Σ is of the same size as A.
▶ This means that Σ has a diagonal submatrix that contains the
singular values and needs additional zero padding.
▶ Specifically, if m > n, then the matrix Σ has diagonal
structure up to row n and then consists of zero rows.
▶ If m < n, the matrix Σ has a diagonal structure up to column
m and columns that consist of 0 from m + 1 to n.
Construction of V
▶ It can be observed that
AT A = VΣT ΣVT
▶ Since AT A has the following eigendecomposition
AT A = PDPT
▶ Therefore, the eigenvectors of AT A that compose P are the
right-singular vectors V of A.
▶ The eigenvalues of AT A are the squared singular values of Σ
Construction of U
▶ It can be observed that
AAT = UΣVT VΣT UT
▶ Since AAT has the following eigendecomposition
AAT = SDST
▶ Therefore, the eigenvectors of AAT that compose S are the
left-singular vectors U of A
Construction of U continued
▶ A = UΣVT can be rearranged to obtain a simple formulation
for ui
▶ By postmultiplying by V we get AV = UΣVT V
▶ By observing that V is orthogonal we obtain a simple form
AV = UΣ
▶ This is equivalent to the following
1
ui = Avi ∀i = 1, 2, ..., r
σi
Computing Singular Value Decomposition 1
▶ We want to find SVD of the following rectangular matrix A
1 0 1
A=
−2 1 0
▶ Let us consider the matrix AT A derived from A given by
5 −2 1
AT A = −2 1 0
1 0 1
▶ It is a symmetric matrix
Computing Singular Value Decomposition 2
▶ Derive the eigendecomposition of AT A in the form PDP T
▶ The matrix P is given by
√5 −1
30
0 √
6
√−2 √1 −2
√
P= 30 5 6
√1 √2 √1
30 5 6
▶ The matrix D is given by
6 0 0
D = 0 1 0
0 0 0
Computing Singular Value Decomposition 3
Now we construct the singular value matrix Σ
▶ The matrix Σ has the dimension same as A. In this case Σ is
hence a 2 × 3 matrix.
▶ The diagonal entries of submatrix is obtained by taking square
root of 6 and 1 respectively
▶ Singular-value matrix Σ is given by:
√
6 0 0
Σ=
0 1 0
▶ The last column is a column of zeros only
Computing Singular Value Decomposition 4
Left singular vectors as the normalized image of the right singular
vectors. Recall that ui = σ1i Avi
▶ The first vector " 1 #
1 √
u1 = Av1 = √ 5
−2
σ1 5
▶ The second vector
" #
1 √2
u2 = Av2 = 5
σ2 √1
5
Final Step : Combining U, Σ and V
We compile all the three matrices together to generate the SVD
▶
−1 T
5
" 1 2
# √ √
30
0 √ 6
√ √
5 5 6 0 0 √−2 √1 √ −2
A= √ −2 √1
30 5 6
5 5
0 1 0 1 2 1
√ √ √
30 5 6
▶ The matrix U is an 2 × 2 matrix satifying orthogonality
property.
▶ The matrix V is an 3 × 3 matrix satifying orthogonality
property.
Comparing SVD and EVD
▶ The left-singular vectors of A are eigenvectors of AAT
▶ The right-singular vectors of A are eigenvectors of AT A
▶ The non-zero singular values of A are the square roots of the
nonzero eigenvalues of AT A.
▶ The SVD always exists for any matrix in Rm×n
▶ The eigendecomposition is only defined for square matrices in
Rn×n and only exists if we can find a basis of eigenvectors of
Rn
Comparing SVD and EVD
▶ The vectors in the eigendecomposition matrix P are not
necessarily orthogonal.
▶ On the other hand, the vectors in the matrices U and V in the
SVD are orthonormal.
▶ Both the eigendecomposition and the SVD are compositions
of three linear mappings:
▶ A key difference between the eigendecomposition and the
SVD is that in the SVD, domain and codomain can be of
different dimensions
▶ In the SVD, the left and right singular vector matrices P and
P are generally not inverse of each other.
Comparing SVD and EVD 3
▶ In the eigendecomposition, the matrices in decomposition are
inverse of each other
▶ In the SVD, the entries in the diagonal matrix Σ are all real
and nonnegative,
▶ In eigendecomposition diagonal matrix entries need not be
real always.
▶ The leftsingular vectors of A are eigenvectors of AAT
▶ The rightsingular vectors of A are eigenvectors of AT A .
Matrix Approximation
▶ We considered the SVD as a way to factorize A = UΣVT into
the product of three matrices, where U and V are orthogonal
and Σ contains the singular values on its main diagonal.
▶ Instead of doing the full SVD factorization, we will now
investigate how the SVD allows us to represent a matrix A as
a sum of simpler matrices Ai
▶ This representation which lends itself to a matrix
approximation scheme that is cheaper to compute than the
full SVD.
Matrix Approximation continued
▶ A matrix A ∈ Rm×n of rank r can be written as a sum of
rank-1 matrices so that A = ri=1 σi ui viT
P
▶ The diagonal structure of the singular value matrix Σ
multiplies only matching left and right singular vectors ui viT
and scales them by the corresponding singular value σi .
▶ All terms σi ui viT vanish for i ̸= j because Σ is a diagonal
matrix.
▶ Any term for i > r would vanish because the corresponding
singular value is 0.
Rank k Approximation
▶ We summed up the r individual rank-1 matrices to obtain a
rank r matrix A.
▶ If the sum does not run over all matrices Ai i = 1, ..., r but
only up to an intermediate value k we obtain a rank-k
approximation
▶ The approximation represented by Â(k) is defined as follows
k
X
Â(k) = σi ui viT
i=1
▶ To measure the difference between A and its rank-k
approximation we need the notion of a norm which is
introduced next
Spectral Norm of a matrix
▶ We introduce the notation of a subscript in the matrix norm
▶ Spectral Norm of a Matrix. For x ∈ Rn , x ̸= 0, the spectral
norm norm of a matrix A ∈ Rm×n is defined as
∥Ax∥2
∥A∥2 = max
x ∥x∥2
where ∥y ∥2 is the euclidean norm of y
▶ Theorem : The spectral norm of a matrix A is its largest
singular value
Example : Spectral Norm of a matrix
▶ Example : Consider the following matrix A
1 2
A=
3 4
▶ Singular value decomposition of this matrix will provide the
matrix Σ as follows
5.465 0
Σ=
0 0.366
▶ The 2 singular values are 5.4650 and 0.366.
▶ By definition the spectral norm is the largest singular value.
▶ Hence, the spectral norm is 5.4650
Lecture 6
Math Foundations Team
Introduction
Many algorithms in machine learning optimize an objective
function with respect to a set of desired model parameters that
control how well a model explains the data: Finding good
parameters can be phrased as an optimization problem.
Examples include: linear regression, where we look at curve-fitting
problems and optimize linear weight parameters to maximize the
likelihood; neural-network auto-encoders for dimensionality
reduction and data compression.
Differentiation of Univariate Functions
For h > 0, the derivative of f at x is defined as the limit
df f (x + h) − f (x)
= lim (1)
dx h→0 h
The derivative of f points in the direction of steepest ascent of f .
Derivative of a Polynomial
To compute the derivative of f (x) = x n n ∈ N using the definition
df f (x + h) − f (x)
= lim
dx h→0 h
(x + h)n − x n
= lim
h→0 h
Pn n n−i i (2)
h − xn
i=0 i x
= lim
h→0 h
Pn n n−i i
x h
= lim i=1 i
h→0 h
Derivative of a Polynomial
n
df X n n−i i−1
= lim x h
dx h→0 i
i=1
n
(3)
n n−1 X n n−i i−1
= lim x + lim x h
h→0 1 h→0 i
i=2
n−1
= nx
Taylor polynomial
The Taylor polynomial is a representation of a function f as an
finite sum of terms. These terms are determined using derivatives
of f evaluated at x0 .
Definition: The Taylor polynomial of degree n of f : R → R at x0
is defined as
n
X f (k) (x0 )
Tn (x) = (x − x0 )k (4)
k!
k=0
where f (k) (x 0) is the kth derivative of f at x0 which we assume
exists.
Taylor series
Definition: The Taylor series of smooth (continuously
differentiable infinite many times) function f : R → R at x0 is
defined as
∞
X f (k) (x0 )
T∞ (x) = (x − x0 )k (5)
k!
k=0
For x0 = 0, we obtain the Maclaurin series as a a special instance
of the Taylor series.
Remark: In general, a Taylor polynomial of degree n is an
approximation of a function, which does not need to be a
polynomial. The Taylor polynomial is similar to f in a
neighborhood around x0 . However, a Taylor polynomial of degree n
is an exact representation of a polynomial f of degree k ≤ n since
all derivatives f (i) = 0, for i > k.
Taylor Polynomial example
Consider the polynomial f (x) = x 4 . Find the Taylor polynomial T6
evaluated at x0 = 1.
We compute f (k) (1) for k = 0, 1, 2..., 6
′ ′′
f (1) = 1, f (1) = 4, f (1) = 12, f (3) (1) = 24, f (4) (1) = 24,
f (5) (1) = 0, f (6) (1) = 0. The desired Taylor polynomial is
6
X f (k) (x0 )
T6 (x) = (x − x0 )k
k!
k=0
= 1 + 4(x − 1) + 12(x − 1)2 + 24(x − 1)3 + 24(x − 1)4
= x 4 = f (x)
(6)
we obtain an exact representation of the original function.
Taylor Series example
Consider the smooth function f (x) = sin(x) + cos(x). We compute
Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
f (0) = sin(0) + cos(0) = 1
′
f (0) = cos(0) − sin(0) = 1
′′
f (0) = −sin(0) − cos(0) = −1
f (3) (0) = −cos(0) + sin(0) = −1
f (4) (0) = sin(0) + cos(0) = f (0) = 1
The coefficients in our Taylor series are only ±1 (since sin(0) = 0),
each of which occurs twice before switching to the other one.
Furthermore, f (k+4) (0) = f k (0)
Taylor Series example
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
∞
X f (k) (x0 )
T∞ (x) = (x − x0 )k
k!
k=0
1 2 1 1 1
=1+x − x − x3 + x4 + x5 − . . .
2! 3! 4! 5!
1 1 1 1 (7)
= 1 − x 2 + x 4 ∓ . . .x − x 3 + x 5 ∓ . . .
2! 4! 3! 5!
∞ ∞
X 1 X 1
= (−1)k x 2k + (−1)k x 2k+1
(2k)! (2k + 1)!
k=0 k=0
= cos(x) + sin(x)
Differentiation Rules
′
We denote the derivative of f by f
′ ′ ′
▶ Product Rule: (f (x)g (x)) = f (x)g (x) + f (x)g (x)
′ ′ ′
▶ Sum Rule: (f (x) + g (x)) = f (x) + g (x)
′ ′
′
▶ Quotient Rule: ( gf (x)
(x) ) =
f (x)g (x)−f (x)g (x)
(g (x))2
′ ′ ′ ′
▶ Chain Rule: (g (f (x)) = (g ◦ f ) (x) = g (f (x))f (x)
Example: Chain Rule
Compute the derivative of function h(x) = (2x + 1)4
h(x) = (2x + 1)4 = g (f (x))
f (x) = 2x + 1,
g (f ) = f 4
Derivatives of f and g are
′
f (x) = 2
′
g (f ) = 4f 3
′ ′ ′
h (x) = g (f )f (x) = (4f 3 ).2 = 8(2x + 1)3
Partial Differentiation and Gradients
Differentiation applies to functions f of a scalar variable x ∈ R. In
the following, we consider the general case where the function f
depends on one or more variables x ∈ R n , e.g.,f (x) = f (x1 , x2 ).
The generalization of the derivative to functions of several
variables is the gradient. We find the gradient of the function f
with respect to x by varying one variable at a time and keeping the
others constant. The gradient is then the collection of these partial
derivatives.
Partial derivatives and Gradients
Definition: For a function f : Rn → R, x → f (x), x ∈ R n of n
variables x1 , . . . , xn we define the partial derivatives as
∂f f (x1 + h, x2 , . . . , xn ) − f (x1 , x2 , . . . , xn )
= lim
∂x1 h→0 h
∂f f (x1 , x2 + h, . . . , xn ) − f (x1 , x2 , . . . , xn )
= lim
∂x2 h→0 h
..
.
∂f f (x1 , x2 , . . . , xn + h) − f (x1 , x2 , . . . , xn )
= lim
∂xn h→0 h
We collect them in the row vector called the gradient of f or
Jacobian
df h i
∆x f = gradf = = ∂f∂x(x) , ∂f (x)
∂x2 , . . . , ∂f (x)
∂xn
(8)
dx 1
Example 1: Find the partial derivatives of f (x, y ) = (x + 2y 3 )2
∂f (x, y ) ∂(x + 2y 3 )
= 2(x + 2y 3 ) = 2(x + 2y 3 ) (9)
∂x ∂x
∂f (x, y ) ∂(x + 2y 3 )
= 2(x + 2y 3 ) = 12y 2 (x + 2y 3 ) (10)
∂y ∂y
here we used the chain rule to compute the partial derivatives.
Example 2
Find the partial derivatives of f (x1 , x2 ) = x12 x2 + x1 x23
∂f (x1 , x2 )
= 2x1 x2 + x23 (11)
∂x1
∂f (x1 , x2 )
= x12 + 3x1 x22 (12)
∂x2
So the gradient is then
df ∂f (x1 , x2 ) ∂f (x1 , x2 )
=[ , ] = [2x1 x2 + x23 , x12 + 3x1 x22 ] ∈ R1×2
dx ∂x1 ∂x2
(13)
Basic rules of partial differentiation
When we compute derivatives with respect to vectors x ∈ Rn we
need to pay attention: Our gradients now involve vectors and
matrices, and matrix multiplication is not commutative i.e., the
order matters.
∂ ∂f ∂g
Product rule: (f (x)g (x)) = g (x) + f (x) (14)
∂x ∂x ∂x
∂ ∂f ∂g
Sum rule: (f (x) + g (x)) = + (15)
∂x ∂x ∂x
∂ ∂ ∂g ∂f
chain rule: (g ◦ f )(x) = (g (f (x)) = (16)
∂x ∂x ∂f ∂x
Chain Rule
Consider a function f : R → R of two variables x1 , x2 .
Furthermore, x1 (t) and x2 (t) are themselves functions of t.
To compute the gradient of f with respect to t, we need to apply
the chain rule for multivariate functions as
∂x1 (t)
df h
∂f ∂f
i
∂t ∂f ∂x1 ∂f ∂x2
= ∂x1 ∂x2 = + (17)
dt ∂x2 (t) ∂x1 ∂t ∂x2 ∂t
∂t
where d denotes the gradient and ∂ partial derivatives.
Example
Consider f (x1 , x2 ) = x12 + 2x2 , where x1 = sin t and x2 = cos t then
df ∂f ∂x1 ∂f ∂x2
= +
dt ∂x1 ∂t ∂x2 ∂t
∂ sin t ∂ cos t
= 2 sin t +2
∂t ∂t
= 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1)
is the corresponding derivative of f with respect to t.
If f (x1 , x2 ) is a function of x1 and x2 , where x1 (s, t) and x2 (s, t)
are themselves functions of two variables s and t, the chain rule
yields the partial derivatives:
∂f ∂f ∂x1 ∂f ∂x2
= + (18)
∂s ∂x1 ∂s ∂x2 ∂s
∂f ∂f ∂x1 ∂f ∂x2
= + (19)
∂t ∂x1 ∂t ∂x2 ∂t
and the gradient is obtained by the matrix multiplication
df ∂f ∂x
=
d(s, t) ∂x ∂(s, t)
i ∂x1 ∂x1
" #
h
∂f ∂f ∂s ∂t
= ∂x 1 ∂x2 ∂x2 ∂x2
∂s ∂t
Gradients of Vector-Valued Functions
We have discussed partial derivatives and gradients of functions
f : Rn → R mapping to the real numbers. Now we will generalize
the concept of the gradient to vector-valued functions
f : Rn → Rm , where n ≥ 1 and m > 1.
For a function f : Rn → Rm and a vector x = [x1 , . . . , xn ]T
corresponding vector of function values is given as
f1 (x)
f (x) = ... ∈ Rm (20)
fm (x)
where each fi : Rn → R
Gradients of Vector-Valued Functions
Therefore, the partial derivative of a vector-valued function
f : Rn → Rm w.r.t. xi ∈ R, i = 1, . . . n is given as the vector
∂f1
∂x
i
∂f
= ...
∂xi ∂f m
∂xi
f1 (x1 ,...,xi−1 ,xi +h,xi+1 ,...,xn )−f1 (x)
limh→0 h
.. m
= ∈R
.
fm (x1 ,...,xi−1 ,xi +h,xi+1 ,...,xn )−fm (x)
limh→0 h
Gradients of Vector-Valued Functions
We know that the gradient of f with respect to a vector is the row
∂f
vector of the partial derivatives. Every partial derivative ∂x i
is itself
a column vector. Therefore, we obtain the gradient of
f : Rn → Rm with respect to x ∈ Rn by collecting these partial
derivatives:
df (x) h ∂f (x) i
= ∂x1 . . . ∂f∂x(x)
dx
n
∂f1 (x) ∂f1 (x)
∂x1 . . . ∂xn
=
.. ∈ Rm×n
.
∂fm (x) ∂fm (x)
∂x1 . . . ∂xn
Example 1: Gradients of Vector-Valued Functions
Given f (x) = Ax, f (x) ∈ RM , A ∈ RM×N , x ∈ RN
Since f : RN → RM , it follows that df /dx ∈ RM×N . To compute
the gradient we determine the partial derivatives of f w.r.t xj :
N
X ∂fi
fi (x) = Aij xj =⇒ = Aij (21)
∂xj
i=1
We obtain the gradient using Jacobian
∂f1 ∂f1
. . . ∂x
∂x1 N
A11 . . . A1N
df .. .. M×N
= = =A∈R (22)
dx . .
∂fM ∂fM
∂x . . . ∂x
AM1 . . . AMN
1 N
Example 2: Gradients of Vector-Valued Functions
Consider the function h : R → R, h(t) = (f ◦ g )(t) with
f (x) = exp(x1 x22 )
x1 t cos t
x= = g (t) = (23)
x2 t sin t
and compute the gradient of h w.r.t. t. Since f : R2 → R and
g : R → R2 we note that
∂f ∂g
∈ R1×2 and ∈ R2×1 (24)
∂x ∂t
The desired gradient is computed by applying the chain rule:
i ∂x1
" #
dh ∂f ∂x h
∂f ∂f ∂t
= = ∂x ∂x2
dt ∂x ∂t 1 ∂x2
∂t
2 2 2
cos t − t sin t
= exp(x1 x2 )x2 2exp(x1 x2 )x1 x2
sin t + t cos t
= exp(x1 x22 )(x22 (cos t − t sin t) + 2x1 x2 (sin t + t cos t))
where x1 = t cos t and x2 = t sin t;
Lecture 7
Math Foundations Team
Introduction
▶ In last lecture, we discussed about differentiation of univariate
functions, partial differentiation, gradients and gradients of
vector valued functions.
▶ Now we will look into gradients of matrices and some useful
identities for computing gradients.
▶ Finally, we will discuss back propagation and automatic
differentiation.
Gradients of Matrices
The gradient of an m × n matrix A with respect to a p × q matrix
B, the resulting Jacobian would be an (m × n) × (p × q), i.e., a
four-dimensional tensor J, whose entries are given as
∂Aij
Jijkl =
∂Bkl
Since, we can consider Rm×n as Rmn , we can shape our matrix
into vectors of length mn and pq respectively. The gradient using
mn vectors results in a Jacobian of size mn × pq
Gradients of Matrices
Gradients of Matrices
Let f = Ax where A ∈ Rm×n , and x ∈ Rn , then
∂f
∈ Rm×(m×n)
∂A
By definition ∂f
1
∂A
∂f ∂fi
= ... , ∈ R1×(m×n)
∂A ∂f
∂A
m
∂A
Gradients of Matrices
Now, we have
n
X
fi = Aij xj , i = 1, · · · , m.
j=1
Therefore, by taking partial derivatives with respect to Aiq
∂fi
= xq .
∂Aiq
Hence, i th row becomes
∂fi
= x T ∈ R1×1×n
∂Ai,:
∂fi
= 0T ∈ R1×1×n , fork ̸= i
∂Ak,:
Hence, by stacking the partial derivatives, we get
T
0
..
.
∂fi T 1×m×n
=
∂Ak,:
x ∈R
..
.
0T
Gradients of Matrices with respect to Matrices
Let B ∈ Rm×n and f : Rm×n → Rn×n with
f (B) = B T B =: K ∈ Rn×n
Then, we have
∂K
∈ R(n×n)×(m×n) .
∂B
Moreover
∂Kpq
∈ R1×(m×n) , forp, q = 1, · · · n
∂B
where Kpq is the (p, q)th entry of K = f (B)
Gradients of Matrices with respect to Matrices
Let i th column of B be bi , then
m
X
Kpq = rpT rq = Blp Blq
l=1
Computing the partial derivative, we get
m
∂Kpq X ∂
= Blp Blq = ∂pqij
∂Bij ∂Bij
l=1
Gradients of Matrices with respect to Matrices
Clearly, we have
∂pqij = Biq if j = p, p ̸= q
∂pqij = Bip if j = q, p ̸= q
∂pqij = 2Biq if j = p, p = q
∂pqij = 0 otherwise
where p, q, j = 1, · · · , n i = 1, · · · , m
Useful Identities for Computing Gradients
∂ T ∂f (X ) T
▶ ∂X f (X ) = ( ∂X )
∂ ∂f (X )
▶ ∂X tr (f (X )) = tr ( ∂X )
▶ ∂ −1 ∂f (X ) )
∂X det(f (X )) = det(f (X ))tr (f (X ) ∂X
∂ −1 −1 ∂f (X ) −1
▶ ∂X f (X ) = −f (X ) ∂X f (X )
Useful Identities for Computing Gradients
∂aT X −1 b
▶ ∂X = −(X −1 )T ab T (X −1 )T
▶ ∂x T a T
∂x = a
▶ ∂aT x T
∂x = a
T
∂a Xb
▶ T
∂X = ab
▶ ∂x T B T T
∂x = x (B + B )
∂ T
▶ ∂s (x − As) W (x − As) = −2(x − As)T WA
for symmetric W .
Backpropagation and Automatic Differentiation
Consider the function
q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Taking dervatives
df 2x + 2xexp(x 2 )
= p − sin(x 2 + exp(x 2 ))(2x + 2xexp(x 2 ))
dx 2 x 2 + exp(x 2 )
1
= 2x( p − sin(x 2 + exp(x 2 )))(1 + exp(x 2 ))
2 x + exp(x 2 )
2
Motivation
▶ The implementation of the gradient could be significantly
more expensive than computing the function, which imposes
unnecessary overhead where we get such lengthy expressions.
▶ We need an efficient way to compute the gradient of an error
function with respect to the parameters of the model.
▶ For training deep neural network models, the backpropagation
algorithm is one such method.
Backpropagation and Automatic Differentiation
In neural networks with multiple layers
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 )
where xi−1 is the output of layer i − 1 and σ is an activation
function.
Backpropagation
To train these model, the gradient of the loss function L with
respect to all model parameters θj = {Aj , bj }, j = 1, · · · , K and
inputs of each layer needs to be computed. Consider,
f0 := x
fi := σi (Ai−1 fi−1 + bi−1 ), i = 1, · · · , K .
We have to find θj = {Aj , bj }, j = 1, · · · , K − 1 such that
L(θ) = ||y − fK (θ, x)||2
is minimum where θ = {A0 , b0 , · · · , AK −1 , bK −1 }
Backpropagation
Using the chain rule, we get
∂L ∂L ∂fK
=
∂θK −1 ∂fK ∂θK −1
∂L ∂L fK ∂fK −1
=
∂θK −2 ∂fK fK −1 ∂θK −2
∂L ∂L fK ∂fK −1 ∂fK −2
=
∂θK −3 ∂fK fK −1 ∂fK −2 ∂θK −3
∂L ∂L fK ∂fi+2 ∂fi+1
= ···
∂θi ∂fK fK −1 ∂fi+1 ∂θi
Backpropagation
∂L
If the partial derivatives
∂θi+1 are computed, then the computation
∂L
can be reused to compute ∂θ i
.
Example
Consider the function
q
f (x) = x 2 + exp(x 2 ) + cos(x 2 + exp(x 2 ))
Let
√
a = x 2 , b = exp(a), c = a + b, d = c, e = cos(c) ⇒ f = d + e
Example
∂a
⇒ = 2x
∂x
∂b
= exp(a)
∂a
∂c ∂c
= 1=
∂a ∂b
∂d 1
= √
∂c 2 c
∂e
= −sin(c)
∂c
∂f ∂f
= 1=
∂d ∂e
Example
Thus, we have
∂f ∂f ∂d ∂f ∂e
= +
∂c ∂d ∂c ∂e ∂c
∂f ∂f ∂c
=
∂b ∂c ∂b
∂f ∂f ∂b ∂f ∂c
= +
∂a ∂b ∂a ∂c ∂a
∂f ∂f ∂a
=
∂x ∂a ∂x
Example
Substituting the results, we get
∂f 1
= 1.( √ + 1).(−sin(c))
∂c 2 c
∂f ∂f
= .1
∂b ∂c
∂f ∂f ∂f
= exp(a) + .1
∂a ∂b ∂c
∂f ∂f
= 2x
∂x ∂a
Thus, the computation for calculating the derivative is of similar
complexity as the computation of the function itself.
Formalization of Automatic Differentiation
Let x1 , · · · , xd : input variables.
xd+1 , · · · , xD−1 : intermediate variables.
xD : output variable, then we have,
xi = gi (xPa(xi ) )
Note that gi s are elementary functions and are also called as
forward propagation function and xPa(xi ) is the set of parent nodes
of variable xi in the graph.
Formalization of Automatic Differentiation
Now,
∂f
f = xD ⇒ =1
∂D
For other variables, using chain rule, we get
∂f X ∂f ∂xj X ∂f ∂gi
= =
∂xi ∂xj ∂xi ∂xj ∂xi
xj :xi ∈Pa(xj ) xj :xi ∈Pa(xj )
The last equation is the back propagation of the gradient through
the computation graph. For neural network training, we back
propagate the error of the prediction with respect to the label.
Lecture 8
Math Foundations Team
Introduction
▶ Till now we have discussed about Taylor/Maclaurian series,
Partial Derivatives and Gradients.
▶ Now we are interested in Higher order Derivatives.
▶ Multivariate Taylor Series and its uses in the expansion of a
function with multivariables.
Higher-Order Derivatives
Consider a function f : R2 → R
Notations for Higher-Order Partial Derivatives:
∂2f
∂x 2
: Second Partial Derivative of x w.r.t. x
∂nf
∂x n : nth Partial Derivative of x w.r.t. x
∂2f ∂ ∂f
∂y ∂x = ∂y ( ∂x ) : is the partial derivative obtained by first partial
differentiating with respect to x and then with respect to y.
∂2f ∂ ∂f
∂x∂y = ∂x ( ∂y ) : is the partial derivative obtained by first partial
differentiating by y and then x.
Hessian Matrix
The Hessian is the collection of all second-order partial derivatives.
If f(x, y) is a twice (continuously) differentiable function, then
∂2f ∂2f
∂x∂y = ∂y ∂x i.e., the order of differentiation does not matter, and
the corresponding Hessian matrix
∂2f ∂2f
" #
∂x 2 ∂x∂y
H= ∂2f ∂2f
∂x∂y ∂y 2
is symmetric. The Hessian is denoted as ∇2x,y f (x, y )
Linearization and Multivariate Taylor Series
The gradient ∇f of a function f is often used for a locally linear
approximation of f around x0 :
f (x) ≈ f (x0 ) + (∇x f )(x0 )(x − x0 ) (1)
Here (∇x f )(x0 ) is the gradient of f with respect to x, evaluated
atx0 . Figure illustrates the linear approximation of a function f at
an input x0 . The original function is approximated by a straight
line.
Linearization and Multivariate Taylor Series...
This approximation is locally accurate, but the farther we move
away from x0 the worse the approximation gets. Equation (1) is a
special case of a multivariate Taylor series expansion of f at x0 ,
where we consider only the first two terms. We discuss the more
general case in the following, which will allow for better
approximations.
Multivariate Taylor Series
Consider a function f : RD → R, x → f (x),
x ∈ RD
that is smooth at x0 . When we define the difference vector
δ := x − x0 the multivariate Taylor series of f at (x0 ) is defined as
multivariate Taylor series
∞
X D k f (x0 )
x
f (x) = δk (2)
k!
k=0
where Dxk f (x0 )
is the k th (total) derivative of f with respect to x,
evaluated at x0 .
Taylor Polynomial
The Taylor polynomial of degree n of Taylor polynomial f at x0
contains the first n + 1 components of the series in (2) and is
defined as
n
X Dxk f (x0 ) k
Tn (x) = δ (3)
k!
k=0
In (2) and (3), we used the slightly sloppy notation of δ k , which is
not defined for vectors
x ∈ RD ,
D > 1, and k > 1. Note that both Dxk f and δ k are k th order
tensors, i.e., k-dimensional arrays.
Taylor Polynomial...
Taylor Polynomial...
.
Taylor Polynomial...
In general, we obtain the terms in the Taylor series, where
Dxk f (x0 )δ k contains k th order [Link] that we defined the
Taylor series for vector fields, let us explicitly write down the first
terms Dxk f (x0 )δ k of the Taylor series expansion for
Taylor Series Expansion of a Function with Two Variables
Consider the function f (x, y ) = x 2 + 2xy + y 3 .
We want to compute the Taylor series expansion of f at
(x0 , y0 ) = (1, 2).
Before we start, let us discuss what to expect: The function in
f (x, y ) is a polynomial of degree 3. We are looking for a Taylor
series expansion,which itself is a linear combination of polynomials.
Therefore, we do not expect the Taylor series expansion to contain
terms of fourth or higher order to express a third-order polynomial.
This means that P it shouldk be sufficient to determine the first four
terms of f (x) = ∞ k=0
Dx f (x0 ) k
k! δ for an exact alternative
representation of f (x, y ). To determine the Taylor series
expansion, we start with the constant term and the first-order
derivatives, which are given by f (1, 2) = 13
Taylor Series Expansion of a Function with Two Variables...
Taylor Series Expansion of a Function with Two Variables...
When we collect the second-order partial derivatives, we obtain the
Hessian
Taylor Series Expansion of a Function with Two Variables...
The third-order derivatives are obtained as
Taylor Series Expansion of a Function with Two Variables...
Since most second-order partial derivatives in the Hessian, are
constant, the only nonzero third-order partial derivative is
∂3f ∂3f
∂y 3
= 6 =⇒ ∂y 3 (1, 2) = 6 Higher-order derivatives and the mixed
∂3f
derivatives of degree 3 (e.g., ∂x 2 ∂y
) vanish, such that
Taylor Series Expansion of a Function with Two Variables...
which collects all cubic terms of the Taylor series. Overall, the
(exact) Taylor series expansion of f at (x0 , y0 ) = (1, 2) is