Calculus With Vectors and Matrices

Paul Klein
Stockholms universitet
October 1999
Calculus with vectors and matrices
1 Matrix differential calculus
1.1 The gradient and the Hessian
The purpose of this section is to make sense of expressions like
∂f (x)
= ∇x f (x) = ∇f (x) = f 0 (x) = fx (x) (1)
∂xT
where f : Rn → Rm . Of course, we already know what a partial derivative is and
how to calculate it. What this section will tell us is how to arrange the partial
derivatives into a matrix (gradient), and the rules of arithmetic that follow from
adopting our particular arrangement convention.
Definition 1.1 Let f : Rn → Rm have partial derivatives at x. Then

 
∂f1 (x) ∂f1 (x)
 ∂x1
··· ∂xn 
∂f (x)  .. ... ..

, (2)
 
T . .
| ∂x

{z } 



m×n ∂fm (x) ∂fm (x)
∂x1
··· ∂xn
1
and
T
∂f (x) ∂f (x)
, (3)
| ∂x
{z } ∂xT
n×m
T
where A is the transpose of A.
Definition 1.2 Let f : Rn → Rn have partial derivatives at x. Then the (scalar-
valued) Jacobian of f at x is defined via
∂f (x)
Jf (x) , det . (4)
∂xT
∂f (x)
Remark 1.1 Sometimes the gradient itself is called the Jacobian. Here
∂xT
the Jacobian is defined as the determinant of the gradient.
The following properties of the gradient follow straightforwardly from the defi-
nition.
Proposition 1.1 1. Let x be an n × 1 vector and A an m × n matrix. Then
∂
[Ax] = A. (5)
∂xT
2. Let x be an n × 1 vector and A an n × m matrix. Then
∂ T
T
x A = AT . (6)
∂x
3. Let x be an n × 1 vector and A an n × n matrix. Then
∂ T T T

x Ax = x A + A . (7)
∂xT
2
4. Let x be an n × 1 vector and A an n × n symmetric matrix. Then
∂ T
x Ax = 2xT A (8)
∂xT
If f is scalar-valued, it is straightforward to define the second derivative (Hes-
sian) as follows.
Definition 1.3 Let f : Rn → R have continuous first and second partial deriva-
tives at x (so as to satisfy the requirements of Young’s theorem). Then

 
∂ 2 f (x) ∂ 2 f (x)
 ∂x21
··· ∂x1 ∂xn 
∂ 2 f (x)  .. .. ..

, .  , f 00 (x) . (9)
 
T . .
∂x∂x
| {z }  
 
n×n ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1
··· ∂x2n
Note that, by Young’s theorem, the Hessian of a scalar-valued function is symmet-
ric.
Proposition 1.2 Let f (x) , xT Ax where A is symmetric. Then
∂ 2 f (x)
= 2A (10)
∂x∂xT
Occasionally we run into matrix-valued functions, and the way forward then is
to vectorize and then differentiate.
3
" #
Definition 1.4 Let A = a1 a2 · · · an be an m × n matrix. Then
|{z} |{z} |{z}
m×1 m×1 m×1
 
 a1 
 
a2 
 

vec (A) ,   (11)
| {z }  .. 
mn×1

 . 

 
an
Definition 1.5 Let f : Rk → Rn×m have partial derivatives at x. Then
∂f (x) ∂vecf (x)

T
, (12)
| ∂x
{z } ∂xT
nm×k
Having defined the vec operator, we quickly run into cases where we need the
Kronecker product, defined as follows.
Definition 1.6 Let A and B be matrices. Denote the element in the i:th row
m×n k×l
and j:th column of A by aij . Then

 
 a11 B · · · a1n B 
⊗ B} ,  ... ..
 
A .. . (13)
.
 
| {z  . 
mk×ln  
am1 B · · · amn B
Proposition 1.3 Let A , B and C be matrices. Then

k×l m×n p×q
vec (ABC) = C T ⊗ A vec (B)

(14)
4
Proof. Exercise.
Occasionally we find ourselves wanting to differentiate a vector-valued function
with respect to a matrix. Again the way forward is to vectorize.
Proposition 1 Whenever the following expressions are defined, they are true.
The trace of a matrix A is denoted by tr (A) . [Various rules of arithmetic omit-
ted in this version. See the bibliography for sources.]
Definition 1.7 Let f : Rn×m → Rk have partial derivatives at x. Then
∂f (x) ∂f (x)
T
, T
(15)
| ∂A
{z } ∂ (vecA)
nm×k
Example 1.1 Let f : Rn×m → Rn be defined via f (Φ) , Φk where k ∈ Rm is a

constant vector. Then f (Φ) = k T ⊗ In vecΦ and hence
∂f (x) T

= k ⊗ In . (16)
∂ΦT
We are now in a position to state rather general versions of the product and chain
rule for matrices.
5
1.2 The product rule
Proposition 1.4 (the product rule) Let A : Rl → Rn×m and B : Rl → Rm×k
have partial derivatives at x ∈ Rl . Then
∂
T
∂vecA (x) ∂vecB (x)
[A (x) B (x)] = B (x) ⊗ In + (Ik ⊗ A (x)) . (17)
∂xT ∂xT ∂xT
Kind-of proof. Suppose A (x) ≡ A. Then, by Proposition 1.3,
vec (AB (x)) = (Ik ⊗ A) vecB (x) . (18)
Since differentiation is a linear operator, it follows that
∂vec (AB (x)) ∂vecB (x)

T
= (Ik ⊗ A) (19)
∂x ∂xT
Conversely, assume that B (x) ≡ B. Then
∂vec (A (x) B) T
∂vecA (x)
= B ⊗ In (20)
∂xT ∂xT
Combining the two results yields the product rule.
Corollary 1.1 When we have vector- rather than matrix-valued functions, the for-
mula is drastically simplified. Let f : Rl → Rm and g : Rl → Rm have partial
derivatives at x ∈ Rl . Then
∂ h i ∂f (x) ∂g (x)
T
f (x) g (x) = g (x)T
T
T
+ f (x)T (21)
∂x ∂x ∂xT
6
Example 1.2 Suppose we would like to differentiate f (Ω) , Ω−1 with respect to
(vecΩ)T . One quick way of getting the result is to note that
∂ (ΩΩ−1 ) ∂ (I)
T
= =0
∂ (vecΩ) ∂ (vecΩ)T
so that
∂vecΩ ∂vecΩ−1
0 = Ω−T ⊗ In

+ (In ⊗ Ω)
∂ (vecΩ)T ∂ (vecΩ)T
Hence
∂vecΩ−1
= − (In ⊗ Ω)−1 Ω−T ⊗ In = −Ω−T ⊗ Ω−1 .

∂ (vecΩ)T
1.3 The chain rule
Proposition 1.5 (the chain rule) Let f and g have partial derivatives at x, and
let h (x) = (f ◦ g) (x) = f (g (x)) . Define y = g (x). Then h has partial derivatives
at x and
∂h (x) ∂f (y) ∂g (x)
= . (22)
∂xT ∂y T ∂xT
With an alternative piece of notation, we have
∂f (g (x)) ∂f (g (x)) ∂g (x)

T
= . (23)
∂x ∂g T ∂xT
Proof. The scalar chain rule and the definition of matrix multiplication.
Example 1.3 Let f (A) , xT Ax and let B (A) , A−1 . Find
∂f (B (A))
.
∂vec (A)T
7
Here we use the chain rule to find that
∂f (B (A)) xT Bx ∂vec (A−1 )

= − xT ⊗ xT Ω−T ⊗ Ω−1 .

T
= T T
∂vec (A) ∂vec (B) ∂vec (A)
2 Remarks on unvectorizing
The definition of the derivative of a matrix function with respect to a matrix given
above was stated originally by Magnus and Neudecker (1999). It has some great
advantages. But sometimes it is not so useful. One example is found in chapter
10, where we differentiate a matrix with respect to a scalar. Then the whole
theory would break down if we hade to vectorize before we differentiated. So we
go against Magnus & Neudecker and omit to vectorize.
A similar phenomenon arises when we want to differentiate a scalar with respect
to a matrix. For example, according to the Magnus-Neudecker definition,
∂xT Ax ∂xT Ax
= xT ⊗ xT ,

T
= T
∂A ∂ (vecA)
so that the result is an n2 × 1 vector. But often it is nicer to define the result as
the n × n matrix
∂xT Ax ∂xT Ax
T
= = xxT .
∂A ∂A
More generally, let f : Rm×n → R be a scalar-valued function. Then the defini-
8
tions  
∂f (A) ∂f (A)
 ∂a11 ∂am1 
∂f (A)  
,
 
∂A

 
 ∂f (A) ∂f (A) 
∂a1n ∂amn
and
T
∂f (A) ∂f (A)
=
∂AT ∂A
where aij is the element on the ith row and jth column of A are often nicer in
practice than the ones given by Magnus & Neudecker.
Another example is the following. Suppose we want to evaluate
∂ xT Ω−1 x

.
∂ΩT
We found above that
∂vec xT Ω−1 x

= − xT ⊗ xT Ω−T ⊗ Ω−1 ,

∂ (vecΩ)T
which is a 1 × n2 vector. But a much neater statement of the result is
∂ xT Ω−1 x

= −Ω−1 xxT Ω−1 ,
∂ΩT
which is an n × n matrix. This fact is brought home clearly by looking at the
following example, where the transpose will be denoted by 0 .
Example 2.1 (Maximum likelihood estimation of the population mean and vari-
ance/covariance matrix of a normally distributed vector of random variables.) Let
9
µ be an unknown n × 1 column vector and let Ω be an unknown n × n matrix. Let
hxt i be a given sequence of n × 1 vectors. Consider the maximization problem

( ( T ))
−T /2
X 1 0 −1
max |Ω| exp − (xt − µ) Ω (xt − µ) .
µ,Ω
t=1
2
Using the ‘unvectorized’ definition of the matrix derivative and noting that, with
this definition,
∂ ln |Ω|
= Ω−1 ,
∂Ω0
it is easy to confirm that the unique solution is given by


1 PT
 µ= xt


T t=1
1 PT
(xt − µ) (xt − µ)0 .

 Ω=


T t=1
So the derivative of a scalar-valued function with respect to a matrix may usefully
be defined as I have in this section. However, there is no satisfactory chain rule
for the definition of the matrix derivative given here. This is one of Magnus &
Neudecker’s reasons for advocating the definition I give in the previous section.
2.1 Taylor’s formula in n dimensions
Proposition 2.1 Let f : S → Rm be differentiable on the open set S ⊂ Rn . Let
x0 ∈ S. Then there is a function r : Rn → Rm (which typically depends on x0 ) such
that
10
1. For all x ∈ S,
∂f (x0 )
f (x) = f (x0 ) + (x − x0 ) + r (x − x0 ) . (24)
∂xT
2.
r (h)
lim =0 (25)
h→0 khk
Proof. See [??].
Proposition 2.2 Let f : S → R be twice differentiable on the open set S ⊂ Rn . Let
x0 ∈ S. Then there is a function r : Rn → R such that
1. For all x ∈ S,
2
∂f (x0 ) 1 T ∂f (x0 )
f (x) = f (x0 ) + (x − x 0 ) + (x − x 0 ) (x − x0 ) + r (x − x0 ) .
∂xT 2 ∂x∂xT
(26)
2.
r (h)
lim =0 (27)
h→0 khk2
Proof. See [??].
Warning. It is not claimed (and it isn’t true) that whenever f : S → R has
derivatives of all orders, the infinite series
2
∂f (x0 ) 1 T ∂f (x0 )
f (x0 ) + (x − x 0 ) + (x − x 0 ) (x − x0 ) + · · ·
∂xT 2 ∂x∂xT
11
converges to f (x) . This series may fail to converge, or it may converge to a
number different from f (x). In other words, not all infinitely differentiable
functions are analytic (representable by a power series). A counterexample

2
is the function f defined via f (x) = e−1/x for x 6= 0 and f (0) = 0. Then f
is infinitely differentiable with f (n) (0) = 0 for all n = 0, 1, 2, ... and so the
infinite Taylor series around x0 = 0 is identically zero, yet of course f itself
is not identically zero.
3 Integrating over Rn
3.1 Definition
Within the Riemann theory, integrating over rectangles (and the n−dimensional
counterparts) is just a matter of iterating the process of integration. More pre-
cisely, suppose E ⊂ R2 is a closed rectangle, i.e. E = [a1 , b1 ] × [a2 , b2 ] where we
require a1 ≤ b1 and a2 ≤ b2 so that the orientation of our set E is not an issue. We
then have the following definition.
Definition 3.1 Let E ⊂ R2 be a closed rectangle and let f : E → R be a continuous
function. Let x = hx, yi. Define

 
Zb2
ϕ (y) =  f (x, y) dx (28)
a2
12
Then  
Z Zb1 Zb2 Zb1
f (x) dx =  f (x, y) dx = ϕ (y) dy (29)
E a1 a2 a1
Happily, the order of integration does not matter under our assumptions. We
have the following proposition.
Proposition 3.1 Let E be a closed rectangle and let f : E → R be a continuous
function. Then
   
Zb1 Zb2 Z Zb2 Zb1
 f (x, y) dx dy = f (x) dx =  f (x, y) dy  dx. (30)
a1 a2 E a2 a1
Proof. See [??].
Remark 3.1 If you think this is a surprising result, recall that integrals are just
sums, and sums (avoiding pathologies where infinity is involved) are the same
independent of the order of the terms.
We can of course generalize this definition and proposition to integration over
closed rectangles E ⊂ Rn , i.e. sets of the form E = [a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ] .
Just keep on iterating the process of integration!
13
3.2 Change of variables
Definition 3.2 Let f be an arbitrary function on Rn into R. Then the set
Sf = {x ∈ X : f (x) 6= 0} (31)
is called the support of f . If Sf is a compact set, then f is said to have compact
support.
Theorem 3.1 Let T be a 1-1 (injective) continuously differentiable function from
an open set E ⊂ Rn into Rn such that the Jacobian JT (x) 6= 0 for all x ∈ E. Let f
be a continuous function from Rn into R such whose support is compact and lies
in T (E) . Then
Z Z
f (y) dy = f (T (x)) |JT (x)| dx. (32)
Rn Rn
Proof. See [??].
Remark 3.2 The reason for having |JT (x)| instead of JT (x) is that, with the def-
inition of the integral used in this section, we integrate over subsets of Rn without
Rb
regard for their orientation. For example, in the scalar case, we consider f (x) dx
a
Ra
and f (x) dx to be the same. Given that these are defined to be the same, we
b
must take steps to assure that, say, the change of variables T (x) = −x makes no
difference, and that is guaranteed by taking the absolute value of the Jacobian.
14
Example 3.1 (from Econometrics II; calculating the volume of a cylinder)
Let c, k ≥ 0. Let f : R2 → R be defined via


 c if x2 + y 2 ≤ k 2

f (x, y) = (33)
 0 otherwise

(Draw a picture of this!) We now want to calculate

Z
f (x, y) dx (34)
R2
and it turns out to be convenient to use the change of variables approach, noting
with satisfaction that f has compact support. Looking at the picture, it seems that
a switch to polar coordinates makes sense. So define

  
 r
 

2
E =   ∈ R : 0 < r < k and 0 < θ < 2π (35)
 
 θ
 

and T on E via  
 r cos θ 
T (r, θ) =   (36)
r sin θ
Apparently the Jacobian is
 
 cos θ −r sin θ 
JT (r, θ) = det   =r (37)
sin θ r cos θ
and 
 c if 0 ≤ r ≤ k and 0 ≤ θ ≤ 2π

f (T (r, θ)) = (38)
 0 otherwise.

Hence  
Z Z2π Zk
f (x, y) =  crdr dθ = ck 2 π. (39)
R2 0 0
15
References
Magnus, J. and H. Neudecker (1999). Matrix Differential Calculus With Appli-
cations in Statistics and Econometrics. John Wiley and Sons.
16

Calculus With Vectors and Matrices

Uploaded by

Calculus With Vectors and Matrices

Uploaded by

Paul Klein

Calculus with vectors and matrices

1 Matrix differential calculus

1.1 The gradient and the Hessian

The purpose of this section is to make sense of expressions like

where f : Rn → Rm . Of course, we already know what a partial derivative is and

adopting our particular arrangement convention.

Definition 1.1 Let f : Rn → Rm have partial derivatives at x. Then

Definition 1.2 Let f : Rn → Rn have partial derivatives at x. Then the (scalar-

valued) Jacobian of f at x is defined via

Proposition 1.1 1. Let x be an n × 1 vector and A an m × n matrix. Then

2. Let x be an n × 1 vector and A an n × m matrix. Then

3. Let x be an n × 1 vector and A an n × n matrix. Then

If f is scalar-valued, it is straightforward to define the second derivative (Hes-

tives at x (so as to satisfy the requirements of Young’s theorem). Then

Note that, by Young’s theorem, the Hessian of a scalar-valued function is symmet-

Proposition 1.2 Let f (x) , xT Ax where A is symmetric. Then

to vectorize and then differentiate.

Definition 1.5 Let f : Rk → Rn×m have partial derivatives at x. Then

∂f (x) ∂vecf (x)

Kronecker product, defined as follows.

and j:th column of A by aij . Then

Proposition 1.3 Let A , B and C be matrices. Then

vec (ABC) = C T ⊗ A vec (B)

Occasionally we find ourselves wanting to differentiate a vector-valued function

with respect to a matrix. Again the way forward is to vectorize.

The trace of a matrix A is denoted by tr (A) . [Various rules of arithmetic omit-

ted in this version. See the bibliography for sources.]

Definition 1.7 Let f : Rn×m → Rk have partial derivatives at x. Then

Example 1.1 Let f : Rn×m → Rn be defined via f (Φ) , Φk where k ∈ Rm is a

rule for matrices.

Proposition 1.4 (the product rule) Let A : Rl → Rn×m and B : Rl → Rm×k

have partial derivatives at x ∈ Rl . Then

Kind-of proof. Suppose A (x) ≡ A. Then, by Proposition 1.3,

vec (AB (x)) = (Ik ⊗ A) vecB (x) . (18)

Since differentiation is a linear operator, it follows that

∂vec (AB (x)) ∂vecB (x)

Conversely, assume that B (x) ≡ B. Then

Combining the two results yields the product rule.

mula is drastically simplified. Let f : Rl → Rm and g : Rl → Rm have partial

(vecΩ)T . One quick way of getting the result is to note that

1.3 The chain rule

∂f (g (x)) ∂f (g (x)) ∂g (x)

Example 1.3 Let f (A) , xT Ax and let B (A) , A−1 . Find

∂f (B (A)) xT Bx ∂vec (A−1 )

advantages. But sometimes it is not so useful. One example is found in chapter

theory would break down if we hade to vectorize before we differentiated. So we

go against Magnus & Neudecker and omit to vectorize.

A similar phenomenon arises when we want to differentiate a scalar with respect

to a matrix. For example, according to the Magnus-Neudecker definition,

More generally, let f : Rm×n → R be a scalar-valued function. Then the defini-

practice than the ones given by Magnus & Neudecker.

Another example is the following. Suppose we want to evaluate

We found above that

which is a 1 × n2 vector. But a much neater statement of the result is

which is an n × n matrix. This fact is brought home clearly by looking at the

following example, where the transpose will be denoted by 0 .

ance/covariance matrix of a normally distributed vector of random variables.) Let

hxt i be a given sequence of n × 1 vectors. Consider the maximization problem

it is easy to confirm that the unique solution is given by

So the derivative of a scalar-valued function with respect to a matrix may usefully

be defined as I have in this section. However, there is no satisfactory chain rule

2.1 Taylor’s formula in n dimensions

Proposition 2.1 Let f : S → Rm be differentiable on the open set S ⊂ Rn . Let

x0 ∈ S. Then there is a function r : Rn → Rm (which typically depends on x0 ) such

Proof. See [??].

Proposition 2.2 Let f : S → R be twice differentiable on the open set S ⊂ Rn . Let

x0 ∈ S. Then there is a function r : Rn → R such that

Proof. See [??].

Warning. It is not claimed (and it isn’t true) that whenever f : S → R has

derivatives of all orders, the infinite series

functions are analytic (representable by a power series). A counterexample

infinite Taylor series around x0 = 0 is identically zero, yet of course f itself

is not identically zero.

counterparts) is just a matter of iterating the process of integration. More pre-

cisely, suppose E ⊂ R2 is a closed rectangle, i.e. E = [a1 , b1 ] × [a2 , b2 ] where we