Calculus With Vectors and Matrices
Calculus With Vectors and Matrices
Stockholms universitet
October 1999
∂f (x)
= ∇x f (x) = ∇f (x) = f 0 (x) = fx (x) (1)
∂xT
how to calculate it. What this section will tell us is how to arrange the partial
derivatives into a matrix (gradient), and the rules of arithmetic that follow from
1
and
T
∂f (x) ∂f (x)
, (3)
| ∂x
{z } ∂xT
n×m
T
where A is the transpose of A.
∂f (x)
Jf (x) , det . (4)
∂xT
∂f (x)
Remark 1.1 Sometimes the gradient itself is called the Jacobian. Here
∂xT
the Jacobian is defined as the determinant of the gradient.
The following properties of the gradient follow straightforwardly from the defi-
nition.
∂
[Ax] = A. (5)
∂xT
∂ T
T
x A = AT . (6)
∂x
∂ T T T
x Ax = x A + A . (7)
∂xT
2
4. Let x be an n × 1 vector and A an n × n symmetric matrix. Then
∂ T
x Ax = 2xT A (8)
∂xT
sian) as follows.
Definition 1.3 Let f : Rn → R have continuous first and second partial deriva-
ric.
∂ 2 f (x)
= 2A (10)
∂x∂xT
Occasionally we run into matrix-valued functions, and the way forward then is
3
" #
Definition 1.4 Let A = a1 a2 · · · an be an m × n matrix. Then
|{z} |{z} |{z}
m×1 m×1 m×1
a1
a2
vec (A) , (11)
| {z } ..
mn×1
.
an
Having defined the vec operator, we quickly run into cases where we need the
Definition 1.6 Let A and B be matrices. Denote the element in the i:th row
m×n k×l
4
Proof. Exercise.
Proposition 1 Whenever the following expressions are defined, they are true.
∂f (x) ∂f (x)
T
, T
(15)
| ∂A
{z } ∂ (vecA)
nm×k
∂f (x) T
= k ⊗ In . (16)
∂ΦT
We are now in a position to state rather general versions of the product and chain
5
1.2 The product rule
∂
T
∂vecA (x) ∂vecB (x)
[A (x) B (x)] = B (x) ⊗ In + (Ik ⊗ A (x)) . (17)
∂xT ∂xT ∂xT
∂vec (A (x) B) T
∂vecA (x)
= B ⊗ In (20)
∂xT ∂xT
Corollary 1.1 When we have vector- rather than matrix-valued functions, the for-
derivatives at x ∈ Rl . Then
∂ h i ∂f (x) ∂g (x)
T
f (x) g (x) = g (x)T
T
T
+ f (x)T (21)
∂x ∂x ∂xT
6
Example 1.2 Suppose we would like to differentiate f (Ω) , Ω−1 with respect to
∂ (ΩΩ−1 ) ∂ (I)
T
= =0
∂ (vecΩ) ∂ (vecΩ)T
so that
∂vecΩ ∂vecΩ−1
0 = Ω−T ⊗ In
+ (In ⊗ Ω)
∂ (vecΩ)T ∂ (vecΩ)T
Hence
∂vecΩ−1
= − (In ⊗ Ω)−1 Ω−T ⊗ In = −Ω−T ⊗ Ω−1 .
∂ (vecΩ)T
Proposition 1.5 (the chain rule) Let f and g have partial derivatives at x, and
let h (x) = (f ◦ g) (x) = f (g (x)) . Define y = g (x). Then h has partial derivatives
at x and
∂h (x) ∂f (y) ∂g (x)
= . (22)
∂xT ∂y T ∂xT
With an alternative piece of notation, we have
Proof. The scalar chain rule and the definition of matrix multiplication.
∂f (B (A))
.
∂vec (A)T
7
Here we use the chain rule to find that
2 Remarks on unvectorizing
The definition of the derivative of a matrix function with respect to a matrix given
above was stated originally by Magnus and Neudecker (1999). It has some great
10, where we differentiate a matrix with respect to a scalar. Then the whole
∂xT Ax ∂xT Ax
= xT ⊗ xT ,
T
= T
∂A ∂ (vecA)
so that the result is an n2 × 1 vector. But often it is nicer to define the result as
the n × n matrix
∂xT Ax ∂xT Ax
T
= = xxT .
∂A ∂A
8
tions
∂f (A) ∂f (A)
∂a11 ∂am1
∂f (A)
,
∂A
∂f (A) ∂f (A)
∂a1n ∂amn
and
T
∂f (A) ∂f (A)
=
∂AT ∂A
where aij is the element on the ith row and jth column of A are often nicer in
∂ xT Ω−1 x
.
∂ΩT
∂vec xT Ω−1 x
= − xT ⊗ xT Ω−T ⊗ Ω−1 ,
∂ (vecΩ)T
∂ xT Ω−1 x
= −Ω−1 xxT Ω−1 ,
∂ΩT
Example 2.1 (Maximum likelihood estimation of the population mean and vari-
9
µ be an unknown n × 1 column vector and let Ω be an unknown n × n matrix. Let
Using the ‘unvectorized’ definition of the matrix derivative and noting that, with
this definition,
∂ ln |Ω|
= Ω−1 ,
∂Ω0
for the definition of the matrix derivative given here. This is one of Magnus &
Neudecker’s reasons for advocating the definition I give in the previous section.
that
10
1. For all x ∈ S,
∂f (x0 )
f (x) = f (x0 ) + (x − x0 ) + r (x − x0 ) . (24)
∂xT
2.
r (h)
lim =0 (25)
h→0 khk
1. For all x ∈ S,
2
∂f (x0 ) 1 T ∂f (x0 )
f (x) = f (x0 ) + (x − x 0 ) + (x − x 0 ) (x − x0 ) + r (x − x0 ) .
∂xT 2 ∂x∂xT
(26)
2.
r (h)
lim =0 (27)
h→0 khk2
2
∂f (x0 ) 1 T ∂f (x0 )
f (x0 ) + (x − x 0 ) + (x − x 0 ) (x − x0 ) + · · ·
∂xT 2 ∂x∂xT
11
converges to f (x) . This series may fail to converge, or it may converge to a
number different from f (x). In other words, not all infinitely differentiable
is infinitely differentiable with f (n) (0) = 0 for all n = 0, 1, 2, ... and so the
3 Integrating over Rn
3.1 Definition
Within the Riemann theory, integrating over rectangles (and the n−dimensional
12
Then
Z Zb1 Zb2 Zb1
f (x) dx = f (x, y) dx = ϕ (y) dy (29)
E a1 a2 a1
Happily, the order of integration does not matter under our assumptions. We
function. Then
Zb1 Zb2 Z Zb2 Zb1
f (x, y) dx dy = f (x) dx = f (x, y) dy dx. (30)
a1 a2 E a2 a1
Remark 3.1 If you think this is a surprising result, recall that integrals are just
sums, and sums (avoiding pathologies where infinity is involved) are the same
13
3.2 Change of variables
Sf = {x ∈ X : f (x) 6= 0} (31)
support.
an open set E ⊂ Rn into Rn such that the Jacobian JT (x) 6= 0 for all x ∈ E. Let f
be a continuous function from Rn into R such whose support is compact and lies
in T (E) . Then
Z Z
f (y) dy = f (T (x)) |JT (x)| dx. (32)
Rn Rn
Remark 3.2 The reason for having |JT (x)| instead of JT (x) is that, with the def-
inition of the integral used in this section, we integrate over subsets of Rn without
Rb
regard for their orientation. For example, in the scalar case, we consider f (x) dx
a
Ra
and f (x) dx to be the same. Given that these are defined to be the same, we
b
must take steps to assure that, say, the change of variables T (x) = −x makes no
difference, and that is guaranteed by taking the absolute value of the Jacobian.
14
Example 3.1 (from Econometrics II; calculating the volume of a cylinder)
and it turns out to be convenient to use the change of variables approach, noting
with satisfaction that f has compact support. Looking at the picture, it seems that
and T on E via
r cos θ
T (r, θ) = (36)
r sin θ
Apparently the Jacobian is
cos θ −r sin θ
JT (r, θ) = det =r (37)
sin θ r cos θ
and
c if 0 ≤ r ≤ k and 0 ≤ θ ≤ 2π
f (T (r, θ)) = (38)
0 otherwise.
Hence
Z Z2π Zk
f (x, y) = crdr dθ = ck 2 π. (39)
R2 0 0
15
References
Magnus, J. and H. Neudecker (1999). Matrix Differential Calculus With Appli-
cations in Statistics and Econometrics. John Wiley and Sons.
16