Math For Data Science
Math For Data Science
Omar Hijab*
Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Preface
For Python coding, the standard data science libraries are used. A Python
index lists the Python functions used in the text.
Sections and figures are numbered sequentially within each chapter, and
equations are numbered sequentially within each section, so §3.3 is the third
section in the third chapter, Figure 7.11 is the eleventh figure in the seventh
chapter, and (3.2.1) is the first equation in the second section of the third
chapter.
iii
iv
⋆ under construction ⋆,
Preface iii
1 Data Sets 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . 9
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . 42
1.7 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 58
2 Linear Geometry 63
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . 88
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . 102
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
v
vi CONTENTS
4 Counting 191
4.1 Permutations and Combinations . . . . . . . . . . . . . . . . . 191
4.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 210
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . 217
5 Probability 227
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . 227
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 261
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . 271
6 Statistics 283
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
6.4 Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.5 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
6.6 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . 316
6.7 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . 317
7 Calculus 323
7.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . 341
7.3 Multi-variable Calculus . . . . . . . . . . . . . . . . . . . . . . 349
7.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 355
7.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 367
7.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . 384
A Appendices 457
A.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.2 Minimizing Sequences . . . . . . . . . . . . . . . . . . . . . . 471
References 479
Python 481
Index 485
viii CONTENTS
List of Figures
ix
x LIST OF FIGURES
Data Sets
In this chapter we explore examples of data sets and some simple Python
code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and covariance of a dataset, then present
a first taste of what higher dimensions might look like.
1.1 Introduction
Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-
dimensional space Rd . Algebraically, a dataset is an N × d matrix.
Practically speaking, as we shall see, the following are all representations
of datasets
matrix = CSV file = spreadsheet = SQL table = array = dataframe
Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,
and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Sometimes one of the features is separated out as the label. In this case,
the dataset is a labelled dataset.
As examples, we look at two datasets, the Iris dataset and the MNIST
dataset. The Iris dataset contains 150 examples of four features of Iris flowers,
and there are three classes of Irises, Setosa, Versicolor, and Virginica, with
50 samples from each class.
1
2 CHAPTER 1. DATA SETS
The four features are sepal length and width, and petal length and width
(Figure 1.1). For each example, the class is the label corresponding to that
example, so the Iris dataset is labelled.
iris = datasets.load_iris(as_frame=True)
dataset = iris["frame"]
dataset
The MNIST dataset consists of 60, 000 images of hand-written digits (Fig-
ure 1.2). There are 10 classes of images, corresponding to each digit 0, 1,
. . . , 9. We seek to compress the images while preserving as much as possible
of the images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.
This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points
x1 , x2 , . . . , xN
If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (Datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus),
and Python (numpy, pandas, scipy, sympy, matplotlib). It may help to
read the code examples , and the important math principles first,
then dive into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a Jupyter notebook.
Jupyter is an IDE, an integrated development environment. Jupyter sup-
ports many frameworks, including Python, Sage, Julia, and R. A useful
Jupyter feature is the ability to measure the amount of execution time of
a code cell by including at the start of the cell
%%time
(This code requires keras, tensorflow and related modules if not already
installed.)
1
The National Institute of Standards and Technology (NIST) is a physical sciences
laboratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 5
Since this dataset is for demonstration purposes, these images are coarse.
Since each image consists of 784 pixels, and each pixel shading is a number,
each image is a point x in Rd = R784 .
Figure 1.4: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
For the second image in Figure 1.2, reducing dimension from d = 784 to
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.4.
The top left image in Figure 1.4 is given by a 784-dimensional point which
is imported as an array pixels of shape (28,28).
pixels = train_X[1]
grid()
scatter(2,3)
show()
3. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
pixels = train_X[1]
grid()
for i in range(28):
for j in range(28): scatter(i,j, s = pixels[i,j])
show()
imshow(pixels, cmap="gray_r")
We end the section by discussing the Python import command. The last
code snippet can be rewritten
plt.imshow(pixels, cmap="gray_r")
or as
imshow(pixels, cmap="gray_r")
L = [x_1,x_2,...,x_N].
The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population
Or, the sample space consists of all integers and we take N = 5 samples from
this population
Or, the sample space consists of all rational numbers and we take N = 5
samples from this population
Or, the sample space consists of all Python strings and we take N = 5
samples from this population
L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']
Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population
10 CHAPTER 1. DATA SETS
def hexcolor():
return "#" + ''.join([choice('0123456789abcdef') for _ in
,→ range(6)])
we call the members of the population “vectors”, even though the members
may be anything, as long as they satisfy the basic rules of a vector space.
In a vector space V , the rules are:
8. 1v = v and 0v = 0
9. r(sv) = (rs)v.
A vector is an arrow joining two points (Figure 1.8). Given two points
m = (a, b) and x = (c, d), the vector joining them is
v = x − m = (c − a, d − b).
Then m is the tail of v, and x is the head of v. For example, the vector
joining m = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean m of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − m, k = 1, 2, . . . , N .
The dataset v1 , v2 , . . . , vN is centered, its mean is zero,
v1 + v2 + · · · + vN
= 0.
N
So datasets can be points x1 , x2 , . . . , xN with mean m, or vectors v1 , v2 , . . . ,
vN with mean zero (Figure 1.9). This distinction is crucial, when measuring
the dimension of a dataset (§2.8).
x5 x2
v5 v2
m
v4 0 v1
x4 x1 v3
x3
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
V
f
Sample Space
returns False.
Usually, we can’t take sample means from a population, we instead take
the sample mean of a statistic associated to the population. A statistic is
an assignment of a number f (item) to each item in the population. For
example, the human population on Earth is not a vector space (they can’t
be added), but their heights is a vector space (heights can be added). For the
list L4 , a statistic might be the length of the string. For the HTML colors, a
statistic is the HTML code of the color.
14 CHAPTER 1. DATA SETS
dataset = array([1.23,4.29,-3.3,555])
mean(dataset)
mean(dataset, axis=0)
mean(dataset, axis=1)
N = 20
dataset = array([ [random(), random()] for _ in range(N) ])
mean = mean(dataset,axis=0)
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
scatter(*mean)
show()
In this code, scatter expects two positional arguments, the x and the y
components of a point, or two lists of x and y components separately. The
unpacking operator * unpacks mean from one pair into its separate x and
y components *mean. Also, for scatter, dataset is separated into its two
columns.
(0, 2) (3, 2)
(0, 1)
(0, −2)
v1
v2
0 0
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13. This cannot be done unless one first draws a horizontal line
(the x-axis), then a vertical line (the y-axis). In this manner, each vector v
has cartesian coordinates v = (x, y). In particular, the vector 0 = (0, 0), the
zero vector, corresponds to the origin.
In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added
by adding their coordinates,
Addition of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)
Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
18 CHAPTER 1. DATA SETS
v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False
v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True
v = array([1,2])
3*v == array([3,6]) # returns True
Scaling of vectors
Thus multiplying v by s, and then multiplying the result by t, has the same
effect as multiplying v by ts, in a single step. Because points and vectors are
interchangeable, the same formula is used for scaling tP points P by t.
tv
v
0 tv
v1 − v2 = v1 + (−v2 ).
This gives
20 CHAPTER 1. DATA SETS
Subtraction of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True
Distance Formula
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and v2
is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .
(x, y)
r y
θ
0 x
In Python,
v = array([1,2])
norm(v) == sqrt(5)# returns True
The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),
v = (x, y) = (cos θ, sin θ). (1.4.3)
The unit circle intersects the horizontal axis at the vectors (1, 0), and
(−1, 0), and intersects the vertical axis at the vectors (0, 1), and (0, −1).
These four vectors are equally spaced on the unit circle (Figure 1.17).
−v
I
0
v
More generally, any circle with center (a, b) and radius r consists of vectors
v = (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the
scaled point tR is on the circle with center (0, 0) and radius t. Moreover, if
Q is any point, Q + tR is on the circle with center Q and radius r.
Given this, it is easy to check
v2 − v1
v2
v1
Now we discuss the dot product in two dimensions. We have two vectors
v and v ′ in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
1.4. TWO DIMENSIONS 23
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the
v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True
As a consequence of the dot product identity, we have code for the angle
between two vectors,
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
If u and v are any two vectors, then
a2 = d2 + f 2 and c2 = e 2 + f 2 .
Also b = e + d, so
b2 = (e + d)2 = e2 + 2ed + d2 .
c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,
so we get (1.4.6).
f
e
a
d b
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |.
Now go back to deriving (1.4.4). By vector addition, we have
v2 − v1 = (x2 − x1 , y2 − y1 ),
thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.7)
Comparing the terms in (1.4.6) and (1.4.7), we arrive at (1.4.4).
v · v ⊥ = (x, y) · (−y, x) = 0.
P⊥
P
v⊥ v
0
−v ⊥
−P ⊥
From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ =
0 iff P ′ = ±P ⊥ .
ax + by = 0, cx + dy = 0. (1.4.9)
In (1.4.9), multiply the first equation by d and the second by b and sub-
tract, obtaining
In (1.4.9), multiply the first equation by c and the second by a and subtract,
obtaining
(bc − ad)y = c(ax + by) − a(cx + dy) = 0.
From here, we see there are two cases: det(A) = 0 and det(A) ̸= 0. When
det(A) ̸= 0, the only solution of (1.4.9) is (x, y) = (0, 0). When det(A) = 0,
(x, y) = (−b, a) is a solution of both equations in (1.4.9). We have shown
Homogeneous System
Inhomogeneous System
u · u′ u · v ′
′
AA = .
u′ · v u′ · v ′
cos θ′ − sin θ′
′ cos θ − sin θ
U (θ)U (θ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )
= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )
30 CHAPTER 1. DATA SETS
AA−1 = A−1 A = I.
(AB)−1 = B −1 A−1 .
(AB)t = B t At .
Av = w,
where
a b x e
A= , v= , w= .
c d y f
1.4. TWO DIMENSIONS 31
v = A−1 w,
where A−1 is the inverse matrix. We study inverse matrices in depth in §2.3.
The matrix (1.4.10) is symmetric if b = c. A symmetric matrix looks like
a b
Q= .
b c
Qt = Q.
Orthogonal Matrices
Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get
ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
det(u ⊗ v) = 0.
This is true no matter what the vectors u and v are. Check this yourself.
Notice by definition of u ⊗ v,
Quadratic Form
If
a b
Q= and v = (x, y),
b c
then
v · Qv = ax2 + 2bxy + cy 2 .
P
P′
1
1
O O
P ′′
Q Q
P ′′
O O
This ability of points in the plane to follow the usual rules of arithmetic is
unique to one and two dimensions, and not present in any other dimension.
1.5. COMPLEX NUMBERS 35
When thought of in this manner, points in the plane are called complex
numbers, and the plane is the complex plane.
P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(1.5.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).
P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (1.5.3)
Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
2
P P̄ ′ is the hermitian product of P and P ′ .
1.5. COMPLEX NUMBERS 37
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and,
using (1.5.1), one can check
Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written
P = x + iy,
since x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y). This leads to
Figure 1.23. In this way, real numbers x are considered complex numbers
with zero imaginary part, x = x + 0i.
2i 3 + 2i
−1 0 1 2 3
Square Root of −1
and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator by the
formula
1 1 x − iy z̄
= = 2 2
= 2.
z x + iy x +y |z|
Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the
conjugate of z.
If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers P and
P ′ , and (r′′ , θ′′ ) are the polar coordinates of the product P ′′ = P P ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .
From this and (1.5.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(1.5.5)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .
We will need the roots of unity in §3.2. This generalizes square roots,
cube roots, etc.
A point ω is a root of unity if ω d = 1 for some power d. If d is the power,
we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1, ±i, since (±1)4 = 1, (±i)4 = 1. Here
we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).
40 CHAPTER 1. DATA SETS
ω
ω
ω 1 1 ω2 1
ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1
If ω d = 1, then
d k
ωk = ωd = 1k = 1.
With ω given by (1.5.6), this implies
1, ω, ω 2 , . . . , ω d−1
13 = 1, ω 3 = 1, (ω 2 )3 = 1.
1.5. COMPLEX NUMBERS 41
ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω 4 ω4 ω5 ω 10 11 ω 12
ω
ω5 = 1 ω6 = 1 ω 15 = 1
Summarizing,
Roots of Unity
If
ω = cos(2π/d) + i sin(2π/d),
the d-th roots of unity are
1, ω, ω 2 , . . . , ω d−1 .
ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.
Here is sympy code for the roots of unity. We use display instead of
print to pretty-print the output.
42 CHAPTER 1. DATA SETS
x = symbols('x')
d = 5
import numpy as np
np.roots([a,b,c])
Since the cube roots of unity are the roots of the polynomial p(x) = x3 − 1,
the code
import numpy as np
np.roots([1,0,0,-1])
Above |x| stands for the length of the vector x, or the distance of the
point x to the origin. When d = 2 and we are in two dimensions, this was
defined in §1.4. For general d, this is defined in §2.1. In this section we
continue to focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
m= xk = .
N k=1 N
The mean m is a point in feature space. The first result is
Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-square
distance to the dataset (Figure 1.26).
Figure 1.26: MSD for the mean (green) versus MSD for a random point (red).
Using (1.4.8),
|a + b|2 = |a|2 + 2a · b + |b|2
for vectors a and b, it is easy to derive the above result. Insert a = xk − m
and b = m − x to get
N
2 X
M SD(x) = M SD(m) + (xk − m) · (m − x) + |m − x|2 .
N k=1
44 CHAPTER 1. DATA SETS
so we have
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)
p = array([random(),random()])
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
for v in dataset:
plot([m[0],v[0]],[m[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
show()
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)
# center dataset
vectors = dataset - m
Since
16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16
4 4
(±2, ±2) ⊗ (±2, ±2) = ,
4 4
0 0
(0, 0) ⊗ (0, 0) = ,
0 0
Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1.
The covariance matrix as written in (1.6.1) is the biased covariance matrix.
If the denominator is instead N − 1, the matrix is the unbiased covariance
matrix.
For datasets with large N , it doesn’t matter, since N and N − 1 are
almost equal. For simplicity, here we divide by N , and we only consider the
biased covariance matrix.
In practice, datasets are standardized before computing their covariance.
The covariance of standardized datasets — the correlation matrix — is the
same whether one starts with bias or not (§2.2).
In numpy, the Python covariance constructor is
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset,bias=True,rowvar=False)
This returns the same result as the previous code for Q. Notice here there
is no need to compute the mean, this is taken care of automatically. The
1.6. MEAN AND COVARIANCE 47
Q = cov(dataset.T,bias=True)
We call (1.6.3) the total variance of the dataset. Thus the total variance
equals MSD(m).
In Python, the total variance is
Q = cov(dataset.T,bias=True)
Q.trace()
proj v = (v · u)u.
u
48 CHAPTER 1. DATA SETS
proj u v
u
Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)
This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).
u · Qu = ax2 + 2bxy + cy 2 = 1.
The covariance ellipse and inverse covariance ellipses described above are
centered at the origin (0, 0). When a dataset has mean m and covariance Q,
the ellipses are drawn centered at m, as in Figures 1.30, 1.31, and 1.32.
Here is the code for Figure 1.28. The ellipses drawn here are centered at
the origin.
L, delta = 4, .1
x = arange(-L,L,delta)
y = arange(-L,L,delta)
X,Y = meshgrid(x, y)
a, b, c = 9, 0, 4
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
def ellipse(a,b,c,levels,color):
contour(X,Y,a*X**2 + 2*b*X*Y + c*Y**2,levels,colors=color)
grid()
ellipse(a,b,c,[1],'blue')
ellipse(A,B,C,[1],'red')
show()
x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .
Suppose the mean of this dataset is m = (mx , my ). Then, by the formula for
tensor product, the covariance matrix is
a b
Q= ,
b c
where
N N N
1 X 1 X 1 X
a= (xk −mx )2 , b= (xk −mx )(yk −my ), c= (yk −my )2 .
N k=1 N k=1 N k=1
From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x
and y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
52 CHAPTER 1. DATA SETS
and
y1 − my ′ y2 − my y N − my
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
x = 1, y = 1.
N k=1 k N k=1 k
For example,
9 2 b 1 ′ 1 1/3
Q= =⇒ ρ= √ = =⇒ Q = .
2 4 ac 3 1/3 1
corrcoef(dataset.T)
Here again, we input the transpose of the dataset if our default is vectors
as rows. Notice the 1/N cancels in the definition of ρ. Because of this,
corrcoef is the same whether we deal with biased or unbiased covariance
matrices.
u · Qu = max v · Qv.
|v|=1
Since the sine function varies between +1 and −1, we conclude the projected
variance varies between
1 − ρ ≤ v · Qv ≤ 1 + ρ,
and
π 1 1
θ= , v+ = √ , √ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2
3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.29)
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is
1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,
Here are two randomly generated datasets. For the dataset in Figure
1.30, the mean and covariance are
0.09652275 0.00939796
(0.46563359, 0.59153958) .
0.00939796 0.0674424
56 CHAPTER 1. DATA SETS
For the dataset in Figure 1.31, the mean and covariance are
0.08266583 −0.00976249
(0.48785572, 0.51945499) .
−0.00976249 0.08298294
Here is code for Figures 1.30, 1.31, and 1.32. The code incorporates the
formulas for λ± and v± .
N = 50
X = array([ random() for _ in range(N) ])
Y = array([ random() for _ in range(N) ])
scatter(X,Y,s=2)
58 CHAPTER 1. DATA SETS
m = mean([X,Y],axis=1)
Q = cov(X,Y,bias=True)
a, b, c = Q[0,0], Q[0,1], Q[1,1]
delta = .01
x = arange(0,1,delta)
y = arange(0,1,delta)
X,Y = meshgrid(x, y)
def ellipse(a,b,c,d,e,levels,color):
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
# inverse covariance ellipse centered at (d,e)
Z = A*(X-d)**2 + 2*B*(X-d)*(Y-e) + C*(Y-e)**2
contour(X,Y,Z,levels,colors=color)
for pm in [+1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
len = sqrt(b**2 +(a-lamda)**2)
axesX = [d+sigma*b/len,d-sigma*b/len]
axesY = [e-sigma*(a-lamda)/len,e+sigma*(a-lamda)/len]
plot(axesX,axesY,linewidth=.5)
grid()
levels = [.5,1,1.5,2]
ellipse(a,b,c,*m,levels,'red')
show()
1 √ √
(4 2 − 4) = 2 − 1.
4
1.7. HIGH DIMENSIONS 59
plot([-2,2],[-2,2],color='black')
axes.add_patch(square)
60 CHAPTER 1. DATA SETS
axes.add_patch(circle1)
axes.add_patch(circle2)
axes.add_patch(circle3)
axes.add_patch(circle4)
axes.add_patch(circle)
Since the edge-length of the cube is 4, the √ radius of each blue ball is 1.
Since the length of the diagonal of the cube is 4 3, the radius of the red ball
is
1 √ √
(4 3 − 4) = 3 − 1.
4
Notice there are 8 blue balls.
In two dimensions, when a region is scaled by a factor t, its area increases
by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . We conclude: In d dimensions, when
a region is scaled by a factor t, its (d-dimensional) volume increases by the
factor td . This is called the scaling principle.
In d dimensions, the edge-length of the cube remains 4, the radius of
each blue ball remains 1,√and there are 2d blue balls. Since the length of the
diagonal of the cube is√4 d, the same calculation results in the radius of the
red ball equal to r = d − 1.
By the scaling principle, the volume of the red ball equals rd times the
volume of the blue ball. We conclude the following:
√
• Since r = d−1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.
• Since there are 2d blue balls, the ratio of the volume of the red ball
over the total volume of all the blue balls is rd /2d .
√
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly
when d = 9, we have: In nine dimensions, the volume of the red ball
equals the sum total of the volumes of all blue balls.
√
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.
√
• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the
semi-diagonal. As the dimension grows without bound, the proportion
of the diagonal covered by the red ball converges to 1/2.
The code for Figure 1.35 is as follows. For 3d plotting, the module mayavi
is better than matplotlib.
62 CHAPTER 1. DATA SETS
pm1 = [-1,1]
for center in product(pm1,pm1,pm1):
# blue balls: color (0,0,1)
ball(*center,1,(0,0,1))
# black wire cube: color (0,0,0)
outline(color=(0,0,0))
Linear Geometry
v = array([1,2,3])
v.shape
63
64 CHAPTER 2. LINEAR GEOMETRY
v = Matrix([1,2,3])
v.shape
The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added component by component: With
we have
together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .
A = array([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
A = Matrix([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
Note the transpose operation interchanges rows and columns: the rows of At
are the columns of A. In both numpy or sympy, the transpose of A is A.T.
A d-dimensional vector v may be written as a 1 × d matrix
v = t1 t2 . . . td .
tN
• A, B: any matrix
• Q: symmetric matrix
• P : projections
vN
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
# 5x3 matrix
2.1. VECTORS AND MATRICES 67
A = Matrix.hstack(u,v,w)
# column vector
b = Matrix([1,1,1,1,1])
# 5x4 matrix
M = Matrix.hstack(A,b)
In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code
returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is hstack and vstack, but we prefer column_stack and
row_stack, so the code
both return True. Here col refers to rows of At , hence refers to the columns
of A.
68 CHAPTER 2. LINEAR GEOMETRY
A.shape == (A.rows,A.cols)
A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F
returns
1 0 0 0
0 0 0 1 1 1 2 2 3 5 10 0 1 0 0
, , , , , .
0 0 0 1 1 3 4 4 5 15 20 0 0 1 0
0 0 0 1
70 CHAPTER 2. LINEAR GEOMETRY
A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B
returns
−1 0 0 0
1 0 0 0 0 1 1 0
0 2 0 0
0 1 1 0
0 , .
0 3 0 0 0 0 5
0 0 0 4 0 0 0 7
0 0 0 5
It is straightforward to convert back and forth between numpy and sympy.
In the code
A = diag(1,2,3,4)
B = array(A)
C = Matrix(B)
For the Iris dataset, the mean (§1.3) is given by the following code.
iris = datasets.load_iris()
2.2. PRODUCTS 71
dataset = iris["data"]
m = mean(dataset,axis=0)
vectors = dataset - m
2.2 Products
Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already
know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then
u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)
As in §1.4, we always have rows on the left, and columns on the right.
In Python,
u = array([1,2,3])
72 CHAPTER 2. LINEAR GEOMETRY
v = array([4, 5, 6])
u = Matrix([1,2,3])
v = Matrix([4, 5, 6])
sqrt(dot(v,v))
sqrt(v.T * v)
As in §1.4,
Dot Product
The dot product u · v (2.2.1) satisfies
In two dimensions, this was equation (1.4.4) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
Here is code for the angle θ,
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the product
of their lengths,
are orthogonal. With this understood, the zero vector is orthogonal to every
vector. The converse is true as well: If u·v = 0 for every v, then in particular,
u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.
|a + b| = (a + b) · v ≤ |a| + |b|.
A,B,dot(A,B)
A,B,A*B
returns
70 80 90
AB = .
158 184 210
76 CHAPTER 2. LINEAR GEOMETRY
(Av)t = v t At .
dot(A,B).T == dot(B.T,A.T)
In terms of row vectors and column vectors, this is automatic. For exam-
ple,
(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).
In Python,
dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))
As a consequence,1
Then the identities (1.4.14) and (1.4.15) hold in general. Using the tensor
product, we have
1
Iff is short for if and only if.
78 CHAPTER 2. LINEAR GEOMETRY
Tensor Identity
Let A be a matrix with rows v1 , v2 , . . . , vN . Then
At A = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.7)
Multiplying (2.2.7) by xt on the left and x on the right, and using (1.4.15),
we see (2.2.7) is equivalent to
By matrix-vector multiplication,
Ax = (v1 · x, v2 · x, . . . , vN · x).
Since |Ax|2 is the sum of the squares of its components, this derives (2.2.8).
and
∥A∥2 = trace(At A). (2.2.11)
By replacing A by At , the same results hold for columns.
Q = dot(vectors.T,vectors)/N
Q = cov(dataset,rowvar=False)
or
Q = cov(dataset.T)
After downloading the Iris dataset as in §2.1, the mean, covariance, and
total variance are
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
m = (5.84, 3.05, 3.76, 1.2), Q =
1.27 −0.32 3.09
, 4.54.
1.29
0.51 −0.12 1.29 0.58
(2.2.13)
In Python,
# standardize dataset
vectors = StandardScaler().fit_transform(dataset)
Qcorr = corrcoef(dataset.T)
Qcov = cov(vectors.T,bias=True)
allclose(Qcov,Qcorr)
returns True.
Ax = b. (2.3.1)
In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
However, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the
82 CHAPTER 2. LINEAR GEOMETRY
zero matrix and b any non-zero vector. Because of this, we must be careful
when solving (2.3.1).
Ax = b =⇒ x = A−1 b. (2.3.3)
# solving Ax=b
x = A.inv() * b
# solving Ax=b
x = dot(inv(A) , b)
x+ = A+ b =⇒ Ax+ = b.
• no solutions, or
• A is invertible,
• A = 0 and b = 0.
The pseudo-inverse provides a single systematic procedure for deciding
among these three possibilities. The pseudo-inverse is available in numpy and
sympy as pinv. In this section, we focus on using Python to solve Ax = b,
postponing concepts to §2.6.
How do we use the above result? Given A and b, using Python, we
compute x = A+ b. Then we check, by multiplying in Python, equality of Ax
and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the imple-
mentation of Python on your laptop may be different than on my laptop, our
solutions may differ.
It can be shown that if the entries of A are integers, then the entries of
A+ are fractions. This fact is reflected in sympy, but not in numpy, as the
default in numpy is to work with floats.
Let
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
# arrange as columns
A = column_stack([u,v,w])
pinv(A)
returns
−37 −20 −3 14 31
1
A+ = −10 −5 0 5 10 .
150
17 10 3 −4 −11
Alternatively, in sympy,
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.pinv()
For
c = (−9, −3, 3, 9, 10),
we have
1
x+ = A + c = (82, 25, −32).
15
However, for this x+ , we have
We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
B + u, B + v, B + w,
Let
1 2 3 4 5
C = At = 6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). Then
−37 −10 17
−20 −5 10
+ t + + t
−3
C = (A ) = (A ) = 0 3
14 5 −4
31 10 −11
and
1
x+ = C + f =
(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .
returns
x = (t1 , t2 , . . . , td ).
Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
t1 v1 + t2 v2 + · · · + td vd
of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
Span Definition I
The span of v1 , v2 , . . . , vd is the set S of all linear combinations of v1 ,
v2 , . . . , vd , and we write
S = span(v1 , v2 , . . . , vd ).
Span Definition II
Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then
span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.
span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).
Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then
since adding a third vector can only increase the linear combination possi-
bilities. On the other hand, since w = 2v − u, we also have
It follows that
span(u, v, w) = span(u, v).
Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
# column vectors
2.4. SPAN AND LINEAR INDEPENDENCE 91
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
returns a minimal set of vectors spanning the column space of A. The column
rank of A is the number of vectors returned.
For example, for A as in (2.3.4), this code returns
Ax = t1 v1 + t2 v2 + · · · + td vd .
By (2.4.3),
This code returns two orthonormal vectors a/|a| and b/|b|, where
a = (8, 9, 10, 11, 12), b = (11, 6, 1, −4, −9),
√ √
and |a| = 510, |b| = 255.
We conclude the column space of A can be described in at least three
ways,
span(a, b) = span(u, v, w) = span(u, v).
Explicitly, a and b are linear combinations of u, v, w,
15a = 2u + 5v + 8w, 30b = −173u − 50v + 73w, (2.4.4)
and u, v, w are linear combinations of a and b,
51u = 16a − 7b, 51v = 41a − 2b, w = 2v − u. (2.4.5)
By (2.4.3), to derive (2.4.4), we solve Ax = a and Ax = b for x. But this
was done in §2.3.
Similarly, let B be the matrix with columns a and b, and solve Bx = u,
Bx = v, Bx = w, obtaining (2.4.5). This was done in §2.3.
As a general rule, sympy.columnspace returns vectors in close to original
form, and scipy.linalg.orth orthonormalizes the spanning vectors.
For example, let b = (−9, −3, 3, 9, 10) and let Ā = (A, b). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b is not in the column space of A: b is not a linear combination of u, v, w.
When (2.4.6) holds, b is a linear combination of the columns of A. How-
ever, (2.4.6) does not tell us which linear combination. According to (2.4.3),
finding the linear combination is equivalent to solving Ax = b.
then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or
R3 = span(e1 , e2 , e3 ).
e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)
Then e1 , e2 , . . . , ed span Rd , so
94 CHAPTER 2. LINEAR GEOMETRY
d-dimensional Space
Rd is a span.
span(a, b, c, d, e) = span(a, f ).
t1 v1 + t2 v2 + · · · + td vd = 0.
ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)
u = −(s/r)v − (t/r)w.
v = −(r/s)u − (t/s)w.
If t ̸= 0, then
w = −(r/t)u − (s/t)v.
96 CHAPTER 2. LINEAR GEOMETRY
A.nullspace()
This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
2.4. SPAN AND LINEAR INDEPENDENCE 97
[r,s,t] = A.nullspace()[0]
null_space(A)
A Versus At A
Let A be any matrix. The null space of A equals the null space of At A.
|Ax|2 = Ax · Ax = x · At Ax = 0,
t1 v1 + t2 v2 + · · · + td vd = 0.
Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain
t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
B = row_stack([u,v,w])
null_space(B)
Ax = (v1 · x, v2 · x, . . . , vN · x).
v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,
Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude
A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.
Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 101
A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .
• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because
x 1 + x2 + · · · + xN
m= .
N
Center the dataset (see §1.3)
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
v1 · b, v2 · b, . . . , vN · b.
b · Qb = 0.
b · (x − m) = 0.
b · (x − m) = 0.
a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,
point in the plane, then (x, y, z) − (x0 , y0 , z0 ) is orthogonal to (a, b, c), so the
equation of the plane is
(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,
where d = ax0 + by0 + cz0 .
Suppose we have a dataset in R3 with mean m = (3, 2, 1), and covariance
1 1 1
Q = 1 1 1 . (2.5.2)
1 1 1
Let b = (2, −1, −1). Then Qb = 0, so b · Qb = 0. We conclude the dataset
lies in the plane
(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.
In this case, the dataset is two-dimensional, as it lies in a plane.
If a dataset has covariance the 3 × 3 identity matrix I, then b · Ib is never
zero unless b = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example,
for the covariance (2.5.2) and u = (2, −1, −1), v = (0, 1, −1), we have both
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through m and will be parallel to b. But
we know how to find such a vector. Let A be the matrix with rows u, v. Then
b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).
Based on the above result, here is code that returns zero variance direc-
tions.
def zero_variance(dataset):
Q = cov(dataset.T)
return null_space(Q)
(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).
2.6 Pseudo-Inverse
What exactly is the pseudo-inverse? It turns out the answer is best under-
stood geometrically.
Think of b and Ax as points, and measure the distance between them,
and think of x and the origin 0 as points, and measure the distance between
them (Figure 2.1).
x b
A
−−−−−−→
0 Ax
source space target space
Even though the point x+ may not solve Ax = b, this procedure (Figure
2.2) results in a uniquely determined x+ : While there may be several points
x∗ , there is only one x+ .
∗ column space
x∗ x Ax
x+ A Ax∗
−−−−−−→ AxAx
x
x
x nullspace
0 b
Figure 2.2: The points x, Ax, the points x∗ , Ax∗ , and the point x+ .
The results in this section are as follows. Let A be any matrix. There
is a unique matrix A+ — the pseudo-inverse of A — with the following
properties.
• x+ = A+ b is a solution of
• In either case,
At Ax = At b. (2.6.2)
Zero Residual
x is a solution of (2.3.1) iff the residual is zero.
x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z =3 (2.6.3)
4x + 9y + 14z =9
5x + 10y + 15z = 10
Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
Regression Equation
x∗ is a residual minimizer iff x∗ solves the regression equation.
At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.
At (Ax∗ − b) = 0,
Multiple Solutions
Any two residual minimizers differ by a vector in the nullspace of A.
Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write
x∗ · v ≥ 0.
112 CHAPTER 2. LINEAR GEOMETRY
Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
Uniqueness
If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.10), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so
Ax+ = A(x + v) = Ax + Av = b + 0 = b.
This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,
Solvability of Ax = b
Properties of Pseudo-Inverse
A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric
u = A+ Au + v. (2.6.9)
Au = AA+ Au.
A+ w = A+ AA+ w + v
2.6. PSEUDO-INVERSE 115
for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector
b. Hence At AA+ = At . Let P = AA+ . Now
(x − A+ Ax) · A+ Ay = 0.
x · P y = P x · P y = x · P tP y
Also we have
2.7 Projections
In this section, we study projection matrices P , and we show
b − Pb
b
P b = tu
u
Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.3). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the pro-
jection matrix. Since span(u) is a line, the projected vector P b is a multiple
tu of u.
From Figure 2.3, b − P b is orthogonal to u, so
0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.
is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. Now we project onto a plane. If U is the matrix
with the single column u, then the reduced vector is U t b and the projected
vector is U U t b.
b
b − Pb
u
Pb
2. P b = b if b is in S,
P = AA+ . (2.7.2)
establishing 3.
2.7. PROJECTIONS 119
def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b =
(82, 25, −32),
15
and the projected vector onto the column space of A is
P b = Ax = AA+ b = (−8, −3, 2, 7, 12).
The projection matrix onto the column space of A is
6 4 2 0 −2
4 3 2 1 0
+ 1
2 2 2 2 2 .
P = AA =
10
0 1 2 3 4
−2 0 2 4 6
120 CHAPTER 2. LINEAR GEOMETRY
P = A+ A. (2.7.3)
def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
2.7. PROJECTIONS 121
dataset vk in Rd , k = 1, 2, . . . , N
reduced U t vk in Rn , k = 1, 2, . . . , N
projected U U t vk in Rd , k = 1, 2, . . . , N
If S is a span in Rd , then
Rd = S ⊕ S ⊥ . (2.7.5)
v = P v + (v − P v),
122 CHAPTER 2. LINEAR GEOMETRY
and the null space and row space are orthogonal to each other.
P = I − A+ A. (2.7.7)
But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
2.8. BASIS 123
2.8 Basis
spanning
orthogonal orthonormal
vectors basis
basis basis
linearly
orthogonal orthonormal
independent
Span of N Vectors
e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),
The dimension of Rd is d.
matrix_rank(vectors)
In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, the most likely pixel to remain un-
touched is at the top left corner (0, 0). Thus there are 72 zero variance
directions for this dataset.
We pose the following question: What is the least n for which the first
n images are linearly dependent? Since the dimension of the feature space
Rd is d, we must have n ≤ 784. To answer the question, we compute the
row rank of the first n vectors for n = 1, 2, 3, . . . , and continue until we have
linear dependence of v1 , v2 , . . . , vn .
If we save the MNIST dataset as a centered array vectors, as in §2.1,
and run the code below, we obtain n = 560 (Figure 2.7). matrix_rank is
discussed in §2.9.
def find_first_defect(vectors):
d = len(vectors[0])
previous = 0
for n in range(len(vectors)):
r = matrix_rank(vectors[:n+1,:])
2.8. BASIS 127
print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r
This we call the dimension staircase. For example, Figure 2.8 is the dimen-
sion staircase for
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
With the MNIST dataset loaded as vectors, here is code returning Figure
2.9. This code is not efficient, but it works. It takes 57041 vectors in the
dataset to fill up 712 dimensions.
2.8. BASIS 129
def dimension_staircase(vectors):
d = vectors[0].size
N = len(vectors)
rmax = matrix_rank(vectors)
dimensions = [ ]
basis = [ ]
for n in range(1,N):
r = matrix_rank(vectors[:n,:])
print((r,n),end=",")
dimensions.append(r)
if r == rmax: break
stairs(dimensions, range(n+1))
v1 is a linear combination of b1 , b2 , . . . , bd ,
v1 = t1 b1 + t2 b2 + · · · + td bd .
Since v1 ̸= 0, at least one of the coefficients, say t1 , is not zero, so we can
solve
1
b1 = (v1 − t2 b2 − t3 b3 − · · · − td bd ).
t1
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,
v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
If all the coefficients of b2 , b3 , . . . , bd are zero, then v2 is a multiple of v1 ,
contradicting linear independence of v1 , v2 , . . . , vN . Thus there is at least
one coefficient, say t2 , which is not zero. Solving for b2 , we obtain
1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).
Repeating the same logic, v3 is a linear combination of v1 , v2 , b3 , b3 , . . . ,
bd ,
v3 = s1 v1 + s2 v2 + t3 b3 + t4 b4 + · · · + td bd .
If all the coefficients of b3 , b4 , . . . , bd are zero, then v3 is a linear combination
of v1 , v2 , contradicting linear independence of v1 , v2 , . . . , vN . Thus there is
at least one coefficient, say t3 , which is not zero. Solving for b3 , we obtain
1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).
Continuing in this manner, we eventually arrive at
span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).
This shows vN is a linear combination of v1 , v2 , . . . , vd . This shows N = d,
because N > d contradicts linear independence. Since d is the minimal
spanning number, this shows v1 , v2 , . . . , vN is a minimal spanning set for S.
2.9. RANK 131
2.9 Rank
If A is an N ×d matrix, then (Figure 2.10) x 7→ Ax is a linear transformation
that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).
It follows that for an N × d matrix, the dimension of the source space is
d, and the dimension of the target space is N ,
dim(source space) = d, dim(target space) = N.
R3 R5
x b
A
At b
Ax
At
source space target space
By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .
Rank Theorem
Let A be any matrix. Then
A.rank()
matrix_rank(A)
returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so
For any N × d matrix, the rank is never greater than min(N, d).
C = CI = CAB = IB = B,
so B = C is the inverse of A.
The first two assertions are in §2.2. For the last assertion, assume U
is a square matrix. From §2.4, orthonormality of the rows implies linear
134 CHAPTER 2. LINEAR GEOMETRY
Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.
Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves
angles. Summarizing,
As a consequence,
I = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vd ⊗ vd .
and
|v|2 = |v · v1 |2 + |v · v2 |2 + · · · + |v · vd |2 . (2.9.4)
To derive the main result, first we recall (2.7.6). From the definition of
dimension, we can rewrite (2.7.6) as
136 CHAPTER 2. LINEAR GEOMETRY
Ax = A(u + v) = Au + Av = Av.
This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space,
we must have v orthogonal to itself. Thus v = 0, or t1 v1 +t2 v2 +· · ·+tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Chapter 3
Principal Components
137
138 CHAPTER 3. PRINCIPAL COMPONENTS
To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
This shows the image of the unit circle is the inverse covariance ellipse (§1.6)
corresponding to the covariance Q, with major axis length 2σ1 and minor
axis length 2σ2 .
These reflect vectors across the horizontal axis, and across the vertical axis.
Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t
(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.
The SVD decomposition (§3.3) states that every matrix A can be written
as a product
a b
A= = U SV.
c d
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).
V S U
Av = λv (3.2.1)
3.2. EIGENVALUE DECOMPOSITION 141
singular:
σ, u, v
row column
any
rank rank
matrix square
eigen:
invertible symmetric
λ, v
non-
covariance negative λ≥0
λ ̸= 0 positive λ>0
A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda
(A − λI)v = Av − λv = 0.
v · Qv = v · λv = λv · v = λ.
µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.
This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:
QU = U E. (3.2.3)
allclose(dot(Q,v), lamda*v)
returns True.
λ1 ≥ λ2 ≥ · · · ≥ λd .
Diagonalization (EVD)
In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U
3.2. EIGENVALUE DECOMPOSITION 147
returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)
allclose(Q,dot(U,dot(E,V))
returns True.
init_printing()
# eigenvalues
Q.eigenvals()
# eigenvectors
Q.eigenvects()
U, E = Q.diagonalize()
rank(Q) = rank(E) = r.
Using sympy,
Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)
returns
1 1 1 0
U= , E= .
−1 1 0 3
Also,
Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)
returns
√ √
a b 1 a−c− D a−c+ D
Q= , U=
b c 2b 2b 2b
3.2. EIGENVALUE DECOMPOSITION 149
and
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
display is used to pretty-print the output.
Pseudo-Inverse (EVD)
Qx = b
150 CHAPTER 3. PRINCIPAL COMPONENTS
has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.5)
λ1 λ2 λd
trace(Q) = λ1 + λ2 + · · · + λd . (3.2.6)
Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have
√ √
λ2 v2 λ1 v1
√ √
− λ1 v1 − λ2 v2
Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.7)
v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .
covariance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.
where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a covariance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other
eigenvalue. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any
other eigenvalue.
A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know
λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.
λ1 ≥ v · Qv = v · (λv) = λv · v = λ.
λ1 = v1 · Qv1 ≥ v · Qv (3.2.9)
for all unit vectors v. Let u be any vector. Then for any real t,
v1 + tu
v=
|v1 + tu|
u · Qv1 = λ1 u · v1
154 CHAPTER 3. PRINCIPAL COMPONENTS
u · (Qv1 − λ1 v1 ) = 0
Just as the maximum variance (3.2.8) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.10)
|v|=1
λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
3.2. EIGENVALUE DECOMPOSITION 155
v1 , v2 , v3 , . . . , vd .
T = S⊥
S
v1
v3
v2
Sλ = {v : Qv = λv}
the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .
Let (evs,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.
lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
3.2. EIGENVALUE DECOMPOSITION 157
All this can be readily computed in Python. For the Iris dataset, we have
the covariance matrix in (2.2.13). The eigenvalues are
4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .
For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,
def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
return v
# using sympy
from sympy import Matrix
# using numpy
from numpy import *
m1 m2
x1 x2
To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
160 CHAPTER 3. PRINCIPAL COMPONENTS
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx
proportional to the displacement x. For example, look at the mass m1 . The
spring to its left is extended by x1 , so exerts a force of −kx1 . Here the minus
indicates pulling to the left. On the other hand, the spring to its right is
extended by x2 − x1 , so it exerts a force +k(x2 − x1 ). Here the plus indicates
pulling to the right. Adding the forces from either side, the total force on
m1 is −k(2x1 − x2 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force −kx2 , so the total force on m2 is
−k(2x2 − 2x1 ). We obtain the force vector
2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2
However, as you can see, the matrix here is not exactly Q(2).
m1 m2 m3 m4 m5
x1 x2 x3 x4 x5
vector
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3 −1 2 −1 0 0 x2
−k −x2 + 2x3 − x4 = −k 0 −1
2 −1 0 x3 .
−x3 + 2x4 − x5 0 0 −1 2 −1 x4
−x4 + 2x5 0 0 0 −1 2 x5
But, again, the matrix here is not Q(5). Notice, if we place one mass and
two springs in Figure 3.6, we obtain the 1 × 1 matrix 2.
To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
m1 m2 m2
m1
m1 m1
m2
m2
m5 m5
m3 m4
m4 m3
p(t) = 2 − t − td−1 ,
and let
1
ω
ω2
v1 = .
ω3
..
.
ω d−1
Then Qv1 is
1
2 − ω − ω d−1
−1 + 2ω − ω 2
ω
−ω + 2ω 2 − ω 3
ω2
Qv1 = = p(ω) = p(ω)v1 .
.. ω3
. ..
d−2 d−1
.
−ω + 2ω −1 d−1
ω
vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k .
By (1.5.7),
Eigenvalues of Q(d)
Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
3.2. EIGENVALUE DECOMPOSITION 165
The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above by shifting the entries to the right. The trick of using
the roots of unity to compute the eigenvalues and eigenvectors works for any
circulant matrix.
Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.
d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")
k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)
scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")
legend()
show()
Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ∼ 4 and
the bottom λd = 0 for d large. Using the double-angle formula,
2 πk
λk = 4 sin , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the dou-
ble multiplicity, we obtain2 the proportion of eigenvalues below threshhold
λ,
1√
#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.13)
d π 2
2
This is an approximate equality: The ratio of the two sides approaches 1 as d → ∞.
166 CHAPTER 3. PRINCIPAL COMPONENTS
This arcsine law is valid for a wide class of matrices, not just Q(d), as the
matrix size d grows without bound, d → ∞.
Equivalently, the derivative of the arcsine law (3.2.13) exhibits (see (7.1.9))
the eigenvalue clustering near the ends (Figure 3.11).
lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
f = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,f,usetex=True,fontsize="x-large")
show()
3.3. SINGULAR VALUE DECOMPOSITION 167
When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
Some books allow singular values to be zero. Here we insist that sin-
gular values be positive. Contrast singular values with eigenvalues: While
eigenvalues may be negative or zero, for us singular values are positive.
The definition immediately implies
168 CHAPTER 3. PRINCIPAL COMPONENTS
Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1. Set
t 1 1
Q=AA= .
1 2
Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-
vectors v1 , v2 . Moreover, as we saw in an earlier section, v1 , v2 may be chosen
orthonormal.
The eigenvalues of Q are given by
0 = det(Q − λI) = λ2 − 3λ + 1.
Qv = At Av = At (σu) = σ 2 v. (3.3.2)
Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
3.3. SINGULAR VALUE DECOMPOSITION 169
0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .
A Versus Q
Let A be any matrix. Then
Since the rank equals the dimension of the row space, the first part follows
from §2.4.
170 CHAPTER 3. PRINCIPAL COMPONENTS
Qv = At Av = At (σu) = σAt u = σ 2 v,
so v is an eigenvector of At A corresponding
√ to λ = σ 2 > 0. Conversely, If
Qv = λv and λ > 0, then set σ = λ and u = Av/σ. Then
Avk = σk uk , At uk = σk vk , k = 1, 2, . . . , r, (3.3.3)
and
Avk = 0, At uk = 0 for k > r.
The proof is very simple once we remember the rank of Q equals the
number of positive eigenvalues of Q. By the eigenvalue decomposition, there
is an orthonormal basis of the source space v1 , v2 , . . . and λ1 ≥ λ2 ≥ · · · ≥
λr > 0 such that √ Qvk = λk vk , k = 1, . . . , r, and Qvk = 0, k > r.
Setting σk = λk and uk = Avk /σk , k = 1, . . . , r, as in our first exam-
ple, we have (3.3.3), and, again as in our first example, u1 , u2 , . . . , ur are
orthonormal.
Assume A is N × d. Then the source space is Rd , and the target space
is RN . By construction, vr+1 , vr+2 , . . . , vd is an orthonormal basis for the
null space of A. Set u1 = Av1 /σ1 , u2 = Av2 /σ2 , . . . , ur = Avr /sr . Since
Avr+1 = 0, . . . , Avd = 0, u1 , u2 , . . . , ur is an orthonormal basis for the
column space of A.
3.3. SINGULAR VALUE DECOMPOSITION 171
Since the column space of A is the row space of At , the column space
of A is the orthogonal complement of the nullspace of At (2.7.6). Choose
ur+1 , ur+2 , . . . , uN any orthonormal basis for the nullspace of At . Then
{u1 , u2 , . . . , ur } and {ur+1 , ur+2 , . . . , uN } are orthogonal. From this, u1 , u2 ,
. . . , uN is an orthonormal basis for the target.
For our second example, let a and b be nonzero vectors, possibly of dif-
ferent sizes, and let A be the matrix
A = a ⊗ b, At = b ⊗ a.
Then
Av = (v · b)a = σu and At u = (u · a)b = σv.
Since the range of A equals span(a), the rank of A equals one.
Since σ > 0, v is a multiple of b and u is a multiple of a. If we write
v = tb and u = sa and plug in, we get
Thus there is only one singular value of A, equal to |a| |b|. This is not
surprising since the rank of A is one.
In a similar manner, one sees the only singular value of the 1 × n matrix
A = a equals σ = |a|.
Our third example is
0 0 0 0
1 0 0 0
A=
0
. (3.3.4)
1 0 0
0 0 1 0
Then
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At =
0
, Q = At A =
0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
172 CHAPTER 3. PRINCIPAL COMPONENTS
0 0 0 1
0 0 0 0 0 0
Here we have (N, d) = (6, 4), r = 3. In either case, S has the same shape
N × d as A.
Let U be the matrix with columns u1 , u2 , . . . , uN , and let V be the matrix
with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 , . . . , vd .
3.3. SINGULAR VALUE DECOMPOSITION 173
AV t = U S.
A = U SV.
Summarizing,
Diagonalization (SVD)
U, sigma, V = svd(A)
# sigma is a vector
174 CHAPTER 3. PRINCIPAL COMPONENTS
print(U.shape,S.shape,V.shape)
print(U,S,V)
Given the relation between the singular values of A and the eigenvalues
of Q = At A, we also can conclude
# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]
Q = cov(dataset.T,bias=False)
Q = cov(dataset.T,bias=True)
# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]
# compare columns of U
# and rows of V
U, V
returns
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
, V = −0.66 −0.73 0.18 0.07
U =
0.86 0.18 0.07 −0.48 0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75
This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .
and
A+ uk = 0, (A+ )t vk = 0 for k > r.
Qvk = λk vk , k = 1, . . . , d.
λ1 ≥ λ2 ≥ · · · ≥ λd ,
in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto the plane span(v1 , v2 ). The projected dataset can then be visualized
as points in the plane. Similarly, one can take the top three eigenvalues
λ1 ≥ λ2 ≥ λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto
the space span(v1 , v2 , v3 ). This can then be visualized as points in three
dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784
dimensions. After we download the dataset,
dataset = train_X.reshape((60000,784))
labels = train_y
Q = cov(dataset.T)
totvar = Q.trace()
# cumulative sums
3.4. PRINCIPAL COMPONENT ANALYSIS 179
sums = cumsum(percent)
data = array([percent,sums])
print(data.T[:20].round(decimals=3))
d = len(lamda)
from matplotlib.pyplot import stairs
stairs(percent,range(d+1))
The left column in Figure 3.12 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of
the total variance. The right column lists the cumulative sums of the eigen-
values, so the third entry in the right column is the sum of the top three
eigenvalues, λ1 + λ2 + λ3 = 22.97%.
This results in Figures 3.12 and 3.13. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.12 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.
180 CHAPTER 3. PRINCIPAL COMPONENTS
def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P
In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d × n matrix V . The
code then returns the projection matrix P = V V t (2.7.4).
Instead of working with the covariance Q, as discussed at the start of
the section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.
3.4. PRINCIPAL COMPONENT ANALYSIS 181
def pca_with_svd(dataset,n):
# center dataset
m = mean(dataset,axis=0)
vectors = dataset - m
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P
Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the covariance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")
If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.
N = len(dataset)
n = 10
engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
reduced.shape
and returns (N, n) = (60000, 10). The following code computes the projected
dataset
projected = engine.inverse_transform(reduced)
projected.shape
Figure 3.14: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
Now we project all vectors of the MNIST dataset onto two and three
184 CHAPTER 3. PRINCIPAL COMPONENTS
grid()
legend(loc='upper right')
show()
3.4. PRINCIPAL COMPONENT ANALYSIS 185
grid()
legend(loc='upper right')
show()
%matplotlib notebook
from matplotlib.pyplot import *
from mpl_toolkits import mplot3d
186 CHAPTER 3. PRINCIPAL COMPONENTS
P = axes(projection='3d')
P.set_axis_off()
legend(loc='upper right')
show()
The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.
such that
def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i
def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]
def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
This code returns the size the clusters after each iteration. Here is code
that plots a cluster.
def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
figure(figsize=(4,4))
grid()
for v in dataset: scatter(v[0],v[1],s=20,c='black')
show()
Counting
Some of the material in this chapter is first seen in high school. Because
repeating the exposure leads to a deeper understanding, we review it in a
manner useful to the later chapters.
Why are there six possibilities? Because they are three ways of choosing
191
192 CHAPTER 4. COUNTING
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is
6 = 3 × 2 × 1.
n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.
Notice also
(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,
Permutations of n Objects
The number of ways of selecting n objects from a collection of n distinct
objects is n!.
We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls.
More generally, we can consider the selection of k balls from a bag con-
taining n distinct balls. There are two varieties of selections that can be
made: Ordered selections and unordered selections. An ordered selection is
a permutation, and an unordered selection is a combination. In particular,
when k = n, n! is the number of ways of permuting n objects.
4.1. PERMUTATIONS AND COMBINATIONS 193
Notice P (x, k) is defined for any real number x by the same formula,
P (n, k) n!
C(n, k) = = .
k! (n − k)!k!
For example,
5×4
P (5, 2) = 5 × 4 = 20, C(5, 2) = = 10,
2×1
so we have twenty ordered pairs
(1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2),
(3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4)
{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 5}, {4, 5}.
1, 2, 3, . . . , n − 1, n,
n! < nn .
4.1. PERMUTATIONS AND COMBINATIONS 195
However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume an approximation of the form1
n n
n! ∼ e , for n large, (4.1.1)
e
for some constant e. We seek the best constant e that fits here. In this
approximation, we multiply by e so that (4.1.1) is an equality when n = 1.
Using the binomial theorem, in §4.4 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (4.1.2)
3 2
Based on this, a constant e satisfying (4.1.1) must lie between 2 and 3,
2 ≤ e ≤ 3.
To figure out the best constant e to pick, we see how much both sides
of (4.1.1) increase when we replace n by n + 1. Write (4.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! ∼ e . (4.1.3)
e
Dividing the left sides of (4.1.1), (4.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1
1 1
= (n + 1) · · 1 + . (4.1.4)
e(n/e)n e n
To make these quotients match as closely as possible, we should choose
n
1
e∼ 1+ . (4.1.5)
n
Choosing n = 1, 2, 3, . . . , 100, . . . results in
4.2 Graphs
A graph consists of nodes and edges. For example, the graphs in Figure 4.2
each have four nodes and three edges. The left graph is directed, in that a
direction is specified for each edge. The graph on the right is undirected, no
direction is specified.
−3 7.4
2 0
Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.
In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,
aij = aji .
When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
4.5).
The cycle graph Cn with n nodes is as in Figure 4.5. The graph Cn has
n edges. The cycle graph C3 is a triangle.
d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn
(d1 , d2 , d3 , . . . , dn )
Handshaking Lemma
If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn , then
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1
To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is
n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.
n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.
A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m = kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then v1 →
v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
4.2. GRAPHS 201
For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the
complete graph Kn is the n×n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is
A=1⊗1−I
Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
202 CHAPTER 4. COUNTING
For any adjacency matrix A, the sum of each row is equal to the degree
of the node corresponding to that row. This is the same as saying
d1
d2
A1 = . . . .
dn
A1 = k1,
v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.
Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies
Top Eigenvalue
For a k-regular graph, k is the top eigenvalue of the adjacency matrix
A.
A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,
2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,
Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i,
which means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
4.2. GRAPHS 205
Connected Graph
Let A be the adjacency matrix. Then the graph is connected if for
every i ̸= j, there is a k with (Ak )ij > 0.
1 0 0 0 0 1 0 0
Hence P is orthogonal,
P P t = I, P −1 = P t .
4.2. GRAPHS 207
Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy
A′ = P AP −1 = P AP t
A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros,
and let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix
of Knm is
A = A(Knm ) = a ⊗ b + b ⊗ a.
208 CHAPTER 4. COUNTING
Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b
is an eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity
n + m − 2. Since trace(A) = 0, the sum of the eigenvalues is zero, and the
remaining two eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Then v is orthogonal to the nullspace
of A, so v must be a linear combination of a and b, v = ra+sb. Since a·b = 0,
Aa = nb, Ab = ma.
Hence
λv = Av = A(ra + sb) = rnb + sma.
4.2. GRAPHS 209
Applying A again,
For
√ example, √
for the graph in Figure 4.8, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.
L = B t B.
Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
Laplacian
The laplacian satisfies
L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.
210 CHAPTER 4. COUNTING
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
2 −1 0 0 0 −1
−1 2 −1 0 0 0
0 −1 2 −1 0 0
L = Q(6) = .
0
0 −1 2 −1 0
0 0 0 −1 2 −1
−1 0 0 0 −1 2
Similarly,
Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(4.3.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .
and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and
4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
212 CHAPTER 4. COUNTING
and
5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
With this notation, the number
n
(4.3.5)
k
is the coefficient of an−k xk when you multiply out (a + x)n . This is the
binomial coefficient. Here n is the degree of the binomial, and k, which
specifies the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term
1, and term 2. Alternatively, one says these are the zeroth term, the first
term, and the second term. Thus the second term in theexpansion of the
binomial (a + x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general,
the binomial (a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
Binomial Theorem
The binomial (a + x)n equals
n n n n−1 n n−2 2 n n−1 n n
a + a x+ a x + ··· + ax + x .
0 1 2 n−1 n
(4.3.6)
For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,
In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is
10
= 45.
2
214 CHAPTER 4. COUNTING
We can learn a lot about the binomial coefficients from this triangle.
First, we have 1’s all along the left edge. Next, we have 1’s all along the
right edge. Similarly, one step in from the left or right edge, we have the row
number. Thus we have
n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1
Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written
n n
= , 0 ≤ k ≤ n;
k n−k
Let’s work this out when n = 3. Then the left side is (a + x)4 . From (4.3.4),
we get
4 4 4 3 4 2 2 4 3 4 4
a + a x+ ax + ax + x
0 1 2 3 4
3 3 3 2 3 2 3 3
= (a + x) a + a x+ ax + x
0 1 2 3
3 4 3 3 3 2 2 3
= a + a x+ ax + ax3
0 1 2 3
3 3 3 2 2 3 3 3 4
+ a x+ ax + ax + x
0 1 2 3
3 4 3 3 3 3 3
= a + + a x+ + a2 x 2
0 1 0 2 1
3 3 3 3 4
+ + ax + x.
3 2 3
4.3. BINOMIAL THEOREM 215
This allows us to build Pascal’s triangle (Figure 4.9), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.
We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get
n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n
Hence: the alternating2 sum of the binomial coefficients along the n-th row
of Pascal’s triangle is zero.
We now show
2
Alternating means the plus-minus pattern + − + − + − . . . .
216 CHAPTER 4. COUNTING
Binomial Coefficient
The binomial coefficient nk equals C(n, k),
n n · (n − 1) · · · · · (n − k + 1) n!
= = , 1 ≤ k ≤ n.
k 1 · 2 · ··· · k k!(n − k)!
(4.3.10)
n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!
n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!
The formula (4.3.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
4.4. EXPONENTIAL FUNCTION 217
comb(n,k)
comb(n,k,exact=True)
The binomial coefficient nk makes sense even for fractional n. This can
Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (4.4.1)
n k=2
k! n n n
218 CHAPTER 4. COUNTING
From (4.4.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n
By (4.4.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (4.4.4)
n
Summarizing, we established the following strengthening of (4.1.5).
Euler’s Constant
The limit n
1
e = lim 1+ (4.4.5)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (4.1.2).
To summarize,
Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ...
k=0
k! 2 6 24 120 720
Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2
Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (4.4.5) is often called the compound-interest formula.
Exponential Function
For any real number x, the limit
x n
exp x = lim 1+ (4.4.6)
n→∞ n
exists. In particular, exp 0 = 1 and exp 1 = e.
preceding one,
(1 − x) = 1−x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...
This shows the limit exp x in (4.4.6) is well-defined when x < 0, and
1
exp(−x) = , for all x.
exp x
4.4. EXPONENTIAL FUNCTION 223
Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x 4 x5 x6
exp x = =1+x+ + + + + + . . . (4.4.10)
k=0
k! 2 6 24 120 720
Law of Exponents
For real numbers x and y,
(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )
Thus
∞
! ∞
! ∞ n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0
Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k=0 k=0
k! (n − k)! n! k=0 k n!
Thus
∞
! ∞
! ∞
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k=0
k! m=0
m! n=0
n!
Exponential Notation
For any real number x,
ex = exp x.
Probability
[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].
On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains
[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].
In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave,
with the goal of answering the question: Is a given coin fair?
227
228 CHAPTER 5. PROBABILITY
P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
Independent Coin-Tossing
P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.
P (X = a) = p, P (X = b) = q, P (X = c) = r.
230 CHAPTER 5. PROBABILITY
E(X) = ap + bq + cr.
For example,
E(Xn ) = 1 · p + 0 · (1 − p) = p,
Let
Sn = X1 + X2 + · · · + Xn .
5.1. BINOMIAL PROBABILITY 231
Since Xk = 1 when the k-th toss is heads, and Xk = 0 when the k-th toss is
tails, Sn is the number of heads in n tosses.
The mean of Sn is
which is the same as saying P rob(r < p < r +dr) = dr. By (5.1.6), we obtain
Z 1
n k
P rob(Sn = k) = r (1 − r)n−k dr.
0 k
We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s probability of heads
p?
To this end, we introduce the fundamental
Bayes Theorem
P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.8)
P rob(B)
234 CHAPTER 5. PROBABILITY
P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
P rob(p = r)
P rob(p = r | Sn = k) = P rob(Sn = k | p = r) · . (5.1.9)
P rob(Sn = k)
Notice because of the extra factor (n + 1), this is not equal to (5.1.6).
In (5.1.6), p is fixed, and k is the variable. In (5.1.10), k is fixed, and r is
the variable. This a posteriori distribution for (n, k) = (10, 7) is plotted in
Figure 5.1. Notice this distribution is concentrated about k/n = 7/10 = .7.
5.1. BINOMIAL PROBABILITY 235
grid()
X = arange(0,1,.01)
plot(X,f(X),color="blue",linewidth=.5)
show()
Because Bayes Theorem is so useful, here are two alternate forms. First,
since
P rob(B | A) P rob(A)
P rob(A | B) = . (5.1.11)
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
Let
1
p = σ(z) = . (5.1.12)
1 + e−z
This is the logistic function or sigmoid function (Figure 5.2). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.3). Think of the input z as an activation energy, and the output
p as the probability of activation. In Python, σ is the expit function.
p = expit(z)
5.1. BINOMIAL PROBABILITY 237
If the result is tails, select a point x at random with normal probability with
mean mT , or
2
P rob(x | T ) ∼ e−|x−mT | /2 .
This says the the groups are centered around the points mH and mT respec-
tively.
Given a point x, what is the probability x is in the heads group? In other
words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Let
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ). Inserting the
probabilities and simplifying leads to
P rob(x | H) P rob(H)
log = w · x + w0 . (5.1.14)
P rob(x | T ) P rob(T )
By (5.1.13), this leads to
P rob(H | x) = σ(w · x + w0 ).
238 CHAPTER 5. PROBABILITY
5.2 Probability
A probability is often described as
the extent to which an event is likely to occur, measured by the
ratio of the favorable outcomes to the whole number of outcomes
possible.
We explain what this means by describing the basic terminology:
• An experiment is a procedure that yields an outcome, out of a set of
possible outcomes. For example, tossing a coin is an experiment that
yields one of two outcomes, heads or tails, which we also write as 1 or
0. Rolling a six-sided die yields outcomes 1, 2, 3, 4, 5, 6. Rolling two
six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Flipping a coin three
times yields 23 = 8 outcomes
or
000, 001, 010, 011, 100, 101, 110, 111.
(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
#(E) = 35, which is the number of ways you can choose three things
out of seven things:
7 7·6·5
#(E) = 7-choose-3 = = = 35.
3 1·2·3
1. 0 ≤ P rob(s) ≤ 1,
2. The sum of the probabilities of all outcomes equals one.
• Outcomes are equally likely when they have the same probability. When
this is so, we must have
#(E)
P rob(E) = .
#(S)
For example,
1. A coin is fair if the outcomes are equally likely. For one toss of a
fair coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes
P rob(head) = p, P rob(tail) = 1 − p,
p = .5
n = 10
N = 20
v = binomial(n,p,N)
print(v)
returns
[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]
p = .5
for n in [5,50,500]: print(binomial(n,p,1))
This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
5.2. PROBABILITY 241
Figure 5.4: 100,000 sessions, with 5, 15, 50, and 500 tosses per session.
3, 28, 266
The proportions are the count divided by the total number of tosses in
the experiment. For the above three experiments, the proportions after 5
tosses, 50 tosses, and 500 tosses, are
Now we repeat each experiment 100,000 times and we plot the results in
a histogram.
N = 100000
p = .5
for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
242 CHAPTER 5. PROBABILITY
show()
The takeaway from these graphs are the two fundamental results of prob-
ability:
2. Central Limit Theorem. For large sample size, the shape of the
graph of the proportions or counts is approximately normal. The nor-
mal distribution is studied in §5.4. Another way of saying this is: For
large sample size, the shape of the sample mean histogram is approxi-
mately normal.
The law of large numbers is qualitative and the central limit theorem
is quantitative. While the law of large numbers says one thing is close to
another, it does not say how close. The central limit theorem provides a
numerical measure of closeness, using the normal distribution.
Roll two six-sided dice. Let A be the event that at least one dice is an
even number, and let B be the event that the sum is 6. Then
A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .
B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:
X x
for the chance that X lies in the interval [a, b], we are asking for P rob(a <
X < b). If we don’t know anything about X, then we can’t figure out the
probability, and there is nothing we can say. Knowing something about X
means knowing the distribution of X: Where X is more likely to be and
where X is less likely to be. In effect, a random variable is a quantity X
whose probabilities P rob(a < X < b) can be computed.
For example, take the Iris dataset and let X be the petal length of an iris
(Figure 5.6) selected at random. Here the number of samples is N = 150.
df = read_csv("iris.csv")
petal_length = df["Petal_length"].to_numpy()
N
1 X
m = E(X) = xk .
N k=1
N
2 1 X 2
E(X ) = x .
N k=1 k
In general, given any function f (x), we have the mean of f (x1 ), f (x2 ), . . . ,
f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.1)
N k=1
If we let
(
1, 1 < x < 3,
f (x) =
0, otherwise,
N
1 X #{samples satisfying 1 < xk < 3}
E(f (X)) = f (xk ) = .
N k=1 N
But this is the probability that a randomly selected iris has petal length X
between 1 and 3,
P rob(1 < X < 3) = E(f (X)),
To see how the iris petal lengths are distributed, we plot a histogram,
grid()
hist(petal_length,bins=20)
show()
rng = default_rng()
# n = batch_size
248 CHAPTER 5. PROBABILITY
def random_batch_mean(n):
rng.shuffle(petal_length)
return mean(petal_length[:n])
random_batch_mean(5)
The five petal lengths are selected by first shuffling the petal lengths,
then selecting the first five petal_length[:5]. Now repeat this computation
100,000 times, for batch sizes 1, 5, 15, 50. The resulting histograms are in
Figure 5.8. Notice in the first subplot, the batch is size n = 1, so we recover
the base histogram Figure 5.7. Figure 5.8 is of course another illustration of
the central limit theorem.
N = 100000
5.3. RANDOM VARIABLES 249
for n in [1,5,15,50]:
Xbar = [ random_batch_mean(n) for _ in range(N)]
hist(Xbar,bins=50)
grid()
show()
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
P rob(X = x) = px (1 − p)1−x , x = 0, 1.
p
1−p
0 1
5.10, when the probability P rob(a < X < b) is given by the green area in
Figure 5.10. Thus
0 a b 0 a b
Then the green areas in Figure 5.10 is the difference between two areas, hence
equal
cdfX (b) − cdfX (a).
For the bernoulli distribution in Figure 5.9, the cdf is in Figure 5.12.
Because the bernoulli random variable takes on only the values x = 0, 1,
these are the values where the cdf P rob(X ≤ x) jumps.
1
1−p
0 1
This is the population mean. It does not depend on a sampling of the popu-
lation.
For example, suppose the population consists of 100 balls, of which 30
are red, 20 are green, and 50 are blue. The cost of each ball is
$1, red,
X(ball) = $2, green,
$3, blue.
Then
#(red) 30
pred = P rob(red) = = = .3,
#(balls) 100
#(green) 20
pgreen = P rob(green) = = = .2,
#(balls) 100
#(blue) 50
pblue = P rob(blue) = = = .5.
#(balls) 100
Then the average cost of a ball equals
E(X) = pred · 1 + pgreen · 2 + pblue · 3
30 · 1 + 20 · 2 + 50 · 3 x1 + x2 + · · · + x100
= = .
100 100
252 CHAPTER 5. PROBABILITY
The variance is
V ar(X) = E (X − µ)2 = p1 (x1 − µ)2 + p2 (x2 − µ)2 + p3 (x3 − µ)2 + . . .
N
X
= pk (xk − µ)2 .
k=1
µ = E(X) = x1 p1 + x2 p2 = 1 · p + 0 · (1 − p) = p,
V ar(X) = σ 2 = E (X − µ)2
We conclude
E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.2)
Let X have mean µ and variance σ 2 , and write
X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ) 2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of the bernoulli random variable is
X −p
p .
p(1 − p)
p(1 − p)
0 1
t2 t3
et = 1 + t + + + ...
2! 3!
where t is any real number. The number e, Euler’s constant (§4.4), is ap-
proximately 2.7, as can be seen from
1 1 1 1
e = e1 = 1 + 1 + + + ··· = 1 + 1 + + + ...
2! 3! 2 6
Since X has real values, so does tX, so etX is also a random variable.
The moment generating function is the mean of etX ,
t2 t3
M (t) = MX (t) = E etX = 1 + tE(X) + E(X 2 ) + E(X 3 ) + . . .
2! 3!
For example, for the smartphone random variable X = 0, 1 with P rob(X =
1) = p, X 2 = X, X 3 = X, . . . , so
t2 t3 t2 t3
M (t) = 1 + tE(X) + E(X 2 ) + E(X 3 ) + · · · = 1 + tp + p + p + . . .
2! 3! 2! 3!
which equals
M (t) = (1 − p) + pet .
In §5.2, we discussed independence of events. Now we do the same for
random variables. Let X and Y be random variables. We say X and Y are
uncorrelated if the expectations multiply,
a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)
Since
12
1 X tk 1 e13t − et
MX+Y (t) = e = ,
12 k=1 12 et − 1
we obtain
1 e13t − et 1 e7t − et
= MY (t) · .
12 et − 1 6 et − 1
Factoring
we obtain
1
MY (t) = (e6t + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
258 CHAPTER 5. PROBABILITY
E(X n ) = E(Y n ), n ≥ 1.
X1 , X2 , . . . , Xn x1 , x2 , . . . , xn
mean is
n
X1 + X 2 + · · · + Xn 1X
X̄ = = Xk .
n n k=1
Then
1 1 1
E(X̄) = E(X1 +X2 +· · ·+Xn ) = (E(X1 )+E(X2 )+· · ·+E(Xn )) = ·nµ = µ.
n n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . Since σ 2 =
E(X 2 ) − E(X)2 , we have
E(Xk2 ) = µ2 + σ 2 .
When i ̸= j, by independence,
1 X
= E(Xi Xj )
n2 i,j
!
1 X X
= 2 E(Xi Xj ) + E(Xk2 )
n i̸=j k
1 2 2 2 1
= µ2 + σ 2 .
= 2
n(n − 1)µ + n(µ + σ )
n n
σ2
E(X̄) = µ and V ar(X̄) = .
n
Sn = X1 + X2 + · · · + Xn .
grid()
z = arange(mu-3*sdev,mu+3*sdev,.01)
0 a b
Then
X −µ
X ∼ N (µ, σ) ⇐⇒ Z= ∼ N (0, 1).
σ
A normal distribution is a standard normal distribution when µ = 0 and
σ = 1.
Sn = X1 + X2 + · · · + Xn
we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So
p
p
z z
When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.16) is specified by
p = Z.cdf(z)
z = Z.ppf(p)
ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.17, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
−z 0 −z 0 z
0 z
and
P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < z),
and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
In Python,
# p = P(|Z| < z)
z = Z.ppf((1+p)/2)
Now let’s zoom in closer to the graph and mark off 1, 2, 3 on the hor-
izontal axis to obtain specific colored areas as in Figure 5.18. These areas
are governed by the 68-95-99 rule (Table 5.19). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of
the blue plus green areas 0.955, and our confidence that |Z| < 3 equals the
sum of the blue plus green plus red areas 0.997. This is summarized in Table
5.19.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event
is considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
2 in a billion. You want a plane crash to be six-sigma.
266 CHAPTER 5. PROBABILITY
−3 −2 −1 0 1 2 3
Figure 5.18: 68%, 95%, 99% confidence cutoffs for standard normal.
Figure 5.18 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.19, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.
µ − 3σ µ−σ µ µ+σ µ + 3σ
a = Z.ppf(.15)
b = Z.ppf(.9)
Here are three examples. In the first example, suppose student grades are
normally distributed with mean 80 and variance 16. This says the average
of all grades is 80, and the SD is 4. If a grade is g, the standardized grade is
g − 80
z= .
4
5.4. NORMAL DISTRIBUTION 269
rng = default_rng()
x = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(x)
for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)
Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
########################
# P-values
5.5. CHI-SQUARED DISTRIBUTION 271
########################
def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type?")
return
print("type: ",type)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90
pvalue(mean,sdev,n,xbar,type)
1
MU (u) = E(euU ) = √ .
1 − 2u
Since ∞
1 uU
X un
√ =E e = E(U n ),
1 − 2u n=0
n!
comparing coefficients of un /n! shows
n n −1/2
E(U ) = (−2) n! , n = 0, 1, 2, . . . (5.5.1)
n
1
MU (t) = E(etU ) = .
(1 − 2t)d/2
Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want
d = 2
u = 1
U(d).cdf(u)
returns 0.39.
3
Geometrically, P rob(U < 1) is the probability that a normally distributed point is
inside the unit sphere in d-dimensional space.
5.5. CHI-SQUARED DISTRIBUTION 275
d
X
E(U 2 ) = E(Zk2 Zℓ2 )
k,ℓ=1
X d
X
= E(Zk2 )E(Zℓ2 ) + E(Zk4 )
k̸=ℓ k=1
2
= d(d − 1) · 1 + d · 3 = d + 2d.
Because
1 1 1
′ /2 = ,
d/2
(1 − 2t) (1 − 2t)d (1 − 2t)(d+d′ )/2
we obtain
X = (X1 , X2 , . . . , Xn )
in Rn .
If X is a random vector in Rd , its mean is the vector
Qij = E(Xi Xj ), 1 ≤ i, j ≤ d.
is never negative.
A random vector X is normal with mean µ and variance Q if for every
vector w, w · X is normal with mean w · µ and variance w · Qw.
Then µ is the mean of X, and Q is the variance X. The random vector
X is standard normal if µ = 0 and Q = I.
From §5.3, we see
Z = (Z1 , Z2 , . . . , Zd )
From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.
and
Q+ = U S + U t .
If we set Y = U t X = (Y1 , Y2 , . . . , Yd ), then
so X · v = 0.
It is easy to check Q3 = Q and Q2 is symmetric, so (§2.3) Q+ = Q. Since
X · v = 0,
X · Q+ X = X · QX = X · (X − (v · X)v) = |X|2 .
We conclude
280 CHAPTER 5. PROBABILITY
Singular Chi-squared
We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + X n
X̄ = .
n
Let S 2 be the sample variance,
(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2
S2 = . (5.5.3)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show
Now let
X = Z − (Z · v)v = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).
5.5. CHI-SQUARED DISTRIBUTION 281
E(X ⊗ X) = I − v ⊗ v.
Hence
(n − 1)S 2 = |X|2
is chi-squared with degree n − 1.
Now X and Z · v are uncorrelated, since
Statistics
6.1 Estimation
In statistics, like any science, we start with a guess or an assumption or
hypothesis, then we take a measurement, then we accept or modify our
guess/assumption based on the result of the measurement. This is common
sense, and applies to everything in life, not just statistics.
For example, suppose you see a sign on the UNH campus saying
283
284 CHAPTER 6. STATISTICS
do not
reject H
p>α
hypothesis
sample p-value
H
p<α
reject H
Here is a geometric example. The null hypothesis and the alternate hy-
pothesis are
In §2.2, there is code (2.2) returning the angle angle(u,v) between two
vectors. To test this hypothesis, we run the code
6.1. ESTIMATION 285
N = 784
for _ in range(20):
u = randn(N)
v = randn(N)
print(angle(u,v))
86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357
N = 784
for _ in range(20):
u = binomial(n,.5,N)
v = binomial(n,.5,N)
print(angle(u,v))
59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699
Here we see strong evidence that H0 is false, as the angles are now close to
60◦ .
6.1. ESTIMATION 287
The difference between the two scenarios is the distribution. In the first
scenario, we have randn(n): the components are distributed according to a
standard normal. In the second scenario, we have binomial(1,.5,N): the
components are distributed according to a fair coin toss. To see how the
distribution affects things, we bring in the law of large numbers, which is
discussed in §5.3.
Let X1 , X2 , . . . , Xn be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xn are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + X n
X̄ = .
n
Then we have the
We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xn ), and v = (y1 , y2 , . . . , yn ) where all compo-
nents are selected independently of each other, and each is selected according
to the same distribution.
Let U = (X1 , X2 , . . . , Xn ), V = (Y1 , Y2 , . . . , Yn ), be the corresponding
random variables. Then X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent
and identically distributed (i.i.d.), with population mean E(X) = E(Y ).
From this, X1 Y1 , X2 Y2 , . . . , Xn Yn are i.i.d. random variables with popu-
lation mean E(XY ). By the law of large numbers,1
X1 Y1 + X2 Y2 + · · · + Xn Yn
≈ E(XY ),
n
so
U · V = X1 Y1 + X2 Y2 + · · · + Xn Yn ≈ n E(XY ).
1
≈ means the ratio of the two sides approaches 1 as n grows without bound.
288 CHAPTER 6. STATISTICS
U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2
µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)
µ2
cos(θ) is approximately .
µ2 + σ 2
6.2 Z-test
Suppose we want to estimate the proportion of American college students
who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
6.2. Z-TEST 289
p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n
hist(v,edgecolor ='Black')
show()
290 CHAPTER 6. STATISTICS
|p − X̄| < ϵ,
(L, U ) = (X̄ − ϵ, X̄ + ϵ)
is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if
L, U = X̄ − ϵ, X̄ + ϵ.
P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n
##########################
# Confidence Interval - Z
##########################
def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
294 CHAPTER 6. STATISTICS
alpha = .02
sdev = 228
n = 35
xbar = 95
L, U = confidence_interval(xbar,sdev,n,alpha,type)
Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
• H0 : µ = µ0
• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .
• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value
Hypothesis Testing
There are three types of alternative hypotheses Ha :
µ < µ0 , µ > µ0 , µ ̸= µ0 .
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.
###################
# Hypothesis Z-test
###################
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.
There are two types of possible errors we can make. a Type I error is
when H0 is true, but we reject it, and a Type 2 error is when H0 is not true
but we fail to reject it.
H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β
This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code
############################
# Type1 and Type2 errors - Z
############################
def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ",
,→ alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)
mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"
type2_error(type,mu0,mu1,sdev,n,alpha)
A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then
the power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
6.3. T -TEST 301
6.3 T -test
Let X1 , X2 , . . . , Xn be a simple random sample from a population. We
repeat the previous section when we know neither the population mean µ,
nor the population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
2 1 X
S = (Xk − X̄)2 .
n − 1 k=1
Here N is a constant to make the total area under the graph equal to one
(Figure 6.4). In other words, (6.3.1) is the pdf of the t-distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.
for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))
plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()
Xk = µ + σZk ,
304 CHAPTER 6. STATISTICS
√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n − 1 k=1
Using the last result with d = n − 1, we arrive at the main result in this
section.
2
Geometrically, P rob(T > 1) is the probability that a normally distributed point is
inside the light cone in (d + 1)-dimensional spacetime.
6.3. T -TEST 305
##########################
# Confidence Interval - T
##########################
def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U
n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)
L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
306 CHAPTER 6. STATISTICS
Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we see
(L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.
• H0 : µ = µ0
• Ha : µ ̸= µ0 .
###################
# Hypothesis T-test
###################
xbar = 122
6.3. T -TEST 307
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01
ttest(mu0, s, n, xbar,type)
• H0 : µ = µ0
• Ha : µ > µ0
########################
# Type1 and Type2 errors
########################
def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
elif type == "upper-tail":
308 CHAPTER 6. STATISTICS
tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
tstar = T(d).ppf(1 - alpha/2)
type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta -
,→ tstar)
else: print("what's the test type?"); return
type2_error(type,mu0,mu1,n,alpha)
Similarly,
2σY4
V ar(SY2 ) = .
m−1
Before, with a single mean, we used the result that
Z
T =p
U/n
2. Z is N (0, 1)
We apply this same result this time, but we proceed more carefully. To
begin, X̄ and Ȳ are normal with means µX and µY and variances σ 2 /n and
σ 2 /m respectively. Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
σ2 σ2
+
n m
Next,
2
SX SY2
(n − 1) and (m − 1)
σ2 σ2
are chi-squared of degrees n − 1 and m − 1 respectively, so their sum
2
SX SY2
(n − 1) + (m − 1)
σ2 σ2
310 CHAPTER 6. STATISTICS
##################################
# Confidence Interval - Two means
##################################
import numpy as np
from scipy.stats import t
T = t
def confidence_interval(xbar,ybar,varx,vary,nx,ny,alpha):
tstar = T.ppf(1-alpha/2, nx+ny-2)
varp = (nx-1)*varx+(ny-1)*vary
n = nx+ny-2
varp = varp/n
s_p = np.sqrt(varp)
h = 1/nx + 1/ny
L = xbar - ybar - tstar * s_p * np.sqrt(h)
U = xbar - ybar + tstar * s_p * np.sqrt(h)
6.4. TWO MEANS 311
return L, U
2
Now we turn to the question of what to do when the variances σX and
2
σY are not equal. In this case, by independence, the population variance of
X̄ − Ȳ is the sum of the population variances of X̄ and Ȳ , which is
2
σX σY2
σB2 = + . (6.4.1)
n m
Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
2
σX σY2
+
n m
We want to replace the population variance (6.4.1) by the sample variance
2
SX S2
SB2 = + Y.
n m
Because SB2 is not a straight sum, but is a more complicated linear combina-
tion of variances, SB2 is not chi-squared.
Welch’s approximation is to assume it is chi-squared with degree r, and
to figure out the best r for this. More exactly, we seek the best choice of r
so that
rSB2 rSB2
= 2
σB2 σX σ2
+ Y
n m
is close to chi-squared with degree r. By construction, we multiplied SB2 by
r/σB2 so that its mean equals r,
2
rSB r
E 2
= 2 E(SB2 ) = r.
σB σB
Since the variance of a chi-squared with degree r is 2r, we compute the
variance and set it equal to 2r,
2
rSB r2
2r = V ar = V ar(SB2 ). (6.4.2)
σB2 (σB2 )2
By independence,
2 2
2 SX SY 1 2 1
V ar(SB ) = V ar + V ar = 2 V ar(SX ) + 2 V ar(SY2 ).
n m n m
312 CHAPTER 6. STATISTICS
2 2
But (n − 1)SX /σX and (m − 1)SY2 /σY2 are chi-squared, so
4
σX σY4
V ar(SB2 ) = 2 + 2 . (6.4.3)
n2 (n − 1) m2 (m − 1)
Combining (6.4.2) and (6.4.3), we arrive at Welch’s approximation for the
degrees of freedom,
2 2
σX σY2
+
σB4 n m
r= 4 4 = 4 .
σX σY σX σY4
+ +
n2 (n − 1) m2 (m − 1) n2 (n − 1) m2 (m − 1)
In practice, this expression for r is never an integer, so one rounds it to the
2
closest integer, and the population variances σX and σY2 are replaced by the
2 2
sample variances SX and SY .
We summarize the results.
Welch’s T-statistic
If we have independent simple random samples, then the statistic
X̄ − Ȳ − (µX − µY )
T = r
2
SX S2
+ Y
n m
is approximately distributed according to a T -distribution with degrees
of freedom 2 2
SX SY2
+
n m
r= 4 .
SX SY4
+
n2 (n − 1) m2 (m − 1)
6.5 Variances
Let X1 , X2 , . . . , Xn be a normally distributed simple random sample with
mean 0 and variance 1.
Then we know
U = X12 + X22 + · · · + Xn2
6.5. VARIANCES 313
P rob(U ≤ χ2α,n ) = α.
(n − 1)S 2
∼ χ2n−1 .
σ2
Let
a = χ2α/2,n−1 , b = χ21−α/2,n−1 . (6.5.1)
By definition of the score χ2α,n , we have
(n − 1)S 2
P rob a ≤ ≤ b = 1 − α.
σ2
(n − 1)S 2 (n − 1)S 2
2
P rob ≤ σ ≤ = 1 − α.
b2 a2
We conclude
Confidence Interval
A (1 − α)100% confidence interval for the population variance σ 2 is
(n − 1)S 2 (n − 1)S 2
2
≤σ ≤
b2 a2
where a and b are the χ2n−1 scores at significance 1 − α/2 and α/2.
314 CHAPTER 6. STATISTICS
##############################
# Confidence Interval - Chi2
##############################
def confidence_interval(s2,n,alpha):
a = chi2.ppf(alpha/2,n-1)
b = chi2.ppf(1-alpha/2,n-1)
L = (n-1)*s2/b
U = (n-1)*s2/a
return L, U
L, U = 1.99, 14.0
.
For hypothesis testing, given hypotheses
• H0 : σ = σ0
• Ha : σ ̸= σ0
and one compares the p-value of the standardized test statistic to the required
significance score, whether two-tail, upper-tail, or lower-tail.
6.5. VARIANCES 315
Now we consider two populations with two variances. For this, we intro-
duce the F -distribution. If U1 , U2 are independent chi-squared distributions
with degrees n1 and n2 , then
U1 /n1
F =
U2 /n2
alpha = .05
a = f.ppf(alpha/2,dfn,dfd)
b = f.ppf(1-alpha/2,dfn,dfd)
Then
2
σY2
SX
P rob aα < 2 2 < bα = 1 − α,
SY σ X
which may be rewritten
2 2 2
1 SX σX 1 SX
P rob < 2 < = 1 − α.
bα SY2 σY aα SY2
1 SX 1 SX
L= √ , U=√ .
bα S Y aα SY
L = 0.31389215230779993, U = 1.6621265193149342
Xk = 0, 1, 2, . . . , d − 1.
√ X̄ − p
Z= n· p (6.7.1)
p(1 − p)
is approximately standard normal for large enough sample size, and conse-
quently U = Z 2 is approximately chi-squared with degree one. Pearson’s test
generalizes this from d = 2 categories to d > 2 categories.
Given a category j, let #j denote the number of times Xk = j, 1 ≤ k ≤ n.
Then #j is the count that Xk = j, and p̂j = #j /n is the observed frequency,
in n samples. Let pj be the expected frequency,
√ √
#j
n(p̂j − pj ) = n − pj , 0 ≤ j < d,
n
are approximately normal for large n. Based on this, Pearson [19] showed
Goodness-Of-Fit Test
Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies and p =
318 CHAPTER 6. STATISTICS
def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
n = sum(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i
,→ in range(d) ])
deg = d-1
pvalue = 1 - U(deg).cdf(u)
return pvalue
Suppose a dice is rolled n = 120 times, and the observed counts are
Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
6.7. CHI-SQUARED TESTS 319
d = 6
ustar = U(d-1).ppf(1-alpha)
Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.
We now derive the goodness-of-fit test. For each category 0 ≤ j < d, let
√1
j
if Xk = j,
X̃k = pj
0 if Xk ̸= j.
√
Then E(X̃nj ) = pj , and
(
1 if i = j,
E(X̃ki X̃kj ) =
0 if i ̸= j.
If
√ √ √
µ = ( p1 , p2 , . . . , pd ) and X̃k = (X̃k1 , X̃k2 , . . . , X̃kd ),
then
E(X̃k ) = µ, E(X̃k ⊗ X̃k ) = I.
320 CHAPTER 6. STATISTICS
From this,
V ar(X̃k ) = E(X̃k ⊗ X̃k ) − E(X̃k ) ⊗ E(X̃k ) = I − µ ⊗ µ.
From (5.3.8), we conclude the random vector
n
!
√ 1X
Z= n X̃k − µ
n k=1
has mean zero and variance I − µ ⊗ µ. By the central limit theorem, Z is
approximately normal for large n.
Since
√ √ √
|µ|2 = ( p0 )2 + ( p1 )2 + · · · + ( pd−1 )2 = p0 + p1 + · · · + pd−1 = 1,
µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approx-
imately chi-squared with degree d − 1. Using
√
p̂j √
Zj = n √ − pj ,
pj
we write |Z|2 out,
d d 2 d
2
X X p̂j √ X (p̂j − pj )2
|Z| = Zj2 =n √ − pj = n ,
j=1 j=1
pj j=1
pj
obtaining (6.7.2).
Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, we use the
#{k : Xk = i, Yk = j}
r̂ij = , i = 1, 2, . . . , d, j = 1, 2, . . . , e.
n
If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large
n, the statistic
d,e
X (r̂ij − p̂i q̂j )2
n (6.7.3)
i,j=1
p̂ i q̂ j
d,e
X (observed)2
= −n + n .
i,j=1
expected
The code
def chi2_independence(table):
observed = table
n = sum(observed)
d = len(observed)
e = len(observed.T)
322 CHAPTER 6. STATISTICS
table = array([[68,56,32],[52,72,20]])
chi2_independence(table)
returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
Calculus
7.1 Calculus
In this section, we focus on single-variable calculus, and in §7.3, we review
multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 7.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.
y = f (x)
x
a
Since the tangent line at a passes through the point (a, f (a)), and its
323
324 CHAPTER 7. CALCULUS
Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to
f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to
f (b) − f (a)
≤ L.
b−a
We have shown
f (b) − f (a)
m≤ ≤ L. (7.1.1)
b−a
Derivative Definition
f (x) − f (a)
f ′ (a) = lim . (7.1.2)
x→a x−a
f g
x u y
Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
2
with f (x) = x and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (7.1.3) with n = 1/2. In this generality, the variable x is restricted
to positive values only.
For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
7.1. CALCULUS 327
We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (7.1.4)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .
f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(7.1.5)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .
Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (7.1.4), we derived
More generally, let a be a fixed point. Then any function f (x) can be
expanded in powers (x − a)n , and we have
328 CHAPTER 7. CALCULUS
We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 7.3. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ. By the figure, the arclength θ
is greater than the diagonal, which in turn is greater than y. Moreover θ is
less than 1 − x + y, so
y < θ < 1 − x + y.
P 1−x
θ
0 x 1
which implies
1 − cos θ sin θ
1− < < 1. (7.1.8)
θ θ
7.1. CALCULUS 329
From (1.5.5),
sin(θ + ϕ) = sin θ cos ϕ + cos θ sin ϕ,
so
sin(θ + ϕ) − sin θ cos ϕ − 1 sin ϕ
lim = lim sin θ · + cos θ · = cos θ.
ϕ→0 ϕ ϕ→0 ϕ ϕ
Thus the derivative of sine is cosine,
(sin θ)′ = cos θ.
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have √
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.13). With
x = λ/2, by the chain rule,
′
1√
2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(7.1.9)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
330 CHAPTER 7. CALCULUS
This shows the derivative of the arcsine law is the density in Figure 3.11.
For the parabola in Figure 7.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
increase/decrease of the graph. In particular, the minimum of the parabola
occurs when y ′ = 0.
0
x
√
(c = 1/ 3)
−1 −c c 1
x
0
(ex )(n) = ex , n ≥ 0,
writing the Taylor series centered at zero for the exponential function yields
the exponential series (4.4.10).
7.1. CALCULUS 333
For example, the function in Figure 7.6 is convex near x = a, and the
graph lies above its tangent line at a.
so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
left half of (7.1.10). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,
leading to the right half of (7.1.10).
x
a
Figure 7.6: Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.
f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies
t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0, a ≤ t ≤ b.
This yields
7.1. CALCULUS 335
y = log x ⇐⇒ x = ey .
log(ey ) = y, elog x = x.
336 CHAPTER 7. CALCULUS
From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 7.7).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure 4.10),
log ∞ = ∞.
log 0 = −∞.
We also see log x is negative when 0 < x < 1, and positive when x > 1.
Moreover, by the law of exponents,
ab = eb log a .
7.1. CALCULUS 337
Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,
so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (7.1.12)
x
For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is
Below we see g(p) is also convex. This may not always exist, but we will
work with cases where no problems arise.
Let q > 0. The simplest example is
q 1 2
f (x) = x2 =⇒ g(p) = p.
2 2q
For each p, the point x where px − f (x) equals the maximum g(p) — the
maximizer — depends on p. If we denote the maximizer by x = x(p), then
Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,
Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the
same as f (x) = px − g(p), we have
f ′ (g ′ (p)) = p.
f ′′ (g ′ (p))g ′′ (p) = 1.
We derived
Notice the derivatives of σ and its inverse σ −1 are reciprocals. This result
holds in general, and is called the inverse function theorem.
The partition function is
Then Z ′ (z) = σ(z) and Z ′′ (z) = σ ′ (z) = σ(1 − σ) > 0. This shows Z(z) is
strictly convex.
The maximum
max(pz − Z(z))
z
which simplifies to I(p). Thus the convex dual of the partition function is the
information. The information is studied further in §7.2, and the multinomial
extension is in §7.6.
n
This makes sense because the binomial coefficient k
is defined for any
real number n (4.3.12), (4.3.13).
In summation notation,
∞
n
X n n−k k
(a + x) = a x . (7.1.21)
k=0
k
The only difference between (4.3.7) and (7.1.21) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (4.3.10),
we have
n
= 0, for k > n,
k
so (7.1.21) is a sum of n + 1 terms, and equals (4.3.7) exactly. When n is not
a whole number, the sum (7.1.21) is an infinite sum.
Actually, in §5.5, we will need the special case a = 1, which we write in
slightly different notation,
∞
p
X p n
(1 + x) = x . (7.1.22)
n=0
n
f (x) = (a + x)n .
7.2. ENTROPY AND INFORMATION 341
so
f (k) (0)
n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
n
X f (k) (0) X n
(a + x) = = an−k xk ,
k=0
k! k=0
k
or
(1 − t) log a + t log b ≤ log ((1 − t)a + tb) .
Since the inequality sign is reversed, this shows
for 0 ≤ t ≤ 1.
Since x > 0, y ′′ < 0, which shows log x is in fact strictly concave everywhere
it is defined.
Since log x is strictly concave,
1
log = − log x
x
is strictly convex.
Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximum of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.
Taking the second derivative, by the chain rule and the quotient rule,
′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)
A crucial aspect of Figure 7.8 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1
= − lim 2p log(2p)
p→0
Then I ′ (p) is the inverse of the derivative σ(x) (7.1.16) of the dual Z(x)
(7.1.19) of I(p), as it should be (7.1.14).
Toss a coin n times, and let #n (p) be the number of outcomes where
the proportion of heads is p. Then we have the approximation
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (7.2.5)
2πn p(1 − p)
Figure 7.9 is returned by the code below, which compares both sides of
the asymptotic equality (7.2.5) for n = 10.
n = 10
def H(p): return - p*log(p) - (1-p)*log(1-p)
p = arange(.01,.99,.01)
grid()
plot(p, comb(n, n*p), label="binomial coefficient")
plot(p, exp(n*H(p))/sqrt(2*n*pi*p*(1-p)), label="entropy
,→ approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
Then
I(q, q) = 0,
which agrees with our understanding that I(p, q) measures the difference in
information between p and q. Because I(p, q) is not symmetric in p, q, we
think of q as a base or reference probability, against which we compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
7.2. ENTROPY AND INFORMATION 347
Since
I(p, q) = −H(p) − p log(q) − (1 − p) log(1 − q)
and H(0) = 0 = H(1), I(p, q) is well-defined for p = 0, and p = 1,
d2 1
2
I(p, q) = −H ′′ (p) = ,
dp p(1 − p)
d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (7.2.7)
2πn p(1 − p)
The law of large numbers (§6.1) states that the proportion of heads equals
approximately q for large n. Therefore, when p ̸= q, we expect the probability
that the proportion of heads equal p should become successively smaller as
n get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(7.2.7) implies this is so. Thus (7.2.7) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.
The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.
Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §8.3.
The second interpretation is combinatorial, and involves repeated com-
positions of functions. This interpretation is relevant to computing gradients
in networks, specifically backpropagation §7.4, §8.2.
These two interpretations work together when training neural networks,
§8.4.
For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).
Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function
f (t) = f (x1 (t), x2 (t), . . . , xd (t))
of a single variable t. Then we have
The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector
df
= ∇f (x(t)) · x′ (t).
dt
d
f (x + tv) = ∇f (x) · v. (7.3.3)
dt t=0
r = f (x) = sin x,
1
s = g(x) = ,
1 + e−x
t = h(x) = x2 ,
u = r + s + t,
y = k(u) = cos u.
g s u y
x + k
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §7.4.
∇f (x∗ ) = 0.
In this case,
d d
∂f 1X 1X
= qij xj + qji xj − bi = (Qx − b)i .
∂xi 2 j=1 2 j=1
Quadratic Convexity
Let Q be a symmetric matrix and b a vector. The quadratic function
1
f (x) = x · Qx − b · x
2
has gradient
∇f (x) = Qx − b. (7.3.6)
Moreover f (x) is convex everywhere exactly when Q is a covariance
matrix, Q ≥ 0.
7.4. BACK PROPAGATION 355
By (2.2.2),
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
356 CHAPTER 7. CALCULUS
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose
r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .
r s y
x f g h
The chain in Figure 7.12 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,
Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
With the above values for x, r, s, we have
dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (7.1.17),
From this,
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
7.4. BACK PROPAGATION 357
r = x2 ,
s = r 2 = x4 ,
y = s2 = x 8 .
This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have
func_chain = [f,g,h]
der_chain = [df,dg,dh]
Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,
def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x
# dy/dy = 1
delta_out = 1
7.4. BACK PROPAGATION 359
def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta
delta = backward_prop(delta_out,x,der_chain)
d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1
x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
Now we work with the network in Figure 7.13, using the multi-variable
360 CHAPTER 7. CALCULUS
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
The composite function is
J = (x + y) max(y, z),
x +
a
y J
∗
b
z max
Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 7.15). This is forward propagation.
Then
∂a ∂a
= 1, = 1.
∂x ∂y
z
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z
y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0
Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 7.14, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
Hence we have
∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J
362 CHAPTER 7. CALCULUS
The outputs (blue) and the derivatives (red) are displayed in Figure 7.15.
1
x +
2∗1=2
2 3
2 2
y 6
∗
1
2 2
3 3
0
z max
0
d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1
More generally, in a weighed directed graph, the weights wij are numeric
scalars. In this case, for each node j, let
x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.1)
Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj
−
x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).
and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point
defining f1 , f2 , f3 .
364 CHAPTER 7. CALCULUS
activate = [None]*d
def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i
,→ in range(d) ]
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))
Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,
7.4. BACK PROPAGATION 365
m = len(x_in)
x[:m] = x_in
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For this code to work, we assume there are no cycles in the graph: All
backward paths end at inputs.
Let xout be the output nodes. For Figure 7.13, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 7.13, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi
Then δ = (δ1 , δ2 , . . . , δd ) is the gradient vector. We first compute the deriva-
tives of J with respect to the output nodes xout , and we assume these deriva-
tives are assembled into a vector δout .
In Figure 7.13, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the nodes are ordered so that the terminal portion of x equals
xout and the terminal portion of δ equals δout ,
366 CHAPTER 7. CALCULUS
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi
so X
δi = δj · gij · wij .
i→j
The code is
7.5. CONVEX FUNCTIONS 367
def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None
,→ else 0 for j in range(d) ])
def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
delta[d-m:] = delta_out
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta
x0
x0 x0
x1
(1 − t)x0 + tx1
x0
Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 7.16, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves,
in Figure 7.22 are sublevel sets. Note we always consider the level set to be
part of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 7.16 are boundaries of their respective
sublevel sets, and the covariance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.
A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,
This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two
dimensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 7.19.
More generally, given points x1 , x2 , . . . , xN , a linear combination
t1 x1 + t2 x2 + · · · + tN xN
1
We only consider convex functions that are continuous.
370 CHAPTER 7. CALCULUS
Figure 7.19: Convex: The line segment lies above the graph.
Quadratic is Convex
If Q is a nonnegative matrix and b is a vector, then
1
f (x) = x · Qx − b · x
2
7.5. CONVEX FUNCTIONS 371
This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (7.3.5),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(7.5.3)
Inserting t = 1 in (7.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (7.5.3),
f ((1 − t)x0 + tx1 ) = f (x0 + tv)
1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).
When Q is is invertible, then v · Qv > 0, and we have strict convexity.
x3
x4
x2
x6 x7
x5
x1
rng = default_rng()
hull = ConvexHull(points)
facet =hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()
If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 7.19). From convex functions, there are other ways to get convex
sets:
E: f (x) ≤ 1
is a convex set.
H: n · (x − x0 ) = 0. (7.5.4)
n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
7.5. CONVEX FUNCTIONS 375
n n
x0 x0
Separating Hyperplane
Let E be a convex set and let x∗ be a point not in E. Then there is a
hyperplane separating x∗ and E: For some x0 in E and nonzero n,
x∗
n
x′
x0 x0
x x0 + tv
Expanding, we have
0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.
Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain
x in E =⇒ (x − x0 ) · n ≤ 0. (7.5.6)
where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
7.4) and the relative information (Figure 7.10) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (7.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. This is same as saying f (x) rises to +∞ as |x| → ∞.
Remember, if x is a scalar, |x| = ±x is the absolute value, and if x =
(x1 , x2 , . . . , xd ) is in Rd ,
q
|x| = x21 + x22 + · · · + x2d .
378 CHAPTER 7. CALCULUS
For example, the functions in Figure 7.4 is proper and strictly convex,
while the function in Figure 7.5 is proper but neither convex nor strictly
convex.
Intuitively, if f (x) goes up to +∞ when x is far away, then its graph must
have a minimizer at some point x∗ .
To see this, suppose f (x) is not proper. Then there would be a sequence
x1 , x2 , . . . in the row space of A satisfying |xn | → ∞ while f (xn ) remains
bounded, say f (xn ) ≤ c for some constant c. Let x′n = xn /|xn |. Then x′n are
unit vectors in the row space of A, hence x′n is a bounded sequence. From
§A.2, this implies x′n subconverges to some x∗ , necessarily a unit vector in
the row space of A.
By the triangle inequality (2.2.4),
1 1 1
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ (c + |b|).
|xn | |xn | |xn |
Properness of Residual
When the N × d matrix A has rank d,
is proper on Rd .
To see this, pick any point a. Then, by properness, the sublevel set S
given by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer
x∗ (see §A.2). Since for all x outside the sublevel set, we have f (x) > f (a),
x∗ is a global minimizer.
As a consequence,
380 CHAPTER 7. CALCULUS
Let a be any point, and v any direction, and let g(t) = f (a + tv). Then
g ′ (0) = ∇f (a) · v.
∂ 2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
7.5. CONVEX FUNCTIONS 381
d2
f (x + tv) = v · Qv. (7.5.14)
dt2 t=0
This implies
m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (7.5.15)
2 2
m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (7.5.16)
2 2
Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (7.5.17).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (7.5.18)
2 2 2
7.5. CONVEX FUNCTIONS 383
To see this, note the left side is greater or equal to zero. Since the left side
equals zero iff p = Qx, we are led to (7.5.17). The next simplest example is
the partition function, see below.
If x is a maximizer in (7.5.17), then the derivative is zero,
p = ∇f (x) ⇐⇒ x = ∇g(p).
This yields
Using this, and writing out (7.5.15) for g(p) instead of f (x) (we skip the
details) yields
384 CHAPTER 7. CALCULUS
mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (7.5.20)
m+L m+L
This is derived by using (7.5.19), the details are in [2]. This result is used
in gradient descent.
Because
e z2 1
q 2 = z1 = = σ(z2 − z1 ).
e +e z 2 1 + e 2 −z1 )
−(z
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.
386 CHAPTER 7. CALCULUS
z = array([z1,z2,z3])
q = softmax(z)
or
σ(z) = σ(z + a1).
To guarantee uniqueness of a global minimum of Z, we have to restrict
attention to the subspace of vectors z = (z1 , z2 , . . . , zd ) orthogonal to 1, the
vectors satisfying
z1 + z2 + · · · + zd = 0.
Now suppose z is orthogonal to 1. Since the exponential function is
convex, !
d d
eZ 1 X zk 1X
= e ≥ exp zk = e0 = 1.
d d k=1 d k=1
This establishes
Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is
z = Z1 + log p. (7.6.4)
The function
d
X
I(p) = p · log p = pk log pk (7.6.5)
k=1
This implies
d
X d
X
p·z = pk zk = pk log(ezk )
k=1 k=1
d
! d
!
X X
≤ log pk e zk = log ezk +log pk = Z(z + log p).
k=1 k=1
For all z,
Z(z) = max (p · z − I(p)) .
p
Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is
p = array([p_1,p_2,p_3])
entropy(p)
7.6. MULTINOMIAL PROBABILITY 389
Now (
∂ 2Z ∂σj σj − σj σk , if j = k,
= =
∂zj ∂zk ∂zk −σj σk , if j ̸= k.
Hence we have
d
X d
X
v · Qv = qk vk2 2
− (v · q) = qk (vk − v̄)2 ,
k=1 k=1
zj ≤ c, j = 1, 2, . . . , d.
−zj ≤ (d − 1)c, j = 1, 2, . . . , d.
We conclude
We have shown
390 CHAPTER 7. CALCULUS
and
I(p, q) = I(p) − p · log q. (7.6.12)
Similarly, the relative entropy is
p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
returns the relative information, not the relative entropy. See below for more
on this terminology confusion.
7.6. MULTINOMIAL PROBABILITY 391
This identity is the direct analog of (7.5.18). The identity (7.5.18) arises
naturally in least squares, or linear regression. Similarly, (7.6.14) arises in
logistic regression.
The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1
Since I(p, σ(z)) and Icross (p, σ(z)) differ by the constant I(p), we also have
This will be useful in logistic regression. Table 7.25 summarizes the situation.
H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)
Table 7.25: The third row is the sum of the first and second rows, and the
H column is the negative of the I column.
Here is the multinomial analog of (7.2.6). Suppose a dice has d faces, and
suppose the probability of rolling the k-th face in a single roll is qk . Then
q = (q1 , q2 , . . . , qd ) is a probability distribution. Let p = (p1 , p2 , . . . , pd ) be
another probability distribution. Roll the dice n times, and let Pn (p, q) be
the probability that the proportion of times the k-th face is rolled equals pk ,
k = 1, 2, . . . , d. Then we have the approximation
Machine Learning
8.1 Overview
This first section is an overview of the chapter. Here is a summary of the
structure of neural networks.
• In a directed graph, there are input nodes, output nodes, and hidden
nodes.
• A network is a weighed directed graph (§4.2) where the nodes are neu-
rons (§7.4).
395
396 CHAPTER 8. MACHINE LEARNING
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 7.13 is not a neural network.
Let X
x−j = wij xi (8.2.1)
i→j
398 CHAPTER 8. MACHINE LEARNING
be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
!
X
xj = fj (x−
j ) = fj wij xi . (8.2.2)
i→j
x = (x1 , x2 , . . . , xd ),
In a network, in §7.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form
x = f (W t x).
However, this representation is more useful when the network has structure,
for example in a dense shallow layer (8.2.12).
y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)
Neural Network
Every neural network is a combination of perceptrons.
8.2. NEURAL NETWORKS 399
x1
w1
w2
x2 f y
w3
x3
y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).
P rob(H | x) = σ(w · x + w0 ).
Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [15], from which Figure 8.2 is taken.
400 CHAPTER 8. MACHINE LEARNING
# activation functions
w13 w35
x1 f3 f5 x5
w14 w36
w23 w45
w24 w46
x2 f4 f6 x6
Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 8.3 has outgoing vectors
Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .
8.2. NEURAL NETWORKS 403
xi xj
fi fj
wij
def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None
,→ else 0 for i in range(d) ])
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
404 CHAPTER 8. MACHINE LEARNING
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
activate = [None]*d
activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
x_in = [1.5,2.5]
x = forward_prop(x_in,w)
Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 8.3, xout = (x5 , x6 ) and we
may take J to be the mean square error
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (8.2.6)
2 2
The code for this J is
def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])
y0 = [0.132,-0.954]
y = [0.427, -0.288]
J(x_out,y0), J(x_out,y)
∂J ∂J
, fj′ (x−
j ), . (8.2.7)
∂x−
j ∂xj
These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)
∂J fi′ ∂J
∂x−
i ∂xi
fi
From (8.2.2),
∂xj
= fj′ (x−
j ). (8.2.8)
∂x−
j
By the chain rule and (8.2.8), the key relation between these derivatives
is
∂J ∂J
− = · f ′ (x− ), (8.2.9)
∂xi ∂xi i i
or
downstream = upstream × local.
8.2. NEURAL NETWORKS 407
def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))
Let
∂J
δi = , i = 1, 2, . . . , d.
∂x−
i
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J ∂J
δ5 = , δ6 = , δout = (δ5 , δ6 )
∂x−
5 ∂x−
6
∂J
= (x5 − y1 ) = −0.294.
∂x5
408 CHAPTER 8. MACHINE LEARNING
σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.
Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
The code for this is
def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in
,→ range(m) ]
delta_out(x_out,y_star,w)
∂J X ∂J ∂x− j ∂xi
− = − · · −
∂xi i→j
∂xj ∂xi ∂xi
!
X ∂J
= − · wij · fi′ (x−
i ).
i→j
∂x j
8.2. NEURAL NETWORKS 409
The code is
def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if
,→ w[i][j] != None else 0 for j in range(d) ])
return upstream * local(x,w,i)
def backward_prop(x,y,w):
d = len(w)
delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta
delta = backward_prop(x,y,w)
returns
∂x−
j
= xi ,
∂wij
We have shown
∂J
= xi · δj . (8.2.11)
∂wij
x1
z1
f
x2
z2
f
x3
z3
f
x4
Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (8.2.1), (8.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (8.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for vectorized forward and back propagation.
This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in
mind is f = J, where J is the error from the previous section.
Usually f (w) is a measure of cost or lack of compatibility. Because of
this, f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §7.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.
Inserting a = w and b = w+ ,
Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
f ′ (w)
w+ ≈ w − .
f ′′ (w)
f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )
def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)
def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
8.3. GRADIENT DESCENT 415
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
f (w) = f (w1 , w2 , . . . ).
∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....
In other words,
8.3. GRADIENT DESCENT 417
In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §7.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
u0 a b c w1 w
0
Figure 8.10: Double well cost function and sublevel sets at w0 and at w1 .
In Figure 8.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means
To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (7.5.15) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (8.3.3).
The curvature of the loss function and the learning rate are inversely
proportional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.
1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L
Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude
1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L
For example, let f (w) = w4 − 6w2 + 2w (Figures 8.9, 8.10, 8.11). Then
Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 8.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 8.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 8.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §8.8, modified
gradient descent will address some of these shortcomings.
420 CHAPTER 8. MACHINE LEARNING
def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
8.4. NETWORK TRAINING 421
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
xin → xout .
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is
xin → y.
This is training.
Let (§8.2)
−
x− = (x− −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )
δ = (δ1 , δ2 , . . . , δd )
Let wij be the weight along an edge (i, j), let xi be the outgoing signal
from the i-th node, and let δj be the downstream derivative of the
output J with respect to the j-th node. Then the derivative ∂J/∂wij
equals xi δj . In this partial sense,
∇W J = x ⊗ δ. (8.4.2)
def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
8.4. NETWORK TRAINING 423
cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory
Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
Let W0 be the weight matrix (8.2.4). Then
x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000
w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)
returns the cost trajectory, which can be plotted using the code
for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)
grid()
legend()
show()
Figure 8.12: Cost trajectory and number of iterations as learning rate varies.
∇W J(x, y, W ) = (x ⊗ x)W − x ⊗ y.
426 CHAPTER 8. MACHINE LEARNING
x1
+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4
∇W J(W ) = XX t W − XY t ,
x1
z1
+ p
q1
x2
z2 q2 J
+ σ I
x3 q3
z3 z = W tx
+ q = σ(z)
J = I(p, q)
x4
and, by (7.6.9),
W1 = 0 (8.5.7)
d d
!2 d
X X X
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1
430 CHAPTER 8. MACHINE LEARNING
Now we turn to properness of J(W ). There are two cases: Strict proba-
bilities, and one-hot encoded probabilities. Here is the first case.
8.5. LOSS FUNCTIONS 431
The convex hull is discussed in §7.5, see Figures 7.20 and 7.21. In the
next section, we show a simple example of how this works.
Note if the span of Ki ∩ Kj is full-rank, then the span of the dataset itself
is full-rank, hence, from a previous result in this section, J(W ) is strictly
convex.
Suppose J(W ) ≤ c. Then J(x, p, W ) ≤ c for every sample x and corre-
sponding target p.
Let x be a sample in Ki , and let z = W t x. Then the corresponding target
p satisfies pi = 1, and I(p) = 0. If j ̸= i, by (7.6.14),
|zi − zj | ≤ c, for x in Ki ∩ Kj .
X
d|zi | = |(d − 1)zi + zi | = (zi − zj ) ≤ (d − 1)C.
j̸=i
This implies
d
!1/2
X √
|z| = zi2 ≤C d
i=1
∇z J(x, p, W ) = 0
can always be solved, so there is at least one minimum for each J(x, p, W ).
Here properness ultimately depends on properness of the partition function
Z(z).
In one-hot encoded regression, ∇W J(x, p, W ) = 0 can never be solved,
because q = σ(z) is always strict and p is one-hot encoded, see (8.5.6). Nev-
ertheless, properness of J(W ) is achievable, hence ∇W J(W ) = 0 is solvable,
if there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
434 CHAPTER 8. MACHINE LEARNING
practice, when the dimensions are high, gradient descent may be the only
option for linear regression. In logistic regression, the minimizer cannot be
found in closed form, so we have no choice but to apply gradient descent,
even for low dimensions.
y = w · x = w1 x1 + w2 x2 + · · · + wd xd .
For example, Figure 8.16 is a dataset and Figure 8.15 is a plot of popu-
lation versus employed, with the mean and the regression line shown.
Linear Regression
We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
x1 1 y1
x 1 y
X= 2 , Y = 2.
. . . . . . . . .
xN 1 yN
In the scalar case, the regression equation (8.6.3) is 2 × 2. To simplify
the computation of X t X, let
N N
1 X 1 X
x̄ = xk , ȳ = yk .
N k=1 N k=1
Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.6,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N k=1 N
Then cov(x, y) is the covariance between x and y,
t x · x x̄ t x·y
X X=N , X Y =N .
x̄ 1 ȳ
With w = (m, b), the regression equation reduces to
(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.
The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
cov(x, x)m = (x · x − x̄2 )m = (x · y − x̄ȳ) = cov(x, y).
This derives
8.6. REGRESSION EXAMPLES 437
The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)
df - read_csv("longley.csv")
X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()
X = X - mean(X)
Y = Y - mean(Y)
varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)
X = X/sqrt(varx)
Y = Y/sqrt(vary)
After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
438 CHAPTER 8. MACHINE LEARNING
figure(figsize=(12,12))
# six subplots
rows, cols = 3,2
# x interval
x = arange(xmin,xmax,.01)
for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,f(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()
show()
Running this code with degree 1 returns Figure 8.15. Taking too high a
power can lead to overfitting, for example for degree 12.
8.6. REGRESSION EXAMPLES 439
More generally, we may only know the amount of study time xk , and the
probability pk that the student passed, where now 0 ≤ pk ≤ 1.
x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1
Let σ(z) be the sigmoid function. Then, as in the previous section, the
goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (8.6.4)
k=1
Once we have the minimizer (m, b), we have the best-fit curve
q = σ(mx + b)
8.6. REGRESSION EXAMPLES 441
(Figure 8.18).
If the targets pk are one-hot encoded, the dataset is as follows.
x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)
X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
x = arange(0,6,.01)
plot(x,expit(m*x+b))
scatter(X,P)
grid()
show()
To apply the results from the previous section, first we incorporate the
bias and rewrite the dataset as
(x1 , 1), (x2 , 1), . . . , (xN , 1), N = 20.
Clearly the dataset is full-rank in R2 , hence J(m, b) is strictly convex.
Each sample x in the dataset is in R2 , and each target is one-hot encoded
as (p, 1 − p). This implies the weight matrix must satisfy (8.5.7) W 1 = 0. so
b −b
W = .
m −m
442 CHAPTER 8. MACHINE LEARNING
p
b z
1 + q
−b J
σ I
m
−z
x + 1−q
−m
Since here d = 2, the networks in Figures 8.21 and 8.22 are equivalent.
In Figure 8.21, σ is the softmax function. In Figure 8.22, σ is the sigmoid
function.
1 b
z q J
+ σ I
m
x
Figure 8.18 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 8.23) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis. In particular, K0 and
K1 computed in the next paragraph are in the horizontal plane.
8.6. REGRESSION EXAMPLES 443
Referring to Figure 8.18, the convex hulls K0 and K1 are in feature space,
which here is the horizontal plane R2 . Now the convex hull K1 of the samples
corresponding to pk = 0 is the line segment joining (.5, 1) and (3.5, 1), and
the convex hull K1 of the samples corresponding to pk = 1 is the line segment
joining (1.75, 1) and (5.5, 1). Since K0 ∩K1 is the line segment joining (1.75, 1)
and (3.5, 1), the span of K0 ∩ K1 is the whole horizontal plane R2 in Figure
8.23. By the results of the previous section, J(w) is proper.
Here is the descent code.
X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
def gradient(m,b):
444 CHAPTER 8. MACHINE LEARNING
# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
t = .01 # learning rate
m = 1.49991537, b = −4.06373862.
The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 8.24). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 8.25).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is
proper.
8.6. REGRESSION EXAMPLES 445
Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have
1
f (w) = w · Qw − b · w, (8.7.2)
2
where Q is a covariance matrix. Then D2 f (w) = Q. If the eigenvalues of Q
are between positive constants m and L, then f (w) is smooth and strictly
convex.
By (7.3.6), the gradient for this example is g = Qw − b. Hence the
minimizer is the unique solution w∗ = Q−1 b of the linear system Qw =
b. Thus gradient descent is a natural tool for solving linear systems and
computing inverses, at least for covariance matrices Q.
By (7.5.16), f (w) lies between two quadratics,
m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (8.7.3)
2 2
1
In the literature, the condition number is often defined as L/m.
8.7. STRICT CONVEXITY 447
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (8.7.1), the
estimate (8.7.3) shows these two error measures are equivalent. We use both
measures below.
Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
Gradient Descent II
Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
2
t=
m+L
converges to w∗ at the rate
2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (8.7.6)
1+r
For example, if L = 6 and m = 2, then r = 1/3, the learning rates are 1/6
versus 1/4, and the convergence rates are 2/3 versus 1/4. Even though GD-II
improves GD-I, the improvement is not substantial. In the next section, we
use momentum to derive better convergence rates.
Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (7.3.6), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,
+ 1 g
E(w ) = 1 − −1
E(w), u= .
(u · Qu)(u · Q u) |g|
w◦ = w + s(w − w− ). (8.8.1)
450 CHAPTER 8. MACHINE LEARNING
Here s is the decay rate. The momentum term reflects the direction induced
by the previous step. Because this mimics the behavior of a ball rolling
downhill, gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by
Here we have two hyperparameters, the learning rate and the decay rate.
wn = w∗ + ρn v, Qv = λv. (8.8.5)
Inserting this into (8.8.3) and using Qw∗ = b leads to the quadratic equation
ρ2 = (1 − tλ + s)ρ − s.
Suppose the loss function f (w) is quadratic (8.7.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (8.8.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (8.8.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (8.8.12)
1+ r
w◦ = w + s(w − w− ),
(8.8.13)
w+ = w◦ − t∇f (w◦ ).
Starting from w0 , and setting w−1 = w0 , here it turns out the loss se-
quence f (w0 ), f (w1 ), f (w2 ), . . . is not always decreasing. Because of this,
we seek another function V (w) where the corresponding sequence V (w0 ),
V (w1 ), V (w2 ), . . . is decreasing.
To explain this, it’s best to assume w∗ = 0 and f (w∗ ) = 0. This can
always be arranged by translating the coordinate system. Then it turns out
L
V (w) = f (w) + |w − ρw− |2 , (8.8.14)
2
with a suitable choice of ρ, does the job. With the choices
√
1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r
we will show
V (w+ ) ≤ ρV (w). (8.8.15)
In fact, we see below (8.8.22), (8.8.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (8.3.3).
The derivation of (8.8.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (8.8.13), it is not clear what
w− means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work with
the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,
then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (8.8.14) to simplify the calculations.
We first show how (8.8.15) implies the result. Using (w0 )− = w0 and
(8.7.3),
L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (8.8.15), we obtain
This derives
8.8. ACCELERATED GRADIENT DESCENT 455
While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all
convex functions satisfying (8.7.1), and the fact, also due to Nesterov [16],
that this convergence rate is best-possible among all such functions.
Now we derive (8.8.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (8.3.1) with w◦ replacing w, (8.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (8.8.17)
2
Here we used t = 1/L.
By (7.5.15) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (8.8.18)
2
By (7.5.15) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (8.8.19)
2
Multiply (8.8.18) by ρ and (8.8.19) by 1 − ρ and add, then insert the sum
into (8.8.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(8.8.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
456 CHAPTER 8. MACHINE LEARNING
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (8.8.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (8.8.21)
2t 2t
Let
R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .
which is positive.
Appendices
A.1 SQL
Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes
are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence
457
458 CHAPTER A. APPENDICES
select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
A.1. SQL 459
This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers
or any unmutable Python objects. Since a Python list is mutable, a key
cannot be a list. Values may be any Python objects, so a value may be a
list. In a dict, values are accessed through their keys. For example, item1[
,→ "dish"] returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts,
for example,
len(L), L[0]["dish"]
returns
(2,'Hummus')
returns True.
460 CHAPTER A. APPENDICES
s = dumps(L)
Now print L and print s. Even though L and s “look” the same, L is a
list, and s is a string. To emphasize this point, note
L1 = loads(s)
Then L == L1 returns True. Strings having this form are called JSON
strings, and are easy to store in a database as VARCHARs (see Figure A.4).
The basic object in the Python module pandas is the dataframe (Figures
A.1, A.2, A.4, A.5). The pandas module can convert a dataframe df to
many, many other formats
df = DataFrame(L)
df
L1 = df.to_dict('records')
L == L1
returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code
menu_df = read_csv("menu.csv")
menu_df
To go the other way, to convert the dataframe df to the CSV file menu1
,→ .csv, use the code
df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)
protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port
This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”\rawa”, the URI is
A.1. SQL 463
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
df.to_sql('Menu',engine,if_exists='replace')
df.to_sql('Menu',engine)
One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.1.2).
/* Menu */
dish varchar
price integer
/* ordersin */
orderid integer
created datetime
customerid integer
items json
/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
466 CHAPTER A. APPENDICES
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents
so they are INTs). The JSON string is of a list-of-dicts in the form
discussed above L = [item1, item2] (see row 0 in Figure A.4).
Do this by looping over each order in the list-of-dicts orders, then
looping over each item in the list-of-dicts menu, and extracting the
quantity ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are com-
puted from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed
using the Connecticut tax rate 7.35%. Tax is applied to the sum of
subtotal and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
# step 1
from pandas import *
protocol = "https://"
server = "math.temple.edu"
path = "/~hijab/teaching/csv_files/restaurant/"
url = protocol + server + path
# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')
# step 3
OrdersIn = h
468 CHAPTER A. APPENDICES
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)
# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)
# step 5
from json import *
# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item
,→ in items ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
A.1. SQL 469
# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)
# step 8
from sqlalchemy import create_engine, text
engine = create_engine(uri)
dtype1 = { "dish":sqlalchemy.String(60),
,→ "price":sqlalchemy.Integer }
dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
470 CHAPTER A. APPENDICES
protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
connection = engine.connect()
A.2. MINIMIZING SEQUENCES 471
num = 0
print(num)
def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])
Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.
num = df["items"].map(num_plates).sum()
print(num)
Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
mizers were ignored. In this section, which may safely be skipped, we review
the foundational material supporting the existence of minimizers.
The first issue that must be clarified is the difference between the min-
imum and the infimum. In a given situation, it is possible that there is
no minimum. By contrast, in any reasonable situation, there is always an
infimum.
For example, since y = ex is an increasing function, the minimum
min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1
In this situation, the minimizer does not exist, but, since the values of 1/x are
arbitrarily close to 0, we say the infimum is 0. Since there is no minimizer,
there is no minimum value. Also, even though 0 is the infimum, we do not
say ∞ is the “infimizer”, since ∞ is not an actual number.
S is infinite, a minimum need not exist. When the minimum exists, we write
m = min S.
If S is bounded below, then S has many lower bounds. The greatest
among these lower bounds is the infimum of S. A foundational axiom for
real numbers is that the infimum always exists. When m is the infimum of
S, we write m = inf S.
Existence of Infima
Any collection S of real numbers that is bounded below has an infimum:
There is a lower bound m for S that is greater than any other lower
bound b for S.
For example, for S = [0, 1], inf S = 0 and min S = 0, and, for S = (0, 1),
inf S = 0, but min S does not exist. For both these sets S, it is clear that 0
is the infimum. The power of the axiom comes from its validity for any set
S of scalars that is bounded below, no matter how complicated.
By definition, the infimum of S is the lower bound for S that is greater
than any other lower bound for S. From this, if min S exists, then inf S =
min S.
e1 ≥ e2 ≥ · · · ≥ 0.
inf en = 0.
n≥1
Error Sequence
An error sequence e1 ≥ e2 ≥ · · · ≥ 0 converges to zero iff for any ϵ > 0,
there is an N > 0 with
0 ≤ en < ϵ, n ≥ N.
or we write xn → x∗ .
Note this definition of convergence is consistent with the previous defini-
tion, since an error sequence e1 , e2 , . . . converges to zero (in the first sense)
iff
lim en = 0
n→∞
Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
Note a subsequence of an error sequence converging to zero is also an
error sequence converging to zero. As a consequence, if a sequence converges
to x∗ , then every subsequence of the sequence converges to x∗ . From this
A.2. MINIMIZING SEQUENCES 475
it follows that the sequence 1, −1, 1, −1, 1, −1, . . . does not converge to
anything: it bounces back and forth between ±1.
I0 ⊃ I! ⊃ I2 ⊃ . . . ,
x∗ = inf x∗n
n≥1
must exist.
A minimizer is a vector x∗ satisfying f (x∗ ) = m. As we saw above, a
minimizer may or may not exist, and, when the minimizer does exist, there
may be several minimizers.
A minimizing sequence for f (x) over S is a sequence x1 , x2 , . . . of vectors
in S such that the corresponding values f (x1 ), f (x2 ), . . . are decreasing and
converge to m = inf S f (x) as n → ∞. In other words, x1 , x2 , . . . is a
minimizing sequence for f (x) over S if
and
inf f (x) = inf f (xn ).
S n≥1
or
0 < f (x1 ) − m < (f (x0 ) − m)/2.
Existence of Minimizers
If f (x) is continuous on Rd and S is a bounded set in Rd , then there
is a minimizer x∗ ,
f (x∗ ) = inf f (x). (A.2.2)
x in S
[1] Joshua Akey, Genome 560: Introduction to Statistical Genomics, 2008. https://ww
w.gs.washington.edu/academics/courses/akey/56008/lecture/lecture1.pdf.
[2] Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, Foundations
and Trends in Machine Learning, vol. 8, Now Publishers, 2015.
[3] Harald Cramér, Mathematical Methods of Statistics, Princeton University Press, 1946.
[4] A. Aldo Faisal Marc Peter Deisenroth and Cheng Soon Ong, Mathematics for Machine
Learning, Cambridge University Press, 2020.
[5] J. L. Doob, Probability and Statistics, Transactions of the American Mathematical
Society 36 (1934), 759-775.
[6] R. A. Fisher, The conditions under which χ2 measures the discrepancy between ob-
servation and hypothesis, Journal of the Royal Statistical Society 87 (1924), 442-450.
[7] Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press,
2016.
[8] Google, Machine Learning. https://developers.google.com/machine-learning.
[9] Robert M. Gray, Toeplitz and Circulant Matrices: A Review, Foundations and Trends
in Communications and Information Theory 2 (2006), no. 3, 155-239.
[10] T. L. Heath, The Works of Archimedes, Cambridge University, 1897.
[11] Lily Jiang, A Visual Explanation of Gradient Descent Methods, 2020. https://towa
rdsdatascience.com/a-visual-explanation-of-gradient-descent-methods-m
omentum-adagrad-rmsprop-adam-f898b102325c.
[12] J. W. Longley, An Appraisal of Least Squares Programs for the Electronic Computer
from the Point of View of the User, Journal of the American Statistical Association
62.319 (1967), 819-841.
[13] David G. Luenberger and Yinyu Ye, Linear and Nonlinear Programming, Springer,
2008.
[14] Ioannis Mitliagkas, Theoretical principles for deep learning, lecture notes, 2019. http
s://mitliagkas.github.io/ift6085-dl-theory-class-2019/.
479
480 BIBLIOGRAPHY
*, 8, 16 def.newton, 413
def.num_plates, 471
all, 188 def.outgoing, 364, 403
append, 187 def.pca, 180
def.pca_with_svd, 180
def.angle, 23, 73 def.plot_cluster, 188
def.assign_clusters, 187 def.plot_descent, 414
def.backward_prop, 359, 367, def.project, 119
409 def.project_to_ortho, 120
def.ball, 61 def.pvalue, 270
def.chi2_independence, 321 def.random_batch_mean, 247
def.confidence_interval, 293, def.random_vector, 188
305, 310, 313 def.tensor, 32
def.delta_out, 408 def.train_nn, 422
def.derivative, 366 def.ttest, 306
def.dimension_staircase, 128 def.type2_error, 299, 307
def.downstream, 409 def.update_means, 187
def.ellipse, 50, 57 def.update_weights, 422
def.find_first_defect, 126 def.zero_variance, 106
def.forward_prop, 358, 365, 404 def.ztest, 297
def.gd, 420 diag, 173
def.goodness_of_fit, 318 dict, 459
def.H, 345 display, 148
def.hexcolor, 10
def.incoming, 364, 403 enumerate, 181
def.J, 405 floor, 165
def.local, 406
def.nearest_index, 187 import, 8
481
482 PYTHON
pandas.DataFrame.to_numpy, 70 sympy.*, 71
pandas.DataFrame.to_sql, 463 sympy.diag, 70
pandas.read_csv, 437, 461 sympy.diagonalize, 147
pandas.read_sql, 463 sympy.eigenvects, 147
sympy.init_printing, 147
random.choice, 10
sympy.Matrix, 63
random.random, 15
sympy.Matrix.col, 68
scipy.linalg.null_space, 97, sympy.Matrix.cols, 68
98 sympy.Matrix.columnspace, 90
scipy.linalg.orth, 91 sympy.Matrix.eye, 69
scipy.linalg.pinv, 85 sympy.Matrix.hstack, 66, 85
scipy.spatial.ConvexHull, 371 sympy.Matrix.inv, 82
simplices, 372 sympy.Matrix.nullspace, 96
scipy.special.comb, 216 sympy.Matrix.ones, 69
scipy.special.expit, 236, 441 sympy.Matrix.rank, 132
scipy.special.softmax, 385 sympy.Matrix.row, 68
scipy.stats.chi2, 274 sympy.Matrix.rows, 68
scipy.stats.norm, 261 sympy.Matrix.rowspace, 94
scipy.stats.t, 302, 305 sympy.Matrix.zeros, 69
sklearn.datasets.load_iris, 2 sympy.RootOf, 42
sklearn.decomposition sympy.shape, 63
.PCA, 182 sympy.solve, 255
sklearn.preprocessing sympy.symbols, 42
.StandardScaler, 81
sort, 178 tuple, 18
sqlalchemy.create_engine, 463
sqlalchemy.text, 463 zip, 185
484 PYTHON
Index
485
486 INDEX