Math For Data Science

Math for Data Science
Omar Hijab*
Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Preface
This text is under construction and is continuously updated. Upon comple-

tion of the text, the preface will be enlarged and exercises will be added.
This text is a presentation of the mathematics underlying Data Science.
The text assumes minimal math background, and basic math is reviewed.
Alongside this, advanced results are presented, sometimes without proof.
However, as a general rule, as much as possible is carefully derived, because
there is value in understanding how things work.
Important principles or results are displayed in these boxes.
For Python coding, the standard data science libraries are used. A Python
index lists the Python functions used in the text.
Python code is displayed in these boxes.
Because SQL is usually part of a data scientist’s toolkit, an introduction

to using SQL from within Python, is included in an appendix.
Throughout, we use iff to mean if and only if. To help navigate the text,
in each section, we indicate a break, a new idea, or a change in direction,
with the symbol
Sections and figures are numbered sequentially within each chapter, and
equations are numbered sequentially within each section, so §3.3 is the third
section in the third chapter, Figure 7.11 is the eleventh figure in the seventh
chapter, and (3.2.1) is the first equation in the second section of the third
chapter.
iii
iv
If a section contains the alert
⋆ under construction ⋆,
then it is incomplete. If a section does not contain this alert,

then it is complete except for minor edits. Out of 54 sections,
fewer than 4 are incomplete.
Contents
Preface iii
List of Figures xiii
1 Data Sets 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . 9
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . 42
1.7 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 58
2 Linear Geometry 63
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . 88
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . 102
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 107
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3 Principal Components 137

3.1 Geometry of Matrices . . . . . . . . . . . . . . . . . . . . . . . 137
3.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . 140
3.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 167
v
vi CONTENTS
3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . 176

3.5 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4 Counting 191
4.1 Permutations and Combinations . . . . . . . . . . . . . . . . . 191
4.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 210
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . 217
5 Probability 227
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . 227
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 261
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . 271
6 Statistics 283
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
6.4 Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.5 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
6.6 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . 316
6.7 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . 317
7 Calculus 323
7.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . 341
7.3 Multi-variable Calculus . . . . . . . . . . . . . . . . . . . . . . 349
7.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 355
7.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 367
7.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . 384
8 Machine Learning 395

8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
8.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 397
8.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 412
8.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 421
CONTENTS vii
8.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 424

8.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . . 434
8.7 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 446
8.8 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . 449
8.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 456
A Appendices 457
A.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.2 Minimizing Sequences . . . . . . . . . . . . . . . . . . . . . . 471
References 479
Python 481
Index 485
viii CONTENTS
List of Figures
1.1 Iris dataset [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Images in the MNIST dataset. . . . . . . . . . . . . . . . . . . 3
1.3 A portion of the MNIST dataset. . . . . . . . . . . . . . . . . 5
1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . 5
1.5 The MNIST dataset (3d projection). . . . . . . . . . . . . . . 6
1.6 A crude copy of the image. . . . . . . . . . . . . . . . . . . . . 7
1.7 HTML colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 The vector v joining the points m and x. . . . . . . . . . . . . 11
1.9 Datasets of points versus datasets of vectors. . . . . . . . . . . 12
1.10 A statistic f valued in a vector space V . . . . . . . . . . . . . 13
1.11 A dataset with its mean. . . . . . . . . . . . . . . . . . . . . . 16
1.12 A vector v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.13 Vectors v1 and v2 and their shadows in the plane. . . . . . . . 17
1.14 Adding v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.15 Scaling with t = 2 and t = −2/3 . . . . . . . . . . . . . . . . . 19
1.16 The polar representation of v = (x, y). . . . . . . . . . . . . . 20
1.17 v and its antipode −v . . . . . . . . . . . . . . . . . . . . . . 21
1.18 Two vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . 22
1.19 Pythagoras for general triangles. . . . . . . . . . . . . . . . . . 24
1.20 Proof of Pythagoras for general triangles. . . . . . . . . . . . . 25
1.21 v and v ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.22 Multiplying and dividing points on the unit circle. . . . . . . . 34
1.23 Complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.24 The second, third, and fourth roots of unity . . . . . . . . . . 40
1.25 The fifth, sixth, and fifteenth roots of unity . . . . . . . . . . 41
1.26 MSD for the mean (green) versus MSD for a random point (red). 43
1.27 Projecting a vector v onto the line through u. . . . . . . . . . 48
1.28 Covariance ellipses and inverse covariance ellipses. . . . . . . . 51
ix
x LIST OF FIGURES
1.29 Covariance ellipse and incovariance ellipse. . . . . . . . . . . . 54

1.30 A positively correlated dataset ρ > 0. . . . . . . . . . . . . . . 55
1.31 A negatively correlated dataset ρ < 0. . . . . . . . . . . . . . . 55
1.32 Level contours of v · Q−1 v. . . . . . . . . . . . . . . . . . . . . 56
1.33 Ellipsoid and axes in 3d. . . . . . . . . . . . . . . . . . . . . . 57
1.34 Disks inside the square . . . . . . . . . . . . . . . . . . . . . . 59
1.35 Balls inside the cube . . . . . . . . . . . . . . . . . . . . . . . 60
2.1 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . 107

2.2 The points x, Ax, the points x∗ , Ax∗ , and the point x+ . . . . . 108
2.3 Projecting onto a line. . . . . . . . . . . . . . . . . . . . . . . 116
2.4 Projecting onto a plane, P b = ru + sv. . . . . . . . . . . . . . 117
2.5 Dataset, reduced dataset, and projected dataset, n < d. . . . . 121
2.6 Relations between vector classes. . . . . . . . . . . . . . . . . 124
2.7 First defect for MNIST. . . . . . . . . . . . . . . . . . . . . . 127
2.8 The dimension staircase with defects. . . . . . . . . . . . . . . 128
2.9 The dimension staircase for the MNIST dataset. . . . . . . . . 128
2.10 A 5 × 3 matrix A is a linear transformation from R3 to R5 . . . 131
3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . 138

3.2 SVD decomposition A = U SV . . . . . . . . . . . . . . . . . . 140
3.3 Relations between matrix classes. . . . . . . . . . . . . . . . . 141
3.4 Inverse covariance ellipse and centered dataset. . . . . . . . . . 151
3.5 S = span(v1 ) and T = S ⊥ . . . . . . . . . . . . . . . . . . . . . 155
3.6 Three springs at rest and perturbed. . . . . . . . . . . . . . . 159
3.7 Six springs at rest and perturbed. . . . . . . . . . . . . . . . . 160
3.8 Two springs along a circle leading to Q(2). . . . . . . . . . . . 161
3.9 Five springs along a circle leading to Q(5). . . . . . . . . . . . 162
3.10 Plot of eigenvalues of Q(50). . . . . . . . . . . . . . . . . . . . 166
3.11 Density of eigenvalues of Q(d) for d large. . . . . . . . . . . . 167
3.12 MNIST eigenvalues as a percentage of the total variance. . . . 178
3.13 MNIST eigenvalue percentage plot. . . . . . . . . . . . . . . . 179
3.14 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . 183
3.15 The full MNIST dataset (2d projection). . . . . . . . . . . . . 184
3.16 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . 185
4.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . 191

4.2 Directed and undirected graphs. . . . . . . . . . . . . . . . . . 196
LIST OF FIGURES xi
4.3 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . 196

4.4 A double edge and a loop. . . . . . . . . . . . . . . . . . . . . 197
4.5 The complete graph K6 and the cycle graph C6 . . . . . . . . . 198
4.6 The triangle K3 = C3 . . . . . . . . . . . . . . . . . . . . . . . 199
4.7 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1). . 207
4.8 Complete bipartite graph K53 . . . . . . . . . . . . . . . . . . . 208
4.9 Pascal’s triangle. . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.10 The exponential function exp x. . . . . . . . . . . . . . . . . . 223
4.11 Convexity of the exponential function. . . . . . . . . . . . . . 226
5.1 The distribution of p given 7 heads in 10 tosses. . . . . . . . . 235

5.2 The logistic function. . . . . . . . . . . . . . . . . . . . . . . . 236
5.3 The logistic function takes real numbers to probabilities. . . . 237
5.4 100,000 sessions, with 5, 15, 50, and 500 tosses per session. . . 241
5.5 When we sample X, we get x. . . . . . . . . . . . . . . . . . 244
5.6 N = 150 petal lengths and their mean. . . . . . . . . . . . . . 245
5.7 Histogram of N = 150 petal lengths. . . . . . . . . . . . . . . 247
5.8 Means of 100,000 batches, of size n = 1, 5, 15, 50. . . . . . . . . 248
5.9 Distribution of a bernoulli random variable. . . . . . . . . . . 249
5.10 Confidence that X lies in interval [a, b]. . . . . . . . . . . . . . 250
5.11 Cumulative distribution functions. . . . . . . . . . . . . . . . . 250
5.12 Cdf of a bernoulli distribution. . . . . . . . . . . . . . . . . . . 251
5.13 Binary variance. . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.14 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn . . . . 258
5.15 The standard normal distribution. . . . . . . . . . . . . . . . . 262
5.16 z = norm.ppf(p) and p = norm.cdf(z). . . . . . . . . . . . 264
5.17 Confidence (green) or significance (red) (lower-tail, two-tail,
upper-tail). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.18 68%, 95%, 99% confidence cutoffs for standard normal. . . . . 266
5.19 Cutoffs, confidence levels, p-values. . . . . . . . . . . . . . . . 266
5.21 68%, 95%, 99% cutoffs for non-standard normal. . . . . . . . . 267
5.20 p-values at 5% and at 1%. . . . . . . . . . . . . . . . . . . . . 267
5.22 (X, Y ) in the square and in the circle. . . . . . . . . . . . . . . 272
5.23 Chi-squared distribution with different degrees. . . . . . . . . 273
5.24 With degree d ≥ 2, the chi-squared distribution peaks at d − 2. 275
6.1 Statistics flowchart: p-value p and significance α. . . . . . . . 284

xii LIST OF FIGURES
6.2 Histogram of sampling n = 25 students, repeated N = 1000

times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
6.3 The error matrix. . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.4 t-distribution, against normal (dashed). . . . . . . . . . . . . . 302
6.5 Fisher F -distribution. . . . . . . . . . . . . . . . . . . . . . . . 315
6.6 Contingency table [22]. . . . . . . . . . . . . . . . . . . . . . . 320
7.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . 323

7.2 Composition of two functions. . . . . . . . . . . . . . . . . . . 325
7.3 Angle θ in the plane, P = (x, y). . . . . . . . . . . . . . . . . . 328
7.4 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . 330
7.5 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . 331
7.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0. . . 334
7.7 The logarithm function log x. . . . . . . . . . . . . . . . . . . 336
7.8 The absolute entropy function H(p). . . . . . . . . . . . . . . 343
7.9 Asymptotics of binomial coefficients. . . . . . . . . . . . . . . 345
7.10 The relative information I(p, q) with q = .7. . . . . . . . . . . 347
7.11 Composition of multiple functions. . . . . . . . . . . . . . . . 352
7.12 Composition of three functions in a chain. . . . . . . . . . . . 356
7.13 A network composition [24]. . . . . . . . . . . . . . . . . . . . 360
7.14 The function g = max(y, z). . . . . . . . . . . . . . . . . . . . 361
7.15 Forward and backward propagation [24]. . . . . . . . . . . . . 362
7.16 Level sets and sublevel sets in two dimensions. . . . . . . . . . 368
7.17 Contour lines in two dimensions. . . . . . . . . . . . . . . . . . 368
7.18 Line segment [x0 , x1 ]. . . . . . . . . . . . . . . . . . . . . . . . 369
7.19 Convex: The line segment lies above the graph. . . . . . . . . 370
7.20 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 . . . . . . . . . . . . . . 372
7.21 A convex hull with one facet highlighted. . . . . . . . . . . . . 373
7.22 Convex set in three dimensions with supporting hyperplane. . 374
7.23 Hyperplanes in two and three dimensions. . . . . . . . . . . . 375
7.24 Separating hyperplane theorem. . . . . . . . . . . . . . . . . . 375
7.25 The third row is the sum of the first and second rows, and the
H column is the negative of the I column. . . . . . . . . . . . 393
8.1 A perceptron with activation function f. . . . . . . . . . . . . 399

8.2 Perceptrons in parallel [15]. . . . . . . . . . . . . . . . . . . . 400
8.3 Network of neurons. . . . . . . . . . . . . . . . . . . . . . . . . 402
8.4 Incoming and Outgoing signals. . . . . . . . . . . . . . . . . . 403
LIST OF FIGURES xiii
8.5 Forward and back propagation between two neurons. . . . . . 403

8.6 Downstream, local, and upstream derivatives at node i. . . . . 406
8.7 A shallow dense layer. . . . . . . . . . . . . . . . . . . . . . . 411
8.8 Layered neural network [8]. . . . . . . . . . . . . . . . . . . . . 411
8.9 Double well newton descent. . . . . . . . . . . . . . . . . . . . 415
8.10 Double well cost function and sublevel sets at w0 and at w1 . . 417
8.11 Double well gradient descent. . . . . . . . . . . . . . . . . . . 420
8.12 Cost trajectory and number of iterations as learning rate varies.424
8.13 Linear regression neural network. . . . . . . . . . . . . . . . . 426
8.14 Logistic regression neural network. . . . . . . . . . . . . . . . 428
8.15 Population versus employed: Linear Regression. . . . . . . . . 434
8.16 Longley Economic Data [12]. . . . . . . . . . . . . . . . . . . . 435
8.17 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12. . . . . . . . . 439
8.18 Exam outcomes (0 or 1) and logistic regression fit [26]. . . . . 440
8.19 Hours studied and outcomes. . . . . . . . . . . . . . . . . . . . 440
8.20 Hours studied and one-hot encoded outcomes. . . . . . . . . . 441
8.21 Neural network for student exam outcomes. . . . . . . . . . . 442
8.22 Equivalent neural network for student exam outcomes. . . . . 442
8.23 Exam outcomes dataset including bias inputs. . . . . . . . . . 443
8.24 Projection of Iris dataset onto R2 . . . . . . . . . . . . . . . . . 445
8.25 Convex hulls of MNIST classes in R2 . . . . . . . . . . . . . . . 445
A.1 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . 461

A.2 Menu dataframe and SQL table. . . . . . . . . . . . . . . . . . 461
A.3 Rawa restaurant. . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.4 OrdersIn dataframe and SQL table. . . . . . . . . . . . . . . . 465
A.5 OrdersOut dataframe and SQL table. . . . . . . . . . . . . . . 467
xiv LIST OF FIGURES
Chapter 1
Data Sets
In this chapter we explore examples of data sets and some simple Python
code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and covariance of a dataset, then present
a first taste of what higher dimensions might look like.
1.1 Introduction
Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-
dimensional space Rd . Algebraically, a dataset is an N × d matrix.
Practically speaking, as we shall see, the following are all representations
of datasets
matrix = CSV file = spreadsheet = SQL table = array = dataframe
Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,
and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Sometimes one of the features is separated out as the label. In this case,
the dataset is a labelled dataset.
As examples, we look at two datasets, the Iris dataset and the MNIST
dataset. The Iris dataset contains 150 examples of four features of Iris flowers,
and there are three classes of Irises, Setosa, Versicolor, and Virginica, with
50 samples from each class.
1
2 CHAPTER 1. DATA SETS
The four features are sepal length and width, and petal length and width
(Figure 1.1). For each example, the class is the label corresponding to that
example, so the Iris dataset is labelled.
Figure 1.1: Iris dataset [20].
The Iris dataset is downloaded using the code
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)
dataset = iris["frame"]
dataset
If sklearn is not installed, you’ll need to first run
pip install sklearn

1.1. INTRODUCTION 3
Figure 1.2: Images in the MNIST dataset.
The MNIST dataset consists of 60, 000 images of hand-written digits (Fig-
ure 1.2). There are 10 classes of images, corresponding to each digit 0, 1,
. . . , 9. We seek to compress the images while preserving as much as possible
of the images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.
This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points
x1 , x2 , . . . , xN
in d-dimensional feature space. We seek to find a lower-dimensional feature

space U ⊂ Rd so that the projections of these points onto U retain as much
information as possible about the data.
In other words, we are looking for an n-dimensional subspace U for some
n < d. Among all n-dimensional subspaces, which one should we pick?
The answer is to select U among all n-dimensional subspaces to maximize
variability in the data.
Another issue is the choice of n, which is an integer satisfying 0 ≤ n ≤ d.
On the one hand, we want n to be as small as possible, to maximize data
compression. On the other hand, we want n to be big enough to capture
most of the features of the data. At one extreme, if we pick n = d, then we
have no compression and complete information. At the other extreme, if we
pick n = 0, then we have full compression and no information.
Projecting the data from Rd to a lower-dimensional space U is dimen-
sional reduction. The best alignment, the best-fit, or the best choice of U is
principal component analysis. These issues will be taken up in §3.4.
If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (Datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus),
and Python (numpy, pandas, scipy, sympy, matplotlib). It may help to
read the code examples , and the important math principles first,
then dive into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a Jupyter notebook.
Jupyter is an IDE, an integrated development environment. Jupyter sup-
ports many frameworks, including Python, Sage, Julia, and R. A useful
Jupyter feature is the ability to measure the amount of execution time of
a code cell by including at the start of the cell
%%time
The installation procedure is to first install Python, then install Jupyter

using Python pip. Additional frameworks (R, . . . ) are then installed sep-
arately. After this, to make each additional installed framework available
from within Jupyter, run jupyter kernelspec. Detailed steps depend on the
reader’s laptop setup.
1.2 The MNIST Dataset

As our working example, we take the MNIST1 dataset.
The dataset consists of 70,000 images, split into 60,000 training images
and 10,000 testing images. The dataset is loaded by
from keras.datasets import mnist
(train_X, train_y), (test_X, test_y) = mnist.load_data()
(This code requires keras, tensorflow and related modules if not already
installed.)
1
The National Institute of Standards and Technology (NIST) is a physical sciences
laboratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 5
Figure 1.3: A portion of the MNIST dataset.
Since this dataset is for demonstration purposes, these images are coarse.
Since each image consists of 784 pixels, and each pixel shading is a number,
each image is a point x in Rd = R784 .
Figure 1.4: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
To compress the image means to reduce the number of dimensions in the

point x while keeping maximum information. We can think of a single image
as a dataset itself, and compress the image, or we can design a compression
algorithm based on a collection of images. In the second case, we say we
want to train the procedure on the dataset. It is then reasonable to expect

that the procedure applies well to any image that is similar to the images in
the dataset.
Figure 1.5: The MNIST dataset (3d projection).
For the second image in Figure 1.2, reducing dimension from d = 784 to
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.4.
The top left image in Figure 1.4 is given by a 784-dimensional point which
is imported as an array pixels of shape (28,28).
pixels = train_X[1]
Live exercise in class

1. Take out your laptops and open Jupyter.
2. In Jupyter, return a two-dimensional plot of the point (2, 3) using the
code
1.2. THE MNIST DATASET 7
from matplotlib.pyplot import *
grid()
scatter(2,3)
show()
3. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
Figure 1.6: A crude copy of the image.
Here is one possible code, returning Figure 1.6.

from numpy import *
pixels = train_X[1]
grid()
for i in range(28):
for j in range(28): scatter(i,j, s = pixels[i,j])
show()
The top left image in Figure 1.4 is returned by the code
imshow(pixels, cmap="gray_r")
We end the section by discussing the Python import command. The last
code snippet can be rewritten
import matplotlib.pyplot as plt
plt.imshow(pixels, cmap="gray_r")
or as
from matplotlib.pyplot import imshow
imshow(pixels, cmap="gray_r")
So we have three versions of this code snippet.

In the second version, it is explicit that imshow is imported from the
submodule pyplot of the module matplotlib. Moreoever, the submodule
matplotlib.pyplot is referenced by a short nickname plt.
In the first version import from *, many commands, maybe not all, are
imported from the submodule matplotlib.pyplot.
In the third version, only the command imshow is imported. Which im-
port style is used depends on the situation.
1.3. AVERAGES AND VECTOR SPACES 9
In this text, we usually use the first style, as it is visually lightest. To

help with online searches, in the Python index, Python commands are listed
under their full module path.
1.3 Averages and Vector Spaces

Suppose we have a population of things (people, tables, numbers, vectors,
images, etc.) and we have a sample of size N from this population:
L = [x_1,x_2,...,x_N].
The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population
L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].
Or, the sample space consists of all integers and we take N = 5 samples from
this population
L_2 = [35, -32, -8, 45, -8].
Or, the sample space consists of all rational numbers and we take N = 5
samples from this population
L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].
Or, the sample space consists of all Python strings and we take N = 5
samples from this population
L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']
Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population
Figure 1.7: HTML colors.
Here’s the code generating the colors
# HTML color codes are #rrggbb (6 hexes)

from random import choice
def hexcolor():
return "#" + ''.join([choice('0123456789abcdef') for _ in
,→ range(6)])
for i in range(5): scatter(i,0, c=hexcolor())

show()
Let L be a list as above. The goal is to compute the sample average or

mean of the list, which is
x1 + x2 + · · · + xN
mean = average = .
N
In the first example, for real numbers, the average is
3.95 + 3.20 + 3.10 + 5.55 + 6.93
= 4.546.
5
In the second case, for integers, the average is 32/5. In the third case, the
average is 385373/73656. In the fourth case, while we can add strings, we
can’t divide them by 5, so the average is undefined. Similarly for colors: the
average is undefined.
This leads to an important definition (the most important definition in
the course). A sample space or population V is called a vector space if,
roughly speaking, one can compute means or averages in V . In this case,
we call the members of the population “vectors”, even though the members
may be anything, as long as they satisfy the basic rules of a vector space.
In a vector space V , the rules are:
1. vectors can be added (and the sum v + w is back in V )
2. vector addition is commutative v + w = w + v
3. vector addition is associative u + (v + w) = (u + v) + w
4. there is a zero vector 0
5. vectors v have negatives −v
6. vectors can be multiplied by real numbers (and the product rv is back

in V )
7. multiplication is distributive over addition (r + s)v = rv + sv and

r(u + v) = ru + rv
8. 1v = v and 0v = 0
9. r(sv) = (rs)v.
Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or

is the dataset a collection of vectors? In other words, what geometric picture
of datasets should we have in our heads? Here’s how it works.
Figure 1.8: The vector v joining the points m and x.

A vector is an arrow joining two points (Figure 1.8). Given two points
m = (a, b) and x = (c, d), the vector joining them is
v = x − m = (c − a, d − b).
Then m is the tail of v, and x is the head of v. For example, the vector
joining m = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean m of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − m, k = 1, 2, . . . , N .
The dataset v1 , v2 , . . . , vN is centered, its mean is zero,
v1 + v2 + · · · + vN
= 0.
N
So datasets can be points x1 , x2 , . . . , xN with mean m, or vectors v1 , v2 , . . . ,
vN with mean zero (Figure 1.9). This distinction is crucial, when measuring
the dimension of a dataset (§2.8).
x5 x2
v5 v2
m
v4 0 v1
x4 x1 v3
x3
Figure 1.9: Datasets of points versus datasets of vectors.
Centered Versus Non-Centered

If x1 , x2 , . . . , xN is a dataset of points with mean m and
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
then v1 , v2 , . . . , vN is a centered dataset of vectors.

V
f
Sample Space
Figure 1.10: A statistic f valued in a vector space V .
Let us go back to vector spaces. When we work with vector spaces,

numbers are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions
of v. When we multiply a vector v by a scalar r to get the scaled vector rv,
we call it scalar multiplication. This is to distinguish this multiplication from
the inner and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:
'alpha' + 'romeo' == 'romeo' + 'alpha'
returns False.
Usually, we can’t take sample means from a population, we instead take
the sample mean of a statistic associated to the population. A statistic is
an assignment of a number f (item) to each item in the population. For
example, the human population on Earth is not a vector space (they can’t
be added), but their heights is a vector space (heights can be added). For the
list L4 , a statistic might be the length of the string. For the HTML colors, a
statistic is the HTML code of the color.
In general, a statistic need not be a number. A statistic can be anything

that “behaves like a number”. For example, we shall see below that f (item)
can be a vector or a matrix. More generally, a statistic’s values may be
anything that lives in a vector space V , which we defined above.
For example, for the scalar dataset
x1 = 1.23, x2 = 4.29, x3 = −3.3, x4 = 555,
the average is
1.23 + 4.29 − 3.3 + 555
m= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code
from numpy import *
dataset = array([1.23,4.29,-3.3,555])
mean(dataset)
returns the average.

For the two-dimensional dataset
x1 = (1, 2), x2 = (3, 4), x3 = (−2, 11), x4 = (0, 66),
the average is
(1, 2) + (3, 4) + (−2, 11) + (0, 66)
m= = (0.5, 20.75).
4
Note the x-components are summed, and the y-components are summed,
leading to a two-dimensional mean. (This is vector addition, taken up in
§1.4.)
In Python, a dataset in R2 may be assembled as 4 × 2 array
from numpy import *
dataset = array([[1,2], [3,4], [-2,11], [0,66]])
Then the code

mean(dataset, axis=0)
returns the mean (0.5, 20.75).

To explain what axis=0 does, we use matrix terminology. After arranging
dataset into an array of four rows and two columns, to compute the mean,
we sum over the row index.
This means summing the entries of the first column, then summing the
entries of the second column, resulting in a mean with two components.
In Python, the default is to consider the row index i as index zero, and
to consider the column index j as index one.
Summing over index=1 is equivalent to thinking of the dataset as two
points in R4 , so
mean(dataset, axis=1)
returns (1.5, 3.5, 4.5, 33).

Here is a more involved example of a dataset of random points and their
mean:
from numpy import *

from numpy.random import random
from matplotlib.pyplot import scatter, grid, show
N = 20
dataset = array([ [random(), random()] for _ in range(N) ])
mean = mean(dataset,axis=0)
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
scatter(*mean)
show()
This returns Figure 1.11.

Figure 1.11: A dataset with its mean.
In this code, scatter expects two positional arguments, the x and the y
components of a point, or two lists of x and y components separately. The
unpacking operator * unpacks mean from one pair into its separate x and
y components *mean. Also, for scatter, dataset is separated into its two
columns.
1.4 Two Dimensions

We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.
(0, 2) (3, 2)
(0, 1)
(0, −2)
Figure 1.12: A vector v.

1.4. TWO DIMENSIONS 17
In the cartesian plane, a vector is an arrow v joining the origin to a point

(Figure 1.12). In this way, points and vectors are almost interchangeable,
as a point x in Rd corresponds to the vector v starting at the origin 0 and
ending at x.
v1
v2
0 0
Figure 1.13: Vectors v1 and v2 and their shadows in the plane.
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13. This cannot be done unless one first draws a horizontal line
(the x-axis), then a vertical line (the y-axis). In this manner, each vector v
has cartesian coordinates v = (x, y). In particular, the vector 0 = (0, 0), the
zero vector, corresponds to the origin.
In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added
by adding their coordinates,
Addition of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)
Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
Figure 1.14: Adding v1 and v2
This is the same as combining their shadows as in Figure 1.14. In Python,

lists and tuples do not add this way. Lists and tuples have to first be con-
verted into numpy arrays.
v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False
v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
from numpy import *
v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True
For example, v1 = (−3, 1) and v2 = (2, −2) returns
v1 + v2 = (−3, 1) + (2, −2) = (−3 + 2, 1 − 2) = (−1, −1).
A vector v = (x, y) in the plane may be scaled by scaling the shadow

as in Figure 1.15. This is vector scaling by t. Note when t is negative, the
shadow is also flipped. In Python, we write
from numpy import *
v = array([1,2])
3*v == array([3,6]) # returns True
Given a vector v, the scalings tv of v form a line passing through the

origin 0 (Figure 1.17). This line is the span of v (see §2.4). Scalings tv of v
are also called multiples of v.
Scaling of vectors
If v = (x, y), then

tv = (tx, ty).
If t and s are real numbers, it is easy to check
t(v1 + v2 ) = tv1 + tv2 and t(sv) = (ts)v.
Thus multiplying v by s, and then multiplying the result by t, has the same
effect as multiplying v by ts, in a single step. Because points and vectors are
interchangeable, the same formula is used for scaling tP points P by t.
tv
v
0 tv
Figure 1.15: Scaling with t = 2 and t = −2/3
We set −v = (−1)v, and define subtraction of vectors by
v1 − v2 = v1 + (−v2 ).
This gives
Subtraction of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
from numpy import *
v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True
Distance Formula
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and v2
is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .
The distance of v = (x, y) to the origin 0 = (0, 0) is its magnitude or

norm or length p
r = |v| = |v − 0| = x2 + y 2 .
(x, y)
r y
θ
0 x
Figure 1.16: The polar representation of v = (x, y).
In terms of r and θ (Figure 1.16), the polar representation of (x, y) is

x = r cos θ, y = r sin θ.
In Python,
from numpy import *

from numpy.linalg import norm
v = array([1,2])
norm(v) == sqrt(5)# returns True
The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),
v = (x, y) = (cos θ, sin θ). (1.4.3)
The unit circle intersects the horizontal axis at the vectors (1, 0), and
(−1, 0), and intersects the vertical axis at the vectors (0, 1), and (0, −1).
These four vectors are equally spaced on the unit circle (Figure 1.17).
−v
I
0
v
Figure 1.17: v and its antipode −v
By the distance formula, a vector v = (x, y) is a unit vector when

x2 + y 2 = 1.
More generally, any circle with center (a, b) and radius r consists of vectors
v = (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the
scaled point tR is on the circle with center (0, 0) and radius t. Moreover, if
Q is any point, Q + tR is on the circle with center Q and radius r.
Given this, it is easy to check
|tv| = |t| |v|
for any real number t and vector v.

From this, if a vector v is unit and r > 0, then rv has magnitude r. If v
is any vector not equal to the zero vector, then r = |v| is positive, and
1 1 1
v = |v| = r = 1,
r r r
so v/r is a unit vector.
v2 − v1
v2
v1
Figure 1.18: Two vectors v1 and v2 .
Now we discuss the dot product in two dimensions. We have two vectors
v and v ′ in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the
Dot Product Identity
x1 x2 + y1 y2 = v1 · v2 = |v1 | |v2 | cos θ. (1.4.4)
In Python, the dot product is given by numpy.dot,
from numpy import *
v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True
As a consequence of the dot product identity, we have code for the angle
between two vectors,
from numpy import *
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Recall that −1 ≤ cos θ ≤ 1. Using the dot product identity (1.4.4), we

obtain the important
Cauchy-Schwarz Inequality
If u and v are any two vectors, then
−|u| |v| ≤ u · v ≤ |u| |v|. (1.4.5)

To derive the dot product identity, we first derive Pythagoras’ theorem

for general triangles
c2 = a2 + b2 − 2ab cos θ. (1.4.6)
To derive (1.4.6), we drop a perpendicular to the base b, obtaining two

right triangles (Figure 1.20). By Pythagoras applied to each triangle,
a2 = d2 + f 2 and c2 = e 2 + f 2 .
Also b = e + d, so
b2 = (e + d)2 = e2 + 2ed + d2 .
By the definition of cos θ, d = a cos θ. Putting this all together,
c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,
so we get (1.4.6).
Figure 1.19: Pythagoras for general triangles.

f
e
a
d b
Figure 1.20: Proof of Pythagoras for general triangles.
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |.
Now go back to deriving (1.4.4). By vector addition, we have
v2 − v1 = (x2 − x1 , y2 − y1 ),
and b2 = |v1 |2 = x21 + y12 , a2 = |v2 |2 = x2 2 + y2 2 . By the binomial theorem,
c2 = |v2 − v1 |2 = |(x2 − x1 , y2 − y1 )|2 = (x2 − x1 )2 + (y2 − y1 )2

= x21 + y12 − 2(x1 x2 + y1 y2 ) + x2 2 + y2 2 = a2 + b2 − 2(x1 x2 + y1 y2 ),
thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.7)
Comparing the terms in (1.4.6) and (1.4.7), we arrive at (1.4.4).
Let u and v be vectors. A basic property of dot product is
|u + v|2 = |u|2 + 2u · v + |v|2 . (1.4.8)
This is easily derived from the definition of u · v.

If P = (x, y), let P ⊥ = (−y, x), and let v = OP and v ⊥ = OP ′ be the
vectors emanating from the origin, and ending at P and P ⊥ . Then
v · v ⊥ = (x, y) · (−y, x) = 0.
This shows v and v ⊥ are perpendicular (Figure 1.21).

P⊥
P
v⊥ v
0
−v ⊥
−P ⊥
Figure 1.21: v and v ⊥
From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ =
0 iff P ′ = ±P ⊥ .
We now solve two linear equations in two unknowns x, y. We start with

the homogeneous case
ax + by = 0, cx + dy = 0. (1.4.9)
Let A be the 2 × 2 matrix

a b
A= (1.4.10)
c d
It is easy to exhibit a solution of the first equation in (1.4.9): choose
(x, y) = (−b, a). If we want this to be a solution of the second equation as
well, we must have cx + dy = ad − bc = 0. Based on this, we make the
following definition. The determinant of A is

a b
det(A) = det = ad − bc.
c d
In (1.4.9), multiply the first equation by d and the second by b and sub-
tract, obtaining
(ad − bc)x = d(ax + by) − b(cx + dy) = 0.

In (1.4.9), multiply the first equation by c and the second by a and subtract,
obtaining
(bc − ad)y = c(ax + by) − a(cx + dy) = 0.
From here, we see there are two cases: det(A) = 0 and det(A) ̸= 0. When
det(A) ̸= 0, the only solution of (1.4.9) is (x, y) = (0, 0). When det(A) = 0,
(x, y) = (−b, a) is a solution of both equations in (1.4.9). We have shown
Homogeneous System
When det(A) = 0, the homogeneous system (1.4.9) has a nonzero so-

lution, and all solutions are scalar multiples of (x, y) = (−b, a). When
det(A) ̸= 0, the only solution is (x, y) = (0, 0).
This covers the homogeneous case. For the inhomogeneous case

ax + by = e, cx + dy = f, (1.4.11)
multiplying and subtracting as above, we obtain
(ad − bc)x = d(ax + by) − b(cx + dy) = de − bf,
(bc − d)y = c(ax + by) − a(cx + dy) = ce − af.
Dividing by det(A), we obtain
Inhomogeneous System
When det(A) ̸= 0, the inhomogeneous system (1.4.11) has the unique

solution
de − bf af − ce
x= , y= . (1.4.12)
ad − bc ad − bc
When det(A) = 0, (1.4.11) has a solution iff ce = af and de = bf .
When a2 + b2 ̸= 0, a solution is
ae be
x= 2 2
, y= 2 .
a +b a + b2
When c2 + d2 ̸= 0, a solution is
cf df
x= 2 2
, y= 2 .
c +d c + d2
Any other solution differs from these solutions by a scalar multiple of
the homogeneous solution (x, y) = (−b, a).
We now go over the basic properties of 2 × 2 matrices. This we use in the

next section. A 2 × 2 matrix A is a block of four numbers as in (1.4.10).
The matrix (1.4.10) can be written in terms of the two vectors u = (a, b)
and v = (c, d), as follows

a b u
A= = , u = (a, b), v = (c, d).
c d v
In this case, we call u and v the rows of A. On the other hand, A may be
written as

a b
A= = u v , u = (a, c), v = (b, d).
c d
In this case, we call u and v the columns of A. This shows there are at least
three ways to think about a matrix: as rows, or as columns, or as a single
block.
The simplest operations on matrices are addition and scalar multiplica-
tion. Addition is as follows,
′ ′
a + a′ b + b ′

a b ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
c d c d c + c′ d + d′
and scalar multiplication is as follows,

ta tb
tA = .
tc td
The transpose At of the matrix A is

a b a c
t
A= =⇒ A = .
c d b d
Then the rows of At are the columns of A.
Let w = (x, y) be a vector. We now explain how to multiply the matrix
A by the vector w. The result is then another vector Aw. This is called right
multiplication.
u
To do this, we write A as rows A = , then use the dot product,
v
Aw = (u · w, v · w) = (ax + by, cx + dy).
Notice Aw is a vector. When multiplying this way, one often writes

a b x ax + by
Aw = = ,
c d y cx + dy
and we call w and Aw column vectors.

This terminology is introduced to keep things consistent: It’s always row-
times-column with row on the left and column on the right. Nevertheless, a
vector, a row vector, and a column vector are all the same thing, just a
vector.
Just like we can multiply matrices and vectors, we can also multiply two
matrices A and A′ and obtain
a product AA′ . Following the row-column rule
u
above, we write A = as rows and A′ = (u′ , v ′ ) as columns to obtain
v
u · u′ u · v ′

′
AA = .
u′ · v u′ · v ′
If we do this the other way, we obtain

′
u · u u′ · v

′
AA= ,
u · v′ u · v
so
AA′ ̸= A′ A.
A rotation in the plane is the matrix

cos θ − sin θ
U = U (θ) = .
sin θ cos θ
Here θ is the angle of rotation. By the trigonometric addition formulas

(1.5.5),
cos θ′ − sin θ′

′ cos θ − sin θ
U (θ)U (θ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )

= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )
This says rotating by θ′ followed by rotating by θ is the same as rotating by

θ + θ′ .
There is a special matrix I, the identity matrix,

1 0
I= .
0 1
The matrix I satisfies

AI = IA = A
for any matrix A.
Also, for each matrix A with det(A) ̸= 0, the matrix
−b
 
d
−1 1 d −b 1 d −b  ad − bc ad − bc 
A = = =  −c a 
det(A) −c a ad − bc −c a
ad − bc ad − bc
is the inverse of A. The inverse matrix satisfies
AA−1 = A−1 A = I.
The inverse reverses the order of the product,
(AB)−1 = B −1 A−1 .
The transpose also reverses the order of a product,
(AB)t = B t At .
Using matrix-vector multiplication, we can rewrite (1.4.11) as
Av = w,
where
a b x e
A= , v= , w= .
c d y f
Then the solution (1.4.12) can be rewritten
v = A−1 w,
where A−1 is the inverse matrix. We study inverse matrices in depth in §2.3.
The matrix (1.4.10) is symmetric if b = c. A symmetric matrix looks like

a b
Q= .
b c
A general matrix A consists of four numbers a, b, c, d, and a symmetric

matrix Q consists of three numbers a, b, c. A matrix Q is symmetric when
Qt = Q.
Let A = (u, be a 2 × 2 matrix with columns u, v. Then u, v are the

v)
u
rows of At = . Since matrix multiplication is row × column,
v

t u u·u u·v
AA= u v = .
v v·u v·v
Now suppose At A = I. Then u · u = 1 = v · v and u · v = 0, so u and v

are orthogonal unit vectors. Such vectors are called orthonormal. We have
shown
Orthogonal Matrices
Let A be a matrix. Then At A = I iff the columns of A are orthonormal,

and AAt = I iff the rows of A are orthonormal.
The second statement follows by applying the first to At instead of A. A

matrix U is orthogonal if
U tU = I = U U t.
Thus a matrix is orthogonal iff its rows are orthonormal, and its columns are
orthonormal.
Now we introduce the tensor product. If u = (a, b) and v = (c, d) are

vectors, their tensor product is the matrix

ac ad av
u⊗v = = cu du = .
bc bd bv
Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get

ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
from numpy import *
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
The trace of a matrix A is the sum of the diagonal entries,

a b
A= =⇒ trace(A) = a + d.
c d
The determinant of u ⊗ v is zero,
det(u ⊗ v) = 0.
This is true no matter what the vectors u and v are. Check this yourself.
Notice by definition of u ⊗ v,
trace(u ⊗ v) = u · v, and trace(v ⊗ v) = |v|2 . (1.4.13)
The basic property of tensor product is

Tensor Product Identity

If A = u ⊗ v, then
Aw = (u ⊗ v)w = (v · w)u. (1.4.14)
This can be checked by writing out both sides in detail.

Now let
a b
Q=
b c
be a symmetric matrix and let v = (x, y). Then
Qv = (ax + by, bx + cy),
so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
Quadratic Form
If
a b
Q= and v = (x, y),
b c
then
v · Qv = ax2 + 2bxy + cy 2 .
When Q is the identity

1 0
Q=I= ,
0 1
then the quadratic function is x2 + y 2 :
Q=I =⇒ v · Qv = x2 + y 2 .
When Q is diagonal,

a 0
Q= =⇒ v · Qv = ax2 + cy 2 .
0 c
An important case is when Q = u ⊗ u. In this case, by (1.4.14),
Quadratic Forms of Tensors

If Q = u ⊗ u, then
v · Qv = v · (u ⊗ u)v = (u · v)2 . (1.4.15)
1.5 Complex Numbers

This section is a brief review of complex numbers, for use in later sections.
In §1.4, we studied points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we study points in any number of dimensions,
and there we also add and subtract points.
In two dimensions, each point has a shadow (Figure 1.13). By stacking
shadows, points in the plane can be multiplied and divided (Figure 1.22). In
this sense, points in the plane behave like numbers, because they follow the
usual rules of arithmetic.
P
P′
1
1
O O
P ′′
Q Q
P ′′
O O
Figure 1.22: Multiplying and dividing points on the unit circle.
This ability of points in the plane to follow the usual rules of arithmetic is
unique to one and two dimensions, and not present in any other dimension.
1.5. COMPLEX NUMBERS 35
When thought of in this manner, points in the plane are called complex
numbers, and the plane is the complex plane.
To define multiplication of points, let P = (x, y) and P ′ = (x′ , y ′ ) be

points on the unit circle. Stack the shadow of P ′ on top of the shadow of P ,
as in Figure 1.22. Here is how one does this without any angle measurement:
Mark Q = x′ P at distance x′ along the vector OP joining O and P , and
draw the circle with radius y ′ and center Q. Then this circle intersects the
unit circle at two points, both called P ′′ .
We think of the first point P ′′ as the result of multiplying P and P ′ , and
we write P ′′ = P P ′ , and we think of the second point P ′′ as the result of
dividing P by P ′ , and we write P ′′ = P/P ′ . Then we have
Multiplication and Division of Points
For P = (x, y) and P ′ = (x′ , y ′ ) on the unit circle, when x′ y ′ ̸= 0,
P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(1.5.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).
To derive (1.5.1), let P ⊥ = (−y, x) (“P -perp”). Then

x′ P + y ′ P ⊥ = (x′ x, x′ y) + (−y ′ y, y ′ x) = (xx′ − yy ′ , x′ y + xy ′ ),
x′ P − y ′ P ⊥ = (x′ x, x′ y) − (−y ′ y, y ′ x) = (xx′ + yy ′ , x′ y − xy ′ ),
so (1.5.1) is equivalent to
P ′′ = x′ P ± y ′ P ⊥ . (1.5.2)
To establish (1.5.2), since P ′′ is on the circle of center Q and radius y ′ ,
we may write P ′′ = Q + y ′ R, for some point R on the unit circle (see §1.4).
Interpreting points as vectors, and using (1.4.8), P ′′ = x′ P + y ′ R is on
the unit circle iff
1 = |x′ P + y ′ R|2 = |x′ P |2 + 2x′ P · y ′ R + |y ′ R|2
2 2
= x′ |P |2 + 2x′ y ′ P · R + y ′ |R|2
2 2
= x′ + y ′ + 2x′ y ′ P · R
= 1 + 2x′ y ′ P · R.
But this happens iff P · R = 0, which happens iff R = ±P ⊥ (Figure 1.21).

This establishes (1.5.2).
More generally, if r = |P | and r′ = |P ′ |, let R be any point satisfying

|R| = r. Then
P ′′ = Q + y ′ R = x′ P + y ′ R
satisfies |P ′′ | = rr′ exactly when R = ±P ⊥ , leading to the two points in
(1.5.1).
Let P̄ be the conjugate (x, −y) of P = (x, y). The first P ′′ is the product
P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (1.5.3)
but the second P ′′ is not division, it is the product2 P P̄ ′ of P and P̄ ′ .

Instead, the correct formula for division is given by
1 1
P/P ′ = 2 P P̄ ′
= (xx′ + yy ′ , x′ y − xy ′ ). (1.5.4)
r′ x′ + y ′ 2
2
When r′ = 1, (1.5.4) reduces to the corresponding formula in (1.5.1).

With this understood, it is easily checked that division undoes multipli-
cation,
(P/P ′ )P ′ = P.
In fact, one can check that multiplication and division as defined by (1.5.3)
and (1.5.4) follow the usual rules of arithmetic.
It is natural to identify points on the horizontal axis with real numbers,

because, using (1.5.1), z = (x, 0) and z ′ = (x′ , 0) implies
z + z ′ = (x, 0) + (x′ , 0) = (x + x′ , 0), zz ′ = (xx′ − 00, x0 + x′ 0) = (xx′ , 0).
Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
2
P P̄ ′ is the hermitian product of P and P ′ .
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and,
using (1.5.1), one can check
ix = (0, 1)(x, 0) = (−0, x) = x⊥ .
Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written
P = x + iy,
since x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y). This leads to
Figure 1.23. In this way, real numbers x are considered complex numbers
with zero imaginary part, x = x + 0i.
2i 3 + 2i
−1 0 1 2 3
Figure 1.23: Complex numbers
Since by (1.5.1), i2 = (0, 1)2 = (−1, 0) = −1, we have
Square Root of −1
The complex number i satisfies i2 = −1.
When thinking of points in the plane as complex numbers, it is traditional

to denote them by z instead of P . By (1.5.1), we have
z = x + iy, z ′ = x′ + iy ′ =⇒ zz ′ = (xx′ − yy ′ ) + i(x′ y + xy ′ ),

and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator by the
formula
1 1 x − iy z̄
= = 2 2
= 2.
z x + iy x +y |z|
Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the
conjugate of z.
Let r, r′ , r′′ and θ, θ′ , θ′′ be the polar coordinates of P , P ′ , P ′′ = P P ′ (see

Figure 1.16). Then Figure 1.22 suggests θ′′ = θ + θ′ . Using multiplication
of points (1.5.1) together with his bisection method, Archimedes[10] defined
angle measure θ numerically and derived θ′′ = θ + θ′ .
By elementary algebra,
2 2
(x2 + y 2 )(x′ + y ′ ) = (xx′ − yy ′ )2 + (x′ y + xy ′ )2 .
Since this implies r′′ 2 = r2 r′ 2 , we conclude
Polar Coordinates of Complex Numbers
If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers P and
P ′ , and (r′′ , θ′′ ) are the polar coordinates of the product P ′′ = P P ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .
From this and (1.5.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(1.5.5)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .
For example, if ω = cos θ + i sin θ, then the polar coordinates of ω are

r = 1 and θ. It follows the polar coordinates of ω 2 are r = 1 and 2θ, so
ω 2 = cos(2θ) + i sin(2θ).
By the same logic, for any power k, the polar coordinates of ω k are r = 1
and kθ, so ω k = cos(kθ) + i sin(kθ).
When P = (x, y) = x + iy is thought of as a complexpnumber, r is called

the absolute value of P , and written r = |P |. Then r = x2 + y 2 and
P = x + iy = r cos θ + ir sin θ = r(cos θ + i sin θ).
for any complex number P .
We can reverse the logic in the previous paragraph to compute square

roots. Let P be any point in the plane. We define the square
√ root of P to
2
be a point Q satisfying Q = P . In this case, we write Q = √ P . If Q is a
square root, so is −Q, so√
there are two square roots ±Q = ± P .
The formula for Q = P is
√

r+x y
P = (x, y) =⇒ P =± √ ,√ .
2r + 2x 2r + 2x
This formula is valid as long as x ̸= −r, and can checked directly by checking
Q2 = P .
When P is on the unit circle, r = 1, so the formula reduces to
√

1+x y
P =± √ ,√ .
2 + 2x 2 + 2x
We will need the roots of unity in §3.2. This generalizes square roots,
cube roots, etc.
A point ω is a root of unity if ω d = 1 for some power d. If d is the power,
we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1, ±i, since (±1)4 = 1, (±i)4 = 1. Here
we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).
In general, the roots of unity are denoted by powers of ω, so the square

roots of unity are 1 and ω = −1, and the fourth roots of unity are 1, ω = i,
ω 2 = −1, ω 3 = −i.
Let ω = cos θ + i sin θ. Since 1 = cos(2π) + i sin(2π) and ω k = cos(kθ) +
i sin(kθ), a d-th root of unity ω satisfies
ω = cos(2π/d) + i sin(2π/d). (1.5.6)
ω
ω
ω 1 1 ω2 1
ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1
Figure 1.24: The second, third, and fourth roots of unity
If ω d = 1, then
d k
ωk = ωd = 1k = 1.
With ω given by (1.5.6), this implies
1, ω, ω 2 , . . . , ω d−1
are the d-th roots of unity.

If we set √
1 3
ω =− +i = cos(2π/3) + i sin(2π/3),
2 2
then a calculation shows
√
2 1 3
1, ω, ω =− −i
2 2
are the cube roots of unity,
13 = 1, ω 3 = 1, (ω 2 )3 = 1.
Similarly, the fifth roots of unity are 1, ω, ω 2 , ω 3 , ω 4 , where

√ s√
1 5 5 5
ω=− + +i + = cos(2π/5) + i sin(2π/5).
4 4 8 8
ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω 4 ω4 ω5 ω 10 11 ω 12
ω
ω5 = 1 ω6 = 1 ω 15 = 1
Figure 1.25: The fifth, sixth, and fifteenth roots of unity
Summarizing,
Roots of Unity
If
ω = cos(2π/d) + i sin(2π/d),
the d-th roots of unity are
1, ω, ω 2 , . . . , ω d−1 .
The roots satisfy
ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.
Since ω d = 1, one has, from Figures 1.24 and 1.25,
ω k + ω −k = ω k + ω d−k = 2 cos(2πk/d), k = 0, 1, 2, . . . , d − 1. (1.5.7)
Here is sympy code for the roots of unity. We use display instead of
print to pretty-print the output.
from sympy import RootOf, symbols, init_printing

init_printing()
x = symbols('x')
d = 5
for k in range(d): display(RootOf(x**d - 1,k))
The fundamental theorem of algebra states that a polynomial p(x) of de-

gree d has d roots: There are d complex numbers x0 , x1 , . . . , xd−1 (not
necessarily distinct) satisfying p(x) = 0. In numpy, the roots of the polyno-
mial p(x) = ax2 + bx + c are returned by
import numpy as np
np.roots([a,b,c])
Since the cube roots of unity are the roots of the polynomial p(x) = x3 − 1,
the code
import numpy as np
np.roots([1,0,0,-1])
returns the cube roots
array([-0.5+0.8660254j, -0.5-0.8660254j, 1. +0.j ])
1.6 Mean and Covariance

Let x1 , x2 , . . . , xN be a dataset in Rd , and let x be any point in Rd . The
mean-square distance of x to D is
N
1 X
M SD(x) = |xk − x|2 .
N k=1
1.6. MEAN AND COVARIANCE 43
Above |x| stands for the length of the vector x, or the distance of the
point x to the origin. When d = 2 and we are in two dimensions, this was
defined in §1.4. For general d, this is defined in §2.1. In this section we
continue to focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
m= xk = .
N k=1 N
The mean m is a point in feature space. The first result is
Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-square
distance to the dataset (Figure 1.26).
Figure 1.26: MSD for the mean (green) versus MSD for a random point (red).
Using (1.4.8),
|a + b|2 = |a|2 + 2a · b + |b|2
for vectors a and b, it is easy to derive the above result. Insert a = xk − m
and b = m − x to get
N
2 X
M SD(x) = M SD(m) + (xk − m) · (m − x) + |m − x|2 .
N k=1
Now the middle term vanishes

N N
! !
2 X 2 X
(xk − m) · (m − x) = xk − Nm · (m − x)
N k=1 N k=1
= 2(m − m) · (m − x) = 0,
so we have
M SD(x) = M SD(m) + |m − x|2 ≥ M SD(m),
deriving the above result.

Here is the code for Figure 1.26.

from numpy import *
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)
p = array([random(),random()])
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
for v in dataset:
plot([m[0],v[0]],[m[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
show()
The covariance of a dataset is defined in any dimension d. When d = 1,

the dataset consists of scalars x1 , x2 , . . . , xN , and the mean m is a scalar.
In this case, the covariance q is also a scalar,

N
1 X
q= (xk − m)2 .
N k=1
In the scalar case, the covariance is called the variance of the scalar dataset.
In general, the covariance is a symmetric d × d matrix Q. When d = 1, a
1 × 1 matrix is a scalar, Q = (q), as above.
If the dataset x1 , x2 , . . . , xN has mean m, we can center the dataset,
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m.
Then the covariance matrix is (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.6.1)
N
The covariance is a symmetric d × d matrix. Here is code from scratch for
the covariance matrix of a dataset.
from numpy import *

def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
N = 20
# center dataset
vectors = dataset - m
Q = mean([ tensor(v,v) for v in vectors ],axis=0)
For example, suppose N = 5 and

x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.6.2)
Then m = (5, 6) and
v1 = x1 − m = (−4, −4), v2 = x2 − m = (−2, −2), v3 = x3 − m = (0, 0),
v4 = x4 − m = (2, 2), v5 = x5 − m = (4, 4).
Since

16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16

4 4
(±2, ±2) ⊗ (±2, ±2) = ,
4 4

0 0
(0, 0) ⊗ (0, 0) = ,
0 0
summing and dividing by N leads to the covariance

8 8
Q= .
8 8
Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1.
The covariance matrix as written in (1.6.1) is the biased covariance matrix.
If the denominator is instead N − 1, the matrix is the unbiased covariance
matrix.
For datasets with large N , it doesn’t matter, since N and N − 1 are
almost equal. For simplicity, here we divide by N , and we only consider the
biased covariance matrix.
In practice, datasets are standardized before computing their covariance.
The covariance of standardized datasets — the correlation matrix — is the
same whether one starts with bias or not (§2.2).
In numpy, the Python covariance constructor is
from numpy import *

N = 20
Q = cov(dataset,bias=True,rowvar=False)
This returns the same result as the previous code for Q. Notice here there
is no need to compute the mean, this is taken care of automatically. The
option bias=True indicates division by N , returning the biased covariance.

To return the unbiased covariance and divide by N − 1, change the option
to bias=False, or remove it, since bias=False is the default.
The option rowvar=False indicates the vectors of the dataset are entered
as rows; the dataset is then an N × 2 matrix. If the transpose dataset.T is
entered, the X samples and Y samples of the dataset are entered separately,
and we insert rowvar=True, or remove it, since rowvar=True is the default.
Q = cov(dataset.T,bias=True)
to return the same result.

From (1.4.13), if Q is the covariance matrix (1.6.1),
N
1 X
trace(Q) = |xk − m|2 . (1.6.3)
N k=1
We call (1.6.3) the total variance of the dataset. Thus the total variance
equals MSD(m).
In Python, the total variance is
from numpy import *
Q.trace()
We now project a 2d dataset onto a line. Let u be a unit vector (a vector

of length one, |u| = 1), and let v1 , v2 , . . . , vN be a 2d dataset, assumed
for simplicity to be centered. We wish to project this dataset onto the line
through u. This will result in a 1d dataset.
According to Figure 1.27, when a vector v is projected onto the line
through u, the length of the projected vector proj u v equals |v| cos θ, where
θ is the angle between the vectors v and u. Since |u| = 1, this length equals
the dot product v · u. Hence the projected vector is
proj v = (v · u)u.
u
Applying this logic to each vector v1 , v2 , . . . , vN , we conclude: the pro-

jected dataset onto the line through u is the dataset
(v1 · u)u, (v2 · u)u, . . . , (vN · u)u.
These vectors are all multiples of u, as they should be. The projected dataset
is in R2 .
Alternately, discarding u and retaining the scalar coefficients, we have
the one-dimensional dataset
v1 · u, v2 · u, . . . , vN · u.
This is the reduced dataset. The reduced dataset consists of scalars.
Since the vector u is fixed, the reduced dataset and the projected dataset
contain the same information.
v
proj u v
u
Figure 1.27: Projecting a vector v onto the line through u.
The mean of the reduced dataset is 0, since

v1 · u + v2 · u + · · · + vN · u v1 + v2 + · · · + vN
= · u = 0 · u = 0,
N N
and the mean of the projected dataset is also 0.
Since the reduced dataset is one-dimensional, its variance is
N
1 X
q= (vk · u)2 .
N k=1
But, according to (1.4.15), this equals

N
1 X
q= u · (vk ⊗ vk )u = u · Qu.
N k=1
Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)
Variance of Projected Dataset

Let Q be the covariance matrix of a dataset. Then the variance of
the projected dataset onto the line through the vector u equals the
quadratic function u · Qu.
As a consequence, for any covariance Q, we see u · Qu ≥ 0 for any vector

u, as this is the variance of the projected dataset. Projections are studied
further in §2.7.
Going back to the dataset (1.6.2), xk −m, k = 1, 2, 3, 4, 5, are all multiples
of (1, 1). If we select u = (1, −1), then (xk − m) · u = 0, so the covariance Q
satisfies u · Qu = 0. This can also be seen by
Qu = 8((1, 1) ⊗ (1, 1))u = 8(1, 1) · u (1, 1) = 0.
This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).
Now we describe the covariance ellipses associated to a given dataset

(Figure 1.28).
The contour of all points x satisfying x·Qx = 1 is the covariance ellipsoid.
In two dimensions d = 2, this is the covariance ellipse. The contour of all
points x satisfying x · Q−1 x = 1 is the inverse covariance ellipsoid. In two
dimensions d = 2, this is the inverse covariance ellipse
In two dimensions d = 2, a covariance matrix has the form

a b
Q= .
b c
If we write u = (x, y) for a vector in the plane, the covariance ellipse is
u · Qu = ax2 + 2bxy + cy 2 = 1.
Figure1.28 displays examples of covariance ellipses (blue) and inverse co-

variance ellipses (red). When Q is diagonal, the lengths of the major and
√ √
minor axes of the inverse covariance ellipse equal 2 a and 2 c, and the
√
lengths√of the major and minor axes of the covariance ellipse equal 2/ a
and 2/ c.
The covariance ellipse and inverse covariance ellipses described above are
centered at the origin (0, 0). When a dataset has mean m and covariance Q,
the ellipses are drawn centered at m, as in Figures 1.30, 1.31, and 1.32.
In particular, when a = c and b = 0, then Q = aI is a multiple√ of the

identity, the inverse covariance ellipse is √the circle of radius a, and the
covariance ellipse is the circle of radius 1/ a.
Here is the code for Figure 1.28. The ellipses drawn here are centered at
the origin.

from numpy import *
L, delta = 4, .1
x = arange(-L,L,delta)
y = arange(-L,L,delta)
X,Y = meshgrid(x, y)
a, b, c = 9, 0, 4
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
def ellipse(a,b,c,levels,color):
contour(X,Y,a*X**2 + 2*b*X*Y + c*Y**2,levels,colors=color)
grid()
ellipse(a,b,c,[1],'blue')
ellipse(A,B,C,[1],'red')
show()
This completes the discussion of covariance ellipses.

Figure 1.28: Covariance ellipses and inverse covariance ellipses.
Now we describe how to standardize datasets in R2 . For datasets in Rd ,

this is described in §2.2.
Remember, a dataset is a sequence of N points in a d-dimensional feature
space. Restricting to the case d = 2, a dataset is a sequence of x-coordinates
and y-coordinates
x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .
Suppose the mean of this dataset is m = (mx , my ). Then, by the formula for
tensor product, the covariance matrix is

a b
Q= ,
b c
where
N N N
1 X 1 X 1 X
a= (xk −mx )2 , b= (xk −mx )(yk −my ), c= (yk −my )2 .
N k=1 N k=1 N k=1
From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x
and y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
to their mean mx , resulting in a small x variance a, while the y-features may

be spread far from their mean my , resulting in a large y variance c.
When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − mx ′ x2 − m x xN − mx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a
and
y1 − my ′ y2 − my y N − my
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
x = 1, y = 1.
N k=1 k N k=1 k
This is the standardized dataset.

Because of this, the covariance matrix of the standardized dataset has
the form
′ 1 ρ
Q = ,
ρ 1
where
N
X
N
(xk − mx )(yk − my )
1 X ′ ′ b k=1
ρ= xk yk = √ = v
N k=1 ac u N
u X
! N
X
!
t (xk − mx )2 (yk − my )2
k=1 k=1
is the Pearson correlation coefficient of the dataset. The matrix Q′ is the

correlation matrix, or the standardized covariance matrix.
For example,

9 2 b 1 ′ 1 1/3
Q= =⇒ ρ= √ = =⇒ Q = .
2 4 ac 3 1/3 1
The correlation coefficient ρ (“rho”) is always between −1 and 1 (this

follows from the Cauchy-Schwarz inequality (1.4.5).
When ρ = ±1, the dataset samples are perfectly correlated and lie on
a line passing through the mean. When ρ = 1, the line has slope 1, and
when ρ = −1, the line has slope −1. When ρ = 0, the dataset samples are
completely uncorrelated and are considered two independent one-dimensional
datasets.
In Python numpy, the correlation matrix Q′ is returned by
from numpy import *
corrcoef(dataset.T)
Here again, we input the transpose of the dataset if our default is vectors
as rows. Notice the 1/N cancels in the definition of ρ. Because of this,
corrcoef is the same whether we deal with biased or unbiased covariance
matrices.
We say a unit vector u is best aligned or best-fit with the dataset if u

maximizes the variance v · Qv over all unit vectors v,
u · Qu = max v · Qv.
|v|=1
We calculate the best-aligned unit vector. When a dataset is standard-

ized, the variance of the dataset projected onto a vector v = (x, y) equals
v · Qv = ax2 + 2bxy + cy 2 = x2 + 2ρxy + y 2 .
Since v = (x, y) is a unit vector, we have x2 + y 2 = 1, so we can write

(x, y) = (cos θ, sin θ). Using the double-angle formula, we obtain
v · Qv = x2 + 2ρxy + y 2 = 1 + 2ρ sin θ cos θ = 1 + ρ sin(2θ).

Since the sine function varies between +1 and −1, we conclude the projected
variance varies between
1 − ρ ≤ v · Qv ≤ 1 + ρ,
and
π 1 1
θ= , v+ = √ , √ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2

3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.29)
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is
1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,
and v± must be switched when ρ < 0. We study best-aligned vectors in Rd

in §3.2.
Figure 1.29: Covariance ellipse and incovariance ellipse.

Figure 1.30: A positively correlated dataset ρ > 0.
Figure 1.31: A negatively correlated dataset ρ < 0.
Here are two randomly generated datasets. For the dataset in Figure
1.30, the mean and covariance are

0.09652275 0.00939796
(0.46563359, 0.59153958) .
0.00939796 0.0674424
For the dataset in Figure 1.31, the mean and covariance are

0.08266583 −0.00976249
(0.48785572, 0.51945499) .
−0.00976249 0.08298294
Figure 1.32: Level contours of v · Q−1 v.
In general, for non-standardized datasets, the projected variance v · Qv

varies between two extremes λ± ,
λ− ≤ v · Qv ≤ λ+ , |v| = 1.
where λ± are given by
s 2
a+c a−c
λ± = ± + b2 . (1.6.4)
2 2
When the datset is standardized, as we saw above, λ± = 1 ± |ρ|.
pThe major axis of the inverse covariance
p ellipse v · Q−1 v = 1 has length
2 λ+ , and the minor axis has length 2 λ− . These are the principal axes of
the dataset.
When the dataset is not standardized, the best-aligned and worst-aligned

vectors are
v+ = (−b, a − λ+ ), v− = (−b, a − λ− ). (1.6.5)
All this will discussed in detail in §3.2. In Figure 1.32, the level contours
1 3
v · Q−1 v = k, k = , 1, , 2,
2 2
are drawn.
In three dimensions, when d = 3, the ellipses are replaced by ellipsoids
(Figure 1.33). In 3d, the inverse covariance ellipsoid and principal axes are
displayed in Figure 1.33.
Figure 1.33: Ellipsoid and axes in 3d.
Here is code for Figures 1.30, 1.31, and 1.32. The code incorporates the
formulas for λ± and v± .

from numpy import *
from numpy.random import *
N = 50
X = array([ random() for _ in range(N) ])
Y = array([ random() for _ in range(N) ])
scatter(X,Y,s=2)
m = mean([X,Y],axis=1)
Q = cov(X,Y,bias=True)
a, b, c = Q[0,0], Q[0,1], Q[1,1]
delta = .01
x = arange(0,1,delta)
y = arange(0,1,delta)
X,Y = meshgrid(x, y)
def ellipse(a,b,c,d,e,levels,color):
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
# inverse covariance ellipse centered at (d,e)
Z = A*(X-d)**2 + 2*B*(X-d)*(Y-e) + C*(Y-e)**2
contour(X,Y,Z,levels,colors=color)
for pm in [+1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
len = sqrt(b**2 +(a-lamda)**2)
axesX = [d+sigma*b/len,d-sigma*b/len]
axesY = [e-sigma*(a-lamda)/len,e+sigma*(a-lamda)/len]
plot(axesX,axesY,linewidth=.5)
grid()
levels = [.5,1,1.5,2]
ellipse(a,b,c,*m,levels,'red')
show()
1.7 High Dimensions

Although not directly used in later material, this section is here to boost
intuition about high dimensions.
Draw four disks inside a square, as in Figure 1.34. In Figure 1.34, the
edge-length of the square is 4, and the radius
√ of each blue disk is 1. Since
the length of the diagonal of the square is 4 2, the radius of the red disk is
1 √ √
(4 2 − 4) = 2 − 1.
4
1.7. HIGH DIMENSIONS 59
Notice there are 4 blue disks.
Figure 1.34: Disks inside the square
The following code returns Figure 1.34.

from matplotlib import patches
from numpy import *
fig, axes = subplots()
square = Rectangle((-2,-2), 4, 4,color='lightblue')

circle1 = Circle((1, 1), radius=1, color='blue')
circle2 = Circle((-1, 1), radius=1, color='blue')
circle3 = Circle((1, -1), radius=1, color='blue')
circle4 = Circle((-1, -1), radius=1, color='blue')
circle = patches.Circle((0, 0), radius=sqrt(2)-1,
,→ color='red')
plot([-2,2],[-2,2],color='black')
axes.add_patch(square)
axes.add_patch(circle1)
axes.add_patch(circle)
for pos in ['right', 'top', 'bottom', 'left']:

gca().spines[pos].set_visible(False)
axis('equal')
show()
Figure 1.35: Balls inside the cube
Now we repeat this in three dimensions, obtaining Figure 1.35. Draw

eight balls inside a cube, as in Figure 1.35.
1.7. HIGH DIMENSIONS 61
Since the edge-length of the cube is 4, the √ radius of each blue ball is 1.
Since the length of the diagonal of the cube is 4 3, the radius of the red ball
is
1 √ √
(4 3 − 4) = 3 − 1.
4
Notice there are 8 blue balls.
In two dimensions, when a region is scaled by a factor t, its area increases
by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . We conclude: In d dimensions, when
a region is scaled by a factor t, its (d-dimensional) volume increases by the
factor td . This is called the scaling principle.
In d dimensions, the edge-length of the cube remains 4, the radius of
each blue ball remains 1,√and there are 2d blue balls. Since the length of the
diagonal of the cube is√4 d, the same calculation results in the radius of the
red ball equal to r = d − 1.
By the scaling principle, the volume of the red ball equals rd times the
volume of the blue ball. We conclude the following:
√
• Since r = d−1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.
• Since there are 2d blue balls, the ratio of the volume of the red ball
over the total volume of all the blue balls is rd /2d .
√
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly
when d = 9, we have: In nine dimensions, the volume of the red ball
equals the sum total of the volumes of all blue balls.
√
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.
√
• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the
semi-diagonal. As the dimension grows without bound, the proportion
of the diagonal covered by the red ball converges to 1/2.
The code for Figure 1.35 is as follows. For 3d plotting, the module mayavi
is better than matplotlib.
from mayavi.mlab import *

from numpy import *
from itertools import product
# run mayavi viewer inside notebook

init_notebook()
# clear any previously created

# mayavi scenes
clf()
# build sphere mesh

N = 40
theta = linspace(0,2*pi,N)
phi = linspace(0,pi,N)
theta,phi = meshgrid(theta,phi)
# spherical coordinates theta, phi
x = cos(theta)*sin(phi)
y = sin(theta)*sin(phi)
z = cos(phi)
# render ball
# here color is rgb triple of floats
def ball(a,b,c,r,color):
return mesh(a + r*x,b + r*y, c + r*z,color=color)
pm1 = [-1,1]
for center in product(pm1,pm1,pm1):
# blue balls: color (0,0,1)
ball(*center,1,(0,0,1))
# black wire cube: color (0,0,0)
outline(color=(0,0,0))
# red ball: color (1,0,0)

ball(0,0,0,sqrt(3)-1,(1,0,0))
Chapter 2
Linear Geometry
In §1.4, we reviewed the geometry of vectors in the plane. Now we study

linear geometry in any dimension d.
The material in this chapter is usually referred to as Linear Algebra. We
prefer the term Linear Geometry, to emphasize that the material is, like
much of data science, geometric.
2.1 Vectors and Matrices

A vector is a list of scalars
v = (t1 , t2 , . . . , td ).
The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.
from numpy import *
v = array([1,2,3])
v.shape
63
64 CHAPTER 2. LINEAR GEOMETRY
from sympy import *
v = Matrix([1,2,3])
v.shape
The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added component by component: With
v = (t1 , t2 , . . . ) and v = (t′1 , t′2 , . . . ),
we have
v + v ′ = (t1 + t′1 , t2 + t′2 , . . . ), and sv = (st1 , st2 , . . . ).
Addition v + v ′ only works when v and v ′ have the same shape.

The zero vector is the vector 0 = (0, 0, 0, . . . ). The zero vector is the only
vector satisfying 0 + v = v = v + 0 for every vector v. Even though the zero
scalar and the zero vector are distinct objects, we use 0 to denote both. A
vector v is nonzero if v is not the zero vector.
In R4 , the vectors
e1 = (1, 0, 0, 0), e2 = (0, 1, 0, 0), e3 = (0, 0, 1, 0), e4 = (0, 0, 0, 1)
together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .
A matrix is a listing arranged in a rectangle of rows and columns. Specif-

ically, an N × d matrix A has N rows and d columns,
 
a11 a12 . . . a1d
 a21 a22 . . . a2d 
A= ... ...
.
... 
aN 1 aN 2 . . . aN d
In Python, if L is a list of lists, then both array(L) and Matrix(L) return
a matrix. The code
2.1. VECTORS AND MATRICES 65
from numpy import *
A = array([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
from sympy import *
A = Matrix([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
returns (5,3), so A is a 5 × 3 matrix,

 
1 6 11
2 7 12
 
A= 3 8 13.
4 9 14
5 10 15
The transpose of a matrix A is the matrix B = At resulting from turning

A on its side, so
 
1 2 3 4 5
B = At =  6 7 8 9 10 .
11 12 13 14 15
Note the transpose operation interchanges rows and columns: the rows of At
are the columns of A. In both numpy or sympy, the transpose of A is A.T.
A d-dimensional vector v may be written as a 1 × d matrix

v = t1 t2 . . . td .
In this case, we call v a row vector.

An N -dimensional vector v may be written as a N × 1 matrix
 
t1
 t2 
v= . . .  .

tN
In this case, we call v a column vector.

We will be considering matrices with different properties, and we use the

following notation
• A, B: any matrix
• U , V : orthonormal rows or orthonormal columns
• Q: symmetric matrix
• P : projections
Vectors v1 , v2 , . . . , vd with the same dimension may be horizontally

stacked as columns of a matrix,

A = v1 v2 . . . vd .
Similarly, vectors v1 , v2 , . . . , vN with the same dimension may be vertically

stacked as rows of a matrix,
 
v1
 v2 
A= . . .  .

vN
By default, sympy creates column vectors. Because of this, it is easiest to

build matrices as columns,
from sympy import *
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
# 5x3 matrix
A = Matrix.hstack(u,v,w)
# column vector
b = Matrix([1,1,1,1,1])
# 5x4 matrix
M = Matrix.hstack(A,b)
In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code
from sympy import *
A == Matrix.hstack(*[A.col(j) for j in range(A.cols)])
returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is hstack and vstack, but we prefer column_stack and
row_stack, so the code
from numpy import *
A == row_stack([ row for row in A ])
A == column_stack([ col for col in A.T ])
returns True. In numpy, the input is a list, there is no unpacking.
In numpy, a matrix A is a list of rows, so
A == array([ row for row in A ])

A.T == array([ col for col in A.T ])
both return True. Here col refers to rows of At , hence refers to the columns
of A.
The number of rows is len(A), and the number of columns is len(A.T).

To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number
of columns is A.cols, so
A.shape == (A.rows,A.cols)
returns True. To access row i, use A.row(i). Similarly, to access column j,

use A.col(j). So,
A == Matrix([ A.row(i) for i in range(A.rows) ])

A.T == Matrix([ A.col(j) for j in range(A.cols) ])
both return True.

A matrix is square if the number of rows equals the number of columns,
N = d. A matrix is diagonal if it looks like one of these
   
a 0 0 0   a 0 0
0 b 0 0 a 0 0 0
 , or 0 b 0 0 , or 0 b 0 ,
 

0 0 c 0 0 0 c 
0 0 c 0
0 0 0 d 0 0 0
where some of the numbers on the diagonal a, b, c, d may be zero.
A dataset is a collection of points x1 , x2 , . . . , xN in Rd . After centering

the mean to the origin (§1.3), the dataset becomes a collection of vectors v1 ,
v2 , . . . , vN . Usually the vectors are presented as the rows of an N × d matrix
A. Corresponding to this, datasets are often provided as a CSV file.
The matrix A is the dataset matrix. In excel, this is called a spreadsheet.
In SQL, this is called a table. In numpy, it’s an array. In pandas, it’s a
dataframe. So, effectively,
matrix = dataset = CSV file = spreadsheet = table = array = dataframe

Matrices consisting of numbers are added and multiplied by scalars as

follows. With
a′11 a′12 . . . a′1d
   
a11 a12 . . . a1d
 a21 a22 . . . a2d   a′21 a′22 . . . a′2d 
A= ... ...
 and A′ =  ,
...  ... ... ... 
aN 1 aN 2 . . . aN d a′N 1 a′N 2 . . . a′N d
we have
a11 + a′11 a12 + a′12 . . . a1n + a′1d
 
 a21 + a′21 a22 + a′22 . . . a2n + a′2d 
A + A′ =  
 ... ... ... 
′ ′ ′
aN 1 + aN 1 aN 2 + aN 2 . . . aN d + aN d
and  
ta11 ta12 . . . ta1d
 ta21 ta22 . . . ta2d 
tA = 
 ...
.
... ... 
taN 1 taN 2 . . . taN d
A + A′ is the result of matrix addition, and tA is the result of matrix scaling.
Matrices may be added only if they have the same shape.
In Python, matrix scaling and matrix addition are a*A and A + B. The
code
from sympy import *
A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F
returns
 
1 0 0 0
0 0 0 1 1 1 2 2 3 5 10  0 1 0 0
, , , , , .
0 0 0 1 1 3 4 4 5 15 20  0 0 1 0
0 0 0 1
Diagonal matrices are constructed using diag. The code
from sympy import *
A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B
returns  
  −1 0 0 0
1 0 0 0  0 1 1 0

0 2 0 0 
 0 1 1 0

0 , .
0 3 0 0 0 0 5

0 0 0 4 0 0 0 7
0 0 0 5
It is straightforward to convert back and forth between numpy and sympy.
In the code
from sympy import *
A = diag(1,2,3,4)
from numpy import *
B = array(A)
C = Matrix(B)
A and C are sympy.Matrix, and B is numpy.array. numpy is for numerical

computations, and sympy is for algebraic/symbolic computations.
For the Iris dataset, the mean (§1.3) is given by the following code.
from sklearn import datasets
iris = datasets.load_iris()
2.2. PRODUCTS 71
dataset = iris["data"]
To center dataset, we compute the mean and subtract it,
The mean is m = (5.84, 3.05, 3.76, 1.2).
2.2 Products
Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already
know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then
u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)
It’s best to think of this as “row-times-column” multiplication,

 
t1
u · v = s1 s2 s3 t2  = s1 t1 + s2 t2 + s3 t3 .
t3
As in §1.4, we always have rows on the left, and columns on the right.
In Python,
from numpy import *
u = array([1,2,3])
v = array([4, 5, 6])
dot(u,v) == 1*4 + 2*5 + 3*6
from sympy import *
u = Matrix([1,2,3])
v = Matrix([4, 5, 6])
u.T * v == 1*4 + 2*5 + 3*6
both return True.

For clarity, sometimes we write (u.T)*v; the parentheses don’t change
anything. Note in sympy, we take the transpose when multiplying, since
vectors are by default column vectors, and it’s always row × column.
As in two dimensions, the length or norm or magnitude of a vector v is

the square root of the dot product v · v,
√
|v| = v · v.
In Python, the length of a vector v is
from numpy import *
sqrt(dot(v,v))
from sympy import *
sqrt(v.T * v)
Notice numpy returns a scalar, while sympy returns a 1 × 1 matrix.

A vector is a unit vector if its length equals 1. When |v| = 0, all the
features of v equal zero. It follows the zero vector is the only vector with
zero length. All other vectors have positive length.
Let v be any nonzero vector. By dividing v by its length |v|, we obtain a
unit vector u = v/|v|.
2.2. PRODUCTS 73
As in §1.4,
Dot Product
The dot product u · v (2.2.1) satisfies
u · v = |u| |v| cos θ, (2.2.2)
where θ is the angle between u and v.
In two dimensions, this was equation (1.4.4) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
Here is code for the angle θ,
from numpy import *
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Since | cos θ| ≤ 1, we have the
Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the product
of their lengths,
|u · v| ≤ |u| |v| or |u · v|2 ≤ (u · u)(v · v). (2.2.3)
Vectors u and v are said to be perpendicular or orthogonal if u · v = 0.

A collection of vectors is orthogonal if any pair of vectors in the collection
are orthogonal. With this understood, the zero vector is orthogonal to every
vector. The converse is true as well: If u·v = 0 for every v, then in particular,
u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.
An important application of the Cauchy-Schwarz inequality is the triangle

inequality
|a + b| ≤ |a| + |b|. (2.2.4)
To see this, let v be any unit vector. Then
(a + b) · v = a · v + b · v ≤ |a||v| + |b||v| = |a| + |b|.
From this, selecting v = (a + b)/|a + b|,
|a + b| = (a + b) · v ≤ |a| + |b|.
Suppose v is a vector and A is a matrix. If the rows of A have the same

dimension as that of v, we can take the dot product of each row of A with v,
obtaining the matrix-vector product Av: Av is the vector whose features are
the dot products of the rows of A with v.
In other words,
dot(A,v) == array([ dot(row,v) for row in A ])
A*v == Matrix([ A.row(i) * v for i in range(A.rows) ])
both return True.

If u and v are vectors, we can think of u as a row vector, or a matrix con-
sisting of a single row. With this interpretation, the matrix-vector product
uv equals the dot product u · v.
2.2. PRODUCTS 75
If u and v are vectors, we can think of u as a column vector, or a matrix

consisting of a single column. With this interpretation, ut is a single row,
and the matrix-vector product ut v equals the dot product u · v.
Let A and B be two matrices. If the row dimension of A equals the

column dimension of B, the matrix-matrix product AB is defined. When this
condition holds, the entries in the matrix AB are the dot products of the rows
of A with the columns of B. In Python,
from numpy import *
C = array([ [ dot(row,col) for col in B.T ] for row in A ])

dot(A,B) == C
from sympy import *
C = Matrix([[ A.row(i)*B.col(j) for j in range(B.cols)] for i

,→ in range(A.rows) ])
A*B == C
both return True, and, with

 
1 2 3
1 2 3 4 4 5 6
A= ,B = 
 7 8 9 ,

5 6 7 8
10 11 12
the code
A,B,dot(A,B)
A,B,A*B
returns
70 80 90
AB = .
158 184 210
Let A and B be matrices. Since transpose interchanges rows and columns,

we always have
(AB)t = B t At .
As a special case, if we think of v as a column vector, i.e. as a matrix
with a single column, then the matrix-vector product Av is the same as the
matrix-matrix product Av, so
(Av)t = v t At .
Here we are thinking of v as a matrix with one column, and v t as a matrix

with one row.
In Python,
dot(A,B).T == dot(B.T,A.T)
(A * B).T == B.T * A.T
both return True.

We also have
Dot Product Transpose Identity

For any vectors u, v, and matrices A, we have
(Au) · v = u · (At v) and (At u) · v = u · Av, (2.2.5)
whenever the shapes of u, v, A match.
In terms of row vectors and column vectors, this is automatic. For exam-
ple,
(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).
In Python,
dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))
(A*u).T * v == u.T * (A.T*v)

(A.T*u).T * v == u.T * (A*v)
2.2. PRODUCTS 77
all return True.
Let A be a matrix. We compute useful expressions for AAt and At A.

Assume the rows of A are v1 , v2 , . . . , vN . Since matrix-matrix multipli-
cation is row × column, we have
 
v1 · v1 v1 · v2 . . . v1 · vN
 v2 · v1 v2 · v2 . . . v2 · vN 
AAt = 
 ...
. (2.2.6)
... ... ... 
vN · v1 vN · v2 . . . vN · vN
As a consequence,1
Orthonormal Rows and Columns

Let U be a matrix.
• U has orthonormal rows iff U U t = I.
• U has orthonormal columns iff U t U = I.
The second statement follows from the first by substituting U t for U .
To compute At A, we bring in the tensor product. If u and v are vectors,

the tensor product u ⊗ v is the matrix-matrix product ut v, with u and v row
vectors. If u is N -dimensional and v is d-dimensional, then u ⊗ v is an N × d
matrix.
For example, if u = (a, b, c), v = (A, B), then
   
a aA aB
u ⊗ v =  b  A B =  bA bB  .
c cA cB
Then the identities (1.4.14) and (1.4.15) hold in general. Using the tensor
product, we have
1
Iff is short for if and only if.
Tensor Identity
Let A be a matrix with rows v1 , v2 , . . . , vN . Then
At A = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.7)
Multiplying (2.2.7) by xt on the left and x on the right, and using (1.4.15),
we see (2.2.7) is equivalent to
|Ax|2 = xt At Ax = (v1 · x)2 + (v2 · x)2 + · · · + (vN · x)2 . (2.2.8)
By matrix-vector multiplication,
Ax = (v1 · x, v2 · x, . . . , vN · x).
Since |Ax|2 is the sum of the squares of its components, this derives (2.2.8).
A matrix Q is symmetric if Q = Qt . For any matrix A, Q = AAt and

Q = At A are symmetric.
A symmetric matrix Q satisfying v · Qv ≥ 0 for every vector v is nonneg-
ative. A symmetric matrix Q satisfying v · Qv > 0 for every nonzero vector
v is positive.
The most important example of a nonnegative matrix is the covariance
matrix (§1.6) of a dataset. When a dataset in Rd fills up all d dimensions,
the covariance matrix is positive (see §2.5).
The trace of a square matrix
 
a b c
A = b d e 
c e f
is the sum of its diagonal elements,
 
a b c
trace(A) = trace  b d e  = a + d + f.
c e f
Even though in general AB ̸= BA, it is always true that
trace(AB) = trace(BA), (2.2.9)

2.2. PRODUCTS 79
Trace and tensor product combine in the identity

u · Qv = trace(Q(v ⊗ u)). (2.2.10)
The derivations of these identities are simple calculations that we skip.
If A = (aij ) is any matrix, then the norm squared of A is

X
∥A∥2 = a2ij .
i,j
By taking the trace in (2.2.7),
Norm Squared of Matrix

Let A be a matrix with rows v1 , v2 , . . . , vN . Then
∥A∥2 = |v1 |2 + |v2 |2 + · · · + |vN |2 ,
and
∥A∥2 = trace(At A). (2.2.11)
By replacing A by At , the same results hold for columns.
If x1 , x2 , . . . , xN is a dataset of points in Rd with mean m, and v1 , v2 ,

. . . , vN is the corresponding centered dataset, then the covariance matrix Q
is the average of tensor products (§1.6),
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= .
N
Let A be the matrix with rows v1 , v2 , . . . , vN . By (2.2.7), the last equation
is the same as
1
Q = At A. (2.2.12)
N
If we save the Iris dataset as a centered array vectors, as in §2.1, code
from scratch for the covariance is
from numpy import *
Q = dot(vectors.T,vectors)/N
Of course, it is simpler to avoid centering and just do directly
Q = cov(dataset,rowvar=False)
or
Q = cov(dataset.T)
After downloading the Iris dataset as in §2.1, the mean, covariance, and
total variance are
 
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
m = (5.84, 3.05, 3.76, 1.2), Q = 
 1.27 −0.32 3.09
 , 4.54.
1.29 
0.51 −0.12 1.29 0.58
(2.2.13)
In §1.6, we discussed standardizing datasets in R2 . This can be done in

general.
Let x1 , x2 , . . . , xN be a dataset in Rd . Each sample point x has d features
(t1 , t2 , . . . , td ). We compute the variance of each feature separately.
Let e1 , e2 , . . . , ed be the standard basis in Rd , and, for each j = 1, 2 . . . , d,
project the dataset onto ej , obtaining the scalar dataset x1 · ej , x2 · ej , . . . ,
xN · ej , consisting of the j-th feature of the samples. If qjj is the variance
of this scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the
covariance matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let m = (m1 , m2 , . . . , md ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is

t1 − m1 t2 − m2 td − md
v= √ , √ ,..., √ .
q11 q22 qdd
2.3. MATRIX INVERSE 81
Then the standardized dataset is v1 , v2 , . . . , vN .

If Q = (qij ) is the covariance matrix, then the correlation matrix is the
d × d matrix Q′ = (qij′ ) with entries
qij
qij′ = √ , i, j = 1, 2, . . . , d.
qii qjj
Then a straightforward calculation shows
Standardized Covariance Equals Correlation

The covariance matrix of the standardized dataset equals the correla-
tion matrix of the original dataset.
In Python,
from numpy import *

from sklearn.preprocessing import StandardScaler
# standardize dataset
vectors = StandardScaler().fit_transform(dataset)
Qcorr = corrcoef(dataset.T)
Qcov = cov(vectors.T,bias=True)
allclose(Qcov,Qcorr)
returns True.
2.3 Matrix Inverse

Let A be any matrix and b a vector. The goal is to solve the linear system
Ax = b. (2.3.1)
In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
However, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the
zero matrix and b any non-zero vector. Because of this, we must be careful
when solving (2.3.1).
Given a square matrix A, the inverse matrix is the matrix B satisfying

AB = I = BA. (2.3.2)
Here I is the identity matrix. Since I is a square matrix, A must also be a
square matrix.
Only square matrices may have inverses. Moreover, not every square
matrix has an inverse. For example, the zero matrix does not have an inverse.
When A has an inverse, we say A is invertible.
If a matrix is d × d, then the inverse is also d × d. We write B = A−1 for
the inverse matrix of A. For example, it is easy to check

a b −1 1 d −b
A= =⇒ A = .
c d ad − bc −c a
Since we can’t divide by zero, a 2 × 2 matrix is invertible only if ad − bc ̸= 0.
Since
(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIA−1 = AA−1 = I,
we have
(AB)−1 = B −1 A−1 .
When A has an inverse A−1 , we can solve the linear system Ax = b.
Solution of Ax = b when A invertible

If A is invertible, then
Ax = b =⇒ x = A−1 b. (2.3.3)
This is easy to check, since

Ax = A(A−1 b) = (AA−1 )b = Ib = b.
from sympy import *
# solving Ax=b
x = A.inv() * b
from numpy import *

from numpy.linalg import inv
# solving Ax=b
x = dot(inv(A) , b)
In general, a matrix A is not invertible, and Ax = b is solved using

the pseudo-inverse x = A+ b. The definition and framework of the pseudo-
inverse is in §2.6. The upshot is: every (square or non-square) matrix A has
a pseudo-inverse A+ . Here is the general result.
Solution of Ax = b for General A

If Ax = b is solvable, then
x+ = A+ b =⇒ Ax+ = b.
If Ax = b is not solvable, then x+ minimizes the residual |Ax − b|2 .
This says if Ax = b has some solution, then x+ = A+ b is also a solution.

On the other hand, Ax = b may have no solution, in which case the error
|Ax − b|2 is minimized. From this point of view, it’s best to think of x+ as
a candidate for a solution. It’s a solution only after confirming equality of
Ax+ and b. All this is worked out in §2.6.
To put this in context, there are three possibilities for a linear system
(2.3.1). A linear system Ax = b can have
• no solutions, or
• exactly one solution, or

• infinitely many solutions.

As examples of these three possibilities, we have
• A = 0 and b ̸= 0,
• A is invertible,
• A = 0 and b = 0.
The pseudo-inverse provides a single systematic procedure for deciding
among these three possibilities. The pseudo-inverse is available in numpy and
sympy as pinv. In this section, we focus on using Python to solve Ax = b,
postponing concepts to §2.6.
How do we use the above result? Given A and b, using Python, we
compute x = A+ b. Then we check, by multiplying in Python, equality of Ax
and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the imple-
mentation of Python on your laptop may be different than on my laptop, our
solutions may differ.
It can be shown that if the entries of A are integers, then the entries of
A+ are fractions. This fact is reflected in sympy, but not in numpy, as the
default in numpy is to work with floats.
Let
u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),
and let A be the matrix with columns u, v, w,

 
1 6 11
2 7 12
 
A= u v w = 3 8 13. (2.3.4)
4 9 14
5 10 15
from numpy import *
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
# arrange as columns
A = column_stack([u,v,w])
For this A, the code
from scipy.linalg import pinv
pinv(A)
returns
 
−37 −20 −3 14 31
1 
A+ = −10 −5 0 5 10  .
150
17 10 3 −4 −11
Alternatively, in sympy,
from sympy import *
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A.pinv()
returns the same result.

Let A be as in (2.3.4) and let
a = (8, 9, 10, 11, 12), b = (11, 6, 1, −4, −9).
We solve Ax = a and Ax = b by computing the candidates

1
x+ = A + a = (2, 5, 8),
15
and
1
x+ = A+ b = (−173, −50, 73).
30
Then we check that the candidates are actually solutions, which they are, by
comparing Ax+ and a, in the first case, and Ax+ and b, in the second case.
For
c = (−9, −3, 3, 9, 10),
we have
1
x+ = A + c = (82, 25, −32).
15
However, for this x+ , we have
Ax+ = (−8, −3, 2, 7, 12),
which is not equal to c. From this, not only do we conclude x+ is not a

solution of Ax = c, but also, by the general result above, the system Ax = c
is not solvable at all.
Let B be the matrix with columns a and b,

 
8 11
9 6 
 
10 1  .
B = (a, b) =  
11 −4
12 −9
We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
B + u, B + v, B + w,
obtaining the solutions

1 1 1
x+ = (16, −7), x+ = (41, −2), x+ = (66, 3).
51 51 51
Let  
1 2 3 4 5
C = At =  6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). Then
 
−37 −10 17
−20 −5 10 
+ t + + t
 
 −3
C = (A ) = (A ) =  0 3 
 14 5 −4 
31 10 −11
and
1
x+ = C + f =
(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .
Let D be the matrix with columns a and f ,

 
1 0
D = (a, f ) =  6 −5  ,
11 −10
and let a, b, c, d, e be the columns of B. Then

+ 1 25 10 −5
D = .
30 28 10 −8
We solve
Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,
by constructing the candidates
D+ a, D+ b, D+ c, D+ d, D+ e,
obtaining the solutions
x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).
2.4 Span and Linear Independence

Let u, v, w be three vectors. Then
1
3u − v + 9w, 5u + 0v − w, 0u + 0v + 0w
6
are linear combinations of u, v, w.
In general, a linear combination of vectors v1 , v2 , . . . , vd is
t1 v1 + t2 v2 + · · · + td vd . (2.4.1)
Here the coefficients t1 , t2 , . . . , td are scalars.
In terms of matrices, let
u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),
and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be
the vector (r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this
calculation!) the matrix-vector product Ax equals ru + sv + tw,
Ax = ru + sv + tw.
The code
2.4. SPAN AND LINEAR INDEPENDENCE 89
dot(A,x) == r*u + s*v + t*w
returns
array([ True, True, True, True, True])
To repeat, the linear combination ru + sv + tw is the same as the matrix-

vector product Ax. This is a general fact on which everything depends:
Column Linear Combination Same as Matrix-Vector Product

Let A be a matrix with columns v1 , v2 , . . . , vd , and let
x = (t1 , t2 , . . . , td ).
Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
Ax = b is the same as b = t1 v1 + t2 v2 + · · · + td vd . (2.4.3)
The span of vectors v1 , v2 , . . . , vd consists of all linear combinations
t1 v1 + t2 v2 + · · · + td vd
of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
Span Definition I
The span of v1 , v2 , . . . , vd is the set S of all linear combinations of v1 ,
v2 , . . . , vd , and we write
S = span(v1 , v2 , . . . , vd ).
When we don’t want to specify the vectors v1 , v2 , v3 , . . . , vd , we simply

say S is a span.
From (2.4.2), we have
Span Definition II
Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then
span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.
If each vector vk is a linear combination of vectors w1 , w2 , . . . , wN , then

every vector v in span(v1 , v2 , . . . , vd ) is a linear combination of w1 , w2 , . . . ,
wN , so span(v1 , v2 , . . . , vd ) is contained in span(w1 , w2 , . . . , wN ).
If also each vector wk is a linear combination of vectors v1 , v2 , . . . , vd ,
then every vector w in span(w1 , w2 , . . . , wN ) is a linear combination of v1 , v2 ,
. . . , vd , so span(w1 , w2 , . . . , wN ) is contained in span(v1 , v2 , . . . , vd ).
When both conditions hold, it follows
span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).
Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then
span(u, v) ⊂ span(u, v, w),
since adding a third vector can only increase the linear combination possi-
bilities. On the other hand, since w = 2v − u, we also have
span(u, v, w) ⊂ span(u, v).
It follows that
span(u, v, w) = span(u, v).
Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
from sympy import *
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
# returns minimal spanning set for column space of A

A.columnspace()
returns a minimal set of vectors spanning the column space of A. The column
rank of A is the number of vectors returned.
For example, for A as in (2.3.4), this code returns
u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10).
Why is this? Because w = 2v − u, so
span(u, v, w) = span(u, v).
We conclude the column rank of A equals 2.
If the columns of A are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ) is a vector,

then by definition of matrix-vector multiplication,
Ax = t1 v1 + t2 v2 + · · · + td vd .
By (2.4.3),
Column Space and Ax = b

The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.
The corresponding code in numpy is

from numpy import *

from scipy.linalg import orth
# returns minimal orthonormal spanning set for column space

,→ of A
orth(A)
This code returns two orthonormal vectors a/|a| and b/|b|, where
a = (8, 9, 10, 11, 12), b = (11, 6, 1, −4, −9),
√ √
and |a| = 510, |b| = 255.
We conclude the column space of A can be described in at least three
ways,
span(a, b) = span(u, v, w) = span(u, v).
Explicitly, a and b are linear combinations of u, v, w,
15a = 2u + 5v + 8w, 30b = −173u − 50v + 73w, (2.4.4)
and u, v, w are linear combinations of a and b,
51u = 16a − 7b, 51v = 41a − 2b, w = 2v − u. (2.4.5)
By (2.4.3), to derive (2.4.4), we solve Ax = a and Ax = b for x. But this
was done in §2.3.
Similarly, let B be the matrix with columns a and b, and solve Bx = u,
Bx = v, Bx = w, obtaining (2.4.5). This was done in §2.3.
As a general rule, sympy.columnspace returns vectors in close to original
form, and scipy.linalg.orth orthonormalizes the spanning vectors.
Let A be a matrix, and let b be a vector. Assume the columns of A and

b all lie in Rd . How can we tell if b is in the column space of A? Given the
above tools, here is an easy way to tell.
Write the augmented matrix Ā = (A, b); Ā obtained by adding b as an
extra column next to the columns of A. If A is d × N , then Ā is d × (N + 1).
Given A and Ā = (A, b), compute their column ranks. Let v1 , v2 , . . . , vN
be the columns of A. If these ranks are equal, then
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , . . . , vN , b),
so b is a linear combination of the columns, or b is in the column space of A.
Column Space of Augmented Matrix
Let Ā be the matrix A augmented by a vector b. Then b is in the

column space of A iff
column rank(A) = column rank(Ā). (2.4.6)
For example, let b = (−9, −3, 3, 9, 10) and let Ā = (A, b). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b is not in the column space of A: b is not a linear combination of u, v, w.
When (2.4.6) holds, b is a linear combination of the columns of A. How-
ever, (2.4.6) does not tell us which linear combination. According to (2.4.3),
finding the linear combination is equivalent to solving Ax = b.
R3 consists of all vectors (r, s, t) in three dimensions. If
e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1),
then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or
R3 = span(e1 , e2 , e3 ).
As a consequence, R3 is a span. Similarly, in dimension d, we can write
e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)
Then e1 , e2 , . . . , ed span Rd , so
d-dimensional Space
Rd is a span.
The set e1 , e2 , . . . , ed is the standard basis for Rd .
The row space of a matrix is the span of its rows.
from sympy import *
# returns minimal spanning set for row space of A

A.rowspace()
The row rank of a matrix is the number of vectors returned by rowspace().

This is the minimal number of vectors spanning the row space of A.
For example, call the rows of A in (2.3.4) a, b, c, d, e. Let
f = (0, −5, −10).
Then rowspace returns the vectors a and f , so
span(a, b, c, d, e) = span(a, f ).
Explicitly, the linear combination
50f = 32a + 35b + 38c + 41d + 44e
is derived using C = At and solving Cx = f . The linear combinations
a = a + 0f, b = 2a − 5f, c = 3a − 10f, d = 4a − 15f, e = 5a − 20f
are derived using D = (a, f ) and solving Dx = a, Dx = b, Dx = c, Dx = d,

Dx = e. Again, these linear systems were solved in §2.3.
Since the transpose interchanges rows and columns, the row space of A
equals the column space of At . Using this, we compute the row space in
numpy by
from numpy import *

from scipy.linalg import orth
# returns minimal spanning set for row space of A

orth(A.T)
Numpy returns orthonormal vectors.

When Q is symmetric, the row space of Q equals the column space of Q.
A linear combination t1 v1 + t2 v2 + · · · + td vd is trivial if all the coefficients

are zero, t1 = t2 = · · · = td = 0. Otherwise it is non-trivial, if at least one
coefficient is not zero. A linear combination t1 v1 + t2 v2 + · · · + td vd vanishes
if it equals the zero vector,
t1 v1 + t2 v2 + · · · + td vd = 0.
For example, with u, v, w as above, we have w = 2v − u, so
ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)
is a vanishing non-trivial linear combination of u, v, w.

We say v1 , v2 , . . . , vd are linearly dependent if there is a non-trivial vanish-
ing linear combination of v1 , v2 , . . . , vd . Otherwise, if there is no non-trivial
vanishing linear combination, we say v1 , v2 , . . . , vd are linearly independent.
For example, u, v, w above are linearly dependent.
Suppose u, v, w are any three vectors, and suppose u, v, w are linearly
dependent. Then we have ru + sv + tw = 0 for some scalars r, s, t, where at
least one is not zero. If r ̸= 0, then we may solve for u, obtaining
u = −(s/r)v − (t/r)w.
If s ̸= 0, then we may solve for v, obtaining
v = −(r/s)u − (t/s)w.
If t ̸= 0, then
w = −(r/t)u − (s/t)v.
Hence linear dependence of u, v, w means one of the three vectors is a multiple

of the other two vectors.
In general, linear dependence of v1 , v2 , . . . , vd is the same as saying at
least one of the vectors is a linear combination of the remaining vectors.
In terms of matrices,
Homogeneous Linear Systems

Let A be the matrix with columns v1 , v2 , . . . , vd . Then
• v1 , v2 , . . . , vd are linearly dependent when Ax = 0 has a nonzero

solution x, and
• v1 , v2 , . . . , vd are linearly independent when Ax = 0 has only the

zero solution x = 0.
The set of vectors x satisfying Ax = 0, or the set of solutions x of Ax = 0,

is the null space of the matrix A.
With this terminology, v1 , v2 , . . . , vd are linearly dependent when there
is a nonzero null space for the matrix A.
For example, with A as in (2.3.4), the sympy code
from sympy import *
A.nullspace()
returns a list with a single vector,

   
r 1
[x] =  s   =  −2 .
t 1
This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
[r,s,t] = A.nullspace()[0]
r*u + s*v + t*w
returns the column vector  

0
0
 
0 ,
 
0
0
we have Ax = 0, in agreement with (2.4.8).
The corresponding numpy code is
from scipy.linalg import null_space
null_space(A)
This code returns the unit vector

 
1
−1
√ −2 ,
6 1
which is a multiple of (1, −2, 1). scipy.linalg.null_space always returns

orthonormal vectors.
Here is a simple result that is used frequently.
A Versus At A
Let A be any matrix. The null space of A equals the null space of At A.
If x is in the null space of A, then Ax = 0. Multiplying by At leads to

At Ax = 0, so x is in the null space of At A.
Conversely, if x is in the null space of At A, then At Ax = 0. By the

dot-product-transpose indentity, (2.2.5),
|Ax|2 = Ax · Ax = x · At Ax = 0,
so Ax = 0, which means x is in the null space of A.
An important example of linearly independent vectors are orthonormal

vectors.
Orthonormal Implies Linearly Independent

If v1 , v2 , . . . , vd are orthonormal, they are linearly independent.
To see this, suppose we have a vanishing linear combination
t1 v1 + t2 v2 + · · · + td vd = 0.
Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain
t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.
Similarly, all other coefficients tk are zero. This shows v1 , v2 , . . . , vd are

linearly independent.
In general, nullspace() returns a minimal set of vectors spanning the

null space of A. The nullity of A is the number of vectors returned by the
method nullspace().
For example, to compute the nullspace of the matrix
 
1 2 3 4 5
B = At =  6 7 8 9 10 ,
11 12 13 14 15
we solve Bx = 0. Since the code

from numpy import *

from scipy.linalg import null_space
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
B = row_stack([u,v,w])
null_space(B)
returns the list of three vectors

     
1 2 3
−2 −3 −4
     
 1  ,  0  ,  0  ,
[x1 , x2 , x3 ] =      
 0   1   0 
0 0 1
here we can make three conclusions: (1) the nullspace of B is spanned by

three vectors, (2) this is the least number of vectors that spans the nullspace
of B, and (3) the nullity of B is 3.
Let S and T be spans. We say S and T are orthogonal complements if

every vector in S is orthogonal to every vector in T . In symbols, we write
S = T ⊥ and T = S ⊥ (pronounced “T -perp” and “S-perp”).
Suppose S is the span of vectors a, b, c. How do we compute S ⊥ ? The
answer is by using nullspace: Let A be the matrix with rows a, b, c. By
matrix-vector multiplication,
   
a a·x
0 = Ax =  b  x =  b · x  .
c c·x
This shows x is orthogonal to a, b, c exactly when x is in the null space of

A. Thus S ⊥ equals the null space of A.
In general, if S = span(v1 , v2 , . . . , vN ), let A be the matrix with rows v1 ,
v2 , . . . , vN . Then S ⊥ equals the null space of A.
An important example of orthogonality is the row space and the null

space. Suppose A has rows v1 , v2 , . . . , vN , and x is a vector, all of the same
dimension. Then, by definition, the matrix-vector product is
Ax = (v1 · x, v2 · x, . . . , vN · x).
If x is in the null space, Ax = 0, then
v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,
so x is orthogonal to the rows of A. Conversely, if x is orthogonal to the rows

of A, then Ax = 0.
This shows the null space of A and the row space of A are orthogonal
complements. Summarizing, we write
Row Space and Null Space are Orthogonal

Every vector in the row space is orthogonal to every vector in the null
space,
(nullspace)⊥ = rowspace, (rowspace)⊥ = nullspace. (2.4.9)
Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude
A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.
Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .
Let A be a matrix and b a vector. So far we’ve met four spaces,
• the null space: all x’s satisfying Ax = 0,
• the row space: the span of the rows of A,
• the column space: the span of the columns of A,
• the solution space: the solutions x of Ax = b.
A set S of vectors is a subspace if x1 + x2 is in S whenever x1 and x2

are in S, and tx is in S whenever x is in S. When this happens, we say
S is closed under addition and scalar multiplication: A subspace is a set of
vectors closed under addition and scalar multiplication.
Since a linear combination of linear combinations is a linear combination,
every span is a subspace. In particular, Rd is a subspace.
It’s important to realize the first three are subspaces, but the fourth is
not.
• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because
A(r1 x1 + r2 x2 ) = r1 Ax1 + r2 Ax2 = r1 0 + r2 0 = 0.
This shows the null space is a subspace.
• The row space is a span, so is a subspace.
• The column space is a span, so is a subspace.
• The solution space S of Ax = b is not a subspace, nor a span: If x is

in S, then Ax = b, so A(5x) = 5Ax = 5b, so 5x is not in S.
If x1 and x2 are solutions of Ax = b, then A(x1 + x2 ) = 2b, so the solution

space is not a subspace. However
A(x1 − x2 ) = b − b = 0, (2.4.10)
so the difference x1 − x2 of any two solutions x1 and x2 is in the null space
of A, which is a span.
Let A be an N × d matrix. Then matrix multiplication by A transforms

a vector x to the vector b = Ax. From this point of view, the set of vectors x
is the source space Rd , and the set of vectors b = Ax is the target space RN .
The null space and the row space are in the source space, and the column
space is in the target space.
Let A be a d × d invertible matrix. Then the source space is Rd and the

target space is Rd . If Ax = 0, then
x = (A−1 A)x = A−1 (Ax) = A−1 0 = 0.
This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,
Null Space of Invertible Matrix

Let A be a d × d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.
2.5 Zero Variance Directions

Let x1 , x2 , . . . , xN be a dataset in Rd . Then x1 , x2 , . . . , xN are N points in
Rd , and each x has d features, x = (t1 , t2 , . . . , td ). From §1.6, the mean is
2.5. ZERO VARIANCE DIRECTIONS 103
x 1 + x2 + · · · + xN
m= .
N
Center the dataset (see §1.3)
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
and let A be the matrix with rows v1 , v2 , . . . , vN . By (2.2.7), the covariance

is
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN 1
Q= = At A.
N N
If b is a vector, the projection of the centered dataset onto the line through
b results in the reduced dataset
v1 · b, v2 · b, . . . , vN · b.
The mean of this projected dataset is zero, and its variance is

(v1 · b)2 + (v2 · b)2 + · · · + (vN · b)2 1
= bt At Ab = b · Qb. (2.5.1)
N N
We obtain this result, which was first stated in §1.6.
Variance of Projected Dataset

Let Q be the covariance matrix of a dataset. Then the variance of
the projected dataset onto the line through the vector b equals the
quadratic function b · Qb.
A vector b is a zero variance direction if the projected variance is zero,
b · Qb = 0.
We investigate zero variance directions, but first we need a definition.

Let m be a point in Rd and b a vector in Rd . The hyperplane passing
through m and orthogonal to b is the set of points x satisfying the equation
b · (x − m) = 0.
In R3 , a hyperplane is a plane, and in R2 , a hyperplane is a line. In general,

in Rd , a hyperplane is (d − 1)-dimensional, always one less than the ambient
dimension.
Zero Variance Directions

Let m and Q be the mean and covariance of a dataset in Rd . Then
b · Qb = 0 is the same as saying every point in the dataset lies in the
hyperplane passing through m and orthogonal to b,
b · (x − m) = 0.
This is easy to see. Let the dataset be x1 , x2 , . . . , xN , and center it to

v1 , v2 , . . . , vN . If b · Qb = 0, then, by (2.5.1), vk · b = 0 for k = 1, 2, . . . , N .
This shows b · (xk − m) = 0, k = 1, 2, . . . , N , which means the points x1 , x2 ,
. . . , xN lie on the hyperplane b · (x − m) = 0. Here are some examples.
In two dimensions R2 , a line is determined by a point on the line and a
vector orthogonal to the line. If (a, b) is the vector orthogonal to the line
and (x0 , y0 ), (x, y) are points on the line, then (x, y) − (x0 , y0 ) is orthogonal
to (a, b), or
(a, b) · ((x, y) − (x0 , y0 )) = 0.
Writing this out, the equation of the line is
a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,
where c = ax0 + by0 .

If the mean and covariance of a dataset are m = (2, 3) and

1 −1
Q= ,
−1 1
and b = (1, 1), then Qb = 0, so b · Qb = 0. Since the line passes through the
mean, the dataset lies on the line x + y = 5. We conclude this dataset is
one-dimensional.
If
3 0
Q= ,
0 1
and b = (x, y), then
b · Qb = 3x2 + y 2 ,
so b · Qb is never zero unless b = 0. In this case, we conclude the dataset is
two-dimensional, because it does not lie on a line.
In three dimensions R3 , a plane is determined by a point (x0 , y0 , z0 ) in
the plane, and a vector (a, b, c) orthogonal to the plane. If (x, y, z) is any
2.5. ZERO VARIANCE DIRECTIONS 105
point in the plane, then (x, y, z) − (x0 , y0 , z0 ) is orthogonal to (a, b, c), so the
equation of the plane is
(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,
where d = ax0 + by0 + cz0 .
Suppose we have a dataset in R3 with mean m = (3, 2, 1), and covariance
 
1 1 1
Q = 1 1 1 . (2.5.2)
1 1 1
Let b = (2, −1, −1). Then Qb = 0, so b · Qb = 0. We conclude the dataset
lies in the plane
(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.
In this case, the dataset is two-dimensional, as it lies in a plane.
If a dataset has covariance the 3 × 3 identity matrix I, then b · Ib is never
zero unless b = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example,
for the covariance (2.5.2) and u = (2, −1, −1), v = (0, 1, −1), we have both
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through m and will be parallel to b. But
we know how to find such a vector. Let A be the matrix with rows u, v. Then
b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).
Let v1 , v2 , . . . , vN be a centered dataset of vectors in Rd , and let Q be the

covariance matrix of the dataset. Then v · Qv is the variance of the projected
dataset onto v. Being a variance, we know v · Qv ≥ 0. A zero variance
direction is a vector v satisfying v · Qv = 0. We show the zero variance
directions are the same as the nullspace of Q,
Zero Variance Directions and Nullspace I

Let Q be a covariance matrix. Then the null space of Q equals the
zero variance directions of Q.
To see this, we use the quadratic equation from high school. If Q is

symmetric, then u · Qv = v · Qu. For t scalar and u, v vectors, it follows the
quadratic function
(v + tu) · Q(v + tu) = t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields
(u · Qv)2 ≤ (u · Qu) (v · Qv).
Inserting u = Qv, we have
|Qv|4 = (Qv · Qv)2 ≤ (Qv · QQv) (v · Qv). (2.5.3)
Now we can derive the result. If v is in the null space of Q, then Qv = 0.
Taking the dot product with v, we get v ·Qv = v ·0 = 0, so v is a zero variance
direction. Conversely, if v is a zero variance direction, then v · Qv = 0. By
(2.5.3), this implies Qv = 0, so v is in the null space of Q.
Based on the above result, here is code that returns zero variance direc-
tions.
from numpy import *
def zero_variance(dataset):
Q = cov(dataset.T)
return null_space(Q)
Let A be an N × d dataset matrix, and let Q be the covariance of the

dataset. By (2.2.12), Q = At A/N if the dataset is centered. Then the null
space of Q equals the null space of At A, which equals the null space of A.
We conclude
2.6. PSEUDO-INVERSE 107
Zero Variance Directions and Nullspace II

Let Q be a covariance matrix of a centered dataset A. Then the null
space of A equals the zero variance directions of Q.
Suppose the dataset is
(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).
This is four vectors in R5 . Since it is only four vectors, it is at most a

four-dimensional dataset. The code returns three vectors
(1, −2, 1, 0, 0), (2, −3, 0, 1, 0), (3, −4, 0, 0, 1).
Thus this dataset is orthogonal to three directions. Each hyperplane is one

condition, so each cuts the dimension down by one, so the dimension of this
dataset is 5 − 3 = 2. Dimension of a dataset is discussed in §2.9.
2.6 Pseudo-Inverse
What exactly is the pseudo-inverse? It turns out the answer is best under-
stood geometrically.
Think of b and Ax as points, and measure the distance between them,
and think of x and the origin 0 as points, and measure the distance between
them (Figure 2.1).
x b
A
−−−−−−→
0 Ax
source space target space
Figure 2.1: The points 0, x, Ax, and b.
If Ax = b is solvable, then, among all solutions x∗ , select the solution x+

closest to 0.
More generally, if Ax = b is not solvable, select the points x∗ so that Ax∗
is closest to b, then, among all such x∗ , select the point x+ closest to the
origin.
Even though the point x+ may not solve Ax = b, this procedure (Figure
2.2) results in a uniquely determined x+ : While there may be several points
x∗ , there is only one x+ .
∗ column space
x∗ x Ax
x+ A Ax∗
−−−−−−→ AxAx
x
x
x nullspace
0 b
Figure 2.2: The points x, Ax, the points x∗ , Ax∗ , and the point x+ .
The results in this section are as follows. Let A be any matrix. There
is a unique matrix A+ — the pseudo-inverse of A — with the following
properties.
• the linear system Ax = b is solvable, when b = AA+ b.
• x+ = A+ b is a solution of
1. the linear system Ax = b, if Ax = b is solvable.

2. the regression equation At Ax = At b, always.
• In either case,
1. there is exactly one solution with minimum norm.

2. Among all solutions, x+ has minimum norm.
3. Every other solution is x = x+ + v for v in the null space of A.
Key concepts in this section are the residual
|Ax − b|2 (2.6.1)

and the regression equation
At Ax = At b. (2.6.2)
The following is clear.
Zero Residual
x is a solution of (2.3.1) iff the residual is zero.
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10), the linear system Ax = b is
x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z =3 (2.6.3)
4x + 9y + 14z =9
5x + 10y + 15z = 10
and the regression equation At Ax = At b is
11x + 26y + 41z = 16

13x + 33y + 53z = 13 (2.6.4)
41x + 106y + 171z = 36.
Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
|Ax∗ − b|2 = min |Ax − b|2 . (2.6.5)

x
A residual minimizer always exists.

Existence of Residual Minimizer

There is a residual minimizer x∗ in the row space of A.
The derivation of this technical result is in §7.5, see (7.5.11), (7.5.12).
Regression Equation
x∗ is a residual minimizer iff x∗ solves the regression equation.
To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into

the residual and expand in powers of t to obtain
|Ax − b|2 = |Ax∗ − b|2 + 2t(Ax∗ − b) · Av + t2 |Av|2 = f (t).
If x∗ is a residual minimizer, then f (t) is minimized when t = 0. But a

parabola
f (t) = a + 2bt + ct2
is minimized at t = 0 only when b = 0. Thus the linear coefficient vanishes,
(Ax∗ − b) · Av = 0. This implies
At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.
Since v is any vector, this implies
At (Ax∗ − b) = 0,
which is the regression equation. Conversely, if the regression equation holds,

then the linear coefficient in the parabola f (t) vanishes, so t = 0 is a mini-
mum, establishing that x∗ is a residual minimizer.
If x1 and x2 are solutions of the regression equation, then
At A(x1 − x2 ) = At Ax1 − At Ax2 = At b − At b = 0,
so x1 − x2 is in the null space of At A. But from §2.4, the nullspace of At A

equals the nullspace of A. We conclude x1 − x2 is in the null space of A. This
establishes
Multiple Solutions
Any two residual minimizers differ by a vector in the nullspace of A.
We say x+ is a minimum norm residual minimizer if x+ is a residual

minimizer and
|x+ |2 ≤ |x∗ |2
for any residual minimizer x∗ .
Since any two residual minimizers differ by a vector in the null space of
A, x+ is a minimum norm residual minimizer if x+ is a residual minimizer
and
|x+ |2 ≤ |x+ + v|2
for any v in the null space of A.
Minimum Norm Residual Minimizer

Let x∗ be a residual minimizer. Then x∗ is a minimum norm residual
minimizer iff x∗ is in the row space of A.
Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write
|x∗ + v|2 = |x∗ |2 + 2x∗ · v + |v|2 .
This shows x∗ is a minimum norm solution of the regression equation iff
2x∗ · v + |v|2 ≥ 0. (2.6.6)
If x∗ is in the row space of A, then x∗ · v = 0, so (2.6.6) is valid.

Conversely, if (2.6.6) is valid for every v in the null space of A, replacing
v by tv yields
2tx∗ · v + t2 |v|2 ≥ 0.
Dividing by t and inserting t = 0 yields
x∗ · v ≥ 0.
Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
Now we use this to show
Uniqueness
There is exactly one minimum norm residual minimizer x+ .
If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
is both in the row space and in the null space of A, so x+ +

1 − x2 = 0. Hence
+ +
x1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defin-
ing A+ by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape
of A+ equals the shape of At .
Summarizing what we have so far,
Regression Equation is Always Solvable
The regression equation (2.6.2) is always solvable. The solution of

minimum norm is x+ = A+ b. Any other solution differs by a vector in
the null space of A.
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10),

 
82
1 
x+ = A+ b = 25 
15
−32
is the minimum norm solution of the regression equation (2.6.4).

Returning to the linear system (2.3.1), we show
Linear System Versus Regression Equation

If the linear system is solvable, then every solution of the regression
equation is a solution of the linear system, and vice-versa.
We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.10), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so
Ax+ = A(x + v) = Ax + Av = b + 0 = b.
This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,
Solvability of Ax = b
The linear system Ax = b is solvable iff b = AA+ b. When this happens,

x+ = A+ b is the solution of minimum norm.
If (2.3.1) is solvable, then from above, x+ is a solution, so
AA+ b = A(A+ b) = Ax+ = b.
Conversely, if AA+ b = b, then clearly x+ = A+ b is a solution of (2.3.1).

When (2.3.1) is solvable, (2.3.1) and (2.6.2) have the same solutions, so
+
x is the minimum norm solution of (2.3.1).
For example, let b = (−9, −3, 3, 9, 10), and let A be as in (2.3.4). Since
 
−8
−3
 
AA+ b =  2
 (2.6.7)
7
12
is not equal to b, the linear system (2.6.3) is not solvable.

Suppose A is invertible. Then (2.3.1) has only the solution x = A−1 b, so

A−1 b is the minimum norm residual minimizer. We conclude
Inverse Equals Pseudo-Inverse
If A is invertible, then A+ = A−1 .
The key properties [17] of A+ are
Properties of Pseudo-Inverse
A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric
The verification of these properties is very enlightening, so we do it care-

fully. Let u be a vector and set b = Au. Then the residual
|Ax − b|2 = |Ax − Au|2
is minimized at x = u. Since A+ b = A+ Au is the minimum norm residual

minimizer, u and A+ Au differ by a vector v in the null space of A,
u = A+ Au + v. (2.6.9)
Since Av = 0, multiplying by A leads to
Au = AA+ Au.
Since u was any vector, this yields A.

Now let w be a vector and set u = A+ w. Inserting into (2.6.9) yields
A+ w = A+ AA+ w + v
for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector
b. Hence At AA+ = At . Let P = AA+ . Now
P t P = (AA+ )t (AA+ ) = (A+ )t At AA+ = (A+ )t At = P t .
Since the left side is symmetric, so is P t . Hence P is symmetric, obtaining

C.
For any vector x,
A(x − A+ Ax) = Ax − AA+ Ax = 0,
so x − A+ Ax is in the null space of A. For any y, A+ Ay is in the row space

of A. Since the row space and the null space are orthogonal,
(x − A+ Ax) · A+ Ay = 0.
Let P = A+ A. This implies
x · P y = P x · P y = x · P tP y
Since this is true for any vectors x and y, P = P t P . This shows P = A+ A

is symmetric, obtaining D.
Having arrived at A, B, C, D, the reasoning is reversible: It can be shown
any matrix A+ satisfying A, B, C, D must equal the pseudo-inverse.
Also we have
Pseudo-Inverse and Transpose
If U has orthonormal columns or orthonormal rows, then U + = U t .
From (2.2.6), such a matrix U satisfies U U t = I or U t U = I. In either

case, A, B, C, D are immediate consequences.
2.7 Projections
In this section, we study projection matrices P , and we show
• P = AA+ is the projection matrix onto the column space of A,
• P = A+ A is the projection matrix onto the row space of A,
• P = I − A+ A is the projection matrix onto the null space of A,
b − Pb
b
P b = tu
u
Figure 2.3: Projecting onto a line.
Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.3). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the pro-
jection matrix. Since span(u) is a line, the projected vector P b is a multiple
tu of u.
From Figure 2.3, b − P b is orthogonal to u, so
0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.
Solving for t, this implies t = b · u. Thus
P b = (b · u)u = (u ⊗ u)b. (2.7.1)
Notice P b = b when b is already on the line through u. In other words,

the projection of a vector onto a line equals the vector itself when the vector
2.7. PROJECTIONS 117
is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. Now we project onto a plane. If U is the matrix
with the single column u, then the reduced vector is U t b and the projected
vector is U U t b.
b
b − Pb
u
Pb
Figure 2.4: Projecting onto a plane, P b = ru + sv.
Let u, v be an orthonormal pair of vectors, so u · v = 0, u · u = 1 = v · v.

We project a vector b onto the plane span(u, v) containing u, v. As before,
there is a matrix P , the projection matrix, such that the projection of b onto
the plane equals P b. Then b − P b is orthogonal to the plane (Figure 2.4),
which means b − P b satisfies
(b − P b) · u = 0 and (b − P b) · v = 0.
Since P b lies in the plane, P b = ru + sv is a linear combination of u and v.
Inserting P b = ru + sv, we obtain
r = b · u, s = b · v.
If U is the matrix with columns u, v, by (2.2.7), this yields,
P b = (b · u)u + (b · v)v = (u ⊗ u + v ⊗ v)b = U U t b,
Here the projection matrix also is P = U U t .

Notice P b = b when b is already in the plane. In other words, the projec-
tion of a vector onto a plane equals the vector itself when the vector is already
in the plane.
To summarize, here the projected vector is the vector U U t b = (b · u)u +
(b · v)v, and the reduced vector is the vector U t b = (b · u, b · v). The projected
vector has the same dimension as the original vector, and the reduced vector
has only two components.
We define projection matrices in general. Let S be a span. A matrix P

is a projection matrix onto S if
1. P b is in S for any vector b,
2. P b = b if b is in S,
3. b − P b is orthogonal to S for any vector b.
Let A be any matrix and let S be the column space of A. We show
Projection Onto a Column Space

The projection matrix onto the column space of A is
P = AA+ . (2.7.2)
By definition, the column space S of A consists of vectors of the form Ax.

If b is any vector, then
P b = AA+ b = Ax, with x = A+ b,
so P b is in S. This establishes 1. If b is in S, then b = Ax, so P b = AA+ b =

AA+ Ax = Ax = b, establishing 2. For 3., let x+ = A+ b. Then x+ is a
solution of the regression equation, so
(b − P b) · Av = At (b − AA+ b) · v = (At b − At Ax+ ) · v = 0,
establishing 3.
Now let x = A+ b. Then Ax = AA+ b = P b is the projection of v onto the

column space of A. If the columns are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ),
then by matrix-vector multiplication,
P v = t1 v1 + t2 v2 + · · · + td vd .
So the reduced vector x consists of the coefficients when writing P v as a
linear combination of the columns.
from numpy import *

from numpy.linalg import pinv
# projection of column vector b

# onto column space of A
def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected
Projected and Reduced Vectors

Let A be a matrix and v a vector. Then the projected vector is P v =
AA+ v and the reduced vector is x = A+ v.
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b =
(82, 25, −32),
15
and the projected vector onto the column space of A is
P b = Ax = AA+ b = (−8, −3, 2, 7, 12).
The projection matrix onto the column space of A is
 
6 4 2 0 −2
4 3 2 1 0
+ 1 
 2 2 2 2 2 .

P = AA =
10 
 
0 1 2 3 4
−2 0 2 4 6
In the same way, one can show
Projection Onto a Row Space

The projection matrix onto the row space of A is
P = A+ A. (2.7.3)
For A as in (2.3.4), the projection matrix onto the row space is

 
5 2 −1
1
P = A+ A =  2 2 2 
6
−1 2 5
When the columns of a matrix U are orthonormal, in the previous section

we saw U + = U t , so we have
Projection onto Orthonormal Vectors

If the columns of U are orthonormal, the projection matrix onto the
column space of U is
P = UUt (2.7.4)
Here the projected vector is U U t b, and the reduced vector is U t b. The

code here is
from numpy import *
# projection of column vector b

# onto column space of U
# with orthonormal columns
def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
Let v1 , v2 , . . . , vN be a dataset in Rd , and let U be a d × n matrix with

orthonormal columns. Then the projection matrix onto the column space of
U is P = U U t , and P is the projection onto an orthonormal span.
In this case, the dataset U t v1 , U t v2 , . . . , U t vN is the reduced dataset, and
U U t v1 , U U t v2 , . . . , U U t vN is the projected dataset.
The projected dataset is in Rd , and the reduced dataset is in Rn . Table
2.5 summarizes the relationships.
dataset vk in Rd , k = 1, 2, . . . , N
reduced U t vk in Rn , k = 1, 2, . . . , N
projected U U t vk in Rd , k = 1, 2, . . . , N
Table 2.5: Dataset, reduced dataset, and projected dataset, n < d.
Let S and T be spans. Let S + T consist of all sums of vectors u + v with

u in S and v in T . Then a moment’s thought shows S + T is itself a span.
When the intersection of S and T is the zero vector, we write S ⊕ T , and we
say S ⊕ T is the direct sum of S and T .
Let S be a span and let S ⊥ consist of all vectors orthogonal to S. We
call S ⊥ the orthogonal complement. This is pronounced “S-perp”. If v is in
both S and in S ⊥ , then v is orthogonal to itself, hence v = 0. From this, we
see S + S ⊥ is a direct sum S ⊕ S ⊥ .
Direct Sum and Orthogonal Complement
If S is a span in Rd , then
Rd = S ⊕ S ⊥ . (2.7.5)
This is an immediate consequence of what we already know. Let P be

the projection matrix onto S. Since any vector v in Rd may be written
v = P v + (v − P v),
we see any vector is a sum of a vector in S and a vector in S ⊥ .
An important example of (2.7.5) is the relation between the row space

and the null space of a matrix. In §2.4, we saw that, for any matrix A, the
row space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important
Null space plus Row Space Equals Source Space

If A is an N × d matrix,
nullspace ⊕ rowspace = Rd , (2.7.6)
and the null space and row space are orthogonal to each other.
From this, the projection onto the null space of A is
P = I − A+ A. (2.7.7)
For A as in (2.3.4), the projection matrix onto the null space is

 
1 −2 1
+ 1
P =I −A A= −2 4 −2
6
1 −2 1
Let S be the column space of a matrix A, and let P be the projection

matrix onto S. We end the section by establishing the claim made at the
start of the section, that P b is the point in S that is closest to b.
Since every point in S is of the form Ax, we need to check
|P b − b|2 = min |Ax − b|2 .

x
But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
2.8. BASIS 123
Projection is the Nearest Point in the Span
Let P b = AA+ b be the projection of b onto the column space of A, and

let x+ = A+ b be the reduced vector. Then
|Ax+ − b|2 = min |Ax − b|2 . (2.7.8)

x
2.8 Basis
Let S be the span of vectors v1 , v2 , . . . , vN . Then there are many other

choices of spanning vectors for S. For example, v1 + v2 , v2 , v3 , . . . , vN also
spans S.
If S cannot be spanned by fewer than N vectors, then we say v1 , v2 , . . . ,
vN . is a basis for S, and we call N is the dimension of S.
In other words, when N is the smallest number of spanning vectors, we
say N is the dimension dim S of S, and v1 , v2 , . . . , vN is a minimal spanning
set for S. This definition is important enough to repeat,
Basis and Dimension Definition

A basis for a span S is a minimal spanning set of vectors. The dimen-
sion of S is the number of vectors in any basis for S.
To clarify this definition, suppose someone asks “Who is the shortest

person in the room?” There may be several shortest people in the room,
but, no matter how many shortest people there are, there is only one shortest
height. In other words, a span may have several bases, but a span’s dimension
is uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.
spanning
orthogonal orthonormal
vectors basis
basis basis
linearly
orthogonal orthonormal
independent
Figure 2.6: Relations between vector classes.
Here are two immediate consequences of this terminology.
Span of N Vectors
If S = span(v1 , v2 , . . . , vN ), then dim S ≤ N .
Larger Span has Larger Dimension

If a span S1 is contained in a span S2 , then dim S1 ≤ dim S2 .
With this terminology,
• rowspace() returns a basis of the row space,
• columnspace() returns a basis of the column space,
• nullspace() returns a basis for the null space,
• row rank equals the dimension of the row space,
• column rank equals the dimension of the column space,

2.8. BASIS 125
• nullity equals the dimension of the null space.
Let S be the span of vectors v1 , v2 , . . . , vN . How can we check if these

vectors constitute a basis for S? The answer is the main result of the section.
Spanning Plus Linearly Independent Equals Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then the vectors are a
basis for S if and only if they are linearly independent.
Remember, to check for linear independence of given vectors, assemble

the vectors as columns of a matrix A, and check whether A.nullspace()
equals zero. If that is the case, the vectors are a basis for their span. If not,
the vectors are not a basis for their span. The proof of the main result is at
the end of the section.
Here is an example. Let e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1). We

have just seen that e1 , e2 , e3 are linearly independent. Hence e1 , e2 , e3 is a
basis for R3 , which means e1 , e2 , e3 is a minimal spanning set of vectors for
R3 . From this, we conclude dim R3 = 3.
The statement dim R3 = 3 may at first seem trivial or obvious. But, if
we flesh this out following our terminology above, the statement is saying
that any minimal spanning set of vectors in R3 must have exactly 3 vectors.
Stated in this manner, the statement has content.
Since we can do the same calculation with the standard basis
e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),
in Rd , we conclude e1 , e2 , . . . , ed are linearly independent, so

Dimension of Euclidean Space
The dimension of Rd is d.
The MNIST dataset consists of vectors v1 , v2 , . . . , vN in Rd , where N =

60000 and d = 784. For the MNIST dataset, the full dimension is 712, as
returned by the code
from numpy.linalg import matrix_rank
matrix_rank(vectors)
In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, the most likely pixel to remain un-
touched is at the top left corner (0, 0). Thus there are 72 zero variance
directions for this dataset.
We pose the following question: What is the least n for which the first
n images are linearly dependent? Since the dimension of the feature space
Rd is d, we must have n ≤ 784. To answer the question, we compute the
row rank of the first n vectors for n = 1, 2, 3, . . . , and continue until we have
linear dependence of v1 , v2 , . . . , vn .
If we save the MNIST dataset as a centered array vectors, as in §2.1,
and run the code below, we obtain n = 560 (Figure 2.7). matrix_rank is
discussed in §2.9.
from numpy import *

# vectors as Nxd array
def find_first_defect(vectors):
d = len(vectors[0])
previous = 0
for n in range(len(vectors)):
r = matrix_rank(vectors[:n+1,:])
2.8. BASIS 127
print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r
Figure 2.7: First defect for MNIST.
Let v1 , v2 , . . . , vN be a dataset. We want to compute the dimensions of

the first k vectors,
d1 = dim(v1 ), d2 = dim(v1 , v2 ), d3 = dim(v1 , v2 , v3 ), and so on
This we call the dimension staircase. For example, Figure 2.8 is the dimen-
sion staircase for
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
In Figure 2.8, we call the points (3, 2) and (4, 2) defects.

Figure 2.8: The dimension staircase with defects.
In the code, the staircase is drawn by stairs(X,Y), where the horizontal

points X and the vertical values Y satisfy len(X) == len(Y)+1. In Figure 2.8,
X = [1,2,3,4,5,6], and Y = [1,2,2,2,3].
Figure 2.9: The dimension staircase for the MNIST dataset.
With the MNIST dataset loaded as vectors, here is code returning Figure
2.9. This code is not efficient, but it works. It takes 57041 vectors in the
dataset to fill up 712 dimensions.
2.8. BASIS 129

# vectors as Nxd array
def dimension_staircase(vectors):
d = vectors[0].size
N = len(vectors)
rmax = matrix_rank(vectors)
dimensions = [ ]
basis = [ ]
for n in range(1,N):
r = matrix_rank(vectors[:n,:])
print((r,n),end=",")
dimensions.append(r)
if r == rmax: break
stairs(dimensions, range(n+1))
Proof of main result. Here we derive: Let S be the span of v1 , v2 , . . . ,

vN . Then v1 , v2 , . . . , vN is a basis for S if and only if v1 , v2 , . . . , vN are
linearly independent.
Suppose v1 , v2 , . . . , vN are not linearly independent. Then v1 , v2 , . . . ,
vN are linearly dependent, which means one of the vectors, say v1 , is a linear
combination of the other vectors v2 , v3 , . . . , vN . Then any linear combination
of v1 , v2 , . . . , vN is necessarily a linear combination of v2 , v3 , . . . , vN , thus
span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).
This shows v1 , v2 , . . . , vN is not a minimal spanning set, and completes the
derivation in one direction.
In the other direction, suppose v1 , v2 , . . . , vN are linearly independent,
and suppose b1 , b2 , . . . , bd is a minimal spanning set. Since b1 , b2 , . . . , bd is
minimal, we must have d ≤ N . Once we establish d = N , it follows v1 , v2 ,
. . . , vN is minimal, and the proof will be complete.
Since by assumption,
span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),
v1 is a linear combination of b1 , b2 , . . . , bd ,
v1 = t1 b1 + t2 b2 + · · · + td bd .
Since v1 ̸= 0, at least one of the coefficients, say t1 , is not zero, so we can
solve
1
b1 = (v1 − t2 b2 − t3 b3 − · · · − td bd ).
t1
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,
v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
If all the coefficients of b2 , b3 , . . . , bd are zero, then v2 is a multiple of v1 ,
contradicting linear independence of v1 , v2 , . . . , vN . Thus there is at least
one coefficient, say t2 , which is not zero. Solving for b2 , we obtain
1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).
Repeating the same logic, v3 is a linear combination of v1 , v2 , b3 , b3 , . . . ,
bd ,
v3 = s1 v1 + s2 v2 + t3 b3 + t4 b4 + · · · + td bd .
If all the coefficients of b3 , b4 , . . . , bd are zero, then v3 is a linear combination
of v1 , v2 , contradicting linear independence of v1 , v2 , . . . , vN . Thus there is
at least one coefficient, say t3 , which is not zero. Solving for b3 , we obtain
1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).
Continuing in this manner, we eventually arrive at
span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).
This shows vN is a linear combination of v1 , v2 , . . . , vd . This shows N = d,
because N > d contradicts linear independence. Since d is the minimal
spanning number, this shows v1 , v2 , . . . , vN is a minimal spanning set for S.
2.9. RANK 131
2.9 Rank
If A is an N ×d matrix, then (Figure 2.10) x 7→ Ax is a linear transformation
that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).
It follows that for an N × d matrix, the dimension of the source space is
d, and the dimension of the target space is N ,
dim(source space) = d, dim(target space) = N.
from sympy import *
d = A.cols # source space dimension

N = A.rows # target space dimension
R3 R5
x b
A
At b
Ax
At
source space target space
Figure 2.10: A 5 × 3 matrix A is a linear transformation from R3 to R5 .
By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .
The main result in this section is

Rank Theorem
Let A be any matrix. Then
row rank(A) = column rank(A). (2.9.1)
This is established at the end of the section.

Because the row rank and the column rank are equal, in the future we
just say rank of a matrix, and we write rank(A). In Python,
from sympy import *
A.rank()
matrix_rank(A)
returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so
Upper bound for Rank
For any N × d matrix, the rank is never greater than min(N, d).
An N × d matrix A is full-rank if its rank is the highest it can be,

rank(A) = min(N, d). Here are some consequences of the main result.
• When N ≥ d, full-rank is the same as rank(A) = d, which is the same
as saying the columns are linearly independent and the rows span Rd .
• When N ≤ d, full-rank is the same as rank(A) = N , which is the same
as saying the rows are linearly independent and the columns span RN .
• When N = d, full-rank is the same as saying the rows are a basis of
Rd , and the columns are a basis of RN .
When A is a square matrix, we can say more:
2.9. RANK 133
Full Rank Square Equals Invertible

Let A be a square matrix. Then A is full-rank iff A is invertible.
Suppose A is d × d. If A is invertible and B is its inverse, then AB = I.

Since ABx = A(Bx) = Ay with y = Bx, the column space of AB is contained
in the column space of A. Since the column space of AB = I is Rd , we
conclude the column space of A is Rd , thus rank(A) = d.
Conversely, suppose A is full-rank. This means the columns of A span
d
R . By (2.4.3), this implies
Ax = b
is solvable for any b. Let e1 , e2 , . . . , ed be the standard basis. If we set
successively b = e1 , b = e2 , . . . , b = ed , we then get solutions x1 , x2 , . . . , xd .
If B is the matrix with columns x1 , x2 , . . . , xd , then
AB = A[x1 , x2 , . . . , xd ] = [Ax1 , Ax2 , . . . , Axd ] = [e1 , e2 , . . . , ed ] = I.
Thus we found a matrix B satisfying AB = I.

Repeating the same argument with rows instead of columns, we find a
matrix C satisfying CA = I. Then
C = CI = CAB = IB = B,
so B = C is the inverse of A.
Orthonormal Rows and Columns

Let U be a matrix.
• U has orthonormal rows iff U U t = I.
• U has orthonormal columns iff U t U = I.
If U is square and either holds, then they both hold.
The first two assertions are in §2.2. For the last assertion, assume U
is a square matrix. From §2.4, orthonormality of the rows implies linear
independence of the rows, so U is full-rank. If U also is a square matrix,

then U is invertible. Multiply by U −1 ,
U −1 = U −1 I = U −1 U U t = U t .
Since we have U −1 U = I, we also have U t U = I.
A square matrix U satisfying

U U t = I = U tU (2.9.2)
is an orthogonal matrix.
Equivalently, we can say
Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.
Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves
angles. Summarizing,
Angles, Lengths, and Dot Products

Orthogonal Matrices Preserve Angles, Lengths, and Dot Products:
As a consequence,
Orthogonal Matrix sends ON Vectors to ON Vectors

Let U be an orthogonal matrix. If v1 , v2 , . . . , vd are orthonormal, then
U v1 , U v2 , . . . , U vd are orthonormal.
In two dimensions, d = 2, an orthogonal matrix must have two orthonor-

mal columns, so must be of the form

cos θ − sin θ cos θ sin θ
U= or U= .
sin θ cos θ sin θ − cos θ
2.9. RANK 135
In the first case, U is a rotation, while in the second, U is a rotation followed

by a reflection.
If v1 , v2 , . . . , vd is an orthonormal basis of Rd , and U has columns v1 , v2 ,

. . . , vd , then U is square and U U t = I = U t U . By (2.2.7), we have
I = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vd ⊗ vd .
Multiplying both sides by v, by (1.4.14), we obtain
Orthonormal Basis Expansion

If v1 , v2 , . . . , vd is an orthonormal basis, and v is any vector, then
v = (v · v1 )v1 + (v · v2 )v2 + · · · + (v · vd )vd (2.9.3)
and
|v|2 = |v · v1 |2 + |v · v2 |2 + · · · + |v · vd |2 . (2.9.4)
Let x1 , x2 , . . . , xN be a dataset, and let A be the dataset matrix with

rows x1 , x2 , . . . , xN . The dataset is full-rank if A is full-rank. This is the
same as saying the span of x1 , x2 , . . . , xN is the whole feature space.
The dimension of the dataset is the rank of A. Hence the dimension of the
dataset equals the rank of At A. When the dataset is centered, the covariance
is the matrix Q = At A/N . Since scaling a matrix has no effect on the rank,
we conclude the dimension of a dataset equals the rank of its covariance.
To derive the main result, first we recall (2.7.6). From the definition of
dimension, we can rewrite (2.7.6) as
Row Rank plus Nullity equals Source Space Dimension

For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.
Assume A has N rows and d columns. By (2.7.6), every vector x in the

source space Rd can be written as a sum x = u + v with u in the null space,
and v in the row space. In other words, each vector x may be written as a
sum x = u + v with Au = 0 and v in the row space.
From this, we have
Ax = A(u + v) = Au + Av = Av.
This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
0 = t1 Av1 + t2 Av2 + · · · + tr Avr = A(t1 v1 + t2 v2 + · · · + tr vr ).
If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space,
we must have v orthogonal to itself. Thus v = 0, or t1 v1 +t2 v2 +· · ·+tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Chapter 3
Principal Components
In this chapter, we look at the two fundamental methods of breaking or

decomposing a matrix into elementary components, the eigenvalue decompo-
sition and the singular value decomposition, then we apply this to principal
component analysis.
We begin by looking at the geometry of a matrix as a linear transforma-
tion.
3.1 Geometry of Matrices

Matrix multiplication by an N × d matrix A sends a point x in the source
space Rd to a point b = Ax in the target space RN (Figure 2.10).
Equivalently, since points in Rd are essentially the same as vectors in Rd
(see §1.3), an N × d matrix A sends a vector v in Rd to a vector Av in RN .
Looked at this way, a matrix A induces a linear transformation: Matrix
multiplication by A satisfies
A(v1 + v2 ) = Av1 + Av2 , A(tv) = tAv.
One way to understand what the transformation does is to see how it
distorts distances between vectors. If v1 and v2 are in Rd , then the distance
between them is d = |v1 − v2 | (recall |v| denotes the euclidean length of
v). How does this compare with the distance between Av1 and Av2 , or
|Av1 − Av2 |?
If we let
v1 − v2
u= ,
|v1 − v2 |
137
138 CHAPTER 3. PRINCIPAL COMPONENTS
then u is a unit vector, |u| = 1, and by linearity

|Av1 − Av2 |
|Au| = .
|v1 − v2 |
This ratio is a scaling factor of the linear transformation. Of course this
scaling factor depends on the given vectors v1 , v2 .
From this, to understand the scaling distortions, it is enough to under-
stand what multiplication by A does to unit vectors u.
The first step in understanding this is to compute
σ1 = max |Au| and σ2 = min |Au|.
Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.
To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
Figure 3.1: Image of unit circle with σ1 = 1.5 and σ2 = .75.

3.1. GEOMETRY OF MATRICES 139
To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,
σ12 = max |Au|2 = max(Au) · (Au) = max u · At Au = max u · Qu.
Thus σ12 is the maximum projected variance corresponding to the covari-

ance Q. Similarly, σ22 is the minimum projected variance corresponding to
the covariance Q.
Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and
b · Q−1 b = (Au) · Q−1 Au = u · At (AAt )−1 Au = u · Iu = |u|2 = 1.
This shows the image of the unit circle is the inverse covariance ellipse (§1.6)
corresponding to the covariance Q, with major axis length 2σ1 and minor
axis length 2σ2 .
Let us look at some special cases.

The first example is

cos θ − sin θ
V = . (3.1.1)
sin θ cos θ
If e1 = (1, 0), e2 = (0, 1) is the standard basis in R2 . then the columns of V

are
V e1 = (cos θ, sin θ), and V e2 = (− sin θ, cos θ).
Since V t V = I, the columns of V are orthonormal. Thus V transforms the
orthonormal basis e1 , e2 into the orthonormal basis V e1 , V e2 (see §2.9). By
(1.4.3), V is a rotation by the angle θ.
The second example is

σ1 0
S= .
0 σ2
Then S scales the horizontal direction by the factor σ1 , and S scales the
vertical direction by σ2 .
The third example are the reflections

−1 0 1 0
R= , R= .
0 1 0 −1
These reflect vectors across the horizontal axis, and across the vertical axis.
Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t
(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.
The SVD decomposition (§3.3) states that every matrix A can be written
as a product
a b
A= = U SV.
c d
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).
V S U
Figure 3.2: SVD decomposition A = U SV .
In other words, each 2 × 2 matrix A, consisting of four numbers a, b, c,

d, may be described by four other numbers. These other numbers present a
much clearer picture of the geometry of A: two angles α, β, and two scalings
σ1 , σ2 .
Everything in this section generalizes to any N × d matrix, as we see in
the coming sections.
3.2 Eigenvalue Decomposition

Let A be a matrix. An eigenvector for A is a nonzero vector v such that Av
is aligned with v. This means
Av = λv (3.2.1)
3.2. EIGENVALUE DECOMPOSITION 141
for some scalar λ, the corresponding eigenvalue.
singular:
σ, u, v
row column
any
rank rank
matrix square
eigen:
invertible symmetric
λ, v
non-
covariance negative λ≥0
λ ̸= 0 positive λ>0
Figure 3.3: Relations between matrix classes.

Because the solution v = 0 of (3.2.1) is not useful, we insist eigenvec-

tors be nonzero. If v is an eigenvector, then the dimension of v equals the
dimension of Av, which can only happen when A is a square matrix.
If v is an eigenvector corresponding to eigenvalue λ, then any scalar mul-
tiple u = tv is also an eigenvector corresponding to eigenvalue λ, since
Av = λv =⇒ Au = A(tv) = t(Av) = t(λv) = λ(tv) = λu.
Because of this, we usually take eigenvectors to be unit vectors, by normal-
izing them. Even then, this does not determine v uniquely, since both ±v
are unit eigenvectors.
Let
2 1
Q=
1 2
Then Q has eigenvalues 3 and 1, with corresponding eigenvectors (1, 1) and
(1, −1).√These√are not unit
√ vectors,
√ but the corresponding unit eigenvectors
are (1/ 2, 1/ 2) and (1/ 2, −1/ 2).
The code
from numpy import *

from numpy.linalg import eig
# lambda is a keyword in Python

# so we use lamda instead
A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda
returns the eigenvalues [3,1] as an array, and returns the eigenvectors v1 ,

v2 of Q, as the columns of the matrix U . The matrix U is discussed further
below.
The method eig(A) works on any square matrix A, but may return com-
plex eigenvalues. When eig(A) returns real eigenvalues, they are not neces-
sarily ordered in any predetermined fashion.
If the matrix Q is known to be symmetric, then the eigenvalues are guar-
anteed real. In this case, the method eigh(Q) returns these eigenvalues in
increasing order. If eigh is used on a non-symmetric matrix, it will return
erroneous data.
from numpy import *

from numpy.linalg import eigh
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda
returns the array [1,3].
Let A be a square d × d matrix. The ideal situation is when there is a

basis v1 , v2 , . . . , vd in Rd of eigenvectors of A. However, this is not always
the case. For example, if
1 1
A= (3.2.2)
0 1
and Av = λv, then v = (x, y) satisfies x + y = λx, y = λy. This system has
only the nonzero solution (x, y) = (1, 0) (or its multiples) and λ = 1. Thus
A has only one eigenvector e1 = (1, 0), and the corresponding eigenvalue is
λ = 1.
Let A be any square matrix.
Eigenvalues of A Versus Eigenvalues of At
The eigenvalues of A and the eigenvalues of At are the same.
This result is a consequence of the rank theorem in §2.9. To see why,

suppose λ is an eigenvalue of A with corresponding eigenvector v. Then
Av = λv, which implies
(A − λI)v = Av − λv = 0.
As a consequence, if we let B = A − λI, then λ is an eigenvalue of A iff1

B has a nonzero null space. If we show B t = At − λI has a nonzero null
space, by the same logic, we will conclude λ is an eigenvalue of At . Now
B has a nonzero null space iff B is not full-rank. Since B is square, by the
rank theorem, this happens iff B t is not full-rank, which happens iff B t has
1
Iff is short for if and only if.
a nonzero null space. Thus λ is an eigenvalue of A iff λ is an eigenvalue of

At .
Let v be a unit vector. From §2.5, when Q is the covariance matrix of a

dataset, v · Qv is the variance of the dataset projected onto the line through
v. When v is an eigenvector, Qv = λv, the variance equals
v · Qv = v · λv = λv · v = λ.
More generally, this holds for any symmetric matrix Q. We conclude
Projected Variance along Eigenvector Direction

If v is a unit eigenvector of a symmetric matrix Q, then v·Qv equals the
corresponding eigenvalue. In particular, the eigenvalues of a covariance
matrix are nonnegative.
In general, when Q is symmetric but not a covariance matrix, some eigen-

values of Q may be negative.
Suppose λ and µ are eigenvalues of a symmetric matrix Q with corre-

sponding eigenvectors u, v. Since Q is symmetric, u · Qv = v · Qu. Using
Qu = λu, Qv = µv, we compute u · Qv in two ways:
µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.
This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:
Distinct Eigenvalues Have Orthogonal Eigenvectors

For a symmetric matrix Q, eigenvectors corresponding to distinct
eigenvalues are orthogonal.
Suppose there is a basis v1 , v2 , . . . , vd of eigenvectors of Q, with corre-

sponding eigenvalues λ1 , λ2 , . . . , λd . Let E be the diagonal matrix with λ1 ,
λ2 , . . . , λd on the diagonal,
 
λ1 0 0 ... 0
 0 λ2 0 ... 0
 
E = . . . . . . . . . . . . . . .

.
 0 0 . . . λd−1 0 
0 0 ... 0 λd
Let U be the matrix with columns v1 , v2 , . . . , vd . By matrix multiplication

and Qvj = λj vj , j = 1, 2, . . . , d, we obtain
QU = U E. (3.2.3)
When this happens, we say Q is diagonalizable. Thus A in (3.2.2) is not

diagonalizable. On the other hand, we will show every symmetric matrix
Q is diagonalizable. In fact, Q symmetric leads to an orthonormal basis of
eigenvectors.
In Python, given Q, we compute the third eigenvector v and third eigen-
value λ, and verify Qv = λv. The code
from numpy import *

# Q is any symmetric matrix

lamda, U = eigh(Q)
lamda = lamda[2]
v = U[:,2]
allclose(dot(Q,v), lamda*v)
returns True.
The main result in this section is

Eigenvalue Decomposition (EVD)
Let Q be a symmetric d × d matrix. There is an orthonormal basis v1 ,

v2 , . . . , vd in Rd of eigenvectors of Q, with corresponding eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λd .
Here are some consequences of the eigenvalue decomposition.

If V is the matrix with rows v1 , v2 , . . . , vd then U = V t is the matrix with
columns v1 , v2 , . . . , vd . Since v1 , v2 , . . . , vd are orthonormal, U is orthogonal
(see (2.9.2)), so U t U = I = U U t . By (3.2.3), QU = U E. Multiplying on the
right by V = U t ,
Q = QU V = U EV.
Thus the eigenvalue decomposition states
Diagonalization (EVD)
There is an orthogonal matrix V and a diagonal matrix E such that

with U = V t , we have
Q = U EV. (3.2.4)
When this happens, the rows of V are the eigenvectors of Q, and the
diagonal entries of E are the eigenvalues of Q.
In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code
from numpy import *

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U
returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).
These columns are the orthonormal eigenvectors Qv = 3v, Qu = 1u. By

(3.2.3), QU = U E, where E is the diagonal matrix with the eigenvalues on
the diagonal,
from numpy import *

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)
allclose(Q,dot(U,dot(E,V))
returns True.
In sympy, the corresponding commands are
from sympy import *

from sympy import init_printing
init_printing()
# eigenvalues
Q.eigenvals()
# eigenvectors
Q.eigenvects()
U, E = Q.diagonalize()
The command init_printing pretty-prints the output.

Let λ1 , λ2 , . . . , λr be the nonzero eigenvalues of Q. Then the diagonal

matrix E has r nonzero entries on the diagonal, so rank(E) = r. Since U
and V = U t are invertible, rank(E) = rank(U EV ). Since Q = U EV ,
rank(Q) = rank(E) = r.
Rank Equals Number of Nonzero Eigenvalues

The rank of a diagonal matrix equals the number of nonzero entries.
The rank of a square symmetric matrix Q equals the number of nonzero
eigenvalues of Q.
Using sympy,
from sympy import *
Q = Matrix([[2,1],[1,2]])
display(U,E)
returns
1 1 1 0
U= , E= .
−1 1 0 3
Also,
from sympy import *
a,b,c = symbols("a b c")
Q = Matrix([[a,b ],[b,c]])
display(Q,U,E)
returns
√ √
a b 1 a−c− D a−c+ D
Q= , U=
b c 2b 2b 2b
and
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
display is used to pretty-print the output.
When all the eigenvalues are nonzero, we can write

 
1/λ1 0 0 ... 0
 0 1/λ2 0 . . . 0 
E −1 = 
 ...
.
... ... ... ... 
0 0 . . . 0 1/λd
Then a straightforward calculation using (3.2.4) shows
Nonzero Eigenvalues Equals Invertible

Let Q = U EV be the EVD of a symmetric matrix Q. Then Q is
invertible iff all its eigenvalues are nonzero. When this happens, we
have
Q−1 = U E −1 V
More generally, using (2.6.8), one can check
Pseudo-Inverse (EVD)
If λ1 ≥ λ2 ≥ · · · ≥ λr are the nonzero eigenvalues of Q, then 1/λ1 ≤

1/λ2 ≤ · · · ≤ 1/λr are the nonzero eigenvalues of Q+ . Moreover, if
Q = U EV as in (3.2.4), then Q+ = U E + V .
Similarly, eigendata may be used to solve linear systems.
Nonzero Eigenvalues Equals Solvable

Let v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors of Q corre-
sponding to eigenvalues λ1 , λ2 , . . . , λd . Then the linear system
Qx = b
has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.5)
λ1 λ2 λd
The proof is straightforward using (2.9.3), multiply by Q to verify,

1 1
Qx = Q (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
1 1
= (b · v1 )Qv1 + (b · v2 )Qv2 + . . .
λ1 λ2
1 1
= (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
= (b · v1 )v1 + (b · v2 )v2 + · · · = b.
Another consequence of the eigenvalue decomposition is
Trace is the Sum of Eigenvalues

Let Q be a symmetric matrix with eigenvalues λ1 , λ2 , . . . , λd . Then
trace(Q) = λ1 + λ2 + · · · + λd . (3.2.6)
To derive this, use (3.2.3): Since U is orthogonal, U V = U U t = I. By

(2.2.9), trace(AB) = trace(BA), so
trace(Q) = trace(QU V ) = trace(V QU ) = trace(V U EV U ) = trace(E).
Since E = diag(λ1 , λ2 , . . . , λd ), trace(E) = λ1 + λ2 + · · · + λd , and the result

follows.
Let Q be symmetric with eigenvalues λ1 , λ2 , . . . , λd . Since
Qv = λv =⇒ Q2 v = QQv = Q(λv) = λQv = λ2 v,
Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have
trace(Q2 ) = trace(QQt ) = trace(Q2 ) = λ21 + λ22 + · · · + λ2d .

√ √
λ2 v2 λ1 v1
√ √
− λ1 v1 − λ2 v2
Figure 3.4: Inverse covariance ellipse and centered dataset.
It turns out every symmetric nonnegative matrix is the covariance of a

simple data set (Figure 3.4).
Sum of Tensor Products

Let Q be a symmetric d × d matrix with eigenvalues λ1 , λ2 , . . . , λd and
orthonormal eigenvectors v1 , v2 , . . . , vd . Then
Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.7)
In particular, when Q is nonnegative, the dataset consisting of the 2d

points p p p
± λ1 v1 , ± λ2 v2 , . . . , ± λd vd
is centered and has covariance Q/d
Since v1 , v2 , . . . , vd is an orthonormal basis, by (2.9.3), every vector v

can be written
v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .
Multiply by Q. Since Qvk = λk vk ,

Qv = (v · v1 ) Qv1 + (v · v2 ) Qv2 + · · · + (v · vd ) Qvd
= λ1 (v · v1 ) v1 + λ2 (v · v2 ) v2 + · · · + λd (v · vd ) vd
= (λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd ) v
√
This proves the first part. For the second part, let bk = λk vk . Then the
mean of the 2d vectors ±b1 , ±b2 , . . . , ±bd is clearly zero, and by (3.2.7), the
covariance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.
Now we approach the eigenvalues of Q from a different angle. In §2.5, we

studied zero variance directions. Since the eigenvalues of a covariance matrix
are nonnegative, for a covariance matrix, they may also be called minimum
variance directions. Now we study maximum variance directions.
Let
λ1 = max v · Qv,
|v|=1
where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a covariance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other
eigenvalue. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any
other eigenvalue.
Maximum Projected Variance is an Eigenvalue

Let Q be a symmetric matrix. Then
λ1 = max v · Qv, (3.2.8)

|v|=1
is the top eigenvalue of Q.
Best-aligned vector is an eigenvector

Let Q be a symmetric matrix. Then a best-aligned vector b is an
eigenvector of Q corresponding to the top eigenvalue λ1 .
To prove these results, we begin with a simple calculation, whose deriva-

tion we skip.
A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know
λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.
Let λ be any eigenvalue of Q, with eigenvector v: Qv = λv. Dividing v

by its length, we may assume |v| = 1. Then
λ1 ≥ v · Qv = v · (λv) = λv · v = λ.
This shows λ1 ≥ λ for any eigenvalue λ.

Now we show λ1 itself is an eigenvalue. Let v1 be a unit vector maximizing
v · Qv, so v1 is best-fit for Q. Then
λ1 = v1 · Qv1 ≥ v · Qv (3.2.9)
for all unit vectors v. Let u be any vector. Then for any real t,
v1 + tu
v=
|v1 + tu|
is a unit vector. Insert this v into (3.2.9) to obtain
(v1 + tu) · Q(v1 + tu)

λ1 ≥ .
|v1 + tu|2
Since Q is symmetric, u · Qv1 = v1 · Qu. Expanding with |v1 |2 = 1, we obtain
λ1 + 2tu · Qv1 + t2 u · Qu λ1 + at + bt2

λ1 ≥ = .
1 + 2tu · v1 + t2 |u|2 1 + ct + dt2
Applying the calculation with λ = λ1 , a = 2u · Qv1 , b = u · Qu, c = 2u · v1 ,

and d = |u|2 , we conclude
u · Qv1 = λ1 u · v1
for all vectors u. But this implies
u · (Qv1 − λ1 v1 ) = 0
for all u. Thus Qv1 − λ1 v1 is orthogonal to all vectors, hence orthogonal to

itself. Since this can only happen if Qv1 − λ1 v1 = 0, we conclude Qv1 = λ1 v1 .
Hence λ1 is itself an eigenvalue. This completes the proof of the two results.
Just as the maximum variance (3.2.8) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.10)
|v|=1
is the bottom eigenvalue, and the corresponding eigenvector vd is the worst-

aligned vector.
By the eigenvalue decomposition, the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd
of a symmetric matrix Q may be arranged in decreasing order, and may be
positive, zero, or negative scalars. When Q is a covariance, the eigenvalues
are nonnegative, and the bottom eigenvalue is at least zero. When the bottom
eigenvalue is zero, the corresponding eigenvectors are zero variance directions.
Now we can complete the proof the eigenvalue decomposition. Having

found the top eigenvalue λ1 with its corresponding unit eigenvector v1 , we
let S = span(v1 ) and T = S ⊥ be the orthogonal complement of v1 (Figure
3.5). Then dim(T ) = d − 1, and we can repeat the process and maximize
v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues
λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
with corresponding orthonormal eigenvectors
v1 , v2 , v3 , . . . , vd .
This proves the eigenvalue decomposition.
T = S⊥
S
v1
v3
v2
Figure 3.5: S = span(v1 ) and T = S ⊥ .
Let Q be a positive covariance matrix and let b · Q−1 b = 1 be the inverse

covariance ellipsoid. If v is a unit eigenvector
√ corresponding
√ to an eigenvalue
λ, then λ ≥ 0, and the vector b = λv has length λ. Moreover b satisfies
√ √
b · Q−1 b = ( λv) · Q−1 ( λv) = λv · Q−1 v = λv · (λ−1 v) = v · v = 1.
√
Hence the line segment joining the √ vectors ± λv is an axis of the inverse
covariance ellipsoid, with length 2 λ (Figure 3.4).
When λ = λ1 is the top eigenvalue, the axis is the principal axis of the
inverse covariance ellipsoid. When λ = λ2 is the next highest eigenvalue,
the axis is orthogonal to the principal axis, and is the second principal axis.
Continuing in this manner, we obtain all the principal axes of the inverse
covariance ellipsoid.
Principal Axes of Inverse Covariance Ellipsoid

Let v be a unit eigenvector of a covariance
√ matrix
√ Q with eigenvalue
λ. Then the line segment joining − λv and + √λv is a principal axis
of the inverse covariance ellipsoid, with length 2 λ.
Together with Figure 1.32, this result provides a geometric interpreta-

tion of eigenvalues: They control the variances of a dataset’s points, in the
principal directions.
Sometimes, several eigenvalues are equal, leading to several eigenvectors,
say m of them, corresponding to a given eigenvalue λ. In this case, we say
the eigenvalue λ has multiplicity m, and we call the span
Sλ = {v : Qv = λv}
the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .
Let (evs,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.
from numpy import *

lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is

an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition is a systematic procedure for
finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.
The Eigenvalue Decomposition is Complete

If λ is an eigenvalue for Q, Qv = λv, then λ equals one of the eigen-
values in the eigenvalue decomposition of Q.
To see this, suppose Qv = λv with λ ̸= λj for j = 1, . . . , d. Since λ ̸= λj

for j = 1, . . . , d, the vector v must be orthogonal to every vj , j = 1, . . . , d.
Since span(v1 , . . . , vd ) = Rd , it follows v is orthogonal to every vector, hence
v is orthogonal to itself, hence v = 0. We conclude λ cannot be an eigenvalue.
All this can be readily computed in Python. For the Iris dataset, we have
the covariance matrix in (2.2.13). The eigenvalues are
4.2 > 0.24 > 0.08 > 0.02,
and the orthonormal eigenvectors are the columns of the matrix

 
0.36 −0.66 −0.58 0.32
−0.08 −0.73 0.6 −0.32
U = 
 0.86 0.18 0.07 −0.48
0.36 0.07 0.55 0.75
Since the eigenvalues are distinct, the multiplicity of each eigenvalue is 1.

From (2.2.13), the total variance of the Iris dataset is
4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .
For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,
v1 = (0.36, −0.08, 0.86, 0.36).
The top eigenvalue accounts for 92.5% of the total variance.

The second eigenvalue is λ2 = 0.24 with eigenvector

v2 = (−0.66, −0.73, 0.18, 0.07).
The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector
v3 = (−0.58, 0.60, 0.07, 0.55).
The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector
v4 = (0.32, −0.32, −0.48, 0.75).
The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.
An important class of symmetric matrices are of the form

 
  2 −1 0 −1
2 −1 −1 
2 −2  −1 2 −1 0 
−1 2 −1   
−2 2  0 −1 2 −1 
−1 −1 2
−1 0 −1 2
 
  2 −1 0 0 0 −1
2 −1 0 0 −1 
 −1 2 −1 0 0 0 
 −1 2 −1 0 0  
   0 −1 2 −1 0 0 
 0 −1 2 −1 0   
  0 0 −1 2 −1 0 
 0 0 −1 2 −1   
 0 0 0 −1 2 −1 
−1 0 0 −1 2
−1 0 0 0 −1 2
 
2 −1 0 0 0 0 −1
 −1 2 −1 0 0 0 0 
 
 0 −1 2 −1 0 0 0 
 
 0
 0 −1 2 −1 0 0 .

 0
 0 0 −1 2 −1 0 

 0 0 0 0 −1 2 −1 
−1 0 0 0 0 −1 2
We denote these matrices Q(2), Q(3), Q(4), Q(5), Q(6), Q(7). The following
code generates these symmetric d × d matrices Q(d),
def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
return v
# using sympy
from sympy import Matrix
def Q(d): return Matrix([ row(i,d) for i in range(d) ])
# using numpy
from numpy import *
def Q(d): return array([ row(i,d) for i in range(d) ])
The eigenvalues of these symmetric matrices follow interesting patterns

that are best explored using Python.
Below we will see, the eigenvalues of Q(d) are between 4 and 0, and each
eigenvalue other than 4 and 0 has multiplicity 2.
m1 m2
x1 x2
Figure 3.6: Three springs at rest and perturbed.
To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx
proportional to the displacement x. For example, look at the mass m1 . The
spring to its left is extended by x1 , so exerts a force of −kx1 . Here the minus
indicates pulling to the left. On the other hand, the spring to its right is
extended by x2 − x1 , so it exerts a force +k(x2 − x1 ). Here the plus indicates
pulling to the right. Adding the forces from either side, the total force on
m1 is −k(2x1 − x2 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force −kx2 , so the total force on m2 is
−k(2x2 − 2x1 ). We obtain the force vector

2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2
However, as you can see, the matrix here is not exactly Q(2).
m1 m2 m3 m4 m5
x1 x2 x3 x4 x5
Figure 3.7: Six springs at rest and perturbed.
For five masses, let x1 , x2 , x3 , x4 , x5 denote the displacement of each

mass from its rest position. In Figure 3.7, x1 , x2 , x5 are positive, and x3 , x4
are negative.
As before, the total force on m1 is −k(2x1 − x2 ), and the total force on
m5 is −k(2x5 − x4 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force +k(x3 − x2 ). Hence, the total force
on m2 is −k(−x1 + 2x2 − x3 ). Similarly for m3 , m4 . We obtain the force
vector
    
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3   −1 2 −1 0 0   x2 
    
−k −x2 + 2x3 − x4  = −k  0 −1
   2 −1 0   x3  .
 
−x3 + 2x4 − x5   0 0 −1 2 −1   x4 
−x4 + 2x5 0 0 0 −1 2 x5
But, again, the matrix here is not Q(5). Notice, if we place one mass and
two springs in Figure 3.6, we obtain the 1 × 1 matrix 2.
To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
m1 m2 m2
m1
Figure 3.8: Two springs along a circle leading to Q(2).

m1 m1
m2
m2
m5 m5
m3 m4
m4 m3
Figure 3.9: Five springs along a circle leading to Q(5).
Thus the matrices Q(d) arise from mass-spring systems arranged on a

circle. From Newton’s law (force equals mass times
p acceleration), one shows
the frequencies of the vibrating springs equal λk/m, where k is above, m
is the mass of each of the masses, and λ is an eigenvalue of Q(d). This is the
physical meaning of the eigenvalues of Q(d).
Let v have features (x1 , x2 , . . . , xd ), and let Q = Q(d). By elementary

algebra, check that
v · Qv = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 . (3.2.11)
As a consequence of (3.2.11), show also the following.
• For any vector v, 0 ≤ v · Qv ≤ 4|v|2 . Conclude every eigenvalue λ

satisfies 0 ≤ λ ≤ 4.
• λ = 0 is an eigenvalue, with multiplicity 1.
• When d is even, λ = 4 is an eigenvalue with multiplicity 1.
• When d is odd, λ = 4 is not an eigenvalue.

To compute the eigenvalues, we use complex numbers, specifically the

d-th root of unity ω (§1.5). Let
p(t) = 2 − t − td−1 ,
and let  
1

 ω 

 ω2 
v1 =  .
 
 ω3 
 .. 
 . 
ω d−1
Then Qv1 is
 
  1
2 − ω − ω d−1
 −1 + 2ω − ω 2 



 ω

 −ω + 2ω 2 − ω 3 
   ω2
Qv1 =   = p(ω)   = p(ω)v1 .
 
 ..    ω3
 .    ..
d−2 d−1
  .
−ω + 2ω −1 d−1
ω
Thus v1 is an eigenvector corresponding to eigenvalue p(ω).

For each k = 0, 1, 2, . . . , d − 1, define
vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k .

By the same calculation, we have
Qvk = p(ω k )vk , k = 0, 1, 2, . . . , d − 1.
By (1.5.7),
p(ω k ) = 2 − ω k − ω (d−1)k = 2 − ω k − ω −k = 2 − 2 cos(2πk/d).

Eigenvalues of Q(d)
The (unsorted) eigenvalues of Q(d) are

k 2πk
λk = p(ω ) = 2 − 2 cos , k = 0, 1, 2, . . . , d − 1. (3.2.12)
d
Corresponding to each eigenvalue λk , there is the complex eigenvector vk .

Separating vk into its real and imaginary parts yields two real eigenvectors

2πk 4πk 6πk 2(d − 1)πk
ℜ(vk ) = 1, cos , cos , cos , . . . , cos ,
d d d d

2πk 4πk 6πk 2(d − 1)πk
ℑ(vk ) = 0, sin , sin , sin , . . . , sin .
d d d d
When k = 0 or when k = d/2, d even, we have ℑ(vk ) = 0. This explains the

double multiplicity in Figure 3.10, except when k = 0 or k = d/2, d even.
Applying this formula, we obtain eigenvalues
Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above by shifting the entries to the right. The trick of using
the roots of unity to compute the eigenvalues and eigenvectors works for any
circulant matrix.
Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.

from matplotlib.pyplot import stairs,show,scatter,legend
d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")
k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)
scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")
legend()
show()
Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ∼ 4 and
the bottom λd = 0 for d large. Using the double-angle formula,

2 πk
λk = 4 sin , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the dou-
ble multiplicity, we obtain2 the proportion of eigenvalues below threshhold
λ,
1√

#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.13)
d π 2
2
This is an approximate equality: The ratio of the two sides approaches 1 as d → ∞.
This arcsine law is valid for a wide class of matrices, not just Q(d), as the
matrix size d grows without bound, d → ∞.
Figure 3.10: Plot of eigenvalues of Q(50).
Equivalently, the derivative of the arcsine law (3.2.13) exhibits (see (7.1.9))
the eigenvalue clustering near the ends (Figure 3.11).
from numpy import *

lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
f = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,f,usetex=True,fontsize="x-large")
show()
3.3. SINGULAR VALUE DECOMPOSITION 167
Figure 3.11: Density of eigenvalues of Q(d) for d large.
The matrices Q(d) are prototypes of matrices that are fundamental in

many areas of physics and engineering, including time series analysis and
information theory, see [9].
3.3 Singular Value Decomposition

In this section, we discuss the singular value decomposition (U, S, V ) of a
matrix A.
Let A be a matrix. We say a positive number σ > 0 is a singular value of
A if there are nonzero vectors v and u satisfying
Av = σu and At u = σv. (3.3.1)
When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
Some books allow singular values to be zero. Here we insist that sin-
gular values be positive. Contrast singular values with eigenvalues: While
eigenvalues may be negative or zero, for us singular values are positive.
The definition immediately implies
Singular Values of A Versus At
The singular values of A are the same as the singular values of At .
We work out our first example. Let

1 1
A= .
0 1
Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1. Set
t 1 1
Q=AA= .
1 2
Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-
vectors v1 , v2 . Moreover, as we saw in an earlier section, v1 , v2 may be chosen
orthonormal.
The eigenvalues of Q are given by
0 = det(Q − λI) = λ2 − 3λ + 1.
By the quadratic formula,

√ √
3 5 3 5
λ1 = + = 2.62, λ2 = − = .38.
2 2 2 2
Now we turn to singular values. If v and u and σ > 0 satisfy (3.3.1), then
Qv = At Av = At (σu) = σ 2 v. (3.3.2)
Hence σ 2 = λ, and we obtain

s √ s √
3 5 3 5
σ1 = + = 1.62, σ2 = − = 0.62.
2 2 2 2
To make (3.3.1) work, we set u1 = Av1 /σ1 . Then Av1 = σ1 u1 , and
At u1 = At Av1 /σ1 = Qv1 /σ1 = λ1 v1 /σ1 = σ1 v1 .
Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already

know v1 , v2 are orthonormal, because we chose them that way, as eigenvectors
of the symmetric matrix Q. Also
0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .
Since σ1 ̸= 0, σ2 ̸= 0, it follows u1 , u2 are orthogonal. Also
λ1 = λ1 v1 · v1 = Qv1 · v1 = (At Av1 ) · v1 = (Av1 ) · (Av1 ) = σ12 u1 · u1 .
Since λ1 = σ12 , u1 · u1 = 1. Similarly, u2 · u2 = 1. This shows u1 , u2 are

orthonormal, and completes the first example.
Let A be an N × d matrix, and suppose σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are

singular values with corresponding left singular vectors u1 , u2 , . . . , ur and
right singular vectors v1 , v2 , . . . , vr . Then, since uk = Avk /σk , the vectors
u1 , u2 , . . . , ur are in the column space of A. If u1 , u2 , . . . , ur are linearly
independent, it follows r is no larger than rank(A), hence r is no larger than
min(N, d).
We seek the largest value of r. Many books include zero as a possible
singular value, allowing the largest r to equal min(N, d). Here we insist
singular values are positive. Because of this, we will see the largest r is
rank(A).
The close connection between singular values σ of A and positive eigen-

values λ of Q = At A carries over in the general case.
A Versus Q
Let A be any matrix. Then
• the rank of A equals the rank of Q,
• σ is a singular value of A iff λ = σ 2 is a positive eigenvalue of Q.
Since the rank equals the dimension of the row space, the first part follows
from §2.4.
If Av = σu and At u = σv, then
Qv = At Av = At (σu) = σAt u = σ 2 v,
so v is an eigenvector of At A corresponding
√ to λ = σ 2 > 0. Conversely, If
Qv = λv and λ > 0, then set σ = λ and u = Av/σ. Then
Av = σu, At u = At Av/σ = Qv/σ = λv/σ = σv.
From §3.2, the number of positive eigenvalues (possibly repeated) of Q

equals the rank of Q. By the above, we conclude the rank of A equals the
number of singular values (possibly repeated) of A.
Singular Value Decomposition (SVD)
Let A be any matrix and let r be the rank of A. Then there is

an orthonormal basis u1 , u2 , . . . , uN of the target space and an or-
thonormal basis v1 , v2 , . . . , vd of the source space and positive scalars
σ1 ≥ σ2 ≥ · · · ≥ σr > 0, such that
Avk = σk uk , At uk = σk vk , k = 1, 2, . . . , r, (3.3.3)
and
Avk = 0, At uk = 0 for k > r.
The proof is very simple once we remember the rank of Q equals the
number of positive eigenvalues of Q. By the eigenvalue decomposition, there
is an orthonormal basis of the source space v1 , v2 , . . . and λ1 ≥ λ2 ≥ · · · ≥
λr > 0 such that √ Qvk = λk vk , k = 1, . . . , r, and Qvk = 0, k > r.
Setting σk = λk and uk = Avk /σk , k = 1, . . . , r, as in our first exam-
ple, we have (3.3.3), and, again as in our first example, u1 , u2 , . . . , ur are
orthonormal.
Assume A is N × d. Then the source space is Rd , and the target space
is RN . By construction, vr+1 , vr+2 , . . . , vd is an orthonormal basis for the
null space of A. Set u1 = Av1 /σ1 , u2 = Av2 /σ2 , . . . , ur = Avr /sr . Since
Avr+1 = 0, . . . , Avd = 0, u1 , u2 , . . . , ur is an orthonormal basis for the
column space of A.
Since the column space of A is the row space of At , the column space
of A is the orthogonal complement of the nullspace of At (2.7.6). Choose
ur+1 , ur+2 , . . . , uN any orthonormal basis for the nullspace of At . Then
{u1 , u2 , . . . , ur } and {ur+1 , ur+2 , . . . , uN } are orthogonal. From this, u1 , u2 ,
. . . , uN is an orthonormal basis for the target.
For our second example, let a and b be nonzero vectors, possibly of dif-
ferent sizes, and let A be the matrix
A = a ⊗ b, At = b ⊗ a.
Then
Av = (v · b)a = σu and At u = (u · a)b = σv.
Since the range of A equals span(a), the rank of A equals one.
Since σ > 0, v is a multiple of b and u is a multiple of a. If we write
v = tb and u = sa and plug in, we get
v = |a| b, u = |b| a, σ = |a| |b|.
Thus there is only one singular value of A, equal to |a| |b|. This is not
surprising since the rank of A is one.
In a similar manner, one sees the only singular value of the 1 × n matrix
A = a equals σ = |a|.
Our third example is
 
0 0 0 0
1 0 0 0
A=
0
. (3.3.4)
1 0 0
0 0 1 0
Then    
0 1 0 0 1 0 0 0
0 0 1 0  0 1 0 0
At = 
0
, Q = At A =  
0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,

λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
       
1 0 0 0
0 1 0 0
v1 = 
0 , v2 = 0 , v3 1 , v4 = 0 .
      
0 0 0 1
Clearly v1 , v2 , v3 , v4 are orthonormal. By (3.3.2), σ1 = 1, σ2 = 1, σ3 = 1.

Since we must have Av = σu, we can check that
u1 = Av1 = v2 , u2 = Av2 = v3 , u3 = Av3 = v4 , u4 = v1
satisfies (3.3.1). This completes our third example.
Let’s look at the SVD decomposition in more detail. Suppose A is N × d

and σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are the singular values of A.
If N ≤ d, then r ≤ N , we set
 
σ1 0 0 0 0 0
 0 σ2 0 0 0 0 
S=  0 0 σ3 0 0 0 .

0 0 0 0 0 0
Here we have (N, d) = (4, 6), r = 3.

If N ≥ d, then r ≤ d, we set
 
σ1 0 0 0
0 σ2 0 0
 
0 0 σ3 0
S= 0
.
 0 0 0

0 0 0 0
0 0 0 0
Here we have (N, d) = (6, 4), r = 3. In either case, S has the same shape
N × d as A.
Let U be the matrix with columns u1 , u2 , . . . , uN , and let V be the matrix
with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 , . . . , vd .
Then U and V are orthogonal N × N and d × d matrices. By (3.3.1),
AV t = U S.
Right-multiplying by V and using V t V = I implies
A = U SV.
Summarizing,
Diagonalization (SVD)
If A is any matrix, there is a diagonal matrix S with nonnegative

diagonal entries, with the same shape as A, and orthogonal matrices
U and V , satisfying
A = U SV.
The rows of V are an orthonormal basis of right singular vectors, and
the columns of U are an orthonormal basis of left singular vectors.
From this, if A is N × d, then U is N × N , S is N × d, and V is d × d.

When A = Q is a covariance matrix, Q ≥ 0, then the eigenvalues are
nonnegative, and, from (3.2.4), we have U EU t = Q. If we choose V = U t ,
we see EVD is a special case of SVD, with the singular values corresponding
to the positive eigenvalues.
In general, however, if Q has negative eigenvalues, V is not equal to U t ;
instead V is obtained from U t by re-sorting the rows of U .
In Python, svd returns the orthogonal matrices U and V and a vector

sigma of singular values. The singular values are arranged in decreasing
order. To recover the diagonal matrix S, we use diag.
from numpy import *

from numpy.linalg import svd
U, sigma, V = svd(A)
# sigma is a vector
# build diag matrix S

p = min(A.shape)
S = zeros(A.shape)
S[:p,:p] = diag(sigma)
print(U.shape,S.shape,V.shape)
print(U,S,V)
allclose(A, dot(U, dot(S, V)))
This code returns True.
Given the relation between the singular values of A and the eigenvalues
of Q = At A, we also can conclude
Right Singular Vectors Are the Same as Eigenvectors
Let A be any matrix and let Q = At A.
v is an eigenvector of Q ⇐⇒ v is a right singular vector of A.

(3.3.5)
(This statement ignores zero eigenvalues.)

For example, if dataset is the Iris dataset (ignoring the classes), the code
from numpy import *

from numpy.linalg import svd,eigh
# center dataset
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]
# any of these will work

Q = dot(A.T,A)
Q = cov(dataset.T,bias=False)
# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]
# compare columns of U
# and rows of V
U, V
returns
   
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
 , V = −0.66 −0.73 0.18 0.07 

U = 
 0.86 0.18 0.07 −0.48   0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75
This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .
Now we turn to the pseudo-inverse.
To get the Pseudo-Inverse, Invert the Singular Values
The pseudo-inverse A+ is obtained by replacing singular values of A

by their reciprocals, and taking the transpose.
More explicitly, we can write
Inverse Singular Values, and Flipped Singular Vectors

Let A have rank r, and let σk , vk , uk be the singular data as above.
Then
1 1
A + uk = vk , (A+ )t vk = uk , k = 1, 2, . . . , r,
σk σk
and
A+ uk = 0, (A+ )t vk = 0 for k > r.
We illustrate these results in the case of a diagonal matrix

   
a 0 0 0 0 0 0
= Q 0 0
0 b 0 0 0 
S= .
0 0 c 0 0  0 0
0 0 0 0 0 0 0 0 0 0
Since S is 4 × 5 and SS + S = S, S + must be 5 × 4. Writing

   
∗ ∗ ∗ ∗
 A B
∗ ∗ ∗ ∗ 
 
+
∗ ∗ ∗ ∗ 
 
S =
∗ = ,
∗ ∗ ∗ 
C D

∗ ∗ ∗ ∗
and applying the four properties of the pseudo-inverse S + , leads to

 
1/a 0 0
A = Q−1 =  0 1/b 0  , B = C = D = 0.
0 0 1/c
From SVD, one then deduces the above results.
3.4 Principal Component Analysis

Let Q be the covariance matrix of a dataset in Rd . Then Q is a d × d sym-
metric matrix, and the eigenvalue decomposition guarantees an orthonormal
basis v1 , v2 , . . . , vd in Rd consisting of eigenvectors of Q,
Qvk = λk vk , k = 1, . . . , d.
These eigenvectors are the principal components of the dataset. Principal

Component Analysis (PCA) consists of projecting the dataset onto lower-
dimensional subspaces spanned by some of the eigenvectors.
3.4. PRINCIPAL COMPONENT ANALYSIS 177
Let Q be a symmetric matrix with eigenvalue λ and corresponding eigen-

vector v, Qv = λv. If t is a scalar, then the matrix tQ has eigenvalue tλ and
corresponding eigenvector v, since
(tQ)v = tQv = tλv = (tλ)v.
Hence multiplying Q by a scalar does not change the eigenvectors.

Let A be the dataset matrix of a given dataset with N samples, and d
features. If the samples are the rows of A, then A is N × d. If we assume
the dataset is centered, then, by (2.2.12), the covariance is Q = At A/N .
From the previous paragraph, the eigenvectors of the covariance Q equal the
eigenvectors of At A. From (3.3.5), these are the same as the right singular
vectors of A.
Thus the principal components of a dataset are the right singular vectors
of the centered dataset matrix. This shows there are two approaches to the
principal components of a dataset: Either through EVD and eigenvectors
of the covariance matrix, or through SVD and right singular vectors of the
centered dataset matrix. We shall do both.
Assuming the eigenvalues are ordered top to bottom,
λ1 ≥ λ2 ≥ · · · ≥ λd ,
in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto the plane span(v1 , v2 ). The projected dataset can then be visualized
as points in the plane. Similarly, one can take the top three eigenvalues
λ1 ≥ λ2 ≥ λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto
the space span(v1 , v2 , v3 ). This can then be visualized as points in three
dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784
dimensions. After we download the dataset,
from numpy import *

from keras.datasets import mnist
(train_X, train_y), (test_X, test_y) = mnist.load_data()

dataset = train_X.reshape((60000,784))
labels = train_y
we compute Q, the total variance, and the eigenvalues, as percentages of the

total variance. We also name the targets as labels for later use.
Figure 3.12: MNIST eigenvalues as a percentage of the total variance.
Q = cov(dataset.T)
totvar = Q.trace()
# use eigh for symmetric matrices

lamda, U = eigh(Q)
# sort in ascending order then reverse

sorted = sort(lamda)[::-1]
percent = sorted*100/totvar
# cumulative sums
sums = cumsum(percent)
data = array([percent,sums])
print(data.T[:20].round(decimals=3))
d = len(lamda)
from matplotlib.pyplot import stairs
stairs(percent,range(d+1))
Figure 3.13: MNIST eigenvalue percentage plot.
The left column in Figure 3.12 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of
the total variance. The right column lists the cumulative sums of the eigen-
values, so the third entry in the right column is the sum of the top three
eigenvalues, λ1 + λ2 + λ3 = 22.97%.
This results in Figures 3.12 and 3.13. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.12 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.
A MNIST image is a point in R784 . Now we turn to projecting the image

from 784 dimensions down to n dimensions, where n is 784, 600, 350, 150,
50, 10, 1. Let Q be any d × d covariance matrix, and let v be in Rd . Let
v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors corresponding to the
eigenvalues of Q, arranged in decreasing order. Here is code that returns the
projection matrix P (§2.7) onto the span of the eigenvectors v1 , v2 , . . . , vn
corresponding to the top n eigenvalues of Q.
from numpy import *

# projection matrix onto top n

# eigenvectors of covariance
# of dataset
def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P
In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d × n matrix V . The
code then returns the projection matrix P = V V t (2.7.4).
Instead of working with the covariance Q, as discussed at the start of
the section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.
from numpy import *

from numpy.linalg import svd
# projection matrix onto top n

# eigenvectors of covariance
# of dataset
def pca_with_svd(dataset,n):
# center dataset
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P
Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the covariance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
v = dataset[1] # second image

display_image(v,row,col,1)
for i,n in enumerate([784,600,350,150,50,10,1],start=2):

# either will work
P = pca_with_svd(dataset,n)
P = pca(dataset,n)
projv = dot(P,v)
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")
If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.
We now show how to project a vector v in the dataset using sklearn.

The following code sets up the PCA engine using sklearn.
from sklearn.decomposition import PCA
N = len(dataset)
n = 10
engine = PCA(n_components = n)
The following code computes the reduced dataset (§2.7)
reduced = engine.fit_transform(dataset)
reduced.shape
and returns (N, n) = (60000, 10). The following code computes the projected
dataset
projected = engine.inverse_transform(reduced)
projected.shape
and returns (N, d) = (60000, 784).

Let U be the d × n matrix with columns the top n eigenvectors. Then the
projection matrix onto the column space of U (project_to_ortho in §2.7)
is P = U U t . In the above code, reduced equals U t v for each image v, and
projected is U U t v for each image v.
Figure 3.14: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
Then the code
# eight subplots
rows, cols = 2, 4
v = dataset[1] # second image

display_image(v,rows,cols,1)
for i,n in enumerate([784,600,350,150,50,10,1],start=2):

engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
projected = engine.inverse_transform(reduced)
projv = projected[1] # second image
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")
returns Figure 3.14.
Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To

start, we compute reduced as above with n = 3, the top three components.
In the two-dimensional plotting code below, reduced is an array of shape
(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.15. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.
Figure 3.15: The full MNIST dataset (2d projection).
Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan',

,→ 'turquoise', 'black', 'orchid', 'brown')
for i,color in enumerate(Colors):
scatter(reduced[labels==i,0], reduced[labels==i,1],
,→ label=i, c=color, edgecolor='black')
grid()
legend(loc='upper right')
show()
Figure 3.16: The Iris dataset (2d projection).
Code for the 2d plot (Figure 3.16) of the Iris dataset is
Colors = ['blue', 'red', 'green']

Classes = ["Iris-setosa", "Iris-virginica", "Iris-versicolor"]
for a,b in zip(Classes,Colors):

scatter(reduced[labels==a,0], reduced[labels==a,1],
,→ label=a, c=b, edgecolor='black')
grid()
show()
Now we turn to three dimensional plotting. Here is the code
%matplotlib notebook
from mpl_toolkits import mplot3d
P = axes(projection='3d')
P.set_axis_off()
Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan' ,

,→ 'turquoise', 'black', 'orchid', 'brown')
for i,color in enumerate(Colors):

P.scatter(reduced[labels==i,0], reduced[labels==i,1],
,→ reduced[labels==i,2], label=i, c=color,
,→ edgecolor='black')
show()
The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.
3.5 Cluster Analysis

⋆ under construction ⋆
Cluster analysis seeks to partition a dataset into groups or clusters based
on selected criteria, such as proximity in distance.
Let x1 , x2 , . . . , xN be a dataset in Rd . The simplest algorithm is k-means
clustering. The algorithm is iterative: We start with k means m1 , m2 , . . . ,
mk , not necessarily part of the data set, and we divide the dataset into k
clusters, where the i-th cluster consists of the points x in the dataset for
which mean mi is nearest to x.
The algorithm is in two parts, the assignment step and the update step.
Initially the means m1 , m2 , . . . , mk are chosen at random, or by an edu-
cated guess, then clusters C1 , C2 , . . . , Ck are assigned, then each mean is
recomputed as the mean of each cluster.
The sklearn package contains clustering routines, but here we write the
code from scratch to illustrate the ideas. Here is an animated gif illustrating
the convergence of the algorithm.
Assume the means are given as a list of length k,
3.5. CLUSTER ANALYSIS 187
means = [ means[0], means[1], ... ]
and each cluster is a list of points (so clusters is a list of lists)
clusters = [ clusters[0], clusters[1], ... ]
such that
N == sum([ len(cluster) for cluster in clusters] )
Given a point x, we first select the mean closest to x:
from numpy import *

from numpy.linalg import norm
def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i
Starting with empty clusters (k is the number of clusters), we iterate the

assign/update steps until the means no longer change. If any clusters remain
empty, we discard them. Here is the assignment step.
def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]
Here is the update step.

def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]
Here is the iteration.
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]
close_enough = False
while not close_enough:

clusters = assign_clusters(dataset,means)
print([len(c) for c in clusters])
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
means = newmeans
This code returns the size the clusters after each iteration. Here is code
that plots a cluster.
def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')
Here is code for the entire iteration. hexcolor is in §1.3.

3.5. CLUSTER ANALYSIS 189
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]
colors = [ hexcolor() for _ in range(k) ]
close_enough = False
grid()
for v in dataset: scatter(v[0],v[1],s=20,c='black')
show()
while not close_enough:

clusters = assign_clusters(dataset,means)
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
grid()
for i,c in enumerate(clusters):
plot_cluster(newmeans[i], c, colors[i],
'$' + str(i) + '$')
show()
means = newmeans
Chapter 4
Counting
Some of the material in this chapter is first seen in high school. Because
repeating the exposure leads to a deeper understanding, we review it in a
manner useful to the later chapters.
4.1 Permutations and Combinations

Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure 4.1.
Figure 4.1: 6 = 3! permutations of 3 balls.
Why are there six possibilities? Because they are three ways of choosing
191
192 CHAPTER 4. COUNTING
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is
6 = 3 × 2 × 1.
In particular, we see that the number of ways multiply, 6 = 3 × 2 × 1.

Similarly, there are 5 × 4 × 3 × 2 × 1 = 120 ways of selecting five distinct
balls. Since this pattern appears frequently, it has a name.
If n is a positive integer, then n-factorial is
n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.
The factorial function grows large rapidly, for example,
10! = 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 3, 628, 800.
Notice also
(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,
and (n + 2)! = (n + 2) × (n + 1)!, and so on.
Permutations of n Objects
The number of ways of selecting n objects from a collection of n distinct
objects is n!.
We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls.
More generally, we can consider the selection of k balls from a bag con-
taining n distinct balls. There are two varieties of selections that can be
made: Ordered selections and unordered selections. An ordered selection is
a permutation, and an unordered selection is a combination. In particular,
when k = n, n! is the number of ways of permuting n objects.
4.1. PERMUTATIONS AND COMBINATIONS 193
The number of permutations of k objects from n objects is written as

P (n, k), and the number of combinations of k objects from n objects is writ-
ten as C(n, k).
For ordered selections, there are n choices for the first ball, n − 1 choices
for the second ball, and so on, until we have n − k + 1 choices for the k-th
ball. Thus
P (n, k) = n × (n − 1) × · · · × (n − k + 1).
For example, there are 5 × 4 = 20 ordered selections of two balls from

five distinct balls. Because ordering is taken into account, selecting ball #2
then ball #3 is considered distinct from selecting ball #3 then ball #2.
Permutation of k Objects from n Objects

The number of permutations of k objects from n objects is
n!
P (n, k) = n(n − 1)(n − 2) . . . (n − k + 1) = .
(n − k)!
The last formula follows by canceling,
n! n(n − 1) . . . (n − k + 1)(n − k)!

= = n(n − 1) . . . (n − k + 1).
(n − k)! (n − k)!
Notice P (x, k) is defined for any real number x by the same formula,
P (x, k) = x(x − 1)(x − 2) . . . (x − k + 1).
When a selection of k objects is made, and the k objects are permuted,

that is considered the same unordered selection, but a different ordered selec-
tion. Since the number of permutations of k objects is k!, the number P (n, k)
of ordered selections is k! times the number C(n, k) of unordered selections.
This leads to
Combinations of k Objects from n Objects

The number of combinations of k objects from n objects is
P (n, k) n!
C(n, k) = = .
k! (n − k)!k!
For example,
5×4
P (5, 2) = 5 × 4 = 20, C(5, 2) = = 10,
2×1
so we have twenty ordered pairs
(1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2),
(3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4)
and ten unordered pairs
{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 5}, {4, 5}.
The number C(n, k) is also called n-choose-k. Because it appears in the

binomial theorem, C(n, k) is also called the binomial coefficient (§4.3).
Since P (x, k) is defined for any real number x, so is C(n, k):
P (x, k) x(x − 1)(x − 2) . . . (x − k + 1)

C(x, k) = = .
k! 1 · 2 · 3 · ··· · k
An important question is the rate of growth of the factorial function n!.

Attempting to answer this question leads to the exponential (§4.4) and to
the entropy (§7.2). Here is how this happens.
Since n! is a product of the n factors
1, 2, 3, . . . , n − 1, n,
each no larger than n, it is clear that
n! < nn .
4.1. PERMUTATIONS AND COMBINATIONS 195
However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume an approximation of the form1
n n
n! ∼ e , for n large, (4.1.1)
e
for some constant e. We seek the best constant e that fits here. In this
approximation, we multiply by e so that (4.1.1) is an equality when n = 1.
Using the binomial theorem, in §4.4 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (4.1.2)
3 2
Based on this, a constant e satisfying (4.1.1) must lie between 2 and 3,
2 ≤ e ≤ 3.
To figure out the best constant e to pick, we see how much both sides
of (4.1.1) increase when we replace n by n + 1. Write (4.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! ∼ e . (4.1.3)
e
Dividing the left sides of (4.1.1), (4.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1

1 1
= (n + 1) · · 1 + . (4.1.4)
e(n/e)n e n
To make these quotients match as closely as possible, we should choose
n
1
e∼ 1+ . (4.1.5)
n
Choosing n = 1, 2, 3, . . . , 100, . . . results in
e ∼ 2, e ∼ 2.25, e ∼ 2.37, . . . , e ∼ 2.705, . . . .

1
∼ means approximately equal.
As n → ∞, we obtain Euler’s constant e (§4.4).

Equation (4.1.1) can be improved to Stirling’s approximation
√ n n
n! ≈ 2πn , for n large. (4.1.6)
e
This is an asymptotic equality: the ratio of the two sides approaches 1 as n
grows to infinity.
4.2 Graphs
A graph consists of nodes and edges. For example, the graphs in Figure 4.2
each have four nodes and three edges. The left graph is directed, in that a
direction is specified for each edge. The graph on the right is undirected, no
direction is specified.
Figure 4.2: Directed and undirected graphs.
In a directed graph, if there is an edge pointing from node i to node j,

we say (i, j) is an edge. For undirected graphs, we say i and j are adjacent.
An edge (i, j) is weighed if a scalar wij is attached to it. If every edge in a
graph is weighed, then the graph is a weighed graph. Any two nodes may be
considered adjacent by assigning the weight zero to the edge between them.
−3 7.4
2 0
Figure 4.3: A weighed directed graph.
In §7.4, back propagation on weighed directed graphs is used to calculate

derivatives.
4.2. GRAPHS 197
Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.
In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,
aij = aji .
Figure 4.4: A double edge and a loop.
Sometimes graphs may have multiple edges between nodes, or loops,

which are edges starting and ending at the same node. A graph is sim-
ple if it has no loops and no multiple edges. In this section, we deal only
with simple undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies

n 1
0≤m≤ = n(n − 1).
2 2
How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
a maximum of n(n − 1)/2 edges. The number of subsets of a set with m

elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .
For example, the number of graphs with n = 5 is 25(5−1)/2 = 210 = 1, 024,

and the number of graphs with n = 10 is
n = 10 =⇒ Gn = 245 = 35, 184, 372, 088, 832.
Figure 4.5: The complete graph K6 and the cycle graph C6 .
When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
4.5).
The cycle graph Cn with n nodes is as in Figure 4.5. The graph Cn has
n edges. The cycle graph C3 is a triangle.
A graph G′ is a subgraph of a graph G if every node of G′ is a node of G,

and every edge of G′ is an edge of G. For example, a triangle in G is a graph
triangle that is a subgraph of G. Below we see the graph K6 in Figure 4.5
contains twenty triangles.
4.2. GRAPHS 199
Figure 4.6: The triangle K3 = C3 .
Let v be a node in a (simple, undirected) graph G. The degree of v is the

number dv of edges containing v. If the nodes are labelled 1, 2, . . . , n, with
the degrees in decreasing order, then
d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn
is the degree sequence of the graph. We write
(d1 , d2 , d3 , . . . , dn )
for the degree sequence.

If we add the degrees over all nodes, we obtain the number of edges
counted twice, because each edge contains two nodes. Thus we have
Handshaking Lemma
If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn , then
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1
A node is isolated if its degree is zero. A node is dominating if it has the

highest degree. Notice the highest degree is ≤ n − 1, because there are no
loops. We show
Nodes with Equal Degree

In any graph, there are at least two nodes with the same degree.
To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is
n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.
So we have n integers spread between 1 and n − 1. This can’t happen unless

at least two of these integers are equal. This completes the first case. In
the second case, we have at least one isolated node, so dn = 0. If dn−1 = 0
also, then we have found two nodes with the same degree. If not, then the
maximum degree is n − 2 (because node n is isolated), and
n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.
So now we have n − 1 integers spread between 1 and n − 2. This can’t

happen unless at least two of these integers are equal. This completes the
second case.
A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m = kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then v1 →
v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
4.2. GRAPHS 201
example, Figure 4.5 may be viewed as two connected graphs K6 and C6 , or

a single disconnected graph K6 ∪ C6 .
Consider a graph with order n. The adjacency matrix is the n × n matrix

A = (aij ) given by
(
1, if i and j are adjacent,
aij =
0, if not.
For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the
complete graph Kn is the n×n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is
A=1⊗1−I
For example, for the triangle K3 ,

     
1 1 0 0 0 1 1
A = 1 1 1 ⊗ 1 − 0 1 0 = 1 0 1 .
1 0 0 1 1 1 0
If we label the nodes of the cycle graph Cn consecutively, then node i is

shares an edge with i − 1 and i + 1, except when i = 1 and i = n. Node 1
shares an edge with 2 and n, and node n shares an edge with n − 1 and 1.
So for C6 the adjacency matrix is
 
0 1 0 0 0 1
1 0 1 0 0 0
 
0 1 0 1 0 0
A=  .
 0 0 1 0 1 0 

0 0 0 1 0 1
1 0 0 0 1 0
Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
For any adjacency matrix A, the sum of each row is equal to the degree
of the node corresponding to that row. This is the same as saying
 
d1
 d2 
A1 = . . .  .

dn
In particular, for a k-regular graph, we have
A1 = k1,
so for a k-regular graph, k is an eigenvalue of A.

What is the connection between degrees and eigenvalues in general? To
explain this, let λ be an eigenvalue of A with eigenvector v = (v1 , v2 , . . . , vn ),
so Av = λv. Since a multiple tv of v is also an eigenvector, we may assume
the biggest component of v equals 1. Suppose the nodes are labelled so that
v = (1, v2 , v3 , . . . , vn ), with
v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.
Taking the first component of Av = λv, we have
(Av)1 = a11 v1 + a12 v2 + a13 v3 + · · · + a1n vn .
Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies
d1 = a11 +a12 +· · ·+a1n ≥ a11 v1 +a12 v2 +a13 v3 +· · ·+a1n vn = (Av)1 = λv1 = λ.
Since d1 is one of the degrees, d1 is no greater than the maximum degree.

This explains
Maximum Degree of Graph

If λ is any eigenvalue of the adjacency matrix A, then λ is less or equal
to the maximum degree.
In particular, for a k-regular graph, the maximum degree equals k, and

we already saw k is an eigenvalue, so
4.2. GRAPHS 203
Top Eigenvalue
For a k-regular graph, k is the top eigenvalue of the adjacency matrix
A.
Let A = 1 ⊗ 1 − I be the adjacency matrix of complete graph Kn . Then

for any vector v orthogonal to 1,
Av = (1 ⊗ 1 − I)v = (1 · v)1 − v = 0 − v = −v,
so λ = −1 is an eigenvalue with multiplicity n − 1. Since
A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,
n − 1is an eigenvalue. Hence the eigenvalues of A are n − 1 with multiplicity

1 and −1 with multiplicity n − 1.
Let A be the adjacency matrix of the cycle graph Cn . Since Cn is 2-

regular, the top eigenvalue of A is 2. Since A is a circulant matrix, the
method used to find the eigenvalues of Q(d) in §3.2 works here. However, it
is immediate that
A = 2I − Q(n).
From this and by (3.2.12), the eigenvalues of A are
2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,
and the eigenvectors of A are the eigenvectors of Q(n).
The complement of graph G is the graph Ḡ obtained by switching 1’s and

0’s, so the adjacency matrix Ā of Ḡ is
Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
Let G be a k-regular graph, and suppose k = λ1 ≥ λ2 ≥ · · · ≥ λn are the

eigenvalues of A = A(G). Since A is symmetric, we have an orthogonal basis
of eigenvectors v1 , v2 , . . . , vn , with v1 = 1. Then Ḡ is an (n − 1 − k)-regular
graph, so the top eigenvalue of Ā = A(Ḡ) is n − 1 − k, with eigenvector

v1 = 1. If vk is any eigenvector of A other than 1, then vk is orthogonal to
1, hence
Āv = (1 ⊗ 1 − I − A)vk = −v − λk vk = (−1 − λk )vk .
Hence the eigenvalues of Ā are n − 1 − k and −1 − λk , k = 2, . . . , n, with

the same eigenbasis.
Now we look at powers of the adjacency matrix A. By definition of matrix

multiplication,
n
X
2
(A )ij = i-th row × j-th column = aik akj .
k=1
Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
(A2 )ij = number of 2-step walks connecting i and j.
Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i,
which means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
4.2. GRAPHS 205
Loops, Edges, Triangles

Let A be the adjacency matrix. Then
• trace(A) = number of loops = 0,
• trace(A2 ) = 2 × number of edges,
• trace(A3 ) = 6 × number of triangles.
Let us compute these for the complete graph Kn . Since

(u ⊗ v)2 = (u ⊗ v)(u ⊗ v) = (u · v)(u ⊗ v),
and 1 · 1 = n, we have (1 ⊗ 1)2 = n1 ⊗ 1. So
A2 = (1 ⊗ 1 − I)2 = (1 ⊗ 1)2 − 21 ⊗ 1 + I = (n − 2)1 ⊗ 1 + I.
Since trace(u ⊗ v) = u · v, we have trace(1 ⊗ 1) = n. Hence
trace(A2 ) = trace((n − 2)1 ⊗ 1 + I) = n(n − 2) + n = n(n − 1).
This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,
A3 = A2 A = ((n − 2)1 ⊗ 1 + I)(1 ⊗ 1 − I)
= n(n − 2)1 ⊗ 1 − (n − 2)1 ⊗ 1 + 1 ⊗ 1 − I
= (n2 − 3n + 3)1 ⊗ 1 − I.
From this, we get
trace(A3 ) = n(n2 − 3n + 3) − n = n(n2 − 3n + 2) = n(n − 1)(n − 2).
This is correct because for a complete graph, we have a triangle whenever
we have a triple of nodes, and there are n-choose-3 triples, which equals
n(n − 1)(n − 2)/6.
Remember, a graph is connected if there is a walk connecting any two
nodes. Since there is a 4-step walk between i and j exactly when there are
r, s, and t satisfying
air ars ast atj = 1,
we see there is a 4-step walk connecting i and j if (A4 )ij > 0. Hence
Connected Graph
Let A be the adjacency matrix. Then the graph is connected if for
every i ̸= j, there is a k with (Ak )ij > 0.
Two graphs are isomorphic if a re-labelling of the nodes in one makes it

identical to the other. To explain this, we need permutations.
A permutation on n letters is a re-arrangement of 1, 2, 3,. . . , n. Here
are two permutations of (1, 2, 3, 4),

1 2 3 4 1 2 3 4
, .
4 3 2 1 4 3 1 2
There are n! permutations of (1, 2, . . . , n). If a permutation sends i to j, we

write i → j. Since a permutation is just a re-labelling, if i → k and j → k,
then we must have i = j.
Each permutation leads to a permutation matrix. A permutation matrix
is a matrix of zeros and ones, with only one 1 in any column or row. For
example, the above permutations correspond to the 4 × 4 matrices
   
0 0 0 1 0 0 0 1
0 0 1 0 0 0 1 0
P = 0 1 0 0
 P = 1 0 0 0 .

1 0 0 0 0 1 0 0
In general, the permutation matrix P has Pij = 1 if i → j, and Pij = 0

if not. If P is any permutation matrix, then Pik Pjk equals 1 if both i → k
and j → k. In other words, Pik Pjk = 1 if i = j and i → k, and Pik Pjk = 0
otherwise. Since i → k for exactly one k,
n n
(
X X 1, i = j,
(P P t )ij = t
Pik Pkj = Pik Pjk =
k=1 k=1
0, i ̸= j.
Hence P is orthogonal,
P P t = I, P −1 = P t .
4.2. GRAPHS 207
Figure 4.7: Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1).
Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy
A′ = P AP −1 = P AP t
for some permutation matrix P .

If two graphs are isomorphic, then it is easy to check their degree se-
quences are equal. However, the converse is not true. Figure 4.7 displays two
non-isomorphic graphs with degree sequences (3, 2, 2, 1, 1, 1). These graphs
are non-isomorphic because in one graph, there are two degree-one nodes
adjacent to a degree-three node, while in the other graph, there is only one
degree-one node adjacent to a degree-three node.
A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros,
and let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix
of Knm is
A = A(Knm ) = a ⊗ b + b ⊗ a.
Figure 4.8: Complete bipartite graph K53 .
For example, the adjacency matrix of K53 is A = A(Knm ) which equals

         
1 0 0 1 0 0 0 1 1 1 1 1
1 0 0 1 0 0 0 1 1 1 1 1
         
1 0 0 1 0 0 0 1 1 1 1 1
         
0 1 1 0 1 1 1 0 0 0 0 0
 ⊗ + ⊗ = .
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0
Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b
is an eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity
n + m − 2. Since trace(A) = 0, the sum of the eigenvalues is zero, and the
remaining two eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Then v is orthogonal to the nullspace
of A, so v must be a linear combination of a and b, v = ra+sb. Since a·b = 0,
Aa = nb, Ab = ma.
Hence
λv = Av = A(ra + sb) = rnb + sma.
4.2. GRAPHS 209
Applying A again,
λ2 v = A2 v = A(rnb + sma) = rnma + smnb = nm(ra + sb) = nmv.

√
Hence λ = nm. We conclude the eigenvalues of Knm are
√ √
nm, 0, 0, . . . , 0, − nm, (with 0 repeated n + m − 2 times).
For
√ example, √
for the graph in Figure 4.8, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.
Let G be a graph with n nodes and m edges. The incidence matrix of

G is a matrix whose rows are indexed by the edges, and whose columns are
indexed by the nodes. Therefore, the incidence matrix has shape m × n.
By placing arrows along the edges, we can make G into a directed graph.
In a directed graph, each edge has a tail node and a head node. Then the
incidence matrix is given by

1,
 if node j is the head of edge i,
Bij = −1, if node j is the tail of edge i,

0, if node j is not on edge i.

The laplacian of a graph G is the symmetric n × n matrix
L = B t B.
Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
Laplacian
The laplacian satisfies
L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
 
2 −1 0 0 0 −1
−1 2 −1 0 0 0
 
 0 −1 2 −1 0 0 
L = Q(6) =  .
0
 0 −1 2 −1 0 

0 0 0 −1 2 −1
−1 0 0 0 −1 2
4.3 Binomial Theorem

Let x and a be two variables. A binomial is an expression of the form
(a + x)2 , (a + x)3 , (a + x)4 , ...
The degree of each of these binomials is 2, 3, and 4.

When binomials are expanded by multiplying out, one obtains a sum
of terms. The binomial theorem specifies the exact pattern or form of the
resulting sum.
Recall that
(a + b)(c + d) = a(c + d) + b(c + d) = ac + ad + bc + bd.
Similarly,
(a + b)(c + d + e) = a(c + d + e) + b(c + d + e) = ac + ad + ae + bc + bd + be.
Using this algebra, we can expand each binomial.

Expanding (a + x)2 yields
(a + x)2 = (a + x)(a + x) = a2 + xa + ax + x2 = a2 + 2ax + x2 . (4.3.1)
Similarly, for (a + x)3 , we have
(a + x)3 = (a + x)(a + x)2 = (a + x)(a2 + 2ax + x2 )

= a3 + 2a2 x + ax2 + xa2 + 2xax + x3 (4.3.2)
= a3 + 3a2 x + 3ax2 + x3 .
4.3. BINOMIAL THEOREM 211
For (a + x)4 , we have
(a + x)4 = (a + x)(a + x)3 = (a + x)(a3 + 3a2 x + 3ax2 + x3 )

= a4 + 3a3 x + 3a2 x2 + ax3 + a3 x + 3a2 x2 + 3ax3 + x4 (4.3.3)
= a4 + 4a3 x + 6a2 x2 + 4ax3 + x4 .
Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(4.3.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .
Here ⋆ means we haven’t found the coefficient yet.
There is a pattern in (4.3.4). In the first line, the powers of a are in

decreasing order, 2, 1, 0, while the powers of x are in increasing order, 0, 1,
2. In the second line, the powers of a decrease from 3 to 0, while the powers
of x increase from 0 to 3. In the third line, the powers of a decrease from 4
to 0, while the powers of x increase from 0 to 4.
This pattern of powers is simple and clear. Now we want to find the
pattern for the coefficients in front of each term. In (4.3.4), these coefficients
are (1, 2, 1), (1, 3, 3, 1), (1, 4, 6, 4, 1), and (⋆, ⋆, ⋆, ⋆, ⋆, ⋆). These coefficients
are the binomial coefficients.
Before we determine the pattern, we introduce a useful notation for these
coefficients by writing

2 2 2
= 1, = 2, =1
0 1 2
and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and

4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
and

5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
With this notation, the number

n
(4.3.5)
k
is the coefficient of an−k xk when you multiply out (a + x)n . This is the
binomial coefficient. Here n is the degree of the binomial, and k, which
specifies the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term
1, and term 2. Alternatively, one says these are the zeroth term, the first
term, and the second term. Thus the second term in theexpansion of the
binomial (a + x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general,
the binomial (a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you

multiply out (a + x)n , we have the binomial theorem.
Binomial Theorem
The binomial (a + x)n equals

n n n n−1 n n−2 2 n n−1 n n
a + a x+ a x + ··· + ax + x .
0 1 2 n−1 n
(4.3.6)
Using summation notation, the binomial theorem states

n
n
X n n−k k
(a + x) = a x . (4.3.7)
k=0
k
The binomial coefficient nk is called “n-choose-k”, because it is the co-

efficient of the term corresponding to choosing k x’s when multiplying the n

factors in the product
(a + x)n = (a + x)(a + x)(a + x) . . . (a + x).
For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,

when multiplying the four factors in the product

(a + x)4 = (a + x)(a + x)(a + x)(a + x).
The binomial coefficients may be arranged in a triangle, Pascal’s triangle
(Figure 4.9). Can you figure out the numbers ⋆ in this triangle before peeking
ahead?
n = 0: 1
n = 1: 1 1
n = 2: 1 2 1
n = 3: 1 3 3 1
n = 4: 1 4 6 4 1
n = 5: 1 5 10 10 5 1
n = 6: ⋆ 6 15 20 15 6 ⋆
n = 7: 1 ⋆ 21 35 35 21 ⋆ 1
n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1
n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1
n = 10: 1 10 45 120 ⋆ 252 ⋆ 120 45 10 1
Figure 4.9: Pascal’s triangle.
In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is

10
= 45.
2
We can learn a lot about the binomial coefficients from this triangle.
First, we have 1’s all along the left edge. Next, we have 1’s all along the
right edge. Similarly, one step in from the left or right edge, we have the row
number. Thus we have

n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1
Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written

n n
= , 0 ≤ k ≤ n;
k n−k
the binomial coefficients remain unchanged when k is replaced by n − k.

The key step in finding a formula for n-choose-k is to notice
(a + x)n+1 = (a + x)(a + x)n .
Let’s work this out when n = 3. Then the left side is (a + x)4 . From (4.3.4),
we get

4 4 4 3 4 2 2 4 3 4 4
a + a x+ ax + ax + x
0 1 2 3 4

3 3 3 2 3 2 3 3
= (a + x) a + a x+ ax + x
0 1 2 3

3 4 3 3 3 2 2 3
= a + a x+ ax + ax3
0 1 2 3

3 3 3 2 2 3 3 3 4
+ a x+ ax + ax + x
0 1 2 3

3 4 3 3 3 3 3
= a + + a x+ + a2 x 2
0 1 0 2 1

3 3 3 3 4
+ + ax + x.
3 2 3
Equating corresponding coefficients of x, we get,

4 3 3 4 3 3 4 3 3
= + , = + , = + .
1 1 0 2 2 1 3 3 2
In general, a similar calculation establishes

n+1 n n
= + , 1 ≤ k ≤ n. (4.3.8)
k k k−1
This allows us to build Pascal’s triangle (Figure 4.9), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.
Insert x = 1 and a = 1 in the binomial theorem to get

n n n n n n
2 = + + + ··· + + . (4.3.9)
0 1 2 n−1 n
We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get

n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n
Hence: the alternating2 sum of the binomial coefficients along the n-th row
of Pascal’s triangle is zero.
We now show
2
Alternating means the plus-minus pattern + − + − + − . . . .
Binomial Coefficient
The binomial coefficient nk equals C(n, k),

n n · (n − 1) · · · · · (n − k + 1) n!
= = , 1 ≤ k ≤ n.
k 1 · 2 · ··· · k k!(n − k)!
(4.3.10)
To establish (4.3.10), because

0
C(0, 0) = 1 = ,
0
it is enough to show C(n, k) also satisfies (4.3.8),
C(n + 1, k) = C(n, k) + C(n, k − 1), 1 ≤ k ≤ n. (4.3.11)
To establish (4.3.11), we simplify
n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!

n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!
This establishes (4.3.11), and, consequently, (4.3.10).

For example,

7 7·6·5 7 10 10 · 9 10
= = 35 = and = = 45 = .
3 1·2·3 4 2 1·2 8
The formula (4.3.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
4.4. EXPONENTIAL FUNCTION 217
from scipy.special import comb
comb(n,k)
comb(n,k,exact=True)
returns the binomial coefficient.
The binomial coefficient nk makes sense even for fractional n. This can

be seen from (4.3.10). For example, for n = 1/2 and k = 3,

1 1 1
−1 −2
1/2 2 2 2 (1/2)(−1/2)(−3/2) 3
= = = . (4.3.12)
3 1·2·3 6 48
This works also for n negative,

1 1 1
− − −1 − −2
−1/2 2 2 2 (−1/2)(−3/2)(−5/2) 15
= = = .
3 1·2·3 6 48
√ (4.3.13)
In fact, in (4.3.10), n may be any real number, for example n = 2.
4.4 Exponential Function

In this section, our first goal is to derive (4.1.2), as promised in §4.1.
To begin, use the binomial theorem (4.3.7) with a = 1 and x = 1/n,
obtaining
n n k X n
1 X n n−k 1 1 n(n − 1)(n − 2) . . . (n − k + 1)
1+ = 1 = .
n k=0
k n k=0
k! n · n · n · ··· · n
Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (4.4.1)
n k=2
k! n n n
From (4.4.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n
Second, each factor in (4.4.1) is of the form

j
1− , 1 ≤ j ≤ k − 1. (4.4.2)
n
Since n is in the denominator, each such factor increases with n. Moreover,

as n increases, the number of terms in (4.4.1) increases, hence so does the
sum. We conclude
n
1
1+ increases as n increases.
n
Therefore, as n increases without bound, there is a definite limit

n
1
e = lim 1 + .
n→∞ n
From above, we have e ≥ 2.

Third, when k ≥ 2, we know
k! = k(k − 1)(k − 2) . . . 3 · 2 ≥ 2k−1 .
Since each factor in (4.4.2) is no greater than 1, by (4.4.1),

n n n
1 X 1 X 1
1+ ≤1+1+ ≤2+ k−1
. (4.4.3)
n k=2
k! k=2
2
But the sum n

X 1 1 1 1 1
sn = k−1
= + + + · · · + n−1
k=2
2 2 4 8 2
may be easily estimated as follows.
Multiplying sn by 2 doubles each term, and results in almost the same
sum, so
1 1 1 1
2sn = 1 + + + · · · + n−2 = 1 + sn − n−1 .
2 4 2 2
Solving for sn , we conclude

n
X 1 1
= s n = 1 − ≤ 1, n ≥ 2.
k=2
2k−1 2n−1
By (4.4.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (4.4.4)
n
Summarizing, we established the following strengthening of (4.1.5).
Euler’s Constant
The limit n
1
e = lim 1+ (4.4.5)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
We use (4.4.4) to establish (4.1.2). Write (4.1.2) as an ≤ bn ≤ cn . When

n = 1,
a1 = b 1 = c 1 .
Moreover, as n increases, an , bn , cn all increase. Therefore, to establish
(4.1.2), it is enough to show bn increases faster than an , and cn increases
faster than bn , both as n increases.
To measure how an , bn , cn increase with n, divide the (n + 1)-st term by
the n-th term: It is enough to show
an+1 bn+1 cn+1
≤ ≤ .
an bn cn
But we already know
bn+1
= n + 1,
bn
and, from (4.4.4),
n
3((n + 1)/3)n+1

an+1 1 1 bn+1
= n
= (n + 1) · · 1 + ≤n+1= ,
an 3(n/3) 3 n bn
and, from (4.4.4) again,

n
2((n + 1)/2)n+1

bn+1 1 1 cn+1
= n + 1 ≤ (n + 1) · · 1 + = n
= .
bn 2 n 2(n/2) cn
Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (4.1.2).
By definition, Euler’s constant e satisfies (4.4.5). To obtain a second

formula for e, insert n = ∞ in (4.4.1), which means let n grow to infinity
without bound in (4.4.1). Using 1/∞ = 0, since the k-th term approaches
1/k!, and since the number of terms increases with n, we obtain the second
formula
∞ X ∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k=2
k! ∞ ∞ ∞ k=0
k!
To summarize,
Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ...
k=0
k! 2 6 24 120 720
Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2
dollars after one year.

Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (4.4.5) is often called the compound-interest formula.
Now we derive the result of continuously compounding at any specified

annual interest rate x. Note here x is a proportion, not a percent. An interest
rate of 30% corresponds to x = .3 in the exponential function.
Exponential Function
For any real number x, the limit
x n
exp x = lim 1+ (4.4.6)
n→∞ n
exists. In particular, exp 0 = 1 and exp 1 = e.
Note, in the compound-interest interpretation, when x > 0, the bank is

giving you interest, while, if x < 0, the bank is taking away interest, leading
to a continual loss.
To derive this, assume first x > 0 is a positive real number. Then, exactly
as before, using the binomial theorem,
x n
1+ , n ≥ 1,
n
is increasing with n, so the limit in (4.4.6) is well-defined.

To establish the existence of the limit when x < 0, we first show
(1 − x)n ≥ 1 − nx, 0 < x < 1, n ≥ 1. (4.4.7)
This follows inductively: Each of the following inequalities is implied by the

preceding one,
(1 − x) = 1−x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...
This establishes (4.4.7) for all n ≥ 1.

Now let x be any real number. Then, for n large enough, x2 /n2 lies
between 0 and 1. Replacing x by x2 /n2 in (4.4.7), we obtain
n
x2 x2

1≥ 1− 2 ≥1− .
n n
As n → ∞, both sides of this last equation approach 1, so

n
x2

lim 1 − 2 = 1. (4.4.8)
n→∞ n
Now let n grow without bound in

n
x2

x n x n
1+ 1− = 1− 2 .
n n n
Since the limit exp x is well-defined when x > 0, by (4.4.8), we obtain

x n
exp x · lim 1− = 1, x > 0.
n→∞ n
This shows the limit exp x in (4.4.6) is well-defined when x < 0, and
1
exp(−x) = , for all x.
exp x
Figure 4.10: The exponential function exp x.
Repeating the logic yielding (4.4.1), we have

n
xk

x n X 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (4.4.9)
n k=2
k! n n n
Letting n → ∞ in (4.4.9) as before, results in the following.
Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x 4 x5 x6
exp x = =1+x+ + + + + + . . . (4.4.10)
k=0
k! 2 6 24 120 720
The graph of exp x is in Figure 4.10.
We use the binomial theorem one more time to show

Law of Exponents
For real numbers x and y,
exp(x + y) = exp x · exp y.
To see this, multiply out the sums
(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )
in a “symmetric” manner, obtaining
a0 b0 + (a0 b1 + a1 b0 ) + (a0 b2 + a1 b1 + a2 b0 ) + (a0 b3 + a1 b2 + a2 b1 + a3 b0 ) + . . .
Using summation notation, the n-th term in this last sum is

n
X
ak bn−k = a0 bn + a1 bn−1 + · · · + an−1 b1 + an b0 .
k=0
Thus
∞
! ∞
! ∞ n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0
Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k=0 k=0
k! (n − k)! n! k=0 k n!
Thus
∞
! ∞
! ∞
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k=0
k! m=0
m! n=0
n!
This derives the law of exponents.

Repeating the law of exponents n times implies
exp(nx) = exp(x + x + · · · + x) = exp x · exp x · · · · · exp x = (exp x)n .

√
If we write n
x = x1/n , replacing x by x/n yields
exp(x/n) = (exp x)1/n .
Combining the last two equations yields
exp(nx/m) = ((exp x)n )1/m = (exp x)n/m .
Inserting x = 1 in this last equation, it follows, for any rational number
x = n/m,
exp x = exp(1 · x) = (exp 1)x = ex .
Because of this, as a matter of convenience, we write the exponential function
either way, exp x or ex , even when x is not rational.
Exponential Notation
For any real number x,
ex = exp x.
Suppose 0 < r < 1. Then r2 < r, r3 < r, and so on. Replacing x by rx

in the exponential series (4.4.10),
1 2 2 1 3 3
erx = 1 + rx + r x + r x + ...
2! 3!
1 2 1 3 (4.4.11)
< 1 + rx + rx + rx + . . .
2! 3!
= 1 − r + rex .
From this we can show
Convexity of the Exponential Function

For 0 < r < 1,
exp((1 − r)x + ry) < (1 − r) exp x + r exp y. (4.4.12)
To derive (4.4.12), replace x by y − x in (4.4.11), obtaining

er(y−x) < 1 − r + rey−x .
Now multiply both sides by ex , obtaining (4.4.12).

Graphically, the convexity of the exponential functions is the fact that
the line segment joining two points on the graph lies above the graph (Figure
4.11).
Figure 4.11: Convexity of the exponential function.
Convexity is discussed further in §7.5.

Chapter 5
Probability
5.1 Binomial Probability

Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about
this coin? Can we claim the coin is fair? Can we claim the probability of
obtaining heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads
[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].
On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains
[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].
In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave,
with the goal of answering the question: Is a given coin fair?
227
228 CHAPTER 5. PROBABILITY
Assume we are tossing a coin. If we let p = P rob(H) and q = P rob(T )

be the probabilities of obtaining heads and tails after a single toss, then
p + q = 1.
In particular, we see q = 1 − p, and p may be any real number between 0
and 1, depending on the particular coin being tossed.
If we toss the coin twice, we obtain one of four possibilities, HH, HT ,
T H, or T T . If we make the natural assumption that the coin has no memory,
that the result of the first toss has no bearing on the result of the second
toss, then the probabilities are
P rob(HH) = p2 , P rob(HT ) = pq, P rob(T H) = qp, P rob(T T ) = q 2 . (5.1.1)
These are valid probabilities since their sum equals 1,
p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.
To see why these are the correct probabilities, we use the conditional
probability definition,
P rob(A and B)
P rob(A | B) = . (5.1.2)
P rob(B)
We use this formula to compute the probability that we obtain heads on
the second toss given that we obtain tails on the first toss. The conditional
probability definition (5.1.2) is equivalent to the chain rule
P rob(A and B) = P rob(A | B) P rob(B).
To compute this, we introduce the convenient notation
(
1, if the n-th toss is heads,
Xn =
0, if the n-th toss is tails.
Then Xn is a random variable (§5.3) and represents a numerical reward
function of the outcome (heads or tails) at the n-th toss.
With this notation, (5.1.1) may be rewritten
P rob(X1 =1 and X2 = 1) = p2 ,
P rob(X1 =1 and X2 = 0) = pq,
P rob(X1 =0 and X2 = 1) = qp,
P rob(X1 =0 and X2 = 0) = q 2 .
5.1. BINOMIAL PROBABILITY 229
In particular, this implies
P rob(X1 = 1) = P rob(X1 = 1 and X2 = 0) + P rob(X1 = 1 and X2 = 1)

= pq + p2 = P rob(p + q) = p.
Similarly, P rob(X2 = 1) = p. Computing,
P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
Independent Coin-Tossing
With the conditional probability definition (5.1.2), a coin has no mem-

ory between successive tosses if and only if the probabilities at distinct
tosses multiply,
P rob(X1 = a1 , X2 = a2 , . . . ) = P rob(X1 = a1 ) P rob(X2 = a2 ) . . .

(5.1.3)
Here a1 , a2 , . . . are 0 or 1.
Since we are tossing the same coin, we can set
P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.
Thus all probabilities in (5.1.3) are determined by the parameter p, which

may be any number between 0 and 1.
Suppose X is a random variable taking on three values a, b, c with prob-

abilities p, q, r,
P (X = a) = p, P (X = b) = q, P (X = c) = r.
Then the mean or average or expected value of X is
E(X) = ap + bq + cr.
Since p + q + r = 1, the expected value of X lies between the greatest of a,

b, c, and the least,
min(a, b, c) ≤ E(X) ≤ max(a, b, c).
The variance of X is a measure of how far X deviates from its mean,
V ar(X) = E((X − m)2 ), m = E(X).
For example,
V ar(X) = (a − m)2 · p + (b − m)2 · q + (c − m)2 · r.
By expanding the squares, one has the identity
V ar(X) = E(X 2 ) − m2 , m = E(X).
A random variable Z is standard if its mean is zero and its variance is

one. If X is any random variable with mean m and variance σ 2 , the random
variable
X −m
Z=
σ
is standard.
For X = Xn , the mean is
E(Xn ) = 1 · p + 0 · (1 − p) = p,
and the variance is
V ar(Xn ) = E(Xn2 ) − m2 = 12 · p + 02 · (1 − p) − p2 = p − p2 = p(1 − p).
Let
Sn = X1 + X2 + · · · + Xn .
Since Xk = 1 when the k-th toss is heads, and Xk = 0 when the k-th toss is
tails, Sn is the number of heads in n tosses.
The mean of Sn is
E(Sn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np.
The second moment of Sn , or the mean of Sn2 , is

 !2 
X n Xn X
2
E(Sn ) = E  Xk  = E(Xn2 ) + E(Xk Xj ).
k=1 k=1 k̸=j
By independence and Xk2 = Xk ,

n
X X
E(Sn2 ) = E(Xn )+ E(Xk )E(Xj ) = np+n(n − 1)·p2 = np(1−p)+n2 p2 .
k=1 k̸=j
Hence the variance of Sn is
V ar(Sn ) = E(Sn2 ) − (np)2 = np(1 − p) + n2 p2 − n2 p2 = np(1 − p).
It is natural to ask for the probability of obtaining k heads in n tosses,

P rob(Sn = k). Here k varies between 0 and n, corresponding to all tails or
all heads respectively.
There are n + 1 possibilities Sn = 0, Sn = 1, Sn = 2, . . . , Sn = n for the
number of heads in n tosses. If we have no idea what the parameter p is,
then all possibilities are equally likely, so one expects
1
P rob(Sn = k) = , 0 ≤ k ≤ n. (5.1.4)
n+1
Notice n n
X X 1
P rob(Sn = k) = = 1,
k=0 k=0
n+1
as it should be.
Now suppose we are given p, so we know p = P rob(X n = 1). Since the
number of ways of choosing k heads from n tosses is nk , and the probabilities

of distinct tosses multiply, the probability of k heads in n tosses is as follows.

Binomial Distribution With Parameters n, p

If a coin has heads-probability p, the probability of obtaining k heads
in n tosses is
n k
P rob(Sn = k) = p (1 − p)n−k . (5.1.5)
k
Moreover the mean and variance of the binomial distribution is np and
np(1 − p).
By the binomial theorem,

n n
X X n
P rob(Sn = k) = pk (1 − p)n−k = (p + 1 − p)n = 1,
k=0 k=0
k
again as it should be.
The binomial distribution with n = 1 corresponds to a single coin toss,
and is called the bernoulli distribution. The corresponding random variable
X,
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
is a bernoulli or bernoulli random variable. The values of a bernoulli random
variable need not be 0, 1, they may be any two values a and b.
Now we assume the coin parameter p is unknown, and we interpret (5.1.5)

as the conditional probability that Sn = k given knowledge of p, which we
rewrite as

n k
P rob(Sn = k | p = r) = r (1 − r)n−k , 0 ≤ k ≤ n. (5.1.6)
k
Now P rob(Sn = k) is the sum of the probabilities P rob(Sn = k and p = r)
over 0 ≤ r ≤ 1. By the definition of conditional probability (5.1.2),
P rob(Sn = k and p = r) = P rob(Sn = k | p = r) P rob(p = r).
Thus P rob(Sn = k) is the sum of P rob(Sn = k | p = r) P rob(p = r) over
0 ≤ r ≤ 1. Since p varies continuously over 0 ≤ r ≤ 1, the sum is replaced
by the integral, so
Z 1
P rob(Sn = k) = P rob(Sn = k | p = r) P rob(r < p < r + dr).
0
Since we don’t know anything about p, it’s simplest to assume a uniform

a priori probability
P rob(a < p < b) = b − a, 0 ≤ a < b ≤ 1,
which is the same as saying P rob(r < p < r +dr) = dr. By (5.1.6), we obtain
Z 1
n k
P rob(Sn = k) = r (1 − r)n−k dr.
0 k
But by integration by parts,

1
k!(n − k)!
Z
rk (1 − r)n−k dr = .
0 (n + 1)!
From this, we conclude

1
n k!(n − k)!
Z
n k n−k 1
P rob(Sn = k) = r (1−r) dr = = , (5.1.7)
0 k k (n + 1)! n+1
agreeing with with our intuitive result earlier.

Notice the difference: In (5.1.5), we know the coin’s heads probability p,
and obtain the binomial distribution, while in (5.1.7), since we don’t know p,
and there are n+1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution
1/(n + 1).
We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s probability of heads
p?
To this end, we introduce the fundamental
Bayes Theorem
P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.8)
P rob(B)
The proof of Bayes Theorem is straightforward:
P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
The depth of the result lies in its widespread usefulness.

We now write Bayes Theorem to compute
P rob(p = r)
P rob(p = r | Sn = k) = P rob(Sn = k | p = r) · . (5.1.9)
P rob(Sn = k)
But P rob(Sn = k | p = r) is as in (5.1.6), P rob(Sn = k) is as in (5.1.7),

and P rob(p = r) = 1. Inserting these quantities into (5.1.9) leads to
A Posteriori probability Given k heads in n tosses

Assume the unknown heads probability p of a coin is uniformly dis-
tributed on 0 ≤ r ≤ 1. Then the probability that p = r given k heads
in n tosses equals

n k
P rob(p = r | Sn = k) = (n + 1) · r (1 − r)n−k . (5.1.10)
k
Notice because of the extra factor (n + 1), this is not equal to (5.1.6).
In (5.1.6), p is fixed, and k is the variable. In (5.1.10), k is fixed, and r is
the variable. This a posteriori distribution for (n, k) = (10, 7) is plotted in
Figure 5.1. Notice this distribution is concentrated about k/n = 7/10 = .7.
Figure 5.1: The distribution of p given 7 heads in 10 tosses.
The code generating this figure is

from numpy import arange
def f(x): return 1320 * x^7*(1-x)^3
grid()
X = arange(0,1,.01)
plot(X,f(X),color="blue",linewidth=.5)
show()
Because Bayes Theorem is so useful, here are two alternate forms. First,
since
P rob(B) = P rob(B and A) + P rob(B and Ac )

= P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac ),
Bayes rule also states
P rob(B | A) P rob(A)
P rob(A | B) = . (5.1.11)
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
Figure 5.2: The logistic function.
Let
1
p = σ(z) = . (5.1.12)
1 + e−z
This is the logistic function or sigmoid function (Figure 5.2). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.3). Think of the input z as an activation energy, and the output
p as the probability of activation. In Python, σ is the expit function.
from scipy.special import expit
p = expit(z)
−∞ < z < ∞ σ 0<p<1
Figure 5.3: The logistic function takes real numbers to probabilities.
Dividing the numerator and denominator of (5.1.11) by the numerator,

we also obtain in terms of log-probabilities,

P rob(B | A) P rob(A)
P rob(A | B) = σ log . (5.1.13)
P rob(B | Ac ) P rob(Ac )
For example, suppose we have two groups of points in Rd , selected as
follows. A fair coin is tossed. If the result is heads, select a point x in Rd at
random with normal probability (§5.4) with mean mH , or
2 /2
P rob(x | H) ∼ e−|x−mH | .
If the result is tails, select a point x at random with normal probability with
mean mT , or
2
P rob(x | T ) ∼ e−|x−mT | /2 .
This says the the groups are centered around the points mH and mT respec-
tively.
Given a point x, what is the probability x is in the heads group? In other
words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Let
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ). Inserting the
probabilities and simplifying leads to

P rob(x | H) P rob(H)
log = w · x + w0 . (5.1.14)
P rob(x | T ) P rob(T )
By (5.1.13), this leads to
P rob(H | x) = σ(w · x + w0 ).
Thus the hyperplane

z = w · x + w0
is the cut-off between the two groups. Written this way, the probability is a
single-layer perceptron (§8.2).
5.2 Probability
A probability is often described as
the extent to which an event is likely to occur, measured by the
ratio of the favorable outcomes to the whole number of outcomes
possible.
We explain what this means by describing the basic terminology:
• An experiment is a procedure that yields an outcome, out of a set of
possible outcomes. For example, tossing a coin is an experiment that
yields one of two outcomes, heads or tails, which we also write as 1 or
0. Rolling a six-sided die yields outcomes 1, 2, 3, 4, 5, 6. Rolling two
six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Flipping a coin three
times yields 23 = 8 outcomes
T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH,
or
000, 001, 010, 011, 100, 101, 110, 111.
• The sample space is the set S of all possible outcomes. If #(S) is

the number of outcomes in S, then for the four experiments above, we
have #(S) equals 2, 6, 36, and 8. The sample space S is also called the
population.
• An event is a specific subset E of S. For example, when rolling two

dice, E can be the outcomes where the sum of the dice equals 7. In
this case, the outcomes in E are
(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
so here #(S) = 36 and #(E) = 6. Another example is obtaining three

heads when tossing a coin seven times. Here #(S) = 27 = 128 and
5.2. PROBABILITY 239
#(E) = 35, which is the number of ways you can choose three things
out of seven things:

7 7·6·5
#(E) = 7-choose-3 = = = 35.
3 1·2·3
• The probability of an outcome s is a number P rob(s) with the properties
1. 0 ≤ P rob(s) ≤ 1,
2. The sum of the probabilities of all outcomes equals one.
• The probability P rob(E) of an event E is the sum of the probabilities

of the outcomes in E.
• Outcomes are equally likely when they have the same probability. When
this is so, we must have
#(E)
P rob(E) = .
#(S)
For example,
1. A coin is fair if the outcomes are equally likely. For one toss of a
fair coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes
P rob(head) = p, P rob(tail) = 1 − p,
with 0 < p < 1.

3. A die is fair if the outcomes are equally likely. Roll a fair die
and let E be the event that the outcome is less than 3. Then
P rob(E) = 2/6 = 1/3.
4. The probability of obtaining a sum of 7 when rolling two fair dice
is P rob(E) = 6/36 = 1/6.
5. The probability of obtaining three heads when tossing a fair coin
seven times is P rob(E) = 35/128.
6. The probability of selecting an iris with petal length between 1
and 3 from the Iris dataset.
Now suppose we conduct an experiment by tossing a coin (always assumed

fair unless otherwise mentioned) 10 times. Because the coin is fair, we expect
to obtain heads around 5 times. Will we obtain heads exactly 5 times? Let’s
run the experiment with Python. In fact, we will run the experiment 20
times. If we count the number of heads after each run of the experiment, we
obtain a digit between 0 and 10 inclusive.
To simulate this, we use binomial(n,p,N). When N = 1, this returns
the number of heads obtained after a single experiment, consisting of tossing
a coin n times, where the probability of obtaining heads in each toss is p.
More generally, binomial(n,p,N) runs this experiment N times, return-
ing a vector v with N components. For example, the code
p = .5
n = 10
N = 20
v = binomial(n,p,N)
print(v)
returns
[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]
The sample space S corresponding to (p, n, N ) consists of all vectors v =

(v1 , v2 , . . . , vN ) with N components, with each component equal to to 0, 1,
. . . , n. So here #(S) = (n + 1)N .
Now we conduct three experiments: tossing a coin 5 times, then 50 times,
then 500 times. The code
p = .5
for n in [5,50,500]: print(binomial(n,p,1))
This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
Figure 5.4: 100,000 sessions, with 5, 15, 50, and 500 tosses per session.
3, 28, 266
The proportions are the count divided by the total number of tosses in
the experiment. For the above three experiments, the proportions after 5
tosses, 50 tosses, and 500 tosses, are
3/5=.600, 28/50=.560, 266/500=.532
Now we repeat each experiment 100,000 times and we plot the results in
a histogram.

N = 100000
p = .5
for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
show()
This results in Figure 5.4.
The takeaway from these graphs are the two fundamental results of prob-
ability:
1. Law of Large Numbers. The proportion in a repeated experiment

is the sample proportion. The sample proportion tends to be near the
underlying probability p. The underlying probability is the population
proportion. The larger the sample size in the experiment, the closer
the proportion is to p. Another way of saying this is: For large sample
size, the sample mean is approximately equal to the population mean.
2. Central Limit Theorem. For large sample size, the shape of the
graph of the proportions or counts is approximately normal. The nor-
mal distribution is studied in §5.4. Another way of saying this is: For
large sample size, the shape of the sample mean histogram is approxi-
mately normal.
The law of large numbers is qualitative and the central limit theorem
is quantitative. While the law of large numbers says one thing is close to
another, it does not say how close. The central limit theorem provides a
numerical measure of closeness, using the normal distribution.
Roll two six-sided dice. Let A be the event that at least one dice is an
even number, and let B be the event that the sum is 6. Then
A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .
B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:
A and B = {(2, 4), (4, 2)} .

The union of A and B is the event of outcomes in either event:

A or B = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6), (1, 5), (3, 3), (5, 1)} .
The complement of A is the events of outcomes not in A. So
not A = {(1, 1), (1, 3), (1, 5), (3, 1), (3, 3), (3, 5), (5, 1), (5, 3), (5, 5)} .
Since #(not A) = 9, and #(S) = 36,
#(A) = #(S) − #(not A) = 36 − 9 = 27.
Clearly #(B) = 5.
The difference of A minus B is the event of outcomes in A but not in B:
A − B = A and not B
= {(2, ∗ except 4), (4, ∗ except 2), (6, ∗), (∗ except 4, 2), (∗ except 2, 4), (∗, 6)} .
Similarly,
B − A = {(1, 5), (3, 3), (5, 1)} .
Then A − B is part of A and B − A is part of B, A ∩ B is part of both, and
all are part of A ∪ B.
Hence
27 3 5
P rob(A) = = , P rob(B) = .
36 4 36
Events A and B are independent if
P rob(A and B) = P rob(A) × P rob(B).
The conditional probability of A given B is
P rob(A and B)
P rob(A | B) = .
P rob(B)
When A and B are independent,
P rob(A and B) P rob(A) × P rob(B)
P rob(A | B) = = = P rob(A),
P rob(B) P rob(B)
so the conditional probability equals the unconditional probability.
Are A and B above independent? Since
P rob(A ∩ B) 2/36 2
P rob(A | B) = = =
P rob(B) 5/36 5
which is not equal to P rob(A), they are not independent.
5.3 Random Variables

Suppose a real number x is selected at random. Even if we don’t know
anything about x, we know x is a number, so our confidence that −∞ <
x < ∞ equals 100%, the chance that x satisfies −∞ < x < ∞ equals 1, the
probability that x satisfies −∞ < x < ∞ equals 1.
When we say x is “selected at random”, we think of a machine X that
is the source of the numbers x (Figure 5.5). Such a source of numbers is
best called a random number. Unfortunately,1 the standard 100-year-old
terminology for such an X is random variable, and this is what we’ll use.
X x
Figure 5.5: When we sample X, we get x.
For example, suppose we want to estimate the proportion of American

college students who have a smart phone. Instead of asking every student,
we take a sample and make an estimate based on the sample.
Let p be the actual proportion of students that in fact have a smartphone.
If there are N students in total, and m of them have a smartphone, then
p = m/N . For each student, let
(
1, if the student has a smartphone,
X=
0, if not.
Then X is a random variable: X is a machine that returns 0 or 1 depending
on the chosen student.
A random variable taking on only two values is a bernoulli random vari-
able. Since X takes on the two values 0 and 1, X is a bernoulli random
variable.
Throughout we adopt the convention that random variables are written
uppercase X, while the numbers they produce when sampled are written
lowercase x. In other words, when we sample X, we get x.
1
The standard terminology is inaccurate, because the variability of the samples x is
implied by the term random; the term variable is superfluous and may suggest something
like double-randomness (whatever that is), which is not the case.
5.3. RANDOM VARIABLES 245
We will have occasion to meet many different random variables X, Y ,

Z, . . . . The letter Z is reserved for a standard random variable, one having
mean zero and variance one. Samples from Z are written as z.
What is the chance, what is our confidence, what is the probability, of

selecting x from an interval [a, b]? If we write
P rob(a < X < b)
for the chance that X lies in the interval [a, b], we are asking for P rob(a <
X < b). If we don’t know anything about X, then we can’t figure out the
probability, and there is nothing we can say. Knowing something about X
means knowing the distribution of X: Where X is more likely to be and
where X is less likely to be. In effect, a random variable is a quantity X
whose probabilities P rob(a < X < b) can be computed.
For example, take the Iris dataset and let X be the petal length of an iris
(Figure 5.6) selected at random. Here the number of samples is N = 150.
from pandas import *
df = read_csv("iris.csv")
petal_length = df["Petal_length"].to_numpy()
Figure 5.6: N = 150 petal lengths and their mean.

The mean is the usual formula
N
1 X
m = E(X) = xk .
N k=1
Similarly, the second moment is
N
2 1 X 2
E(X ) = x .
N k=1 k
In general, given any function f (x), we have the mean of f (x1 ), f (x2 ), . . . ,
f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.1)
N k=1
If we let
(
1, 1 < x < 3,
f (x) =
0, otherwise,
then f (xk ) only counts when 1 < xk < 3, so
N
1 X #{samples satisfying 1 < xk < 3}
E(f (X)) = f (xk ) = .
N k=1 N
But this is the probability that a randomly selected iris has petal length X
between 1 and 3,
P rob(1 < X < 3) = E(f (X)),
when f (x) is as above.

This shows probabilities are special cases of means. Since we can compute
means by (5.3.1), we can compute probabilities for X. By definition, this X
is a random variable.
Figure 5.7: Histogram of N = 150 petal lengths.
To see how the iris petal lengths are distributed, we plot a histogram,
grid()
hist(petal_length,bins=20)
show()
This results in Figure 5.7.

More generally, we take a random batch of samples of size n and take the
mean of the samples in the batch. For example, the following code grabs a
batch of n = 5 petals lengths x1 , x2 , x3 , x4 , x5 at random and takes their
mean,
x1 + x 2 + x3 + x4 + x5
.
5
rng = default_rng()
# n = batch_size
def random_batch_mean(n):
rng.shuffle(petal_length)
return mean(petal_length[:n])
random_batch_mean(5)
Figure 5.8: Means of 100,000 batches, of size n = 1, 5, 15, 50.
The five petal lengths are selected by first shuffling the petal lengths,
then selecting the first five petal_length[:5]. Now repeat this computation
100,000 times, for batch sizes 1, 5, 15, 50. The resulting histograms are in
Figure 5.8. Notice in the first subplot, the batch is size n = 1, so we recover
the base histogram Figure 5.7. Figure 5.8 is of course another illustration of
the central limit theorem.
N = 100000
for n in [1,5,15,50]:
Xbar = [ random_batch_mean(n) for _ in range(N)]
hist(Xbar,bins=50)
grid()
show()
The simplest random variable is the bernoulli random variable X re-

sulting from a coin toss, with X = 1 corresponding to heads, and X = 0
corresponding to tails. In this case,
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
which can be written
P rob(X = x) = px (1 − p)1−x , x = 0, 1.
This distribution is presented graphically in Figure 5.9.
p
1−p
0 1
Figure 5.9: Distribution of a bernoulli random variable.
For example, if we let X be the number of heads obtained in n tosses of

a coin with heads probability p, and sample X 100,000 times, the resulting
frequency histogram is in 5.4. Knowing something about X means knowing
where X is more likely to be and where X is less likely to be. But this is
exactly the information provided by the histogram. Therefore the histogram
is the approximate distribution of X. There is one caveat: Because we prefer
our probabilities to sum to 1, with 1 corresponding to certainty, this last
conclusion is correct only after rescaling the histogram to have total area 1.
Because of the central limit theorem, the normal distribution given on
the left in Figure 5.10 plays a central role. This distribution is studied in
§5.4. We say X follows a specific distribution, such as the curves in Figure
5.10, when the probability P rob(a < X < b) is given by the green area in
Figure 5.10. Thus
chance = confidence = probability = area.
0 a b 0 a b
Figure 5.10: Confidence that X lies in interval [a, b].
Given any distribution as in Figure 5.10, the cumulative distribution func-

tion at a point x is the area under the graph to the left of x,
cdfX (x) = P rob(X ≤ x).
Then the green areas in Figure 5.10 is the difference between two areas, hence
equal
cdfX (b) − cdfX (a).
Figure 5.11: Cumulative distribution functions.
For the distributions in Figure 5.10, the cumulative distribution functions

are in Figure 5.11.
For the bernoulli distribution in Figure 5.9, the cdf is in Figure 5.12.
Because the bernoulli random variable takes on only the values x = 0, 1,
these are the values where the cdf P rob(X ≤ x) jumps.
1
1−p
0 1
Figure 5.12: Cdf of a bernoulli distribution.
Let X be a random variable taking on values x1 , x2 , x3 , . . . , with proba-

bilities p1 , p2 , p3 ,. . . . then the mean is
N
X
µ = E(X) = p1 x1 + p2 x2 + p3 x3 + · · · = p k xk .
k=1
This is the population mean. It does not depend on a sampling of the popu-
lation.
For example, suppose the population consists of 100 balls, of which 30
are red, 20 are green, and 50 are blue. The cost of each ball is

$1, red,

X(ball) = $2, green,

$3, blue.

Then
#(red) 30
pred = P rob(red) = = = .3,
#(balls) 100
#(green) 20
pgreen = P rob(green) = = = .2,
#(balls) 100
#(blue) 50
pblue = P rob(blue) = = = .5.
#(balls) 100
Then the average cost of a ball equals
E(X) = pred · 1 + pgreen · 2 + pblue · 3
30 · 1 + 20 · 2 + 50 · 3 x1 + x2 + · · · + x100
= = .
100 100
The variance is
V ar(X) = E (X − µ)2 = p1 (x1 − µ)2 + p2 (x2 − µ)2 + p3 (x3 − µ)2 + . . .

N
X
= pk (xk − µ)2 .
k=1
This is the variance or population variance. A direct consequence of the

formula is V ar(X) = 0 exactly when X is a non-random constant. The
square root of the variance is the standard deviation. We also write V ar(X) =
σ 2 , so σ is the standard deviation.
Going back to the smartphone random variable X, because p is the pro-
portion of students having smartphones, P rob(X = 1) = p. This implies
P rob(X = 0) = 1 − p. Since X takes on two values x1 = 1 and x2 = 0 with
probabilities p1 = p and p2 = 1 − p, the mean is
µ = E(X) = x1 p1 + x2 p2 = 1 · p + 0 · (1 − p) = p,
and the variance is
σ 2 = (x1 − µ)2 p1 + (x2 − µ)2 p2 = (1 − p)2 · p + (0 − p)2 · (1 − p) = p(1 − p).

p
Thus the standard deviation is p(1 − p).
More generally, if a random variable X takes on two values a and b with
P rob(X = a) = p, then
E(X) = ap + b(1 − p).
In particular, if Y = ±1 with probability 1/2, then E(Y ) = 0.

If P rob(X = a) = 1, then we say X is the constant a. In this case, the
mean is
E(X) = a · P rob(X = 1) = a,
and the variance is
E((X − a)2 ) = (a − a)2 P rob(X = 1) = 0.
Conversely, a random variable with variance zero is a constant.

Let X and Y be random variables and let a, b be constants. A basic
property of the mean is linearity
E(aX + bY ) = aE(X) + bE(Y ).

Using this, we have
V ar(X) = σ 2 = E (X − µ)2

= E(X 2 − 2µX + µ2 ) = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − E(X)2 .
We conclude
E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.2)
Let X have mean µ and variance σ 2 , and write
X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ) 2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of the bernoulli random variable is
X −p
p .
p(1 − p)
Since p(1 − p) is unchanged when p is replaced by 1 − p, the graph of the

bernoulli variance is a symmetric hump vanishing at p = 0 and p = 1 (Figure
5.13). Therefore the bernoulli variance is maximized at p = 1/2, where it
equals p(1 − p) = 1/4.
p(1 − p)
0 1
Figure 5.13: Binary variance.

If X is a random variable, so are X, X 2 , X 3 , . . . . These are powers of

X. The moments of X are the power means
E(X), E(X 2 ), E(X 3 ), . . .
The moments of a random variable may be combined into a sum. To explain

this, we bring in the exponential series
t2 t3
et = 1 + t + + + ...
2! 3!
where t is any real number. The number e, Euler’s constant (§4.4), is ap-
proximately 2.7, as can be seen from
1 1 1 1
e = e1 = 1 + 1 + + + ··· = 1 + 1 + + + ...
2! 3! 2 6
Since X has real values, so does tX, so etX is also a random variable.
The moment generating function is the mean of etX ,
t2 t3
M (t) = MX (t) = E etX = 1 + tE(X) + E(X 2 ) + E(X 3 ) + . . .

2! 3!
For example, for the smartphone random variable X = 0, 1 with P rob(X =
1) = p, X 2 = X, X 3 = X, . . . , so
t2 t3 t2 t3
M (t) = 1 + tE(X) + E(X 2 ) + E(X 3 ) + · · · = 1 + tp + p + p + . . .
2! 3! 2! 3!
which equals
M (t) = (1 − p) + pet .
In §5.2, we discussed independence of events. Now we do the same for
random variables. Let X and Y be random variables. We say X and Y are
uncorrelated if the expectations multiply,
E(XY ) = E(X) E(Y ).
Otherwise, we say X and Y are correlated.

By (5.3.2), a random variable X is always correlated to itself, unless it is
a constant.
Suppose X and Y take on the values X = ±1 and Y = 0, 1 with the

probabilities


 (1, 1) with probability a,

(1, 0) with probability b,
(X, Y ) = (5.3.3)


 (−1, 1) with probability b,

(−1, 0) with probability c.
We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and
c > 0.
First, because the total probability equals 1,
a + 2b + c = 1. (5.3.4)
Also we have
P rob(X = 1) = a+b = P rob(Y = 1), P rob(X = −1) = b+c = P rob(Y = 0),
and
E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if
a − b = E(XY ) = E(X)E(Y ) = (a − c)(a + b). (5.3.5)
Solving (5.3.4), (5.3.5) using Python,
from sympy import *
a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)
we see X and Y are uncorrelated if

√ √
b = c − c, a = c − 2 c + 1. (5.3.6)
For example, X and Y are uncorrelated when c = 1/4, which leads to a =
b = 1/4. Also, X and Y are uncorrelated if c = .01, which leads to a = .81
and b = .09.
Let X and Y be random variables. We say X and Y are independent if

all powers of X are uncorrelated with all powers of Y ,
E(X n Y m ) = E(X n ) E(Y m ). (5.3.7)
Clearly, if X and Y are independent, then, by taking n = 1 and m = 1,

X and Y are uncorrelated.
Suppose X and Y satisfy (5.3.3) and (5.3.6). Since X = ±1, X n = 1 for
n even and X n = X for X odd. Since Y = 0, 1, Y n = Y for all n. This is
enough to show that, in this case, X and Y uncorrelated is equivalent to X
and Y independent. However, this is certainly not true in general.
Here is an example of uncorrelated random variables that are dependent.
Let X, Y be as above with a = b = c = 1/4. Then, as we have seen, X and
Y are uncorrelated. Let U = XY . Then
1
E(U ) = E(XY ) = E(X)E(Y ) = 0 · = 0.
2
Since U Y = XY Y = XY = U , E(U Y ) = E(U ) = 0. Hence Y and U are
uncorrelated. But, since U 2 = Y , Y and U 2 are correlated, so Y and U are
not independent.
Let X and Y be random variables. Expanding the exponentials into their

series, and using (5.3.7), one can show
Independence and Moment Generating Functions

Let X and Y be random variables. Then X and Y are independent if
their moment generating functions multiply,
MX,Y (a, b) ≡ E eaX+bY = E eaX E ebY = MX (a) MY (b).

The expectation on the left is the joint moment generating function

MX,Y (a, b) of the pair (X, Y ). One may have joint moment generating func-
tions for triples (X, Y, Z), by writing
MX,Y,Z (a, b, c) = E eaX+bY +cZ .

Then we say X, Y , Z are independent if
MX,Y,Z (a, b, c) = MX (a) MY (b) MZ (c).
Of course, this generalizes to any tuple of random variables.

As an illustration, consider an ordinary dice with X = 1, X = 2, . . . , X =
6 equally probable. Then P rob(X = k) = 1/6, k = 1, 2, . . . , 6. Evaluating
the geometric sum,
6
tX
X 1 1 e7t − et
e6k =

MX (t) = E e = .
k=1
6 6 et − 1
Now suppose we have a non-standard dice with unknown probabilities for

Y = 0, Y = 1, . . . ,Y = 6. if we are told the probabilities of the sum X + Y
is uniform over 1 ≤ X + Y ≤ 12, how should we choose the probabilities for
Y?
To answer this, use
MX+Y (t) = E et(X+Y ) = E etX E etY .

Since
12
1 X tk 1 e13t − et
MX+Y (t) = e = ,
12 k=1 12 et − 1
we obtain
1 e13t − et 1 e7t − et
= MY (t) · .
12 et − 1 6 et − 1
Factoring
e13t − et = et (e6t − 1)(e6t + 1), e7t − et = et (e6t − 1),
we obtain
1
MY (t) = (e6t + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
Let X and Y be random variables. We say X and Y are identically

distributed if their moments are equal,
E(X n ) = E(Y n ), n ≥ 1.
This is equivalent to X and Y having equal probabilities,
P rob(a < X < b) = P rob(a < Y < b).
For example, if X and Y satisfy (5.3.3), then X and 2Y −1 are indentically

distributed. However, X and 2Y − 1 are independent if and only if X and Y
are independent, which, as we saw above, happens only when (5.3.6) holds.
On the other hand, Let X be any random variable, and let Y = X. Then
X and Y are identically distributed, but are certainly correlated. So identical
distributions does not imply independence, nor vice-versa.
Let X be a random variable. A simple random sample of size n is a
sequence of random variables X1 , X2 , . . . , Xn that are independent and
identically distributed. We also say the sequence X1 , X2 , . . . , Xn is an i.i.d.
sequence (independent identically distributed).
For example, going back to the smartphone example, suppose we select n
students at random, where we are allowed to select the same student twice.
We obtain numbers x1 , x2 , . . . , xn . So the result of a single selection experi-
ment is a sequence of numbers x1 , x2 , . . . , xn . To make statistical statements
about the results, we repeat this experiment many times, and we obtain a
sequence of numbers x1 , x2 , . . . , xn each time.
This process can be thought of n machines producing x1 , x2 , . . . , xn each
time, or n random variables X1 , X2 , . . . , Xn (Figure 5.14). By making each
of the n selections independently, we end up with an i.i.d. sequence, or a
simple random sample.
X1 , X2 , . . . , Xn x1 , x2 , . . . , xn
Figure 5.14: When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn .
Let X1 , X2 , . . . , Xn be a simple random sample. Then X1 , X2 , . . . , Xn

are identically distributed. Let µ be their common mean E(X). The sample
mean is
n
X1 + X 2 + · · · + Xn 1X
X̄ = = Xk .
n n k=1
Then
1 1 1
E(X̄) = E(X1 +X2 +· · ·+Xn ) = (E(X1 )+E(X2 )+· · ·+E(Xn )) = ·nµ = µ.
n n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . Since σ 2 =
E(X 2 ) − E(X)2 , we have
E(Xk2 ) = µ2 + σ 2 .
When i ̸= j, by independence,
E(Xi Xj ) = E(Xi )E(Xj ) = µ · µ = µ2 .
Putting this all together,

 !2 
n
1  X
E(X̄ 2 ) = E Xk 
n2 k=1
1 X
= E(Xi Xj )
n2 i,j
!
1 X X
= 2 E(Xi Xj ) + E(Xk2 )
n i̸=j k
1 2 2 2 1
= µ2 + σ 2 .

= 2
n(n − 1)µ + n(µ + σ )
n n
Since the variance of X̄ equals
E(X̄ 2 ) − E(X̄)2 = E(X̄ 2 ) − µ2 ,
we conclude the variance of X̄ equals σ 2 /n. The standard deviation of X̄ is

the standard error.
Independence and Variances
If X1 , X2 , . . . , Xn is a simple random sample (i.i.d. sequence) with

mean µ and variance σ 2 , then the mean and variance of X̄ are
σ2
E(X̄) = µ and V ar(X̄) = .
n
In particular, when X1 , X2 , . . . , Xn is a bernoulli simple random sample,

the mean and variance of X̄ are p and p(1 − p)/n.
More generally, by a similar calculation, we have
Independence and Variances

Let X1 , X2 , . . . , Xn be a sequence of independent random variables
with means µ1 , µ2 , . . . , µn , and variances σ12 , σ22 , . . . , σn2 , and let
Sn = X1 + X2 + · · · + Xn .
Then the mean and variance of Sn are
µ1 + µ2 + · · · + µn and σ12 + σ22 + · · · + σn2 . (5.3.8)
Now we restate the two fundamental results of probability in the lan-

guage of this section. Let X1 , X2 , . . . , be independent identically distributed
random variables, each with mean µ and variance σ 2 .
Let
X1 + X2 + · · · + Xn
X̄ = , n ≥ 1,
n
and let Z be a normal random variable with mean µ and variance σ 2 .
1. Law of Large Numbers. For every2 sample x1 , x2 , . . . ,

x1 + x2 + · · · + xn
lim = µ.
n→∞ n
2
This holds for almost every sample, in the sense that the exceptions form a negligible
set of samples.
5.4. NORMAL DISTRIBUTION 261
2. Central Limit Theorem. For every a < b,

lim P rob a < X̄ < b = P rob(a < Z < b).
n→∞
5.4 Normal Distribution

A random variable Z has a standard normal distribution or Z distribution if
its moment generating functionfunction!moment generating!normal satisfies
2
MZ (t) = E etZ = et /2 = exp(t2 /2).

In this case, we write Z ∼ N (0, 1).

This is equivalent to specifying the probability density that Z lies in a
small interval [a, b] containing z. This is specified by the famous formula
P rob(a < Z < b) 1 −z2 /2

= ·e , a < z < b. (5.4.1)
b−a N
Here N is a constant to make the total area under the graph equal to one
(Figure 5.15). In other words, (5.4.1) is the pdf of the normal distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.
Using Python, the normal probability density is plotted by
from scipy.stats import norm as Z

from numpy import *
mu, sdev = 0,1
grid()
z = arange(mu-3*sdev,mu+3*sdev,.01)
# if mu, sdev not specified, then Z is standard

plot(z,Z(mu,sdev).pdf(z))
show()
Here pdf stands for probability density function.

0 a b
Figure 5.15: The standard normal distribution.
Expand both sides of the definition of MZ (t) in exponential series. This

results in
t2 t3 t4
1 + tE(Z) + E(Z 2 ) + E(Z 3 ) + E(Z 4 ) + . . .
2! 3! 4!
2
2 3
t 1 t2 1 t2
=1+ + + + ....
2 2! 2 3! 2
From this, the odd moments of Z are zero, and the even moments are
(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to
(1 · 3 · 5 · · · · · (2n − 1))(2 · 4 · · · · · 2n)
E(Z 2n ) =
2n n!
(1 · 3 · 5 · · · · · (2n − 1))2n n! (5.4.2)
=
2n n!
= 1 · 3 · 5 · · · · · (2n − 1), n ≥ 1.
For example,
E(Z) = 0, E(Z 2 ) = 1, E(Z 3 ) = 0, E(Z 4 ) = 3, E(Z 5 ) = 0, E(Z 6 ) = 15.
More generally, we say X has a normal distribution with mean µ and

variance σ 2 , and we write X ∼ N (µ, σ), if
MX (t) = E etX = exp(µt + σ 2 t2 /2).

(5.4.3)
Then
X −µ
X ∼ N (µ, σ) ⇐⇒ Z= ∼ N (0, 1).
σ
A normal distribution is a standard normal distribution when µ = 0 and
σ = 1.
As mentioned in §5.3, the important result is
Central Limit Theorem

Let X1 , X2 , . . . , Xn be independent identically distributed random
variables. Then, for large n,
Sn = X1 + X2 + · · · + Xn
and X̄ = Sn /n have approximately normal distributions. More pre-

cisely, if µ and σ are the mean and standard deviation of each X, and
(a, b) is an interval,

σ σ
lim P rob µ + a · √ < X̄ < µ + b · √ = P rob(a < Z < b).
n→∞ n n
The standard normal distribution is symmetric about zero, and has a

specific width. Because of the symmetry, a random number Z following this
distribution is equally likely to satisfy Z < 0 and Z > 0, so P rob(Z < 0) =
P rob(Z > 0). Since the total area equals 1,
P rob(Z < 0) + P rob(Z > 0) = 1,
we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So
chance = confidence = percentile = area
To summarize, we expect P rob(Z < 0) = 1/2.

p
p
z z
Figure 5.16: z = norm.ppf(p) and p = norm.cdf(z).
When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.16) is specified by
p = Z.cdf(z)
z = Z.ppf(p)
ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.17, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
−z 0 −z 0 z
0 z
Figure 5.17: Confidence (green) or significance (red) (lower-tail, two-tail,

upper-tail).
By symmetry of the graph, upper-tail and two-tail p-values can be com-

puted from lower tail p-values.
P rob(a < Z < b) = P rob(Z < b) − P rob(Z < a),
and
P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < z),
and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
P rob(Z < z) = 1 − (1 − p)/2 = (1 + p)/2.
In Python,
# p = P(|Z| < z)
z = Z.ppf((1+p)/2)
Now let’s zoom in closer to the graph and mark off 1, 2, 3 on the hor-
izontal axis to obtain specific colored areas as in Figure 5.18. These areas
are governed by the 68-95-99 rule (Table 5.19). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of
the blue plus green areas 0.955, and our confidence that |Z| < 3 equals the
sum of the blue plus green plus red areas 0.997. This is summarized in Table
5.19.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event
is considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
2 in a billion. You want a plane crash to be six-sigma.
−3 −2 −1 0 1 2 3
Figure 5.18: 68%, 95%, 99% confidence cutoffs for standard normal.
Figure 5.18 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.19, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.
cutoff abs confidence two-tail p-value

z 1−p p
1 .685 .315
2 .955 .045
3 .997 .003
Table 5.19: Cutoffs, confidence levels, p-values.
An event is statistically significant if its p-value is 5% or less (Table 5.20).

For example, Z > z is statistically significant if P rob(Z > z) is .05 or
less, which means z is greater than 1.64, Z < z is statistically significant if
P rob(Z < z) is .05 or less, which means z is less than −1.64, and |Z| > z
is statistically significant if P rob(|Z| > z) is .05 or less, which means |z| is
greater than 1.96.
An event is highly significant if its p-value is 1% or less (Table 5.20). For
example, Z > z is highly significant if P rob(Z > z) is .01 or less, which
means z is greater than 2.33, Z < z is highly significant if P rob(Z < z) is .01
or less, which means z is less than −2.33, and |Z| > z is highly significant if
P rob(|Z| > z) is .01 or less, which means |z| is greater than 2.56.
µ − 3σ µ−σ µ µ+σ µ + 3σ
Figure 5.21: 68%, 95%, 99% cutoffs for non-standard normal.
event type p-value z-score

Z>z upper tail .05 1.64
Z<z lower tail .05 -1.64
|Z| > z two-tail .05 1.96
Z>z upper tail .01 2.33
Z<z lower tail .01 -2.33
|Z| > z two-tail .01 2.56
Table 5.20: p-values at 5% and at 1%.
In general, the normal distribution is not centered at the origin, but

elsewhere. We say X is normal with mean µ and standard deviation (SD) σ
if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the
normal with mean µ and SD σ. As its name suggests, it is easily checked
that such a random variable X has mean µ and SD σ. For the normal
distribution with mean µ and standard deviation σ, the cutoffs are as in
Figure 5.21. In Python, norm(m,s) returns the normal with mean m and
standard devistion s.
Here is a sample computation. Let X be a normal random variable with
mean µ and standard deviation σ, and suppose P rob(X < 7) = .15, and
P rob(X < 19) = .9. Given this data, we find µ and σ as follows.
With Z as above, we have
P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.
Also, since Z is standard, we compute
a = Z.ppf(.15)
b = Z.ppf(.9)
By definition of ppf (see above), we then have

7−µ 19 − µ
a= , b= .
σ σ
These are two equations in two unknowns. Multiplying both equations by σ
then subtracting, we obtain µ and σ,
19 − 7
σ= , µ = 7 − aσ.
b−a
Let X̄ be the sample mean

X1 + X2 + · · · + X n
X̄ = ,
n
drawn from a normally distributed population with mean µ and standard
deviation σ.√ As we saw in §5.3, the standard deviation of X̄, the standard
error, is σ/ n.
To compute probabilities for X̄ when X has mean µ and standard devi-
ation σ, standardize X̄ by writing
√ X̄ − µ
Z= n· ,
σ
then compute standard normal probabilities.
Here are three examples. In the first example, suppose student grades are
normally distributed with mean 80 and variance 16. This says the average
of all grades is 80, and the SD is 4. If a grade is g, the standardized grade is
g − 80
z= .
4
A student is picked and their grade was g = 90. Is this significant? Is it

highly significant? In effect, we are asking, how unlikely is it to obtain such
a grade? Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 90 − 80
z= = = 2.5.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
P rob(Z > z) = P rob(Z > 2.5) = .0062,
or .62%. Since the upper-tail p-value is less than 1%, yes, this student’s grade
is both significant and highly significant.
For the second example, suppose a sample of n = 9 students are selected
and their sample average grade is ḡ = 84. Is this significant? Is it highly
significant? This time we take
√ ḡ − 80 84 − 80
z = n· =3 = 3.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
P rob(Z > z) = P rob(Z > 2.5) = 0.0013,
or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
For the third example, suppose a single selected student has the same
grade as the sample average, g = 84. Is this significant? Here
g − 80 84 − 80
z= = = 1.
4 4
Since the upper-tail significance corresponding to z = 1
1 1 1
P rob(Z > 1) = P rob(|Z| > 1) = (1 − P rob(|Z| < 1)) = (1 − .68) = .16,
2 2 2
or 16%, this grade is not significant.
The constructor numpy.random.normal returns samples of normally dis-

tributed real numbers. For example.
from numpy.random import default_rng

from numpy import sqrt
rng = default_rng()
mean, sdev, n = 80, 4, 20

rng.normal(mean,sdev,n)
returns 20 normally distributed numbers, with specified mean and variance.

Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below
70, then 1 − p is the chance that the student has a grade above 70. If n is
the sample size, (1 − p)n is the chance that all sample students have grades
above 70. Thus the requested chance is 1 − (1 − p)n . The following code
shows the answer is n = 112.
x = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(x)
for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)
Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
########################
# P-values
5.5. CHI-SQUARED DISTRIBUTION 271
########################
from numpy import *

def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type?")
return
print("type: ",type)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90
pvalue(mean,sdev,n,xbar,type)
5.5 Chi-squared Distribution

Let X and Y be independent standard normal random variables. Then
(X, Y ) is a random point in the plane. What is the probability that the
point (X, Y ) lies inside a square (Figure 5.22)? Specifically, assume the
square is |X| ≤ 1 and |Y | ≤ 1. Since X and Y independent, the probability
(X, Y ) lies in the square is
P rob(|X| < 1 and |Y | < 1) = P rob(|X| < 1) P rob(|Y | < 1)

= P rob(|X| < 1)2 = .6852 = .469.
What is the probability (X, Y ) lies inside the unit circle,
P rob(X 2 + Y 2 < 1)?
Here the answer is not as straightforward, and leads us to introduce the

chi-squared distribution.
Figure 5.22: (X, Y ) in the square and in the circle.
A random variable U has a chi-squared distribution with degree 1 if
1
MU (u) = E(euU ) = √ .
1 − 2u
To compute the moments of U , we use the binomial theorem (7.1.22)

∞
p
X p n p 2 p 3
(1 + x) = x = 1 + px + x + x + ...
n=0
n 2 3
to write out MU (u). Taking p = −1/2 and x = −2u,

∞
1 −1/2
X −1/2
√ = (1 − 2u) = (−2u)n .
1 − 2u n=0
n
Since ∞
1 uU
X un
√ =E e = E(U n ),
1 − 2u n=0
n!
comparing coefficients of un /n! shows

n n −1/2
E(U ) = (−2) n! , n = 0, 1, 2, . . . (5.5.1)
n
Figure 5.23: Chi-squared distribution with different degrees.
Using the definition

p p · (p − 1) · · · · · (p − n + 1)
= ,
n n!
p

n
makes sense for fractional p (see (4.3.12)). With this, we have
(−1/2) · (−1/2 − 1) · · · · · (−1/2 − n + 1)
E(U n ) = (−2)n n!
n!
= 1 · 3 · 5 · 7 · · · · · (2n − 1).
But this equals the right side of (5.4.2). Thus the left sides of (5.4.2) and
(5.5.1) are equal. This shows
Chi-squared is the Square of Normal
If Z is standard normal, then U = Z 2 is chi-squared with degree 1.
More generally, we say U is chi-squared with degree d if
U = U1 + U2 + · · · + Ud = Z12 + Z22 + · · · + Zd2 , (5.5.2)
with independent standard normal Z1 , Z2 , . . . , Zd .

By independence, the moment generating functions multiply (§5.3), so
the moment generating function for chi-squared with degree d is
1
MU (t) = E(etU ) = .
(1 − 2t)d/2
Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want
P rob(X 2 + Y 2 < 1).
If we set U = X 2 + Y 2 , we want3 P rob(U < 1). Since U is chi-squared with

degree d = 2, we use chi2.cdf(u,d). Then the code
from scipy.stats import chi2 as U
d = 2
u = 1
U(d).cdf(u)
returns 0.39.
3
Geometrically, P rob(U < 1) is the probability that a normally distributed point is
inside the unit sphere in d-dimensional space.
Let us compute the mean and variance of a chi-squared U with degree d.

By (5.5.2),
Xd X d
2
E(U ) = E(Zk ) = 1 = d.
k=1 k=1
By (5.4.2) and independence,
d
X
E(U 2 ) = E(Zk2 Zℓ2 )
k,ℓ=1
X d
X
= E(Zk2 )E(Zℓ2 ) + E(Zk4 )
k̸=ℓ k=1
2
= d(d − 1) · 1 + d · 3 = d + 2d.
Since the variance equals E(U 2 ) − E(U )2 , we conclude
Mean and Variance

The mean and variance of a chi-squared with degree d are
E(U ) = d, and V ar(U ) = 2d.
Figure 5.24: With degree d ≥ 2, the chi-squared distribution peaks at d − 2.

Note that the peak (maximum likelihood point) in a chi-squared distri-

bution with degree d is not at the mean d. It is at d − 2 (Figure 5.24).
Because
1 1 1
′ /2 = ,
d/2
(1 − 2t) (1 − 2t)d (1 − 2t)(d+d′ )/2
we obtain
Independence and Chi-squared

If U and U ′ are independent chi-squared with degrees d and d′ , then
U + U ′ is chi-squared with degree d + d′ .
To compute distributions for sample variances (below) and Pearson’s Test

(§6.7), we need to derive chi-squared for correlated normal samples. This is
best approached using random vectors.
A random vector is a vector X = (X1 , X2 , . . . , Xd ) in Rd whose compo-
nents are random variables. For example, a simple random sample X1 , X2 ,
. . . , Xn may be collected into the random vector
X = (X1 , X2 , . . . , Xn )
in Rn .
If X is a random vector in Rd , its mean is the vector
µ = E(X) = (E(X1 ), E(X2 ), . . . , E(Xd )) = (µ1 , µ2 , . . . , µd ).
Assume the mean of X equals zero. The variance of X is the d × d matrix

Q whose (i, j)-th entry is
Qij = E(Xi Xj ), 1 ≤ i, j ≤ d.
In the notation of §2.2,

Q = E(X ⊗ X).
Clearly, Q is a nonnegative matrix, since

v · Qv = v · E(X ⊗ X)v = E (v · (X ⊗ X)v) = E (v · X)2

is never negative.
A random vector X is normal with mean µ and variance Q if for every
vector w, w · X is normal with mean w · µ and variance w · Qw.
Then µ is the mean of X, and Q is the variance X. The random vector
X is standard normal if µ = 0 and Q = I.
From §5.3, we see
Normal Random Vectors

Z1 , Z2 , . . . , Zd is a simple random sample of standard normal random
variables if and only if
Z = (Z1 , Z2 , . . . , Zd )
is a standard normal random vector in Rd .
If X is a normal random vector with mean zero and variance Q, then, by

definition, w · X is normal with mean 0 and variance w · Qw. Using (5.4.3)
with t = 1, the moment generating function of the random vector X is
MX (w) = E ew·X = ew·Qw/2 .

In §5.3 we studied correlation and independence. We saw how indepen-

dence implies uncorrelatedness, but not conversely. Now we show that, for
normal random vectors, they are in fact the same.
Independence and Correlation
If (X, Y ) is a normal random vector, then X and Y are uncorrelated

iff X and Y are independent.
To see this, we write down

E(X ⊗ X) = A, E(X ⊗ Y ) = B, E(Y ⊗ Y ) = C.
Then the variance of (X, Y ) is

E(X ⊗ X) E(X ⊗ Y ) A B
Q= =
E(Y ⊗ X) E(Y ⊗ Y ) Bt C
. From this, we see X and Y are uncorrelated when B = 0.

With w = (u, v), we write

u A B u
w · Qw = · = u · Au + u · Bv + v · B t u + v · Cv.
v Bt C v
Then
t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .

From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.
If Z is a standard normal random vector in Rd , then above we saw |Z|2 is

chi-squared with degree d. Now we generalize this result to correlated normal
random vectors.
Recall the pseudo-inverse Q+ of the matrix Q (§2.3).
Correlated Normal Random Vector

Let X be a normal random vector with mean zero and variance Q.
Then
U = X · Q+ X
is chi-squared of degree r, where r is the rank of the matrix Q.
To derive this, we use the eigenvalue decomposition (§3.2) of Q,

Q = U SU t .
Here S is a square diagonal matrix and U is an orthogonal matrix, U t U = I.
The diagonal entries of S are the eigenvalues of Q.
Since the rank of Q is r, and this equals the rank of S, only r of these
eigenvalues are positive, λ1 ≥ λ2 ≥ · · · ≥ λr > 0, and λr+1 = · · · = λn = 0.
Moreover,  
1/λ1 0 0 ... 0
 0 1/λ2 0 ... 0
 
 ... . . . . . . . . . . . .
 
S+ =   0 . . . 0 1/λ r 0 ,

 0 0 0 0 0
 
 ... . . . . . . . . . . . .
0 0 0 0 0
and
Q+ = U S + U t .
If we set Y = U t X = (Y1 , Y2 , . . . , Yd ), then
E(Y ⊗ Y ) = E((U t X) ⊗ (U t X)) = U t E(X ⊗ X)U = U t QU = S.
Writing this out, the random variables Y1 , Y2 , . . . , Yd satisfy

(
λi , i=j
E(Yi Yj ) =
0, i ̸= j.
Since λi = 0 for i > r, we see Yi = 0 for i > r. If we set

1
Z i = √ Yi , i = 1, 2, . . . , r,
λi
then Z is a standard normal r-vector. By (5.5.2),
|Z|2 = Z12 + Z22 + · · · + Zr2
is chi-squared with degree r. But

r r
2
X X Y2 k
|Z| = Zk2 = = Y ·S + Y = (U t X)·S + (U t X) = X·(U S + U t )X = X·Q+ X.
k=1 k=1
λk
This completes the proof of the theorem.
Let v be a unit vector, and let Q = I − v ⊗ v. Suppose X is a normal

random vector with mean zero and variance Q. Then
E((X · v)2 ) = v · Qv = v · (I − v ⊗ v)v = v − (v · v)v = 0,
so X · v = 0.
It is easy to check Q3 = Q and Q2 is symmetric, so (§2.3) Q+ = Q. Since
X · v = 0,
X · Q+ X = X · QX = X · (X − (v · X)v) = |X|2 .
We conclude
Singular Chi-squared
Let v be a unit vector, and let X be a normal random vector in Rd

with mean zero and variance I − v ⊗ v. Then |X|2 is chi-squared with
degree d − 1.
We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + X n
X̄ = .
n
Let S 2 be the sample variance,
(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2
S2 = . (5.5.3)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show
Independence of Sample Mean and Sample Variance
Let Z = (Z1 , Z2 , . . . , Zn ) be independent standard normal random

variables, let Z̄ be the sample mean, and let S 2 be the sample variance.
Then (n − 1)S 2 is chi-squared with degree n − 1, and Z̄ and S 2 are
independent.
To see this, let

1 1 1
v= √ , √ ,..., √ .
n n n
Then v is a unit vector, and
n
1 X √
Z ·v = √ Zk = n Z̄, (Z · v)v = (Z̄, . . . , Z̄).
n k=1
Now let
X = Z − (Z · v)v = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).
Then E(X) = 0 and
E((Z · v)2 ) = E(v · (Z ⊗ Z)v) = v · Iv = 1.
From this, it follows that
E(X ⊗ X) = I − v ⊗ v.
Hence
(n − 1)S 2 = |X|2
is chi-squared with degree n − 1.
Now X and Z · v are uncorrelated, since
E(X(Z · v)) = E(Z(Z · v) − E((Z · v)2 )v = v − v = 0.
Since√they are normal, X and Z · v are independent. Since (n − 1)S 2 = |X|2 ,

and nZ̄ = Z · v, S 2 and Z̄ are independent.
Chapter 6
Statistics
6.1 Estimation
In statistics, like any science, we start with a guess or an assumption or
hypothesis, then we take a measurement, then we accept or modify our
guess/assumption based on the result of the measurement. This is common
sense, and applies to everything in life, not just statistics.
For example, suppose you see a sign on the UNH campus saying
There is a lecture in Buckman B120.
How can you tell if this is true/correct or not? One approach is to go to

Buckman B120 and look. Either there is a lecture or there isn’t. Problem
solved.
But then someone might object, saying, wait, what if there is a lecture
in Buckman B120 tomorrow? To address this, you go every day to Buckman
B120 and check, for 100 days. You find out that in 85 of the 100 days, there
is a lecture, and in 15 days, there is none. Based on this, you can say you are
85% confident there is a lecture there. Of course, you can never be sure, it
depends on which day you checked, you can only provide a confidence level.
In general, the measurement is significant if it is unlikely. When we obtain
a significant measurement, then we are likely to reject our guess/assumption.
So
significance = 1 − confidence.
In practice, our guess/assumption allows us to calculate a p-value, which is
the probability that the measurement is not consistent with our assumption.
283
284 CHAPTER 6. STATISTICS
In the above scenario, the p-value is .15, determined by repeatedly sampling

the room.
This is what statistics is about, summarized in Figure 6.1. Of course, the
details may be more or less complicated depending on the problem situation
or setup, but this is the central idea.
do not
reject H
p>α
hypothesis
sample p-value
H
p<α
reject H
Figure 6.1: Statistics flowchart: p-value p and significance α.
Here is a geometric example. The null hypothesis and the alternate hy-
pothesis are
• H0 : The angle between two randomly selected vectors in 784 dimen-

sions is approximately 90◦
• Ha : The angle between two randomly selected vectors in 784 dimen-

sions is approximately 60◦ .
In §2.2, there is code (2.2) returning the angle angle(u,v) between two
vectors. To test this hypothesis, we run the code
6.1. ESTIMATION 285
from numpy import *

from numpy.random import randn
# randn is standard normal
N = 784
for _ in range(20):
u = randn(N)
v = randn(N)
print(angle(u,v))
to randomly select u, v twenty times. This code returns
86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357
Here we see strong evidence supporting H0 . On the other hand, if we run

the code
from numpy import *

from numpy.random import binomial
N = 784
n = 1 # one coin toss

#n = 3 # three coin tosses
for _ in range(20):
u = binomial(n,.5,N)
v = binomial(n,.5,N)
print(angle(u,v))
to randomly select u, v twenty times, we obtain
59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699
Here we see strong evidence that H0 is false, as the angles are now close to
60◦ .
6.1. ESTIMATION 287
The difference between the two scenarios is the distribution. In the first
scenario, we have randn(n): the components are distributed according to a
standard normal. In the second scenario, we have binomial(1,.5,N): the
components are distributed according to a fair coin toss. To see how the
distribution affects things, we bring in the law of large numbers, which is
discussed in §5.3.
Let X1 , X2 , . . . , Xn be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xn are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + X n
X̄ = .
n
Then we have the
law of large numbers

For almost every simple random sample, for large sample size, the
sample mean X̄ approximately equals the population mean µ. More
precisely, as the sample size n approaches infinity, X̄ approaches µ.
We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xn ), and v = (y1 , y2 , . . . , yn ) where all compo-
nents are selected independently of each other, and each is selected according
to the same distribution.
Let U = (X1 , X2 , . . . , Xn ), V = (Y1 , Y2 , . . . , Yn ), be the corresponding
random variables. Then X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent
and identically distributed (i.i.d.), with population mean E(X) = E(Y ).
From this, X1 Y1 , X2 Y2 , . . . , Xn Yn are i.i.d. random variables with popu-
lation mean E(XY ). By the law of large numbers,1
X1 Y1 + X2 Y2 + · · · + Xn Yn
≈ E(XY ),
n
so
U · V = X1 Y1 + X2 Y2 + · · · + Xn Yn ≈ n E(XY ).
1
≈ means the ratio of the two sides approaches 1 as n grows without bound.
Similarly, U · U ≈ n E(X 2 ) and V · V ≈ n E(Y 2 ). Hence (check that the n’s

cancel)
U ·V E(XY )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X 2 )E(Y 2 )
Since X and Y are independent with mean µ and variance σ 2 ,
E(XY ) = E(X)E(Y ) = µ2 , E(X 2 ) = µ2 + σ 2 , E(Y 2 ) = µ2 + σ 2 .
If θ is the angle between U and V , we conclude
U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2
When the distribution is N (0, 1), µ = 0, so the angle is approximately

90◦ . When the distribution is bernoulli with parameter p,
µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)
For p = .5, this results in an angle of 60◦ .

The general result is
Random Vectors in High Dimensions

Let U and V be two vectors selected randomly. Assume the compo-
nents of U and V are independent and identically distributed with
mean µ and variance σ 2 . Let θ be the angle between them. When the
vector dimension is high,
µ2
cos(θ) is approximately .
µ2 + σ 2
6.2 Z-test
Suppose we want to estimate the proportion of American college students
who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
6.2. Z-TEST 289
Figure 6.2: Histogram of sampling n = 25 students, repeated N = 1000

times.
The population proportion p is the actual proportion of students that in

fact have a smart phone. Then 0 < p < 1. Pick a student, and let
(
1, if the student has a smartphone,
X=
0, if not.
Then X is a bernoulli random variable with mean p.

For example, suppose the population proportion of students that have a
smartphone is p = .7, and we sample n = 25 students, obtaining a sample
proportion X̄. If we repeat the sampling N = 1000 times, we will obtain
1000 values for X̄. Figure 6.2 displays the resulting histogram of X̄ values.
Here is the code
from numpy import *

from numpy.random import binomial
p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n
hist(v,edgecolor ='Black')
show()
Let X1 , X2 , . . . , Xn be a simple random sample of size n. This means

n students were selected randomly and independently and whether or not
they had smartphones was recorded in the variables X1 , X2 , . . . , Xn . Each
of these variables equals one or zero with probability p or 1 − p.
The sample mean (§5.3) is
n
X1 + X 2 + · · · + X n 1X
X̄ = = Xk .
n n k=1
Because each Xk is 0 or 1, this is the sample proportion of the students in

the sample that have smartphones. Like p, X̄ is also between zero and one.
Because the samples vary, it is impossible to make absolute statements
about the population. Instead, as we see below, the best we can do is make
statements that come with a confidence level. Confidence levels are expressed
as percentages, such as a 95% confidence level, or as a proportion, such as a
.95 confidence level.
Often levels are expressed as significance levels. The significance level is
the corresponding tail probability, so
significance level = 1 − confidence level.
A confidence level of zero indicates that we have no faith at all that

selecting another sample will give similar results, while a confidence level of
1 indicates that we have no doubt at all that selecting another sample will
give similar results.
When we say p is within X̄ ± ϵ, or
|p − X̄| < ϵ,
we call ϵ the margin of error.. The interval
(L, U ) = (X̄ − ϵ, X̄ + ϵ)
is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• sample proportion X̄,

6.2. Z-TEST 291
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if
P rob(|p − X̄| < ϵ) = α.
Here are some natural questions:
1. Given a sample of size n = 20 and sample proportion X̄ = .7, what

can we say about the margin of error ϵ with confidence α = .95?
2. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .95?
3. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .99?
4. Given a sample of size n = 20 and sample proportion X̄ = .7, with
what confidence level α is the margin of error ϵ = .1?
The answers are at the end of the section.
Suppose each Xk in the sample X1 , X2 , . . . , Xn has mean µ and standard

deviation√σ. From §5.3, we know the mean and standard deviation of X̄ are
µ and σ/ n. In particular, when X1 , X2 , . . . , Xn is a bernoulli sample, the
mean and variance of the sample proportion X̄ are p and p(1 − p)/n.
Therefore, the mean and variance of the standardized random variable
√ X̄ − p
Z= np
p(1 − p)
are zero and one.
Returning to our smartphone question, how close is the sample mean X̄
to the population mean E(X) = p? Remember, both X̄ and p are between 0
and 1. More specifically, given a margin of error ϵ, we want to compute the
confidence level
P rob X̄ − p < ϵ .
This corresponds to the confidence interval
L, U = X̄ − ϵ, X̄ + ϵ.
The key result is the central limit theorem (§5.3): Z is approximately

normal. How large should the sample size n be in order to apply the central
limit theorem? When we have success-failure condition
np ≥ 10, n(1 − p) ≥ 10.
For example, p = .7 and n = 50 satisfies the success-failure condition.

Let α be the two-tail significance level, say α = .05. Assuming Z is
exactly normal, let z ∗ be the z-score corresponding to significance α,
P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n
To compute the confidence interval (L, U ), we solve

|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n
for p. But (6.2.1) may be rewritten as a quadratic equation in p, leading to

the approximate solution
z∗
q
L, U = X̄ ± ϵ = X̄ ± √ · X̄(1 − X̄).
n
From here we obtain the margin of error
z∗
q
ϵ = √ · X̄(1 − X̄).
n
More generally, let z ∗ be the z-score corresponding to significance level

α, so
6.2. Z-TEST 293
zstar = Z.ppf(alpha) # lower-tail, zstar < 0

zstar = Z.ppf(1-alpha) # upper-tail, zstar > 0
zstar = Z.ppf(1-alpha/2) # two-tail, zstar > 0
Given a population with known standard deviation σ, sample size n, and

sample mean X̄, the margin of error is
σ
ϵ = z∗ · √ ,
n
and the intervals

(X̄ − ϵ, X̄),
 lower-tail,
(L, U ) = (X̄, X̄ + ϵ), upper-tail,

(X̄ − ϵ, X̄ + ϵ), two-tail,

are the confidence intervals at significance level α. When not specified, a

confidence interval is usually taken to be two-tail.
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with the X̄. When σ is not known, we have to replace
the normal distribution by the t distribution (§6.3).
##########################
# Confidence Interval - Z
##########################
from numpy import *

# significance level alpha
def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
elif type == "lower-tail":

L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U
# when X is not bernoulli 0,1,

# Z-test assumes sdev is known!!!
# when X is bernoulli, sdev = sqrt(xbar*(1-xbar))
alpha = .02
sdev = 228
n = 35
xbar = 95
L, U = confidence_interval(xbar,sdev,n,alpha,type)
print("type: ", type)

print("significance, sdev, n, xbar: ", alpha,sdev,n,xbar)
print("lower, upper: ",L, U)
Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval

for 15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain
n = 36.
3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval

for 1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain
n = 62.
4. When X̄ = .7, n = 20, and ϵ = .1, we have

√
∗ ϵ n
z = = .976.
σ
Since P rob(Z > z ∗ ) = .165, the confidence level is 1 − 2 ∗ .165 = .68 or
68%.
6.2. Z-TEST 295
The speed limit on a highway is µ0 = 120. Ten automatic speed cam-

eras are installed along a stretch of the highway to measure passing vehicles
speeds. Because the cameras aren’t perfect, the average speed X̄ measured
by the cameras may not equal a vehicle’s true speed µ. As a consequence,
some drivers who were driving at the speed limit may be fined. These drivers
are false positives.
Suppose the distribution of a vehicle’s measured speed is normal with
standard deviation 2. What measured speed cutoff µ∗ should the authorities
use to keep false positives below 1%? Here we are asked for the upper-tail
confidence interval (L, U ) = (µ0 , µ∗ ) at significance level .01. A driver will
be fined if their average measured speed X̄ is higher than µ∗ .
Using the above code, the cutoff µ∗ equals 121.47.
One use of confidence intervals is hypothesis testing. Here we have two

hypotheses, a null hypothesis and an alternate hypothesis. In the above
setting where we are estimating a population parameter µ, the null hypothesis
is that µ equals a certain value µ0 , and the alternate hypothesis is that µ is not
equal to µ0 . Hypothesis testing is of three types, depending on the alternate
hypothesis: µ ̸= µ0 , µ > µ) , µ < µ0 . These are two-tail, lower-tail, and
upper-tail hypotheses.
• H0 : µ = µ0
• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .
For example, going back to our smartphone p setup, if we sample n = 20

students, obtaining a mean x̄ = .7, then σ = x̄(1 − x̄) = .46, and the two-
tail 5% confidence interval is then [.5, .9]. If µ0 lies outside the confidence
interval, we reject H0 and accept Ha , at the 5% level. Otherwise, if µ0 lies
within the interval, we do not reject H0 .
Suppose 35 people are randomly selected and the accuracy of their wrist-
watches is checked, with positive errors representing watches that are ahead
of the correct time and negative errors representing watches that are behind
the correct time. The sample has a mean of 95 seconds and a population
standard deviation of 228 seconds. At the 2% significance, can we claim the

population mean is µ0 = 0?
Here
• H0 : µ = 0
• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value
p = P rob(|Z| > z) = .0137.
On the other hand, the z-score corresponding to the requested significance

level is z ∗ = 2.326, since
P rob(|Z| > 2.326) = .02.
Since p is less than α, or equivalently, since |z| > z ∗ , we reject H0 . In other

words, when the p-value is smaller than the significance level, it is more
significant, and we reject H0 .
Equivalently, the 98% confidence interval is
(x̄ − ϵ, x̄ + ϵ) = (5.3, 184.6) .
Since µ0 = 0 is outside this interval, we reject H0 .
Hypothesis Testing
There are three types of alternative hypotheses Ha :
µ < µ0 , µ > µ0 , µ ̸= µ0 .
These are lower-tail, upper-tail, and two-tail tests. In every case, we

6.2. Z-TEST 297
have a sample of size n, a statistic x̄, a standard deviation σ, a stan-

dardized statistic
√ x̄ − µ0
z = n· ,
σ
a significance level α, the p-value
p = P rob(Z < z), p = P rob(Z > z), p = P rob(|Z| > z),
and the critical cutoff z ∗ ,
P rob(Z < z ∗ ) = α, P rob(Z > z ∗ ) = α, P rob(|Z| > z ∗ ) = α.
Then we reject H0 whenever z is more significant than z ∗ , which is

the same as saying whenever the p-value p is less than the significance
level α.
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.
###################
# Hypothesis Z-test
###################
from numpy import *

# significance level alpha
def ztest(mu0, sdev, n, xbar,type):

Xbar = Z(mu0,sdev/sqrt(n))
print("mu0, sdev, n, xbar: ", mu0,sdev,n,xbar)
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 * (1 - Xbar.cdf(abs(xbar)))
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01
ztest(mu0, sdev, n, xbar,type)
Going back to the driving speed example, the hypothesis test is

• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.
There are two types of possible errors we can make. a Type I error is
when H0 is true, but we reject it, and a Type 2 error is when H0 is not true
but we fail to reject it.
H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β
Table 6.3: The error matrix.
We reject H0 when the p-value of Z is less than the significance level α,

which happens when z < z ∗ or z > z ∗ or |z| > z ∗ . In all cases, the chance of
this happening is by definition α. In other words,
P rob(Type I error) = P rob(p-value < α | H0 ) = α.

6.2. Z-TEST 299
Thus the probability of a type I error is the significance level α.

We make a Type II error when we do not reject H0 , but H0 is false. To
compute the probability of a Type II error, suppose the true value of µ is µ1 .
Then we do not reject
√ H0 if |z| < |z ∗ |, which is when µ0 lies in the confidence
interval x̄ ± z ∗ σ/ n, or when x̄ lies in the interval
z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n
But when µ = µ1 , X̄ is N (µ1 , σ), so the probability of this event can be

computed.
Standardize X̄ by subtracting µ1 and dividing by the standard error.
Then we have a Type II error when
√ (µ0 − µ1 ) √ (µ0 − µ1 )
n − z∗ < z < n + z∗.
σ σ
If we set δ to equal the standardized difference in the means,
√ (µ0 − µ1 )
δ= n ,
σ
then we have a Type II error when
δ − z∗ < Z < δ + z∗,
or when |Z − δ| < z ∗ . Hence
P rob(Type II error) = P rob (|Z − δ| < z ∗ ) .
This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code
############################
# Type1 and Type2 errors - Z
############################
from numpy import *

def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ",
,→ alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)
mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"
type2_error(type,mu0,mu1,sdev,n,alpha)
A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then
the power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
6.3. T -TEST 301
code, the probability is
β = P rob(X̄ = 120 | µ = 122) = 20%.
Therefore this test has power 80% to detect such a driver.
6.3 T -test
Let X1 , X2 , . . . , Xn be a simple random sample from a population. We
repeat the previous section when we know neither the population mean µ,
nor the population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
2 1 X
S = (Xk − X̄)2 .
n − 1 k=1
For example, assume X1 , X2 , . . . , Xn are bernoulli 0,1 random variables.

Then as we’ve seen before,
n
X
2
(n − 1)S = (Xk − X̄)2 = nX̄(1 − X̄).
k=1
When the sample Z1 , Z2 , . . . , Zn is normal, we have

Z1 + Z2 + · · · + Zn
Z̄ =
n
and
n
21 X
S = (Zk − Z̄)2 .
n − 1 k=1
In this case, from §5.5,
• (n − 1)S 2 is chi-squared of degree n − 1, and
• X̄ and S 2 are independent.

A random variable T has a t-distribution with degree d if the probability

that T lies in a small interval [a, b] containing t is
−(d+1)/2
t2

P rob(a < T < b) 1
= · 1+ , a < t < b. (6.3.1)
b−a N d
Here N is a constant to make the total area under the graph equal to one
(Figure 6.4). In other words, (6.3.1) is the pdf of the t-distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.
Figure 6.4: t-distribution, against normal (dashed).
By the compound-interest formula for the exponential (4.4.6), the t-

distribution (6.3.1) approaches the standard normal distribution (5.4.1) as
d → ∞.
6.3. T -TEST 303
from numpy import *

from scipy.stats import t as T, norm as Z
for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))
plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()
Using calculus, one can derive
Relation Between Z, U , and T

Suppose Z and U are independent, where Z is standard normal, and
U is chi-squared with d degrees of freedom. Then
Z
T =p
U/d
is a t-distribution with degree d.
In the previous section, we normalized a sample

√ mean by subtracting the
mean µ and dividing by the standard error σ/ n. Since now we don’t know
σ, it is reasonable to divide by the sample standard error, obtaining
√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1 k=1
If we standardize each variable by
Xk = µ + σZk ,
then we can verify

Z1 + Z2 + · · · + Zn
X̄ = µ + σ Z̄, Z̄ = ,
n
and n
2 1 X
2
S =σ (Zk − Z̄)2 .
n − 1 k=1
From this, we have
√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n − 1 k=1
Using the last result with d = n − 1, we arrive at the main result in this
section.
Samples and T Distributions

Let X1 , X2 , . . . , Xn be independent normal random variables with
mean µ. Let X̄ be the sample mean, let S 2 be the sample variance,
and let
√ X̄ − µ
T = n· .
S
Then T is distributed according to a t-distribution with degree (n − 1).
The takeaway here is we do not need to know the standard deviations σ

of X1 , X2 , . . . , Xn to compute T .
The t-score t∗ corresponding2 to significance α is
tstar = T(d).ppf(alpha) # lower-tail, tstar < 0

tstar = T(d).ppf(1-alpha) # upper-tail, tstar > 0
2
Geometrically, P rob(T > 1) is the probability that a normally distributed point is
inside the light cone in (d + 1)-dimensional spacetime.
6.3. T -TEST 305
tstar = T(d).ppf(1-alpha/2) # two-tail, tstar > 0
Here d is the degree of T . Then we have
##########################
# Confidence Interval - T
##########################
from numpy import *

from scipy.stats import t as T
def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
return L, U
n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)
L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we see
(L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.
We turn now to hypothesis testing. As before, we have two hypotheses, a

null hypothesis and an alternate hypothesis. In the above setting where we
are estimating a population parameter, the null hypothesis is that µ equals
a certain value µ0 , and the alternate hypothesis is that µ is not equal to µ0 .
• H0 : µ = µ0
• Ha : µ ̸= µ0 .
Here is the code:
###################
# Hypothesis T-test
###################
from numpy import *

def ttest(mu0, s, n, xbar,type):

d = n-1
print("mu0, s, n, xbar: ", mu0,s,n,xbar)
t = sqrt(n) * (xbar - mu0) / s
print("t: ",t)
if type == "lower-tail": p = T(d).cdf(t)
elif type == "upper-tail": p = 1 - T(d).cdf(t)
elif type == "two-tail": p = 2 * (1 - T(d).cdf(abs(t)))
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")
xbar = 122
6.3. T -TEST 307
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01
ttest(mu0, s, n, xbar,type)
Going back to the driving speed example, the hypothesis test is
• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects

H0 . This is consistent with the confidence interval cutoff we found above.
However, the p-value obtained here is greater than the corresponding p-value
in §6.2.
For Type I and Type II errors, the code is
########################
# Type1 and Type2 errors
########################
from numpy import *

def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
tstar = T(d).ppf(1 - alpha/2)
type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta -
,→ tstar)
print("test type: ",type)

print("tstar: ", tstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)

print("power: ", 1 - type2)
type2_error(type,mu0,mu1,n,alpha)
Going back to the driving speed example, if a driver’s measured average

speed is X̄ = 122, what is the chance they will not be fined? From the code,
the probability of this Type II error is 37%, and the power to detect such a
driver is 63%.
6.4 Two Means

Let X1 , X2 , . . . , Xn be a simple random sample from a population. Let Y1 ,
Y2 , . . . , Ym be another simple random sample, and assume the two samples
are independent. Assume also that each Xk is N (µX , σ), and each Yk is
N (µY , σ). The goal is to estimate µX − µY .
As before, let
X1 + X2 + · · · + Xn X1 + Y2 + · · · + Ym
X̄ = , Ȳ = ,
n m
and let
n m
2 1 X 1 X
SX = (Xk − X̄)2 , SY2 = (Yk − Ȳ )2 .
n − 1 k=1 m − 1 k=1
2
Then SX and SY2 are unbiased estimators of σX
2
and σY2 , which means
2 2
E(SX ) = σX , E(SY2 ) = σY2 .
6.4. TWO MEANS 309
From before, we know

2
(n − 1)SX
2
σX
is chi-squared with degree n−1. Since the variance of chi-squared with degree
2 2
r equals 2r, the variance of (n − 1)SX /σX equals 2(n − 1). Thus
4 2 4 4

2 σX (n − 1)SX σX 2σX
V ar(SX ) = V ar 2
= · 2(n − 1) = .
(n − 1)2 σX (n − 1)2 n−1
Similarly,
2σY4
V ar(SY2 ) = .
m−1
Before, with a single mean, we used the result that
Z
T =p
U/n
is a t-distribution with n degrees of freedom when
1. Z and U are independent
2. Z is N (0, 1)
3. U is chi-squared with n degrees of freedom.
We apply this same result this time, but we proceed more carefully. To
begin, X̄ and Ȳ are normal with means µX and µY and variances σ 2 /n and
σ 2 /m respectively. Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
σ2 σ2
+
n m
Next,
2
SX SY2
(n − 1) and (m − 1)
σ2 σ2
are chi-squared of degrees n − 1 and m − 1 respectively, so their sum
2
SX SY2
(n − 1) + (m − 1)
σ2 σ2
is chi-squared with degree (n − 1) + (m − 1) = n + m − 2. If we let

2
(n − 1)SX + (m − 1)SY2
Sp2 =
n+m−2
be the pooled sample variance, then the above sum is (n + m − 2)Sp2 /σ 2 . We
conclude (n + m − 2)Sp2 /σ 2 is chi-squared with degree n + m − 2.
By our main result above (the σ’s cancel),
X̄ − Ȳ − (µX − µY )
T = r
1 1
Sp +
n m
is distributed according to a t-distribution with degree n + m − 2.
Based on this, let t∗α be the critical t-score of degree n + m − 2 at signifi-
cance α. Then a confidence interval for µX − µY at significance α is
r
∗ 1 1
X̄ − Ȳ ± Sp · tα · + .
n m
Here is code for computing confidence intervals for two means.
##################################
# Confidence Interval - Two means
##################################
import numpy as np
from scipy.stats import t
T = t
def confidence_interval(xbar,ybar,varx,vary,nx,ny,alpha):
tstar = T.ppf(1-alpha/2, nx+ny-2)
varp = (nx-1)*varx+(ny-1)*vary
n = nx+ny-2
varp = varp/n
s_p = np.sqrt(varp)
h = 1/nx + 1/ny
L = xbar - ybar - tstar * s_p * np.sqrt(h)
U = xbar - ybar + tstar * s_p * np.sqrt(h)
6.4. TWO MEANS 311
return L, U
2
Now we turn to the question of what to do when the variances σX and
2
σY are not equal. In this case, by independence, the population variance of
X̄ − Ȳ is the sum of the population variances of X̄ and Ȳ , which is
2
σX σY2
σB2 = + . (6.4.1)
n m
Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
2
σX σY2
+
n m
We want to replace the population variance (6.4.1) by the sample variance
2
SX S2
SB2 = + Y.
n m
Because SB2 is not a straight sum, but is a more complicated linear combina-
tion of variances, SB2 is not chi-squared.
Welch’s approximation is to assume it is chi-squared with degree r, and
to figure out the best r for this. More exactly, we seek the best choice of r
so that
rSB2 rSB2
= 2
σB2 σX σ2
+ Y
n m
is close to chi-squared with degree r. By construction, we multiplied SB2 by
r/σB2 so that its mean equals r,
2
rSB r
E 2
= 2 E(SB2 ) = r.
σB σB
Since the variance of a chi-squared with degree r is 2r, we compute the
variance and set it equal to 2r,
2
rSB r2
2r = V ar = V ar(SB2 ). (6.4.2)
σB2 (σB2 )2
By independence,
2 2
2 SX SY 1 2 1
V ar(SB ) = V ar + V ar = 2 V ar(SX ) + 2 V ar(SY2 ).
n m n m
2 2
But (n − 1)SX /σX and (m − 1)SY2 /σY2 are chi-squared, so
4
σX σY4
V ar(SB2 ) = 2 + 2 . (6.4.3)
n2 (n − 1) m2 (m − 1)
Combining (6.4.2) and (6.4.3), we arrive at Welch’s approximation for the
degrees of freedom,
2 2
σX σY2
+
σB4 n m
r= 4 4 = 4 .
σX σY σX σY4
+ +
n2 (n − 1) m2 (m − 1) n2 (n − 1) m2 (m − 1)
In practice, this expression for r is never an integer, so one rounds it to the
2
closest integer, and the population variances σX and σY2 are replaced by the
2 2
sample variances SX and SY .
We summarize the results.
Welch’s T-statistic
If we have independent simple random samples, then the statistic
X̄ − Ȳ − (µX − µY )
T = r
2
SX S2
+ Y
n m
is approximately distributed according to a T -distribution with degrees
of freedom 2 2
SX SY2
+
n m
r= 4 .
SX SY4
+
n2 (n − 1) m2 (m − 1)
6.5 Variances
Let X1 , X2 , . . . , Xn be a normally distributed simple random sample with
mean 0 and variance 1.
Then we know
U = X12 + X22 + · · · + Xn2
6.5. VARIANCES 313
is chi-squared with n degrees of freedom.

Throughout we let χ2α,n be the score corresponding to significance 1 − α,
P rob(U ≤ χ2α,n ) = α.
More generally, let X1 , X2 , . . . , Xn be a normally distributed simple

random sample with mean µ and variance σ 2 , and let
n
1 X
S2 = (Xk − X̄)2
n − 1 k=1
be the sample variance. Then
(n − 1)S 2
∼ χ2n−1 .
σ2
Let
a = χ2α/2,n−1 , b = χ21−α/2,n−1 . (6.5.1)
By definition of the score χ2α,n , we have
(n − 1)S 2

P rob a ≤ ≤ b = 1 − α.
σ2
From this, we get
(n − 1)S 2 (n − 1)S 2

2
P rob ≤ σ ≤ = 1 − α.
b2 a2
We conclude
Confidence Interval
A (1 − α)100% confidence interval for the population variance σ 2 is
(n − 1)S 2 (n − 1)S 2

2
≤σ ≤
b2 a2
where a and b are the χ2n−1 scores at significance 1 − α/2 and α/2.
##############################
# Confidence Interval - Chi2
##############################
from scipy.stats import chi2
def confidence_interval(s2,n,alpha):
a = chi2.ppf(alpha/2,n-1)
b = chi2.ppf(1-alpha/2,n-1)
L = (n-1)*s2/b
U = (n-1)*s2/a
return L, U
Here is an example: A large candy manufacturer produces, packages and

sells packs of candy targeted to weigh 52 grams. A quality control manager
working for the company was concerned that the variation in the actual
weights of the targeted 52-gram packs was larger than acceptable. That
is, he was concerned that some packs weighed significantly less than 52-
grams and some weighed significantly more than 52 grams. In an attempt
to estimate σ 2 , he took a random sample of n = 10 packs off of the factory
line. The random sample yielded a sample variance of 4.2 grams. Use the
random sample to derive a 95% confidence interval for σ 2 .
Here S 2 = 4.2, n = 10, and α = .05, resulting in
L, U = 1.99, 14.0
.
For hypothesis testing, given hypotheses
• H0 : σ = σ0
• Ha : σ ̸= σ0
the standardized test statistic is

(n − 1)S 2
.
σ02
and one compares the p-value of the standardized test statistic to the required
significance score, whether two-tail, upper-tail, or lower-tail.
6.5. VARIANCES 315
Figure 6.5: Fisher F -distribution.
Now we consider two populations with two variances. For this, we intro-
duce the F -distribution. If U1 , U2 are independent chi-squared distributions
with degrees n1 and n2 , then
U1 /n1
F =
U2 /n2
is distributed according to an F -distribution with degrees (df n, df d) = (n1 , n2 ).

The F -distribution for a range of degrees is shown in Figure 6.5. Here df n
and df d stand for degrees of freedom for the numerator and denominator.
Let X1 , X2 , . . . , Xn be a simple random sample from a normally dis-
2
tributed population with mean µX and variance σX . Similarly, let Y1 , Y2 ,
. . . , Ym be a simple random sample from a normally distributed population
with mean µY and variance σY2 .
Then
2
(n − 1)SX (m − 1)SY2
2
, and
σX σY2
are independent chi-squared with degrees n and m respectively.
Hence
2
SX σY2
·
SY2 σX2
is F -distributed with degrees (n, m).

For example, suppose we sample two populations independently. Suppose
the first sample size is 10, the second sample size is 5, and the first sample
2
variance is σX = 1.5, the second sample variance is σY2 = 2.3. What is a 95%
confidence interval for σX /σY ?
Let aα and bα be the critical f -scores corresponding to α = .05, with
degrees (df n, df d) = (10, 5),
from scipy.stats import f
alpha = .05
a = f.ppf(alpha/2,dfn,dfd)
b = f.ppf(1-alpha/2,dfn,dfd)
Then
2
σY2

SX
P rob aα < 2 2 < bα = 1 − α,
SY σ X
which may be rewritten
2 2 2

1 SX σX 1 SX
P rob < 2 < = 1 − α.
bα SY2 σY aα SY2
Hence a 100%(1 − α) confidence interval for σX /σY is
1 SX 1 SX
L= √ , U=√ .
bα S Y aα SY
Plugging in, we obtain,
L = 0.31389215230779993, U = 1.6621265193149342
for the 95% confidence interval.
6.6 Maximum Likelihood Estimates

6.7. CHI-SQUARED TESTS 317
6.7 Chi-Squared Tests

Let X1 , X2 , . . . , Xn be i.i.d. random variables, where each Xk is categorical.
This means each Xk is a discrete random variable taking values in one of d
categories. For simplicity, assume the categories are
Xk = 0, 1, 2, . . . , d − 1.
When d = 2, this reduces to the bernoulli case Xk = 0, 1.

When d = 2 and Xk = 0, 1, the sample mean X̄ is a proportion, the
population
p mean is p = P rob(Xk = 1), the populationp standard deviation
is p(1 − p), and the sample standard deviation is X̄(1 − X̄). By the
central limit theorem, the test statistic
√ X̄ − p
Z= n· p (6.7.1)
p(1 − p)
is approximately standard normal for large enough sample size, and conse-
quently U = Z 2 is approximately chi-squared with degree one. Pearson’s test
generalizes this from d = 2 categories to d > 2 categories.
Given a category j, let #j denote the number of times Xk = j, 1 ≤ k ≤ n.
Then #j is the count that Xk = j, and p̂j = #j /n is the observed frequency,
in n samples. Let pj be the expected frequency,
pj = P rob(Xk = j), 0≤j<d
Since Xk are identically distributed, this does not depend on k.

By the central limit theorem,
√ √

#j
n(p̂j − pj ) = n − pj , 0 ≤ j < d,
n
are approximately normal for large n. Based on this, Pearson [19] showed
Goodness-Of-Fit Test
Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies and p =
(p1 , p2 , . . . , pd ) the expected frequencies. Then, for large n, the statistic

d−1
X (p̂j − pj )2
n (6.7.2)
j=0
pj
is approximately chi-squared with degree d − 1.
By clearing denominators, (6.7.2) may be rewritten in terms of counts as

follows,
d−1 d−1
X (#j − npj )2 X (observed − expected)2
= .
j=0
npj j=0
expected
When d = 2, this statistic reduces to Z 2 , where Z is given by (6.7.1).

Here is the code.
from numpy import *

def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
n = sum(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i
,→ in range(d) ])
deg = d-1
pvalue = 1 - U(deg).cdf(u)
return pvalue
Suppose a dice is rolled n = 120 times, and the observed counts are
O1 = 17, O2 = 12, O3 = 14, O4 = 20, O5 = 29, O6 = 28.
Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
If the dice is fair, the expected counts are
E1 = 20, E2 = 20, E3 = 20, E4 = 20, E5 = 20, E6 = 20.
Based on the observed counts, at 5% significance, what can we conclude

about the dice?
Here there are d = 6 categories, and α = .05. The Pearson statistic
(6.7.2) equals
u = 12.7
The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is
d = 6
ustar = U(d-1).ppf(1-alpha)
Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.
We now derive the goodness-of-fit test. For each category 0 ≤ j < d, let
 √1

j
if Xk = j,
X̃k = pj
0 if Xk ̸= j.

√
Then E(X̃nj ) = pj , and
(
1 if i = j,
E(X̃ki X̃kj ) =
0 if i ̸= j.
If
√ √ √
µ = ( p1 , p2 , . . . , pd ) and X̃k = (X̃k1 , X̃k2 , . . . , X̃kd ),
then
E(X̃k ) = µ, E(X̃k ⊗ X̃k ) = I.
From this,
V ar(X̃k ) = E(X̃k ⊗ X̃k ) − E(X̃k ) ⊗ E(X̃k ) = I − µ ⊗ µ.
From (5.3.8), we conclude the random vector
n
!
√ 1X
Z= n X̃k − µ
n k=1
has mean zero and variance I − µ ⊗ µ. By the central limit theorem, Z is
approximately normal for large n.
Since
√ √ √
|µ|2 = ( p0 )2 + ( p1 )2 + · · · + ( pd−1 )2 = p0 + p1 + · · · + pd−1 = 1,
µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approx-
imately chi-squared with degree d − 1. Using
√

p̂j √
Zj = n √ − pj ,
pj
we write |Z|2 out,
d d 2 d
2
X X p̂j √ X (p̂j − pj )2
|Z| = Zj2 =n √ − pj = n ,
j=1 j=1
pj j=1
pj
obtaining (6.7.2).
Suppose X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are samples measuring

two possibly related effects. Suppose the X variables take on d categories,
X = 1, 2, . . . , d, and the Y variables take on e categories, Y = 1, 2, . . . , e.
The goal is test whether the two effects are independent or not.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.6).
Democrat Republican Independent Total

Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300
Table 6.6: Contingency table [22].

Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, we use the
Chi-squared Independence Test
Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies corresponding to

X1 , X2 , . . . , Xn , and let q̂ = (q̂1 , q̂2 , . . . , q̂e ) be the observed frequen-
cies corresponding to Y1 , Y2 , . . . , Yn . Let r̂ij be the joint observed
frequencies
#{k : Xk = i, Yk = j}
r̂ij = , i = 1, 2, . . . , d, j = 1, 2, . . . , e.
n
If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large
n, the statistic
d,e
X (r̂ij − p̂i q̂j )2
n (6.7.3)
i,j=1
p̂ i q̂ j
is approximately chi-squared with degree (d − 1)(e − 1).
By clearing denominators, (6.7.3) may be rewritten in terms of counts as

follows,
X Y 2
d,e d,e 2
X n#XY
ij − #i #j
X #XY
ij
X Y
= −n + n
i,j=1
n#i #j i,j=1
#X
i #j
Y
d,e
X (observed)2
= −n + n .
i,j=1
expected
The code
def chi2_independence(table):
observed = table
n = sum(observed)
d = len(observed)
e = len(observed.T)
rowsum = array([ sum(observed[i,:]) for i in range(d) ])

colsum = array([ sum(observed[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum)
u = -n + n*sum([[ observed[i,j]**2/expected[i,j] for j in
,→ range(e) ] for i in range(d) ])
deg = (d-1)*(e-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue
table = array([[68,56,32],[52,72,20]])
chi2_independence(table)
returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
The independence test is Fisher’s modification [6] of goodness-of-fit, and

the derivation depends on maximum likelihood estimates.
Chapter 7
Calculus
7.1 Calculus
In this section, we focus on single-variable calculus, and in §7.3, we review
multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 7.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.
y = f (x)
x
a
Figure 7.1: f ′ (a) is the slope of the tangent line at a.
Since the tangent line at a passes through the point (a, f (a)), and its
323
324 CHAPTER 7. CALCULUS
slope is f ′ (a), the equation of the tangent line at a is
y = f (a) + f ′ (a)(x − a).
Based on the definition, natural properties of the derivative are
A. The derivative at x of f (x) − mx is f ′ (x) − m.
B. If f ′ (x) ≥ 0 on an interval [a, b], then f (b) ≥ f (a).
C. If f ′ (x) ≤ 0 on an interval [a, b], then f (b) ≤ f (a).
Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to
f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to
f (b) − f (a)
≤ L.
b−a
We have shown
First Derivative Bounds

If m ≤ f ′ (x) ≤ L for a ≤ x ≤ b, then
f (b) − f (a)
m≤ ≤ L. (7.1.1)
b−a
When b is close to a, we expect both extremes m and L to be close to

′
f (a). From (7.1.1), we arrive at the formula for the derivative,
7.1. CALCULUS 325
Derivative Definition
f (x) − f (a)
f ′ (a) = lim . (7.1.2)
x→a x−a
From (7.1.2), the derivative of a line f (x) = mx + b equals f ′ (a) = m: If

the graph of a function is a line, then the tangent line to the graph is that
line.
Below we also write
dy
y ′ = f ′ (x) =
dx
or
dy
f ′ (a) =
dx x=a
When the particular point a is understood from the context, we write y ′ .
From (7.1.2), the basic properties of the derivative are

• Sum rule. h = f + g implies h′ = f ′ + g ′ ,
• Product rule. h = f g implies h′ = f ′ g + f g ′ ,
• Quotient rule. h = f /g implies h′ = (f ′ g − f g ′ )/g 2 .
• Chain rule. u = f (x) and y = g(u) implies
dy dy du
= · .
dx du dx
To visualize the chain rule, suppose
u = f (x) = sin x,
y = g(u) = u2 .
These are two functions f , g in composition, as in Figure 7.2.
f g
x u y
Figure 7.2: Composition of two functions.

√
Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since
dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §7.4.
Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = x is a line with slope 1, (x)′ = 1.
By the product rule,
(x2 )′ = x′ x + xx′ = 1x + x1 = 2x.
Similarly one obtains the power rule
(xn )′ = nxn−1 . (7.1.3)
Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
2
with f (x) = x and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (7.1.3) with n = 1/2. In this generality, the variable x is restricted
to positive values only.
The second derivative f ′′ (x) of f (x) is the derivative of the derivative,

′
f ′′ (x) = (f ′ (x)) .
For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
7.1. CALCULUS 327
(for n! and P (n, k) see §4.1).

More generally, the k-th derivative f (k) (x) is the derivatives taken k times,
so
n!
(xn )(k) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!
When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x).
We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (7.1.4)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .
f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(7.1.5)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .
Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (7.1.4), we derived
Taylor Series About 0
For almost every function f (x),

∞
X f (n) (0) x2 ′′′ x3 (4) x4
f (x) = xn = f (0)+f ′ (0)x+f ′′ (0) +f (0) +f (0) +. . .
n=0
n! 2 6 24
(7.1.6)
More generally, let a be a fixed point. Then any function f (x) can be
expanded in powers (x − a)n , and we have
Taylor Series About a
For almost every function f (x),

∞
X f (n) (a) (x − a)2
f (x) = (x − a)n = f (a) + f ′ (a)(x − a) + f ′′ (a) +...
n=0
n! 2
(7.1.7)
We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 7.3. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ. By the figure, the arclength θ
is greater than the diagonal, which in turn is greater than y. Moreover θ is
less than 1 − x + y, so
y < θ < 1 − x + y.
P 1−x
θ
0 x 1
Figure 7.3: Angle θ in the plane, P = (x, y).
From this we have
sin θ < θ < 1 − cos θ + sin θ,
which implies
1 − cos θ sin θ
1− < < 1. (7.1.8)
θ θ
7.1. CALCULUS 329
From the Figure, 0 < sin θ < θ, and sin2 θ + cos2 θ = 1, so

1 − cos θ 1 − cos2 θ sin θ sin θ
0≤ = = · ≤ sin θ ≤ θ.
θ θ(1 + cos θ) θ 1 + cos θ
This implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (7.1.8), this implies
sin θ
lim = 1.
θ→0 θ
From (1.5.5),
sin(θ + ϕ) = sin θ cos ϕ + cos θ sin ϕ,
so
sin(θ + ϕ) − sin θ cos ϕ − 1 sin ϕ
lim = lim sin θ · + cos θ · = cos θ.
ϕ→0 ϕ ϕ→0 ϕ ϕ
Thus the derivative of sine is cosine,
(sin θ)′ = cos θ.
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have √
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.13). With
x = λ/2, by the chain rule,
′
1√

2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(7.1.9)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
This shows the derivative of the arcsine law is the density in Figure 3.11.
For the parabola in Figure 7.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
increase/decrease of the graph. In particular, the minimum of the parabola
occurs when y ′ = 0.
0
x
Figure 7.4: Increasing or decreasing?
For the curve y = x4 − 2x2 in Figure 7.5,
y ′ = 4x3 − 4x = 4x(x2 − 1) = 4x(x − 1)(x + 1),
so y ′ is a product of the three factors 4x, x − 1, x + 1. Since the zeros of these

factors are 0, 1, and −1, and y ′ > 0 when all factors are positive, or two of
them are negative, this agrees with the increase/decrease in the figure.
Here y ′ = 0 occurs at the two minima x = ±1 and at the local maximum
0. Notice 0 is not a global maximum as there is no highest value for y.
7.1. CALCULUS 331
√
(c = 1/ 3)
−1 −c c 1
x
0
Figure 7.5: Increasing or decreasing?
Let y = f (x) be a function. A critical point is a point x∗ where the

derivative equals zero, f ′ (x∗ ) = 0. Above we saw local or global maximizers
or minimizers are critical points. In general, however, this need not be so. A
critical point may be neither. For example, for f (x) = x3 , x∗ = 0 is a critical
point, but is neither a maximizer nor a minimizer. Here, for y = x3 , x∗ = 0
is a saddle point.
Now we look at the increase/decrease in y ′ , rather than in y. Applying

the above logic to y ′ instead to y, we see y ′ is increasing when y ′′ ≥ 0, and
y ′ is decreasing when y ′′ ≤ 0. In the first case, we say f (x) is convex, while
in the second case, we say f (x) is concave.
If we look at Figure 7.4, the slope at x equals y ′ = 2x. Thus as x increases,
′
y increases. Even though the parabola height y decreases when x < 0 and
increases when x > 0, its slope y ′ is always increasing: When x < 0, as x
increases, y ′ = 2x is less and less negative, while, when x > 0, as x increases,
y ′ is more and more positive.
Since y ′ increases when its derivative is positive, the parabola’s behavior

is encapsulated in
y ′′ = (y ′ )′ = (2x)′ = 2 > 0.
In general,
Second Derivative Test for Convexity
y = f (x) is convex iff y ′′ ≥ 0, and concave if y ′′ ≤ 0.
A point where y ′′ = 0 is an inflection point. For example, the parabola in

Figure 7.4 is convex everywhere. Analytically, for the parabola, y ′′ = 2 > 0
everywhere,
For the graph in Figure 7.5 it is clear the graph is convex away from 0,
and concave near 0. Analytically,
y ′′ = (x4 − 2x2 )′′ = (4x3 − 4x)′ = 12x2 − 4 = 4(3x2 − 1),

√
so the inflection
√ points are x = ±1/ 3. Hence the graph √ is convex
√ when
|x| > 1/ 3, and the graph is concave when |x| < 1/ 3. Since 1/ 3 < 1,
the graph is convex near x = ±1.
A function f (x) is strictly convex if y ′′ > 0. Geometrically, f (x) is strictly

convex if each chord joining any two points on the graph lies strictly above
the graph. Similarly, one defines strictly concave to mean y ′′ < 0.
Second Derivative Test for Strict Convexity
Suppose y = f (x) has a second derivative y ′′ . Then y is strictly convex

if y ′′ > 0, and strictly concave if y ′′ < 0.
For example, x2 and ex√are strictly convex everywhere, and x4 − 2x2 is

strictly convex for |x| > 1/ 3.
This was also derived in (4.4.12). Since
(ex )(n) = ex , n ≥ 0,
writing the Taylor series centered at zero for the exponential function yields
the exponential series (4.4.10).
7.1. CALCULUS 333
Suppose y = f (x) is convex, so y ′ is increasing. Then a ≤ t ≤ x ≤ b

implies f ′ (a) ≤ f ′ (t) ≤ f ′ (x) ≤ f ′ (b). Taking m = f ′ (a) and L = f ′ (x) in
(7.1.1),
f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain
Convex Function Graph Lies Above the Tangent Line
If f (x) is convex on [a.b], then
f (x) ≥ f (a) + f ′ (a)(x − a), a ≤ x ≤ b.
For example, the function in Figure 7.6 is convex near x = a, and the
graph lies above its tangent line at a.
Let pm (x) be the parabola

m
pm (x) = f (a) + f ′ (a)(x − a) + (x − a)2 .
2
Then p′′m (x) = m. Moreover the graph of pm (x) is tangent to the graph of
f (x) at x = a, in the sense f (a) = pm (a) and f ′ (a) = p′m (a). Because of this,
we call pm (x) a tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent
line. When m ≤ y ′′ ≤ L, we can specify the size of the difference between
the graph and the tangent line. In fact, the graph is constrained to lie above
or below the lower or upper tangent parabolas.
Second Derivative Bounds

If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between pm (x) and pL (x),
m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (7.1.10)
2 2
a ≤ x ≤ b.
To see this, suppose f ′′ (x) ≥ m. then g(x) = f (x) − pm (x) satisfies
g ′′ (x) = f ′′ (x) − p′′m (x) = f ′′ (x) − m ≥ 0,
so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
left half of (7.1.10). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,
leading to the right half of (7.1.10).
x
a
Figure 7.6: Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.
Now suppose f (x) is strongly convex in the sense L ≥ f ′′ (x) ≥ m on an

interval [a, b], for some positive constants m and L. By (7.1.1),
f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies
t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0, a ≤ t ≤ b.
This yields
7.1. CALCULUS 335
Coercivity for Strongly Convex Functions
If m ≤ f ′′ (x) ≤ L for a ≤ x ≤ b, then

2
f ′ (b) − f ′ (a) mL 1 f ′ (b) − f ′ (a)
≥ + . (7.1.11)
b−a m+L m+L b−a
We now compute the derivatives of the exponential function (§4.4). By

(4.4.10),
ex − 1 x x2
=1+ + + ...,
x 2 6
so
ex − 1
(ex )′ |x=0 = lim = 1.
x→0 x
By the law of exponents and t = x − a,
ex − ea a ex−a − 1 a et − 1
lim = e · lim = e · lim = ea · 1 = ea .
x→a x − a x→a x − a t→0 t
This derives
Derivative of the Exponential Function

The exponential function satisfies
(ex )′ = ex , (ex )′′ = ex .
In particular, since ex > 0, the exponential function is convex.
The logarithm function is the inverse of the exponential function,
y = log x ⇐⇒ x = ey .
This is the same as saying
log(ey ) = y, elog x = x.
Figure 7.7: The logarithm function log x.
From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 7.7).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure 4.10),
log ∞ = ∞.
Since e−∞ = 1/e∞ = 1/∞ = 0,
log 0 = −∞.
We also see log x is negative when 0 < x < 1, and positive when x > 1.
Moreover, by the law of exponents,
log(ab) = log a + log b.
For a > 0 and b real, define
ab = eb log a .
7.1. CALCULUS 337
Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
By definition of the logarithm, y = log x is shorthand for x = ey . Use the

chain rule to find y ′ :
x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,
so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (7.1.12)
x
For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is
g(p) = max(px − f (x)). (7.1.13)

x
Below we see g(p) is also convex. This may not always exist, but we will
work with cases where no problems arise.
Let q > 0. The simplest example is
q 1 2
f (x) = x2 =⇒ g(p) = p.
2 2q
For each p, the point x where px − f (x) equals the maximum g(p) — the
maximizer — depends on p. If we denote the maximizer by x = x(p), then
g(p) = px(p) − f (x(p)).

Since the maximum occurs when the derivative is zero, we have
0 = (px − f (x))′ = p − f ′ (x) ⇐⇒ x = x(p).
Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,
g ′ (p) = (px − f (x))′ = x + px′ − f ′ (x)x′ = x.
From this, we conclude
p = f ′ (x) ⇐⇒ x = g ′ (p). (7.1.14)
Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the
same as f (x) = px − g(p), we have
Dual of the Dual

If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).
Since f ′ (x) is the inverse function of g ′ (p), we have
f ′ (g ′ (p)) = p.
Differentiating with respect to p again yields
f ′′ (g ′ (p))g ′′ (p) = 1.
We derived
Second Derivatives of Dual Functions

Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (7.1.15)
f ′′ (x)
where x = g ′ (p), p = f ′ (x).

7.1. CALCULUS 339
Since f ′′ (x) > 0, also g ′′ (p) > 0, so g(p) is strictly convex.
The logistic function

ez 1
q = σ(z) = = , −∞ < z < ∞, (7.1.16)
1+e z 1 + e−z
was defined in §5.1. By the quotient and chain rules, its derivative is
−e−z
q′ = − = σ(z)(1 − σ(z)) = q(1 − q). (7.1.17)
(1 + e−z )2
The logistic function is also called the expit function and the sigmoid function.
The inverse of the logistic function is the logit function. The logit function
is found by solving q = σ(z) for z, obtaining

−1 q
z = σ (q) = log . (7.1.18)
1−q
The logit function is also called the log-odds function. Its derivative is
′
′ 1−q q 1−q 1 1
z = · = · 2
= .
q 1−q q (1 − q) q(1 − q)
Notice the derivatives of σ and its inverse σ −1 are reciprocals. This result
holds in general, and is called the inverse function theorem.
The partition function is
Z(z) = log (1 + ez ) . (7.1.19)
Then Z ′ (z) = σ(z) and Z ′′ (z) = σ ′ (z) = σ(1 − σ) > 0. This shows Z(z) is
strictly convex.
The maximum
max(pz − Z(z))
z
is attained when (pz − Z(z)) = 0, which happens when p = Z ′ (z) = σ(z).

′
Inserting the log-odds function z = σ −1 (p), we obtain

p p
max(pz − Z(z)) = p log − Z log , (7.1.20)
z 1−p 1−p
which simplifies to I(p). Thus the convex dual of the partition function is the
information. The information is studied further in §7.2, and the multinomial
extension is in §7.6.
For the chi-squared distribution (§5.5), we will need Newton’s generaliza-

tion of the binomial theorem to general exponents.
Newton’s Binomial Theorem

Let n be any real number. For a > 0 and −a < x < a,

n n n−1 n n−2 2 n n−3 3
(a + x) = a + na x + a x + a x + ....
2 3
n

This makes sense because the binomial coefficient k
is defined for any
real number n (4.3.12), (4.3.13).
In summation notation,
∞
n
X n n−k k
(a + x) = a x . (7.1.21)
k=0
k
The only difference between (4.3.7) and (7.1.21) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (4.3.10),
we have
n
= 0, for k > n,
k
so (7.1.21) is a sum of n + 1 terms, and equals (4.3.7) exactly. When n is not
a whole number, the sum (7.1.21) is an infinite sum.
Actually, in §5.5, we will need the special case a = 1, which we write in
slightly different notation,
∞
p
X p n
(1 + x) = x . (7.1.22)
n=0
n
Newton’s binomial theorem (7.1.21) is a special case of the Taylor series

centered at zero (7.1.6). To see this, set
f (x) = (a + x)n .
7.2. ENTROPY AND INFORMATION 341
Then, by the power rule,
f (k) (x) = n(n − 1)(n − 2) . . . (n − k + 1)(a + x)n−k ,
so
f (k) (0)

n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
n
X f (k) (0) X n
(a + x) = = an−k xk ,
k=0
k! k=0
k
which is Newton’s binomial theorem.
7.2 Entropy and Information

A function f (x) is concave if −f (x) is convex. We use convexity of ex to
show concavity of log x.
Let a = ex and b = ey . Then x = log a and y = log b. Taking the log of
both sides of (4.4.12), since log x is increasing, we have
(1 − t)x + ty ≤ log ((1 − t)a + tb) ,
or
(1 − t) log a + t log b ≤ log ((1 − t)a + tb) .
Since the inequality sign is reversed, this shows
Concavity of the Logarithm Function

The logarithm function is strictly concave,
log((1 − t)a + tb) ≥ (1 − t) log a + t log b, a > 0, b > 0, (7.2.1)
for 0 ≤ t ≤ 1.
We use calculus to derive the strict concavity. By the power rule,

′
′′ ′′ 1 ′
y = (log x) = = x−1 = −1x−2 .
x
Since x > 0, y ′′ < 0, which shows log x is in fact strictly concave everywhere
it is defined.
Since log x is strictly concave,

1
log = − log x
x
is strictly convex.
Let p be a probability, i.e. a number between 0 and 1. The entropy of p

is
H(p) = −p log p − (1 − p) log(1 − p), 0 < p < 1. (7.2.2)
This is sometimes called absolute entropy to contrast with relative entropy

which we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,

′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p
Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximum of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.
Taking the second derivative, by the chain rule and the quotient rule,
′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)
which is negative, leading to the strict concavity of H(p).

Figure 7.8: The absolute entropy function H(p).
A crucial aspect of Figure 7.8 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1
Figure 7.8 suggests H(0) = 0 and H(1) = 0.

To check this, for the first limit, since H(p) is increasing near p = 0, it
is clear there is a definite value H(0). The entropy is the sum of two terms,
−p log p, and −(1 − p) log(1 − p). When p → 0, the second term approaches
− log 1 = 0, so H(0) is the limit of the first term,
H(0) = − lim p log p.
p→0
When p → 0, also 2p → 0. Replacing p by 2p,

H(0) = − lim p log p
p→0
= − lim 2p log(2p)
p→0
= lim −2p log 2 + 2H(0) = 2H(0).

p→0
Thus H(0) = 0. Since H(p) is symmetric, H(1 − p) = H(p), we also have

H(1) = 0. This completes the discussion of Figure 7.8.
We can now explain the meaning of the entropy function. Suppose an

event has probability p. If p is near 1, then we have confidence that the
event is likely, and, if p is near 0, we have confidence the event is unlikely.
If p = 1/2, then we have no information either way. Thus we can view the
entropy as the negative of our information about the event: High entropy
equals low information, or
Entropy and Information

Entropy equals negative information.
Because of this, we call
I(p) = −H(p) = p log p + (1 − p) log(1 − p) (7.2.3)
the information in p. In (7.1.20), we saw the information is the convex dual

of the partition function. The derivative of I(p) is

′ p
I (p) = log . (7.2.4)
1−p
Then I ′ (p) is the inverse of the derivative σ(x) (7.1.16) of the dual Z(x)
(7.1.19) of I(p), as it should be (7.1.14).
The clearest explanation ofH(p) is in terms of counting coin tosses. Re-

call the binomial coefficient nk is the number of ways of selecting k objects
from n objects (4.3.10).
Toss a coin n times, and let #n = #n (p) be the number of outcomes
where the proportion k/n of heads is p. Then the number of heads is k = np,
so,

n
#n (p) = .
np
When p is an irrational, np is replaced by the floor ⌊np⌋, but we ignore this
point. Using (4.1.1), a straightforward calculation results in
Entropy and Coin-Tossing Counting
Toss a coin n times, and let #n (p) be the number of outcomes where
the proportion of heads is p. Then we have the approximation
#n (p) ∼ enH(p) , for n large.
Figure 7.9: Asymptotics of binomial coefficients.
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (7.2.5)
2πn p(1 − p)
Figure 7.9 is returned by the code below, which compares both sides of
the asymptotic equality (7.2.5) for n = 10.
from numpy import *

from scipy.special import comb
n = 10
def H(p): return - p*log(p) - (1-p)*log(1-p)
p = arange(.01,.99,.01)
grid()
plot(p, comb(n, n*p), label="binomial coefficient")
plot(p, exp(n*H(p))/sqrt(2*n*pi*p*(1-p)), label="entropy
,→ approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
The usetex=True option assumes TEX is installed on your system.
Let p and q be two probabilities,
0 < p < 1, and 0 < q < 1.
When do we consider p and q close to each other? If p and q were just

numbers, p and q are considered close if the distance |p − q| is small or the
distance squared |p − q|2 is small. But here p and q are probabilities, so it
makes sense to consider them close if their information content is close.
To this end, we define the relative information I(p, q) by

p 1−p
I(p, q) = p log + (1 − p) log .
q 1−q
Then
I(q, q) = 0,
which agrees with our understanding that I(p, q) measures the difference in
information between p and q. Because I(p, q) is not symmetric in p, q, we
think of q as a base or reference probability, against which we compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
Since − log(x) is strictly convex,

q 1−q
I(p, q) = −p log − (1 − p) log
p 1−p

q 1−q
> − log p · + (1 − p) ·
p 1−p
= − log 1 = 0.
This shows I(p, q) is positive and H(p, q) is negative, when p ̸= q.
Figure 7.10: The relative information I(p, q) with q = .7.
Since
I(p, q) = −H(p) − p log(q) − (1 − p) log(1 − q)
and H(0) = 0 = H(1), I(p, q) is well-defined for p = 0, and p = 1,
I(1, q) = − log q, I(0, q) = − log(1 − q).

Taking derivatives (with independent variable p),
d2 1
2
I(p, q) = −H ′′ (p) = ,
dp p(1 − p)
hence I is strictly convex in p. Thus q is a global minimum of the graph of

I(p, q) (Figure 7.10). Also
d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2
so I(p, q) is strictly convex in q as well.
Assume the probability of heads in a single toss of a coin is q. Then we

expect the long-term proportion of heads in n tosses to equal roughly q. Now
let p be another probability, 0 ≤ p ≤ 1.
Toss a coin n times, and let Pn (p, q) be the probability of obtaining out-
comes where the proportion of heads equals p, given that the base heads-
probability is q.
If p is not equal to q, we expect this outcome to be unlikely. In other
words, we expect Pn (p, q) to be small for large n. In fact, as n → ∞, we
expect Pn (p, q) → 0.
We derive a formula for the speed of this decay. With k = np in the
binomial distribution (5.1.5),

n np
Pn (p, q) = q (1 − q)n−np .
np
Using (4.1.1), a straightforward calculation results in
Relative Entropy and Coin-Tossing Probabilities

Assume the heads-probability of a coin is q. Toss the coin n times,
and let Pn (p, q) be the probability of obtaining outcomes where the
proportion of heads is p. Then we have the approximation
Pn (p, q) ∼ enH(p,q) , for n large. (7.2.6)

7.3. MULTI-VARIABLE CALCULUS 349
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (7.2.7)
2πn p(1 − p)
The law of large numbers (§6.1) states that the proportion of heads equals
approximately q for large n. Therefore, when p ̸= q, we expect the probability
that the proportion of heads equal p should become successively smaller as
n get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(7.2.7) implies this is so. Thus (7.2.7) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.
7.3 Multi-variable Calculus

Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v
is
d
Dv f (x) = f (x + tv). (7.3.1)
dt t=0
In multiple dimensions, there are many directions v emanating from a
point x, we may ask: How does the direction v affect the rate of change of
temperature f ? More specifically, in which direction v does the temperature
f increase? In which direction v does the temperature decrease? In which
direction does the temperature have the greatest increase? In which direction
does the temperature have the greatest decrease? In one dimension, there
are only two directions, so the directional derivative is either f ′ (x) or −f ′ (x).
When we select specific directions, the directional derivatives have specific

names. Let e1 , e2 , . . . , ed be the standard basis in Rd . The partial derivative
in the k-th direction, k = 1, . . . , d, is

∂f d
(x) = f (x + tek ).
∂xk ds t=0
The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.
Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §8.3.
The second interpretation is combinatorial, and involves repeated com-
positions of functions. This interpretation is relevant to computing gradients
in networks, specifically backpropagation §7.4, §8.2.
These two interpretations work together when training neural networks,
§8.4.
For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).
Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function
f (t) = f (x1 (t), x2 (t), . . . , xd (t))
of a single variable t. Then we have
Multi-Variable Chain Rule

With f (t) = f (x1 (t), x2 (t), . . . , xd (t)),
df ∂f dx1 ∂f dx2 ∂f dxd

= · + · + ··· + · .
dt ∂x1 dt ∂x2 dt ∂xd dt
The gradient of f (x) is the vector

∂f ∂f ∂f
∇f = , ,..., . (7.3.2)
∂x1 ∂x2 ∂xd
The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector
x′ (t) = (x′1 (t), x′2 (t), . . . , x′d (t))
represents its velocity at time t.

With this notation, the chain rule may be written
df
= ∇f (x(t)) · x′ (t).
dt
Let v = (v1 , v2 , . . . , vd ). The simplest application of the multi-variable

chain rule is to select x(t) = x + tv. Then the chain rule becomes
Directional Derivative Formula

The directional derivative of f (x) in the direction v is the dot product
of the gradient ∇f (x) and v,
d
f (x + tv) = ∇f (x) · v. (7.3.3)
dt t=0
Here is an example of the second interpretation of the chain rule. Suppose
r = f (x) = sin x,
1
s = g(x) = ,
1 + e−x
t = h(x) = x2 ,
u = r + s + t,
y = k(u) = cos u.
These are multiple functions in composition, as in Figure 7.11.

g s u y
x + k
Figure 7.11: Composition of multiple functions.
The input variable is x and the output variable is y. The intermediate

variables are r, s, t, u. Suppose x = π/4. Then
x, r, s, t, u, y = 0.79, 0.71, 0.69, 0.62, 2.01, −0.43.
To compute derivatives, start with

dy
= k ′ (u) = − sin u = −0.90.
du
Next, to compute dy/dr, the chain rule says
dy dy
= dudr = −0.90 ∗ 1 = −0.90,
dr du
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (7.1.17), s′ = s(1 − s) = 0.22, so
dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §7.4.
Let y = f (x) be a function. A critical point is a point x∗ satisfying
∇f (x∗ ) = 0.
Let x∗ be a local or global minimizer of y = f (x). Then for any vector v

and scalar t near zero, f (x∗ ) ≤ f (x∗ + tv). Hence
d ∗ f (x∗ + tv) − f (x∗ )

∇f (x) · v = f (x + tv) = lim ≥ 0.
dt t=0
t→0 t
This is so for any direction v. Replacing v by −v, we conclude ∇f (x∗ )·v = 0.

Since v is any direction, ∇f (x∗ ) = 0, x∗ is a critical point. Thus a minimizer
is a critical point. Similarly, a maximizer is a critical point.
As in the single-variable case, a critical point may be neither a minimizer
nor a maximizer, for example x∗ = (0, 0) and y = x21 − x22 . Such a point is a
saddle point.
If x∗ is a critical point and D2 f (x∗ ) > 0, then x∗ is a local or global
minimizer. This is the same as saying all eigenvalues of the symmetric matrix
D2 f (x∗ ) are positive. When D2 f (x∗ ) < 0, x∗ is a local or global maximum.
If D2 f (x∗ ) has both positive and negative eigenvalues, x∗ is a saddle
point.
Let Q be a d × d symmetric matrix, let b be a vector, and let

d d
1 1X X
f (x) = x · Qx − b · x = qij xi xj − bj x j . (7.3.4)
2 2 i,j=1 j=1
When Q is a covariance matrix and b = 0, f (x) is the variance corresponding

to covariance matrix Q.
In this case,
d d
∂f 1X 1X
= qij xj + qji xj − bi = (Qx − b)i .
∂xi 2 j=1 2 j=1
Here we used Q = Qt . Thus ∇f (x) = Qx − b, and
Dv f (x) = v · (Qx − b).
A multi-variable function f (x) is convex if its restriction to any line is

convex. Explicitly, f (x) is convex if the single-variable function g(t) = f (x0 +
tv) is convex for for every point x0 and every direction v.
For example, when f (x) is given by (7.3.4),
g(t) = f (x0 + tv)

1
= (x0 + tv) · Q(x0 + tv) − b · (x0 + tv)
2
1 1 (7.3.5)
= x0 · Qx0 − b · x0 + tv · (Qx0 − b) + t2 v · Qv
2 2
1 2
= f (x0 ) + tv · (Qx0 − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx0 − b) + tv · Qv, g ′′ (t) = f (v) = v · Qv.
2
This shows
Quadratic Convexity
Let Q be a symmetric matrix and b a vector. The quadratic function
1
f (x) = x · Qx − b · x
2
has gradient
∇f (x) = Qx − b. (7.3.6)
Moreover f (x) is convex everywhere exactly when Q is a covariance
matrix, Q ≥ 0.
7.4. BACK PROPAGATION 355
By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude
Gradient is Direction of Greatest Increase

Let v be a unit vector and let x0 be a point in Rd . As the direction v
varies, the directional derivative varies between the two extremes
−|∇f (x0 )| ≤ Dv f (x0 ) ≤ |∇f (x0 )|.
The directional derivative achieves its greatest value when v points in

the direction of ∇f (x0 ), and achieves its least value in the opposite
direction, when v points in the direction of −∇f (x0 ).
7.4 Back Propagation

In this section, we compute outputs and derivatives on a graph. We consider
two cases, when the graph is a chain, or the graph is weighed and directed
(§4.2). The derivatives are taken with respect to the outputs at each node
of the graph. In §8.2, we consider a third case, and compute outputs and
derivatives on a neural network.
To compute node outputs, we do forward propagation. To compute
derivatives, we do back propagation. Corresponding to the three cases, we
will code three versions of forward and back propagation. In all cases, back
propagation depends on the chain rule.
The chain rule (§7.1) states
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose
r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .
These are three functions f , g, h composed in a chain (Figure 7.12).
r s y
x f g h
Figure 7.12: Composition of three functions in a chain.
The chain in Figure 7.12 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,
x = 0.785, r = 0.707, s = 0.670, y = 0.448.
Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
With the above values for x, r, s, we have
dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (7.1.17),
g ′ (r) = g(r)(1 − g(r)) = s(1 − s) = 0.670 ∗ (1 − 0.670) = 0.221.
From this,
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,

dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.
Here is another example. Let
r = x2 ,
s = r 2 = x4 ,
y = s2 = x 8 .
This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have
x = 5, r = 25, s = 625, y = 390625.
Applying the chain rule as above, check that

dy dy dy
= 625000, = 62500, = 1250.
dx dr ds
To evaluate x, r, s, y in Figure 7.12, first we built the list of functions

and the list of derivatives
from numpy import *
def f(x): return sin(x)

def g(r): return 1/(1+ exp(-r))
def h(s): return s**2

# this for next example
def k(t): return cos(t)
func_chain = [f,g,h]
def df(x): return cos(x)

def dg(r): return g(r)*(1-g(r))
def dh(s): return 2*s
# this for next example
def dk(t): return -sin(t)
der_chain = [df,dg,dh]
Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,
# first version: chains
def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x
from numpy import *

x_in = pi/4
x = forward_prop(x_in,func_chain)
Now we evaluate the gradient vector δ = (dy/dx, dy/dr, dy/ds, dy/dy).

Since dy/dy = 1, we set
# dy/dy = 1
delta_out = 1
The code for the first version of back propagation is
# first version: chains
def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta
delta = backward_prop(delta_out,x,der_chain)
Note forward propagation must be run prior to back propagation.
To apply this code to the second example, use
d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1
x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
Now we work with the network in Figure 7.13, using the multi-variable
chain rule (§7.3). The functions are
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
The composite function is
J = (x + y) max(y, z),
x +
a
y J
∗
b
z max
Figure 7.13: A network composition [24].
Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 7.15). This is forward propagation.
Now we compute the derivatives

∂J ∂J ∂J ∂J ∂J
, , , , .
∂x ∂y ∂z ∂a ∂b
This we do in reverse order. First we compute
∂J ∂J
= b = 2, = a = 3.
∂a ∂b
Then
∂a ∂a
= 1, = 1.
∂x ∂y
z
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z
y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0
Figure 7.14: The function g = max(y, z).
Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 7.14, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
Hence we have

∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J
The outputs (blue) and the derivatives (red) are displayed in Figure 7.15.
1
x +
2∗1=2
2 3
2 2
y 6
∗
1
2 2
3 3
0
z max
0
Figure 7.15: Forward and backward propagation [24].
Summarizing, by the chain rule,

• derivatives are computed backward,
• derivatives along successive edges are multiplied,
• derivatives along several outgoing edges are added.
To do this in general, recall a directed graph (§4.2) as in Figure 7.13 has

an adjacency matrix W = (wij ) with wij equal to one or zero depending on
whether (i, j) is an edge or not.
Suppose a directed graph has d nodes, and, for each node i, let xi be the
outgoing signal. Then x = (x1 , x2 , . . . , xd ) is the outgoing vector. In the case
of Figure 7.13, d = 6 and
x = (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J).
With this order, the adjacency matrix is
 
0 0 0 1 0 0
0 0 0 1 1 0
 
0 0 0 0 1 0
W = 0 0 0
.
 0 0 1

0 0 0 0 0 1
0 0 0 0 0 0
This we code as a list of lists,
d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1
More generally, in a weighed directed graph, the weights wij are numeric
scalars. In this case, for each node j, let
x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.1)
Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj
−
is the weighed list of incoming signals at node j.

An activation function at node j is a function fj of the incoming signals
x−j . Then the outgoing signal at node j is
xj = fj (x−
j ) = fj (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.2)
By the chain rule,

∂fj
∂xj  · wij , if (i, j) is an edge,
= ∂xi (7.4.3)
∂xi 0, if (i, j) is not an edge.
For example, if (1, 5), (7, 5), (2, 5) are the edges pointing to node 5 and
we ignore zeros in (7.4.1), then x−5 = (w15 x1 , w75 x7 , w25 x2 ), so
x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).
The incoming vector is

−
x− = (x− −
1 , x2 , . . . , xd ).
Then x− is a list of lists. In the case of Figure 7.13, if we ignore zeros,

x− = (x− − − − − −
1 , x2 , x3 , x4 , x5 , x6 ) = ((), (), (), (x, y), (y, z), (a, b)),
and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point
defining f1 , f2 , f3 .
activate = [None]*d
activate[3] = lambda x,y: x+y

activate[4] = lambda y,z: max(y,z)
activate[5] = lambda a,b: a*b
Assume activate[j] is the function at node j. To compute the outgoing

signal xj at node j, we collect the incoming signals x−
j following (7.4.1)
def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i
,→ in range(d) ]
then plug them into the activation function,
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))
Here * is the unpacking operator.

Summarizing, at each node j, we have the outgoing signal xj , and a list
x−
j of incoming signals.
A node with an attached activation function is a neuron. A network is

a directed weighed graph where the nodes are neurons. The code in this
section works for any network. In §8.2, we specialize to neural networks.
Neural networks are networks with a restricted class of activation functions.
Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
Here is the second version of forward propagation.
# second version: networks
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For this code to work, we assume there are no cycles in the graph: All
backward paths end at inputs.
Let xout be the output nodes. For Figure 7.13, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 7.13, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi
Then δ = (δ1 , δ2 , . . . , δd ) is the gradient vector. We first compute the deriva-
tives of J with respect to the output nodes xout , and we assume these deriva-
tives are assembled into a vector δout .
In Figure 7.13, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the nodes are ordered so that the terminal portion of x equals
xout and the terminal portion of δ equals δout ,
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
For each i, j, let

∂fj
gij = .
∂xi
Then we have a d × d gradient matrix g = (gij ). When (i, j) is not an edge,
gij = 0.
These are the local derivatives, not the derivatives obtained by the chain
rule. For example, even though we saw above ∂J/∂y = 1, here the local
derivative is zero, since J does not depend directly on y.
For the example above, with (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J),
g = [ [None]*d for _ in range(d) ]
# note g[i][i] remains undefined
g[0][3] = lambda x,y: 1

g[1][3] = lambda x,y: 1
g[1][4] = lambda y,z: 1 if y>z else 0
g[2][4] = lambda y,z: 1 if z>y else 0
g[3][5] = lambda a,b: b
g[4][5] = lambda a,b: a
By the chain rule and (7.4.3),
∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi
so X
δi = δj · gij · wij .
i→j
The code is
7.5. CONVEX FUNCTIONS 367
def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None
,→ else 0 for j in range(d) ])
This leads to our second version of back propagation,
# second version: networks
def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta
7.5 Convex Functions

Let f (x) be a scalar function of points x = (x1 , . . . , xd ) in Rd . For example,
in two dimensions,
x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), + x22
f (x) = f (x1 , x2 ) =
4
are scalar functions of points in R2 . More generally, if Q is a d × d matrix,
f (x) = x · Qx is such a function. Here, to obtain x · Qx, we think of the point
x as a vector, then use row-times-column multiplication to obtain Qx, then
take the dot product with x. We begin with functions in general.
A level set of f (x) is the set
E: f (x) = 1.
Here we write the level set of level 1. One can have level sets corresponding
to any level ℓ, f (x) = ℓ. In two dimensions, level sets are also called contour
lines.
x0
x0 x0
Figure 7.16: Level sets and sublevel sets in two dimensions.
For example, the covariance ellipsoids x · Qx = 1 are level sets. In two

dimensions, the square and ellipse in Figure 7.16 are level sets
x21
max(|x1 |, |x2 |) = 1, + x22 = 1.
4
The contour lines of
x21 x22
f (x) = f (x1 , x2 ) = +
16 4
are in Figure 7.17.
Figure 7.17: Contour lines in two dimensions.
A sublevel set of f (x) is the set

E: f (x) ≤ 1.
x1
(1 − t)x0 + tx1
x0
Figure 7.18: Line segment [x0 , x1 ].
Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 7.16, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves,
in Figure 7.22 are sublevel sets. Note we always consider the level set to be
part of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 7.16 are boundaries of their respective
sublevel sets, and the covariance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.
Given points x0 and x1 in Rd , the line segment joining them is
[x0 , x1 ] = {(1 − t)x0 + tx1 : 0 ≤ t ≤ 1}.
A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,
f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ), for 0 ≤ t ≤ 1. (7.5.1)
This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two
dimensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 7.19.
More generally, given points x1 , x2 , . . . , xN , a linear combination
t1 x1 + t2 x2 + · · · + tN xN
1
We only consider convex functions that are continuous.
Figure 7.19: Convex: The line segment lies above the graph.
is a convex combination if t1 , t2 , . . . , tN are nonnegative, and

t1 + t2 + · · · + tN = 1.
For example, if 0 ≤ t ≤ 1, (1 − t)x0 + tx1 is a convex combination of x0 and
x1 . Then a convex function also satisfies
f (t1 x1 + · · · + tN xN ) ≤ t1 f (x1 ) + · · · + tN f (xN ), (7.5.2)
for any convex combination.
Recall (§2.2) a nonnegative matrix is a symmetric matrix Q satisfying

x · Qx ≥ 0 for all x, and every such matrix is the covariance matrix of some
dataset. This is equivalent to the nonnegativity of the eigenvalues of Q.
When the eigenvalues of Q are positive, Q is invertible.
Quadratic is Convex
If Q is a nonnegative matrix and b is a vector, then
1
f (x) = x · Qx − b · x
2
is a convex function. When Q is invertible, f (x) is strictly convex.
This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (7.3.5),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(7.5.3)
Inserting t = 1 in (7.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (7.5.3),
f ((1 − t)x0 + tx1 ) = f (x0 + tv)
1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).
When Q is is invertible, then v · Qv > 0, and we have strict convexity.
A convex set is a subset E in Rd that contains the line segment joining

any two points in it: If x0 and x1 are in E, then the line segment [x0 , x1 ] is
in E. To be consistent with sublevel sets, we only consider convex sets that
contain their boundaries.
More generally, given points x1 , x2 , . . . , xN in E, the convex combination
x = t1 x1 + t2 x2 + · · · + tN xN
is also in E. The set of all convex combinations of x1 , x2 , . . . , xN is the
convex hull of x1 , x2 , . . . , xN (Figure 7.20). If E is convex and contains x1 ,
x2 , . . . , xN , then E contains their convex hull.
The interiors of the square and the ellipse in Figure 7.16, together with
their boundaries, are convex sets. The interior of the ellipsoid in Figure 7.22,
together with the ellipsoid, is a convex set.
The following code generates convex hulls,
x3
x4
x2
x6 x7
x5
x1
Figure 7.20: Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 .
from scipy.spatial import ConvexHull

from numpy import *
rng = default_rng()
# 30 random points in 2-D

points = rng.random((30, 2))
hull = ConvexHull(points)
and this plots the facets of the convex hull
plot(points[:,0], points[:,1], 'o')

for facet in hull.simplices:
plot(points[facet,0], points[facet,1], 'k-')
facet =hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()
resulting in Figure 7.21.

Figure 7.21: A convex hull with one facet highlighted.
If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 7.19). From convex functions, there are other ways to get convex
sets:
Sublevel of Convex is Convex

If f (x) is a convex function, then the sublevel set
E: f (x) ≤ 1
is a convex set.
This is an immediate consequence of the definition: f (x0 ) ≤ 1 and

f (x1 ) ≤ 1 implies
f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ) ≤ (1 − t) + t = 1.
From these results, we have

Figure 7.22: Convex set in three dimensions with supporting hyperplane.
Ellipsoids are Boundaries of Convex Sets

If Q is a covariance matrix, then x · Qx ≤ 1 is a convex set.
Let n be a nonzero vector in Rd . In two dimensions, the vectors or-

thogonal to n form a line (Figure 7.23). In three dimensions, the vectors
orthogonal to n form a plane (Figure 7.23). In d dimensions, these vectors
form the orthogonal complement n⊥ (2.7.5), which is a (d − 1)-dimensional
subspace. This subspace is a hyperplane passing through the origin.
In general, given a point x0 and a nonzero vector n, the hyperplane through
x0 with normal n consists of all solutions x of
H: n · (x − x0 ) = 0. (7.5.4)
A hyperplane separates the whole space Rd into two half-spaces,
n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
The vector n is the normal vector to the hyperplane. Note replacing n by

any nonzero multiple of n leaves the hyperplane unchanged.
n n
x0 x0
Figure 7.23: Hyperplanes in two and three dimensions.
Separating Hyperplane
Let E be a convex set and let x∗ be a point not in E. Then there is a
hyperplane separating x∗ and E: For some x0 in E and nonzero n,
n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0. (7.5.5)
x∗
n
x′
x0 x0
x x0 + tv
Figure 7.24: Separating hyperplane theorem.
A diagram of the proof is Figure 7.24. Let x0 be the point in E closest

to x∗ . This means x0 minimizes |x − x∗ |2 over x in E. If x is in E, then by
convexity, the line segment [x0 , x] is in E, hence x0 + tv, v = x − x0 , is in E
for 0 ≤ t ≤ 1. Since x0 is the point of E closest to x∗ ,
|x0 − x∗ |2 ≤ |x0 + tv − x∗ |2 for 0 ≤ t ≤ 1.
Expanding, we have
|x0 − x∗ |2 ≤ |x0 − x∗ |2 + 2t(x0 − x∗ ) · v + t2 |v|2 , 0 ≤ t ≤ 1.
Canceling |x0 − x∗ |2 then, for t > 0, canceling t, we obtain
0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.
Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain
n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0.
Now suppose x0 is a point in the boundary of a convex set E. Since x0 is

in E, we cannot find a separating hyperplane for x∗ = x0 . In this case, the
best we can hope for is a hyperplane passing through x0 , with E to one side
of the hyperplane:
x in E =⇒ (x − x0 ) · n ≤ 0. (7.5.6)
Such a hyperplane is a supporting hyperplane for E at x0 . Figures 7.16 and

7.22 display examples of supporting hyperplanes. Here is the basic result
relating convex sets and supporting hyperplanes.
Supporting Hyperplane for Convex Set

Let E be a convex set and let x0 be a point on the boundary of E.
Then there is a supporting hyperplane at x0 .
If x0 is in the boundary of E, there are points x′ not in E approximating

x0 (Figure 7.24). Applying the separating hyperplane theorem to x′ , and
taking the limit x′ → x0 , results in a supporting hyperplane at x0 . We skip
the details here.
Supporting hyperplanes characterize convex sets in the following sense:
If through every point x0 in the boundary of E, there is a supporting hyper-
plane, then E is convex.
In Figure 7.16, there are multiple supporting hyperplanes. However, at

every other point a on the boundary of the square, there is a unique (up to
scalar multiple) supporting hyperplane. For the ellipse or ellipsoid, at every
point of the boundary, there is a unique supporting hyperplane.
Now we derive the analogous concepts for convex functions.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is
y = f (a) + ∇f (a) · (x − a). (7.5.7)
Convex Function Graph Lies Above the Tangent Hyperplane
If f (x) is convex and has a gradient ∇f (a), then
f (x) ≥ f (a) + ∇f (a) · (x − a). (7.5.8)
This vector result is obtained by applying the corresponding scalar result

in §7.1 to the function f (a + tv), where v = x − a. As in the scalar case,
there is a similar result (7.5.15) with tangent paraboloids.
We now address the existence of a global minimizer of a convex function.

A (global) minimizer for f (x) is a vector x∗ satisfying
f (x∗ ) = min f (x),

x
where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
7.4) and the relative information (Figure 7.10) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (7.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. This is same as saying f (x) rises to +∞ as |x| → ∞.
Remember, if x is a scalar, |x| = ±x is the absolute value, and if x =
(x1 , x2 , . . . , xd ) is in Rd ,
q
|x| = x21 + x22 + · · · + x2d .
For example, the functions in Figure 7.4 is proper and strictly convex,
while the function in Figure 7.5 is proper but neither convex nor strictly
convex.
Intuitively, if f (x) goes up to +∞ when x is far away, then its graph must
have a minimizer at some point x∗ .
The following result describes when the residual (2.6.1) is proper.
Properness of Residual on Row Space

Let A be a matrix, and b a vector with dimensions so that the residual
f (x) = |Ax − b|2
is defined. Then f (x) is proper on the row space of A.
To see this, suppose f (x) is not proper. Then there would be a sequence
x1 , x2 , . . . in the row space of A satisfying |xn | → ∞ while f (xn ) remains
bounded, say f (xn ) ≤ c for some constant c. Let x′n = xn /|xn |. Then x′n are
unit vectors in the row space of A, hence x′n is a bounded sequence. From
§A.2, this implies x′n subconverges to some x∗ , necessarily a unit vector in
the row space of A.
By the triangle inequality (2.2.4),
1 1 1
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ (c + |b|).
|xn | |xn | |xn |
Moreover Ax′n subconverges to Ax∗ . Since |xn | → ∞, taking the limit n →

∞,
1
|Ax∗ | = lim |Ax′n | ≤ (c + |b|) = 0.
n→∞ ∞
∗
Thus x is both in the row space of A and in the null space of A. Since the
row space and the null space are orthogonal, this implies x∗ = 0. But we
can’t have 1 = |x∗ | = |0| = 0. This contradiction shows there is no such
sequence xn , and we conclude f (x) is proper.
When the row space is the source space,
Properness of Residual
When the N × d matrix A has rank d,
f (x) = |Ax − b|2 (7.5.9)
is proper on Rd .
Now we turn to minimizers.
Existence of Global Minimizer

Suppose f (x) is a continuous proper function defined on all of Rd .
Then f (x) has a global minimizer x∗ ,
f (x∗ ) ≤ f (x). (7.5.10)
To see this, pick any point a. Then, by properness, the sublevel set S
given by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer
x∗ (see §A.2). Since for all x outside the sublevel set, we have f (x) > f (a),
x∗ is a global minimizer.
Existence and Uniqueness of Global Minimizer
Suppose f (x) is a continuous strictly convex proper function defined

on all of Rd . Then f (x) has a unique global minimizer x∗ .
Let x1 be another global minimizer. Then f (x1 ) = f (x∗ ). Let x2 =

(x∗ + x1 )/2 be their midpoint. By strict convexity,
1
f (x2 ) < (f (x∗ ) + f (x1 )) = f (x∗ ),
2
∗
contradicting the fact that x is a global minimizer. Thus there cannot be
another global minimizer.
As a consequence,
Existence of Residual Minimizer

Let A be a matrix and b a vector so that the residual
f (x) = |Ax − b|2 (7.5.11)
is well-defined. Then there is a residual minimizer x∗ in the row space

of A,
f (x∗ ) ≤ f (x). (7.5.12)
The global minimizer x∗ is located by the first derivative test.
First Derivative Test for Global Minimizer

Let f (x) be a strictly convex proper function having a gradient ∇f (x)
at every point. Then the global minimizer x∗ is the unique point
satisfying
∇f (x∗ ) = 0. (7.5.13)
Let a be any point, and v any direction, and let g(t) = f (a + tv). Then
g ′ (0) = ∇f (a) · v.
If a is a minimizer, then t = 0 is a minimum of g(t), so g ′ (0) = 0. Since v is

any direction, this shows ∇f (a) = 0.
If there were another point b satisfying ∇f (b) = 0, let v = b − a. Then
b = a + v and g(t) is strictly convex in t, and also g ′ (1) = ∇f (b) · v = 0. By
convexity, g ′ (t) is increasing in t. If g ′ (0) = 0 and g ′ (1) = 0, then g ′ (t) = 0
for 0 < t < 1. This implies g(t) is a linear on 0 < t < 1, contradicting strict
convexity. This establishes the first derivative test.
Suppose the second partials
∂ 2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
exist. Then the second derivative of f (x) is the symmetric matrix

 
2 2
∂ f ∂ f
 . . .
 ∂x1 ∂x1 ∂x1 ∂x2 
 
 ∂ 2f ∂ 2
f 

 ∂x ∂x ∂x ∂x . . . 

2
D f (x) = 
 2 1 2 2 

 
 ... ... . . .
 
 2 2

 ∂ f ∂ f 
...
∂xd ∂x1 ∂xd ∂x2
Replacing x by x + tv in (7.3.3), we have
d
f (x + tv) = ∇f (x + tv) · v.
dt
Differentiating and using the chain rule again,
Second Directional Derivative

The second derivative Q = D2 f (x) satisfies
d2
f (x + tv) = v · Qv. (7.5.14)
dt2 t=0
In particular, f (x) is convex if the second directional derivative is non-

negative for all x and v.
This implies
Second Derivative Test for Multi-variable Strict Convexity
Suppose f (x) has a second derivative D2 f (x), and assume D2 f (x) is

positive definite, at every x. Then f (x) is strictly convex.
An important example of a strictly convex proper function is f (x) =

x · Qx/2 − b · x when Q > 0 (§7.3). Also (§8.5) loss functions in linear
regression and logistic regression are strictly convex and proper under the
right assumptions.
Recall m ≤ Q ≤ L means the eigenvalues of the symmetric matrix Q are

between L and m. The following is the multi-variable version of (7.1.10).
The proof is the same as in the scalar case.
Second Derivative Bounds

If m ≤ D2 f (x) ≤ L, then
m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (7.5.15)
2 2
If we choose a = x∗ , where x∗ is the global minimizer, then by (7.5.13),

we see the graph of f (x) lies between two quadratics globally.
Upper and Lower Paraboloids
If m ≤ D2 f (x) ≤ L and x∗ is the global minimum, then
m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (7.5.16)
2 2
We describe the convex dual in the multi-variable setting (the single-

variable case was done in (7.1.13)). If f (x) is a scalar convex function of x,
and x = (x1 , x2 , . . . , xd ) has d features, the convex dual is
g(p) = max (p · x − f (x)) . (7.5.17)

x
Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (7.5.17).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (7.5.18)
2 2 2
To see this, note the left side is greater or equal to zero. Since the left side
equals zero iff p = Qx, we are led to (7.5.17). The next simplest example is
the partition function, see below.
If x is a maximizer in (7.5.17), then the derivative is zero,
0 = ∇(p · x − f (x)) =⇒ p = ∇f (x).
Here ∇ is with respect to x. The maximizer x = x(p) depends on p, so by

the chain rule
∇g(p) = ∇(p · x(p) − f (x(p)))
= x + p∇x(p) − ∇f (x)∇x(p) = x + (p − ∇f (x))∇x(p) = x.
Here ∇ is with respect to p and, since x = x(p) is vector-valued, ∇x(p) is a

d × d matrix. We conclude
p = ∇f (x) ⇐⇒ x = ∇g(p).
Thus the vector-valued function ∇f (x) is the inverse of the vector-valued

function ∇g(p), or
∇g(∇f (x)) = x.
Differentiating, we obtain
D2 g(∇f (x))D2 f (x) = I.
This yields
Second Derivatives of Dual Functions

Let f (x) be a strictly convex function with second derivative D2 f (x),
and let g(p) be its convex dual. Then
D2 g(p) = (D2 f (x))−1 , p = ∇f (x).
Moreover, if m ≤ D2 f (x) ≤ L, then

1 1
≤ D2 g(p) ≤ .
L m
Using this, and writing out (7.5.15) for g(p) instead of f (x) (we skip the
details) yields
Dual Second Derivative Bounds

Let p = ∇f (x) and q = ∇f (a). If f (x) is convex and m ≤ D2 f (x) ≤ L,
then
1 1
|p − q|2 ≥ f (x) − f (a) − q · (x − a) ≥ |p − q|2 . (7.5.19)
2m 2L
This is used in gradient descent.
Now let f (x) be strongly convex in the sense m ≤ D2 f (x) ≤ L. Then we

have the vector version of (7.1.11).
Coercivity of the Gradient
Let p = ∇f (x) and q = ∇f (a). If m ≤ D2 f (x) ≤ L, then
mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (7.5.20)
m+L m+L
This is derived by using (7.5.19), the details are in [2]. This result is used
in gradient descent.
As in the single variable case,
Dual of the Dual

Let g(p) be the convex dual of f (x). Then the convex dual of g(p) is
f (x).
7.6 Multinomial Probability

In multinomial probability, there are d classes or categories, and p is a vector
of probabilities, p = (p1 , p2 , . . . , pd ). Here pk is the probability we are in class
k. Then each pk is nonnegative, pk ≥ 0, and the sum is one,
p1 + p2 + · · · + pd = 1.
7.6. MULTINOMIAL PROBABILITY 385
The partition function is
Z(z) = log (ez1 + ez2 + · · · + ezd ) . (7.6.1)
Then Z is a function of d scalar variables z = (z1 , z2 , . . . , zd ). If we insert

z = 0, we obtain Z(0) = log d.
Let
1 = (1, 1, . . . , 1).
Then
d
X
p·1= pk = 1.
k=1
Because
Z(z + a1) = Z(z1 + a, z2 + a, . . . , zd + a) = Z(z) + a,
Z is not bounded below and does not have a minimum.
The softmax function is the vector-valued function q = σ(z) with compo-

nents
ezk ezk
qk = σk (z) = = , k = 1, 2, . . . , d.
ez1 + ez2 + · · · + ezd eZ
By the chain rule, the gradient of the partition function is the softmax
function,
∇Z(z) = σ(z). (7.6.2)
When d = 2, the vector softmax function reduces to the scalar logistic
function (5.1.12), since
e z1 1
q1 = = = σ(z1 − z2 ),
z
e +e
1 z 2 1 + e 1 −z2 )
−(z
e z2 1
q 2 = z1 = = σ(z2 − z1 ).
e +e z 2 1 + e 2 −z1 )
−(z
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.
from scipy.special import softmax
z = array([z1,z2,z3])
q = softmax(z)
In the previous section, we studied convex functions and the existence

and uniqueness of the global minimum. As we saw above, Z does not have
a global minimum over unrestricted z.
Since σ(z) = ∇Z(z), a critical point z ∗ of Z must satisfy σ(z ∗ ) = 0. For
Z, a critical point cannot be unique, because
σ(z1 , z2 , . . . , zd ) = σ(z1 + a, z2 + a, . . . , zd + a),
or
σ(z) = σ(z + a1).
To guarantee uniqueness of a global minimum of Z, we have to restrict
attention to the subspace of vectors z = (z1 , z2 , . . . , zd ) orthogonal to 1, the
vectors satisfying
z1 + z2 + · · · + zd = 0.
Now suppose z is orthogonal to 1. Since the exponential function is
convex, !
d d
eZ 1 X zk 1X
= e ≥ exp zk = e0 = 1.
d d k=1 d k=1
This establishes
Global Minimum of the Partition Function

If z satisfies z1 + z2 + · · · + zd = 0, then Z(z) ≥ Z(0) = log d.
The inverse of the softmax function is obtained by solving p = σ(z) for

z, obtaining
zk = Z + log pk , k = 1, 2, . . . , d. (7.6.3)
Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is
z = Z1 + log p. (7.6.4)
The function
d
X
I(p) = p · log p = pk log pk (7.6.5)
k=1
is the (absolute) information. Since 0 ≤ p ≤ 1, log p ≤ 0, hence I(p) ≤ 0.

Since log is concave,
d d
!
X X
pk log(ezk ) ≤ log pk ezk .
k=1 k=1
This implies
d
X d
X
p·z = pk zk = pk log(ezk )
k=1 k=1
d
! d
!
X X
≤ log pk e zk = log ezk +log pk = Z(z + log p).
k=1 k=1
Replacing z by z − log p, this establishes
I(p) ≥ p · z − Z(z). (7.6.6)
By (7.6.4), (7.6.6) is an equality when p = σ(z). We conclude
Information and Partition are Convex Duals

For all p,
I(p) = max (p · z − Z(z)) .
z
For all z,
Z(z) = max (p · z − I(p)) .
p
The second equality follows by switching Z and I in (7.6.6), and repeating

the same logic used to derive the first equality.
Inserting z = 0 in (7.6.6), we have
Absolute Information is Bounded

For all p = (p1 , p2 , . . . , pd ),
0 ≥ I(p) ≥ − log(d). (7.6.7)
The (absolute) entropy, the analog of (7.2.2), is then

d
X
H(p) = −I(p) = − pk log(pk ). (7.6.8)
k=1
Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is
from scipy.stats import entropy
p = array([p_1,p_2,p_3])
entropy(p)
Now (
∂ 2Z ∂σj σj − σj σk , if j = k,
= =
∂zj ∂zk ∂zk −σj σk , if j ̸= k.
Hence we have
D2 Z(z) = ∇σ(z) = diag(q) − q ⊗ q, q = σ(z). (7.6.9)
qk vk . Since Q = D2 Z(z) satisfies

P
Let v̄ = v · q =
d
X d
X
v · Qv = qk vk2 2
− (v · q) = qk (vk − v̄)2 ,
k=1 k=1
which is nonnegative, Q is a covariance matrix, and Z is convex.

In fact Z is strictly convex in directions v orthogonal to 1 = (1, 1, . . . , 1).
If v · Qv = 0, then, since qk > 0 for all k, v = v̄1. If v is orthogonal to 1, this
forces v̄ = 0, which, by using v · Qv = 0 again, implies v = 0. This shows Z
is strictly convex in directions v orthogonal to 1.
Moreover, Z is proper (§7.5) in directions orthogonal to 1. To see this,
suppose z · 1 = 0 and Z(z) ≤ c. Then eZ ≤ ec , which implies
zj ≤ c, j = 1, 2, . . . , d.
Given 1 ≤ j ≤ d,Padd the inequalities zk ≤ c over all indices k ̸= j. Since

z · 1 = 0, −zj = k̸=j zk . Hence
−zj ≤ (d − 1)c, j = 1, 2, . . . , d.
Combining the last two inequalities,
|zj | = max(zj , −zj ) ≤ (d − 1)c, j = 1, 2, . . . , d.
We conclude
Z(z) ≤ c and z·1=0 =⇒ |z| ≤ d(d − 1)c. (7.6.10)
We have shown
The Partition Function is Proper and Strictly Convex
On the subspace z · 1 = 0, Z(z) is proper and strictly convex.
The relative information is

d
X
I(p, q) = pk log(pk /qk ). (7.6.11)
k=1
Here p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qd ) are probability distributions.

Let
log q = (log q1 , log q2 , . . . , log qd ).
Then
d
X
p · log q = pk log qk ,
k=1
and
I(p, q) = I(p) − p · log q. (7.6.12)
Similarly, the relative entropy is
H(p, q) = −I(p, q). (7.6.13)
In Python, the code
from scipy.stats import entropy
p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
returns the relative information, not the relative entropy. See below for more
on this terminology confusion.
The relative partition function is

d
!
X
Z(z, q) = log ezk qk ,
k=1
If we insert qk = exp(log(qk )) in the definition of Z(z, q), one obtains
Z(z, q) = Z(z + log q).
From this, using the change of variable z ′ = z + log q,
max (p · z − Z(z, q)) = max (p · z − Z(z + log q))

z z
= max
′
(p · (z ′ − log q) − Z(z ′ ))
z
= max (p · z − Z(z)) − p · log q
z
= I(p) − p · log q
= I(p, q).
As before, this shows
Relative Information and Relative Partition are Convex Duals

For all p and q,
I(p, q) = max (p · z − Z(z, q)) .

z
For all z and q,
Z(z, q) = max (p · z − I(p, q)) .

p
In logistic regression (§8.5), the computed output is q = σ(z), the desired

output or target is p, and the error function is I(p, σ(z)). To compute the
error function, by (7.6.4),
q = σ(z) =⇒ log q = z − Z(z)1.

By (7.6.12), this yields
I(p, σ(z)) = I(p) − p · log q = I(p) − p · z + Z(z). (7.6.14)
This identity is the direct analog of (7.5.18). The identity (7.5.18) arises
naturally in least squares, or linear regression. Similarly, (7.6.14) arises in
logistic regression.
The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1
and the cross-entropy is

d
X
Hcross (p, q) = −Icross (p, q) = pk log qk .
k=1
The cross-information is usually erroneously called cross-entropy, see the dis-

cussion at the end of the section.
Cross-information and relative information are related by
I(p, q) = I(p) + Icross (p, q).
When the target probabilities pk are all 0 or 1, then
I(p, q) = Icross (p, q). (7.6.15)
In general, from (7.6.14),
Icross (p, σ(z)) = −p · z + Z(z).
From (7.6.2) and (7.6.14),
∇z I(p, σ(z)) = q − p, q = σ(z). (7.6.16)

Since I(p, σ(z)) and Icross (p, σ(z)) differ by the constant I(p), we also have
∇z Icross (p, σ(z)) = q − p, q = σ(z).
This will be useful in logistic regression. Table 7.25 summarizes the situation.
H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)
Table 7.25: The third row is the sum of the first and second rows, and the
H column is the negative of the I column.
Here is the multinomial analog of (7.2.6). Suppose a dice has d faces, and
suppose the probability of rolling the k-th face in a single roll is qk . Then
q = (q1 , q2 , . . . , qd ) is a probability distribution. Let p = (p1 , p2 , . . . , pd ) be
another probability distribution. Roll the dice n times, and let Pn (p, q) be
the probability that the proportion of times the k-th face is rolled equals pk ,
k = 1, 2, . . . , d. Then we have the approximation
Pn (p, q) ∼ enH(p,q) , for n large.
In the literature, in the industry, in Wikipedia, and in Python, the termi-

nology2 is confused: The relative information I(p, q) is almost always called
relative entropy.
2
Of course the correct quantities are used, it’s the naming that is incorrect.
Since the entropy H is the negative of the information I, this is looking at

things upside-down. In other settings, I(p, q) is called the Kullback–Leibler
divergence, which is not informative terminology.
Also, in machine learning, Icross (p, q) is called the cross-entropy, not cross-
information, continuing the confusion.
Rubbing salt into the wound, in Python, entropy(p) is H(p), which is
correct, but entropy(p,q) is I(p, q), which is incorrect, or at the very least,
inconsistent, even within Python.
How does one keep things straight? By remembering that it’s convex
functions that we like to minimize, not concave functions. In machine learn-
ing, loss functions are built to be minimized, and information, in any form,
is convex, while entropy, in any form, is concave.
Chapter 8
Machine Learning
8.1 Overview
This first section is an overview of the chapter. Here is a summary of the
structure of neural networks.
• A graph consists of nodes and edges (§4.2).
• If each edge has a direction, the graph is directed.
• If each edge has a weight, the graph is weighed.
• In a directed graph, there are input nodes, output nodes, and hidden
nodes.
• A node with an activation function is a neuron (§7.4).
• Each neuron has incoming signals and an outgoing signal.
• The outgoing signal is the activation function applied to the incoming

signals.
• A network is a weighed directed graph (§4.2) where the nodes are neu-
rons (§7.4).
• A neural network is a network where each activation function is a func-

tion of the sum of the incoming signals (§8.2).
395
396 CHAPTER 8. MACHINE LEARNING
The goal is to train a neural network. To train a neural network means to

find weights W so that the input-output behavior of the network is as close
as possible to a given dataset of sample pairs (xk , yk ), k = 1, 2. . . . , N . Here
is a summary of how neural networks are trained (§8.4).
1. Start with a sample pair (xk , yk ) and a weight matrix W .
2. Using xk as incoming signals at the input nodes, compute the network’s

outgoing signals at all nodes (forward propagation).
3. Compute the error J = J(xk , yk , W ) between the outgoing signals at

the output nodes and yk .
4. Compute the derivatives δout of J at the output nodes.
5. Compute the derivatives δ of J at all nodes (back propagation).
6. Then the weight gradient is given by ∇W J = x ⊗ δ.
7. Update W using gradient descent (§8.3), W + = W − t∇W J (§8.4).
8. Repeat steps 1-7 over all sample pairs (xk , yk ), k = 1, 2, . . . , N (§8.9).
9. Repeat step 8 until convergence.
Steps 1-7 is an iteration, and step 8 is an epoch. An iteration uses a

single sample (more generally a batch of samples), and an epoch uses the
entire dataset. The mean error over the dataset is
N
X
J(W ) = J(xk , yk , W ).
k=1
Sometimes J(W ) is normalized by dividing by N , but this does not change

the results. With the dataset given, the mean error is a function of the
weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,
J(W ∗ ) ≤ J(W ), for all W.
Convergence means W is close to W ∗ . Now we turn to the details.

8.2. NEURAL NETWORKS 397
8.2 Neural Networks

In §7.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §4.2 and §7.4.
A graph consists of nodes and edges. Nodes are also called vertices, and
an edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is
not the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output,
it is a hidden node.
We assume our graphs have no cycles: every path terminates at an output
node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j).
If (i, j) is not an edge, we set wij = 0. If a network has d nodes, the edges
are completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (7.4.2) is a neuron. A net-
work is a directed weighed graph where the nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.
In a network, in §7.4, the activation function fj at node j was allowed to

be any function of the incoming list (7.4.1) at node j
(w1j x1 , w2j x2 , . . . , wdj xd ).
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 7.13 is not a neural network.
Let X
x−j = wij xi (8.2.1)
i→j
be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
!
X
xj = fj (x−
j ) = fj wij xi . (8.2.2)
i→j
If the network has d nodes, the outgoing vector is
x = (x1 , x2 , . . . , xd ),
and the incoming vector is

−
x− = (x− −
1 , x2 , . . . , xd ).
In a network, in §7.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form
x = f (W t x).
However, this representation is more useful when the network has structure,
for example in a dense shallow layer (8.2.12).
A perceptron is a network of the form
y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)
(Figure 8.1). This is the simplest neural network.

Thus a perceptron is a linear function followed by an activation function.
By our definition of neural network,
Neural Network
Every neural network is a combination of perceptrons.
x1
w1
w2
x2 f y
w3
x3
Figure 8.1: A perceptron with activation function f .
When an input x0 is fixed to equal 1, x0 = 1, the corresponding weight

w0 is called a bias,
y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).
There is no computational advantage in separating out the bias.

Nevertheless, the bias is important, as it shifts the threshhold in the
activation function.
In §5.1, Bayes theorem is used to express a conditional probability in
terms of a perceptron,
P rob(H | x) = σ(w · x + w0 ).
This is a basic example of how a perceptron computes probabilities.
Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [15], from which Figure 8.2 is taken.
Figure 8.2: Perceptrons in parallel [15].
Here is a listing of common activation functions.
• The identity function,

id(z) = z
and its derivative id′ = 1.
• The binary output,

(
1 if z > 0,
bin(z) =
0 if z < 0,
and its derivative bin′ = 0, z ̸= 0, and bin′ (0) undefined.
• The logistic or sigmoid function (Figure 5.2)

1
σ(z) =
1 + e−z
and its derivative σ ′ = σ(1 − σ).
• The hyperbolic tangent function

ez − e−z
tanh(z) =
ez + e−z
and its derivative tanh′ = 1 − tanh2 .
• The rectified linear unit relu,
(
z if z > 0,
relu(z) =
0 if z < 0,
and its derivative relu′ = bin.
Here is the code
# activation functions
def relu(z): return 0 if z < 0 else z

def bin(z): return 0 if z < 0 else 1
def sigmoid(z): return 1/(1+exp(-z))
def id(z): return z
# tanh already part of numpy
def one(z): return 1
def zero(z): return 0
# derivative of relu is bin

# derivative of bin is zero
# derivative of s=sigmoid is s*(1-s)
# derivative of id is one
# derivative of tanh is 1-tanh**2
def D_relu(z): return bin(z)

def D_bin(z): return 0
def D_sigmoid(z): return sigmoid(z)*(1-sigmoid(z))
def D_id(z): return 1
def D_relu(z): return bin(z)
def D_tanh(z): return 1 - tanh(z)**2
der_dict = { relu:D_relu, id:D_id, bin:D_bin,

,→ sigmoid:D_sigmoid, tanh: D_tanh}
The neural network in Figure 8.3 has weight matrix

 
0 0 w13 w14 0 0
0 0 w23 w24 0 0 
 
0 0 0 0 w35 w36 
W =
0
 (8.2.3)
 0 0 0 w45 w46 

0 0 0 0 0 0 
0 0 0 0 0 0
and activation functions f3 , f4 , f5 , f6 . Here 1 and 2 are plain nodes, and 3,

4, 5, 6 are neurons.
w13 w35
x1 f3 f5 x5
w14 w36
w23 w45
w24 w46
x2 f4 f6 x6
Figure 8.3: Network of neurons.
Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 8.3 has outgoing vectors
x = (x1 , x2 , x3 , x4 , x5 , x6 ), xin = (x1 , x2 ), xout = (x5 , x6 ).
Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .
Neuron Incoming Outgoing

f3 x−
3 = w13 x1 + w23 x2 x3 = f3 (w13 x1 + w23 x2 )
f4 x−
4 = w14 x1 + w24 x2 x4 = f4 (w14 x1 + w24 x2 )
f5 x−
5 = w35 x3 + w45 x4 x5 = f5 (w35 x3 + w45 x4 )
f6 x−
6 = w36 x3 + w46 x4 x6 = f6 (w36 x3 + w46 x4 )
Table 8.4: Incoming and Outgoing signals.
Now we specialize the forward propagation code in §7.4 to neural net-

works. The key diagram is Figure 8.5.
xi xj
fi fj
wij
Figure 8.5: Forward and back propagation between two neurons.
Assume the activation function at node j is activate[j]. By (8.2.1) and

(8.2.2), the code is
def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None
,→ else 0 for i in range(d) ])
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
Here is the third version of forward propagation.
# third version: neural networks
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For Figure 8.3, we define a weight matrix as follows,

 
0 0 0.1 −2.0 0 0
0
 0 0.1 −2.0 0 0 

0 0 0 0 −0.3 −0.3
W = 0
 (8.2.4)
 0 0 0 0.22 0.22  
0 0 0 0 0 0 
0 0 0 0 0 0
and activation functions
activate = [None]*d
activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
The code for W is
w = [ [None]*d for _ in range(d) ]
# remember in Python, index starts from 0

w[0][2] = w[1][2] = 0.1

w[0][3] = w[1][3] = -2.0
w[2][4] = w[2][5] = -0.3
w[3][4] = w[3][5] = 0.22
Then the code
x_in = [1.5,2.5]
x = forward_prop(x_in,w)
returns the outgoing vector

x = (1.5, 2.5, 0.4, −8.0, 0.132, −0.954). (8.2.5)
From this, the incoming vector is
x− = (0, 0, 0.4, −8.0, −1.88, −1.88).
and the outputs are
xout = (0.132, −0.954).
Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 8.3, xout = (x5 , x6 ) and we
may take J to be the mean square error
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (8.2.6)
2 2
The code for this J is
def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])
and the code

y0 = [0.132,-0.954]
y = [0.427, -0.288]
J(x_out,y0), J(x_out,y)
returns 0 and 0.266.
By forward propagation, J is also a function of all nodes. Then, at each

node j, we have the derivatives
∂J ∂J
, fj′ (x−
j ), . (8.2.7)
∂x−
j ∂xj
These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)
∂J fi′ ∂J
∂x−
i ∂xi
fi
Figure 8.6: Downstream, local, and upstream derivatives at node i.
From (8.2.2),
∂xj
= fj′ (x−
j ). (8.2.8)
∂x−
j
By the chain rule and (8.2.8), the key relation between these derivatives
is
∂J ∂J
− = · f ′ (x− ), (8.2.9)
∂xi ∂xi i i
or
downstream = upstream × local.
def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))
Let
∂J
δi = , i = 1, 2, . . . , d.
∂x−
i
Then we have the outgoing vector x = (x1 , x2 , . . . , xd ) and the downstream

gradient vector δ = (δ1 , δ2 , . . . , δd ). Strictly speaking, we should write δi−
for the downstream derivatives. However, in §8.4, we don’t need upstream
derivatives. Because of this, we will write δi .
Let xout be the output nodes, and let δout be the downstream derivatives
of J corresponding to xout . Then δout is a function of xout , y, w. We assume
the nodes are ordered so that the terminal portions of x and δ equal xout and
δout respectively,
d = len(x)
m = len(x_out)
x[d-m:] = x_out
Once we have the incoming vector x− and outgoing vector x, we can

differentiate J and compute the downstream derivatives δout with respect to
each node in xout . For example, in Figure 8.3, there are two output nodes
x5 , x6 , and we compute
∂J ∂J
δ5 = , δ6 = , δout = (δ5 , δ6 )
∂x−
5 ∂x−
6
as follows. Using (8.2.5) and (8.2.6), the upstream derivative is
∂J
= (x5 − y1 ) = −0.294.
∂x5
At node 5, the activation function is f5 = σ. Since σ ′ = σ(1 − σ), the local

derivative at node 5 is
σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.
Hence the downstream derivative at node 5 is
δ5 = upstream × local = −0.294 ∗ 0.114 = −0.0337.
Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
The code for this is
# delta_out for mean square error
def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in
,→ range(m) ]
delta_out(x_out,y_star,w)
We compute δ recursively via back propagation as in §7.4. From Figure

8.5 and (8.2.1) and (8.2.8),
∂J X ∂J ∂x− j ∂xi
− = − · · −
∂xi i→j
∂xj ∂xi ∂xi
!
X ∂J
= − · wij · fi′ (x−
i ).
i→j
∂x j
This yields the downstream derivative at node i,

!
X
δi = δj · wij · fi′ (x−
i ). (8.2.10)
i→j
The code is
def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if
,→ w[i][j] != None else 0 for j in range(d) ])
return upstream * local(x,w,i)
Using this, we have the third version of back propagation,
# third version: neural networks
def backward_prop(x,y,w):
d = len(w)
delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta
With W , x, and targets y as above, the code
delta = backward_prop(x,y,w)
returns
δ = (0.0437, 0.0437, 0.0279, −0.0204, −0.0337, −0.059).

Above we computed the upstream, downstream, and local derivatives of

J at a given node (8.2.7). Since the incoming signals x−
j depend also on the
weights wij , J also depends on wij . By (8.2.1),
∂x−
j
= xi ,
∂wij
see also Table 8.4. From this,

−
∂J ∂J ∂xj
= · = δj · x i .
∂wij ∂x−
j ∂wij
We have shown
Weight Gradient of Output
If (i, j) is an edge, then
∂J
= xi · δj . (8.2.11)
∂wij
This result is key for neural network training (§8.4).
Perceptrons can be assembled in parallel (Figure 8.2). If a network has

no hidden nodes, the network is shallow. In a shallow network, all nodes are
either input nodes or output nodes (Figure 8.7).
A shallow network is dense if all input nodes point to all output nodes:
wij is defined for all i, j. A shallow network can always be assumed dense
by inserting zero weights at missing edges.
Neural networks can also be assembled in series, with each component
a layer (Figure 8.8). Usually each layer is a dense shallow network. For
example, Figure 8.3 consists of two dense shallow networks in layers.
The weight matrix W (8.2.3) is 6 × 6, while the weight matrices W1 , W2
of each of the two dense shallow network layers in Figure 8.3 are 2 × 2.
In a single shallow layer with n input nodes and m output nodes (Figure
8.7), let x and z be the layer’s input node vector and output node vector.
Then x and z are n and m dimensional respectively, and W is m × n.
x1
z1
f
x2
z2
f
x3
z3
f
x4
Figure 8.7: A shallow dense layer.
Figure 8.8: Layered neural network [8].
If we have the same activation function f at every output node, then we

may apply it componentwise,

−
f (z − ) = f (z1− , z2− , . . . , zm −
) = (f (z1− ), f (z2− ), . . . , f (zm )).
Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (8.2.1), (8.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (8.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for vectorized forward and back propagation.
8.3 Gradient Descent

Let f (w) be a scalar function of a vector w = (w1 , w2 , . . . , wd ) in Rd . A basic
problem is to minimize f (w), that is, to find or compute a minimizer w∗ ,
f (w) ≥ f (w∗ ), for every w.
This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in
mind is f = J, where J is the error from the previous section.
Usually f (w) is a measure of cost or lack of compatibility. Because of
this, f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §7.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.
Let g(w) be any function of a scalar variable w. From the definition of

derivative (7.1.2), if b is close to a, we have the approximation
g(b) − g(a) ≈ g ′ (a)(b − a).

8.3. GRADIENT DESCENT 413
Inserting a = w and b = w+ ,
g(w+ ) ≈ g(w) + g ′ (w)(w+ − w).
Assume w∗ is a root of g(w) = 0, so g(w∗ ) = 0. If w+ is close to w∗ , then

g(w+ ) is close to zero, so
0 ≈ g(w) + g ′ (w)(w+ − w).
Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
f ′ (w)
w+ ≈ w − .
f ′′ (w)
This leads to Newton’s method of computing approximations w0 , w1 , w2 , . . .

of w∗ using the recursion
f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )
Because calculating f ′′ (w) is computationally expensive, first-order de-

scent methods replace the second derivative terms f ′′ (wn ) by constants, known
as learning rates.
In the multi-variable case, Newton’s method becomes
wn+1 = wn − D2 f (wn )−1 ∇f (wn ), n = 1, 2, . . . ,
and the second-derivative term is even more expensive to compute.

These first-order methods, collectively known as gradient descent, are the
subject of this chapter. In presenting §8.3 and §8.8, we follow [2], [16], [25],
[27].
Here is code for Newton’s method.

from numpy import *
def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory
When applied to the function

f (w) = w4 − 6w2 + 2w,
the code returns trajectory
def loss(w): return w**4 - 6*w**2 + 2*w # f(w)

def grad(w): return 4*w**3 - 12*w + 2 # f'(w)
def curv(w): return 12*w**2 - 12 # f''(w)
u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)
which can be plotted
def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
title("num_iter= " + str(len(trajectory.T)))

grid()
show()
with the code
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
returning Figure 8.9.
Figure 8.9: Double well newton descent.
A descent sequence is a sequence w0 , w1 , w2 , . . . where the loss function

decreases
f (w0 ) ≥ f (w1 ) ≥ f (w2 ) ≥ . . . .
In a descent sequence, the point after the current point w = wn is the succes-
sive point w+ = wn+1 , and the point before the current point is the previous
point w− = wn−1 . Then (w− )+ = w = (w+ )− .
Recall (§7.3) the gradient ∇f (w) at a given point w is the direction of

greatest increase of the function, starting from w. Because of this, it is
natural to construct a descent sequence by moving, at any given w, in the
direction −∇f (w) opposite to the gradient.
A gradient descent is a descent sequence w0 , w1 , w2 , . . . where each
successive point w+ is obtained from the previous point w by moving in the
direction opposite to the gradient g = ∇f (w) at w,
Basic Gradient Descent Step
w+ = w − t∇f (w). (8.3.1)
The step-size t, which determines how far to go in the direction opposite

to g, is the learning rate.
Let us unpack (8.3.1), so we understand how it applies to weights in

networks (§7.4). In a neural network, weights w1 , w2 , . . . are attached to
edges, and the final outputs are combined into a loss function. As a result,
the loss function is a function of the weights,
f (w) = f (w1 , w2 , . . . ).
In (8.3.1), w = (w1 , w2 , . . . ) is the weight vector, consisting all of weights

combined into a single vector. By the gradient formula (7.3.2), (8.3.1) is
equivalent to
∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....
In other words,
Each Weight is Computed Separately

To update a weight in a specific edge using gradient descent, one needs
only the derivative of the loss function relative to that specific weight.
Of course, the derivative relative to a specific weight may depend on other

derivatives and other weights, when one applies backpropagation (§7.4). This
principle also holds for modified gradient descent (§8.8).
In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §7.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
u0 a b c w1 w
0
Figure 8.10: Double well cost function and sublevel sets at w0 and at w1 .
In Figure 8.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means
D2 f (w) ≤ L, on f (w) ≤ f (w0 ), (8.3.2)
in the sense the eigenvalues of D2 f (w) are never greater than L.

Because the second derivative is the derivative of the first derivative,

2
D f (w) measures how fast the gradient ∇f (w) changes from point to point.
From this point of view, D2 f (w) is a measure of the curvature of the function
f (w), and (8.3.2) says the rate of change of the gradient is never greater than
L.
Given such a bound L on the curvature, If the learning rate t is no larger
than 1/L, we say we are doing short step gradient descent. Then we have
Short Step Gradient Descent
Let L be as above and w+ as in (8.3.1). If t ≤ 1/L, then

t
f (w+ ) ≤ f (w) − |∇f (w)|2 . (8.3.3)
2
To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (7.5.15) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (8.3.3).
The curvature of the loss function and the learning rate are inversely
proportional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.
When the sublevel set is bounded, there is a bound L satisfying (8.3.2).

From §7.5, the sublevel set is bounded when f (w) is proper: Large |w| implies
high cost f (w). The graphs in Figures 7.4, 7.5, 8.10, are proper.
In practice, when the loss function is not proper, it is modified by an
extra term that forces properness. This is called regularization. If the extra
term is proportional to |w|2 , it is ridge regularization, and if the extra term
is proportional to |w|, it is LASSO regularization.
Now let w0 w1 , w2 . . . be a short-step gradient descent sequence, t ≤ 1/L.

By (8.3.3), wn remains in the sublevel set f (w) ≤ f (w0 ). If this sublevel set is
bounded, wn subconverges to a limit w∗ (Appendix A.2). Inserting w = wn ,

w+ = wn+1 in (8.3.3),
1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L
Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude
1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L
Since this implies ∇f (w∗ ) = 0, we have derived the following.
Gradient Descent Converges to a Critical Point

Fix an initial weight w0 and let L be as above. If the short-step gradient
descent sequence starting from w0 converges to some point w∗ , then
w∗ is a critical point.
For example, let f (w) = w4 − 6w2 + 2w (Figures 8.9, 8.10, 8.11). Then
f ′ (w) = 4w3 − 12w + 2, f ′′ (w) = 12w2 − 12.
Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 8.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 8.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 8.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §8.8, modified
gradient descent will address some of these shortcomings.
Figure 8.11: Double well gradient descent.
The code for gradient descent is
from numpy import *

def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
When applied to the double well function f (w),
u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
8.4. NETWORK TRAINING 421
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
the code returns Figure 8.11.
8.4 Network Training

A neural network with weight matrix W defines an input-output map
xin → xout .
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is
xin → y.
This is training.
Let (§8.2)
−
x− = (x− −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )
be the network’s incoming vector and outgoing vector, and let
δ = (δ1 , δ2 , . . . , δd )
be the downstream gradient vector, relative to some output function J.

From (8.2.1),
−
∂J ∂J ∂xj ∂J
= − · = · x i = x i δj . (8.4.1)
∂wij ∂xj ∂wij ∂x−
j
This we derived as (8.2.11), but here it is again:

The Weight Gradient of J is a Tensor Product
Let wij be the weight along an edge (i, j), let xi be the outgoing signal
from the i-th node, and let δj be the downstream derivative of the
output J with respect to the j-th node. Then the derivative ∂J/∂wij
equals xi δj . In this partial sense,
∇W J = x ⊗ δ. (8.4.2)
When W is the weight matrix between successive layers in a layered neural

network (Figure 8.8), (8.4.2) is not partial, it is exactly correct.
Using (8.4.1), we update the weight wij using gradient descent
def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
The learning rate is discussed in §8.3. The triple

forward propagation → backward propagation → update weights
is an iteration. Starting with a given W0 , we repeat this iteration until we
obtain the target outputs y. Here is the code.
def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
8.4. NETWORK TRAINING 423
cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory
Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
Let W0 be the weight matrix (8.2.4). Then
x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000
w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)
returns the cost trajectory, which can be plotted using the code
for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)
grid()
legend()
show()
resulting in Figure 8.12.

Figure 8.12: Cost trajectory and number of iterations as learning rate varies.
The convergence here is surprisingly easy to attain. However, the conver-

gence here is a mirage. It is a reflection of overfitting, in the sense that we
tuned the weights to obtain the input-output map corresponding to a single
sample: There is no reason the tuned weights produce the input-output map
for other samples.
Only after we tune the weights repeatedly against all samples in a training
dataset, can we hope to achieve training with some predictive power. This
is the subject of §8.9.
8.5 Loss Functions

Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets y1 , y2 ,
. . . , yN . In this section, we study the loss function
N
X
J(W ) = J(xk , yk , W ). (8.5.1)
k=1
We study two cases,

8.5. LOSS FUNCTIONS 425
• linear regression, and

• logistic regression.
With any loss function J, the goal is minimize J. With this in mind,
from §7.5, we recall
Ideal Loss Function

If a loss function J is proper and strictly convex, then J has a unique
global minimum, characterized as the unique critical point.
Often, in machine learning, J is neither convex nor proper. Nevertheless,

this result is an important benchmark to start with. Lack of properness is
often addressed by regularization, which is the modification of J by a proper
forcing term. Lack of convexity is addressed by using some type of accelerated
gradient descent.
Given this result, our goal in this section is to determine conditions on
the dataset that guarantee properness and strict convexity of J in the two
cases.
The first loss function is (8.5.1) with

1
J(x, y, W ) = |W t x − y|2 . (8.5.2)
2
Then (8.5.1) is the mean square loss, and the problem of minimizing (8.5.1)
is linear regression (Figure 8.13).
To check convexity, expand J(x, y, W ) as a quadratic
1
x · W W t x − 2x · W y + |y|2 .

J(x, y, W ) =
2
Using (2.2.10),
1
trace W t (x ⊗ x)W − 2W t (x ⊗ y) + y ⊗ y .

J(x, y, W ) =
2
From this, the weight gradient is
∇W J(x, y, W ) = (x ⊗ x)W − x ⊗ y.
x1
+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4
Figure 8.13: Linear regression neural network.
Let X be the matrix with columns x1 , x2 , . . . , xN , and let Y be the

matrix with columns y1 , y2 , . . . , yN . Then from (2.2.7),
N
X N
X
t t
XX = xk ⊗ xk , XY = xk ⊗ yk ,
k=1 k=1
and the columns of W t X are W t x1 , W t x2 , . . . , W t xN .

By (2.2.11), the loss function is
1 2
J(W ) = ∥W t X − Y ∥ ,
2
the weight gradient of J(W ) is
∇W J(W ) = XX t W − XY t ,
and the second derivative is

2
DW J(W ) = XX t . (8.5.3)
Since Q = XX t is nonnegative, this shows J(W ) is convex. If v · Qv = 0,

then
XN
(v · xk )2 = 0,
k=1
hence v is a zero-variance direction (§2.5).

Recall (§2.9) a dataset is full-rank if the span of the dataset is the whole
feature space. When this happens, it follows v = 0, hence J(W ) is strictly
convex.
Now we check properness of J(W ). Suppose J(W ) ≤ c for some level c.
Then, by (8.5.2) and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.
Let e1 , e2 , . . . be the standard basis in feature space. When the dataset is

full-rank, e1 , e2 , . . . are in the span of the dataset. Hence for some constant
C,
|W t ej | ≤ C, j = 1, 2, . . . .
This implies the boundedness of W , and completes the derivation that J(W )
is proper. We have shown
Strict Convexity and Properness: Linear Regression

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the mean square
loss J(W ) is proper and strictly convex. It follows J(W ) has a unique
global minimum, characterized as the unique critical point.
This is a simple, clear geometric criterion for convergence of gradient

descent to the global minimum of J, valid for linear regression.
Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets p1 ,

p2 , . . . , pN . In logistic regression, we assume the targets p reflect finitely
many (say d) classes or categories. Hence each target p is a probability
vector p = (p1 , p2 , . . . , pd ). Because of this, we use p instead of y to denote
the targets.
In logistic regression, there are two main sub-cases where things work
out: Strict probabilities and one-hot encoded probabilities.
A probability p = (p1 , p2 , . . . , pd ) is strict if p1 , p2 , . . . , pd are all positive
(none are zero).
A probability p = (p1 , p2 , . . . , pd ) is one-hot encoded if for some j, pj = 1.
When p is one-hot encoded with pj = 1, then pi = 0 for i ̸= j, and we say p
is one-hot encoded at slot j.
For example, in classification problems, each sample x lies in one of d

classes, and the target p is one-hot encoded at the slot corresponding to the
class: If there are three classes 0, 1, 2, then the one-hot encoded target p is
(1, 0, 0), (0, 1, 0), (0, 0, 1),
depending on which class x lies in.
In logistic regression, the loss function is (8.5.1), with
J(x, p, W ) = I(p, q), q = σ(z), z = W t x,
or
J(x, p, W ) = I(p, σ(W t x)).
Here I(p, q) is the relative information, measuring the error between the
desired target p and the predicted probability q, and q = σ(z) is the softmax
function, squashing the network’s output z = W t x into the probability q.
When p is one-hot encoded, by (7.6.15),
J(x, p, W ) = Icross (p, σ(W t x)).
Then J(x, p, W ) is logistic loss, and the problem of minimizing (8.5.1) is
logistic regression (Figure 8.14).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 7.25 is a useful summary
of information I and entropy H.
x1
z1
+ p
q1
x2
z2 q2 J
+ σ I
x3 q3
z3 z = W tx
+ q = σ(z)
J = I(p, q)
x4
Figure 8.14: Logistic regression neural network.

We start by establishing the convexity of the loss function. By (7.6.14),
∇z I(p, σ(z)) = ∇z Z(z) − p = q − p, q = σ(z), (8.5.4)
and, by (7.6.9),
Dz2 I(p, σ(z)) = D2 Z(z) = diag(q) − q ⊗ q, q = σ(z). (8.5.5)
By the chain rule, since z = W t x, the gradient is
G = ∇W J(x, p, W ) = x ⊗ (q − p), q = σ(z). (8.5.6)
Since q and p are probabilities, p · 1 = q · 1, hence the gradient satisfies

G1 = 0.
Recall (§7.6) we have strict convexity of Z(z) in directions orthogonal to
1, when z · 1 = 0. Since z = W t x, it is natural to impose the constraint
W1 = 0 (8.5.7)
on the weight matrix, or

d
X
wij = 0, i = 1, 2, . . . , d.
j=1
Then, if we start gradient descent with a W satisfying W 1 = 0, all successive

W ’s will also satisfy W 1 = 0. Now we can state
Strict Convexity: Logistic Regression

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the logistic loss
J(W ), restricted to the subspace W 1 = 0, is strictly convex. It follows
J(W ) has at most one global minimum, and at most one critical point,
on this subspace.
Pd
To see this, given a vector v and probability q, set v̄ = j=1 vj qj . Then
d d
!2 d
X X X
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1
If either side is zero, and q is strict, then v = v̄1, so v is a multiple of 1.

By (7.5.14) and (8.5.5), the second derivative of I(p, σ(z)) in the direction
of a vector v is
d
d2 X
I(p, σ(z + sv)) = (vj − v̄)2 qj , q = σ(z).
ds2 s=0 j=1
Let V be a weight matrix satisfying V 1 = 0 and let v = V t x. Then

v · 1 = x · V 1 = 0, so v is orthogonal to 1, and
(W + sV )t x = z + sv.
Since z = W t x, it follows the second derivative of J(x, p, W ) in the direction
of V is
d
d2 X
J(x, p, W + sV ) = (vj − v̄)2 qj , v = V t x, q = σ(z). (8.5.8)
ds2 t=0 j=1
This shows the second derivative of J(x, p, W ) is nonnegative, establishing

the convexity of J(x, p, W ). Since J(W ) is the sum of J(x, p, W ) over all
samples, we conclude J(W ) is convex.
Moreover, if (8.5.8) vanishes, then, by the previous paragraph, since q =
σ(z) is strict, v is a multiple of 1. Since v is orthogonal to 1, v = 0. Since
v = V t x, the vanishing of (8.5.8) and V 1 = 0 implies V t x = 0.
If
N
d2 X d2
J(W + sV ) = J(xk , pk , W + sV )
ds2 s=0 k=1
ds2 s=0
vanishes, then, since the summands are nonnegative, (8.5.8) vanishes, for
every sample x = xk , hence
V t xk = 0, k = 1, 2, . . . , N.
When the dataset is full-rank, this implies V = 0. This establishes strict
convexity of J(W ) in the subspace W 1 = 0.
Now we turn to properness of J(W ). There are two cases: Strict proba-
bilities, and one-hot encoded probabilities. Here is the first case.
Properness: Strict Logistic Regression

Let x1 , x2 , . . . , xN be a dataset, with corresponding targets p1 , p2 , . . . ,
pN . If the dataset is full-rank and the targets are strict probabilities,
then the logistic loss J(W ) is proper on the subspace W 1 = 0.
Let x be a sample with target p. Suppose, for some constant ϵ > 0,

p = (p1 , p2 , . . . , pd ) satisfies pk ≥ ϵ for k = 1, 2, . . . , d. Let z = W t x, with
W 1 = 0. We first establish
J(x, p, W ) ≤ c =⇒ |W t x| ≤ c′ , (8.5.9)
for some constant c′ depending only on c, d, and ϵ.
By the definition (§7.6) of Z, Z − zk > 0, for k = 1, 2, . . . , d. Since
z · 1 = 0,
d
X d
X
Z −p·z = pk (Z − zk ) ≥ ϵ (Z − zk ) = ϵdZ.
k=1 k=1
By (7.6.14), J(x, p, W ) ≤ c implies

I(p) + ϵdZ ≤ I(p) − p · z + Z = I(p, σ(z)) = J(x, p, W ) ≤ c.
By (7.6.7), this implies
c + log d
Z(z) ≤ .
ϵd
By (7.6.10), this implies
(d − 1)(c + log d)
|z| ≤ ,
ϵ
establishing (8.5.9).
Now suppose J(W ) ≤ c. Then, by nonnegativity of J(xk , pk , W ), we
have J(xk , pk , W ) ≤ c, for all samples. Since all targets are strict, there is a
constant ϵ > 0 such that pk ≥ ϵ for k = 1, 2, . . . , d, for every target p. By
(8.5.9),
|W t xk | ≤ c′ , k = 1, 2, . . . , N.
Let e1 , e2 , . . . be the standard basis in feature space. Since the dataset is
full-rank, e1 , e2 , . . . are in the span of the dataset. Hence for some constant
C,
|W t ek | ≤ C, k = 1, 2, . . . .
This implies the boundedness of W , and completes the derivation.
Here is the second case.
Properness: One-hot Encoded Logistic Regression

Let x1 , x2 , . . . , xN be a dataset with corresponding one-hot encoded
targets p1 , p2 , . . . , pN . For each class i, let Ki be the convex hull of
the samples whose corresponding targets are one-hot encoded at slot
i. If the span of the intersection Ki ∩ Kj is full-rank for every i and j,
then the logistic loss J(W ) is proper on the subspace W 1 = 0.
The convex hull is discussed in §7.5, see Figures 7.20 and 7.21. In the
next section, we show a simple example of how this works.
Note if the span of Ki ∩ Kj is full-rank, then the span of the dataset itself
is full-rank, hence, from a previous result in this section, J(W ) is strictly
convex.
Suppose J(W ) ≤ c. Then J(x, p, W ) ≤ c for every sample x and corre-
sponding target p.
Let x be a sample in Ki , and let z = W t x. Then the corresponding target
p satisfies pi = 1, and I(p) = 0. If j ̸= i, by (7.6.14),
zj − zi < Z(z) − zi = I(p, σ(z)) = J(x, p, W ) ≤ c.
By taking convex combinations of samples x in Ki , zj − zi ≤ c remains valid

for all x in Ki .
Similarly, if x is in Kj , then zi − zj ≤ c. We conclude
|zi − zj | ≤ c, for x in Ki ∩ Kj .
Let e1 , e2 , . . . be the standard basis in feature space. Since span(Ki ∩Kj )

is full-rank for every i and j, the basis vectors e1 , e2 , . . . are in span(Ki ∩Kj ),
for every i and j. Thus there is a constant C such that
(W t x)i − (W t x)j ) = |zi − zj | ≤ C, for every i and j, (8.5.10)
for every basis vector x = ek .

d be the number of classes, so z = (z1 , z2 , . . . , zd ). Since z · 1 = 0,

Let P
zi = − j̸=i zj . Summing (8.5.10) over j ̸= i,
X
d|zi | = |(d − 1)zi + zi | = (zi − zj ) ≤ (d − 1)C.
j̸=i
This implies
d
!1/2
X √
|z| = zi2 ≤C d
i=1
for every basis vector x = ek . We conclude

√
|W t ek | ≤ C d,
for every basis vector ek in feature space. This implies boundedness of W ,

and establishes the properness of J(W ) on the subspace W 1 = 0.
We end the section by comparing the three regressions: linear, strict

logistic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic that is relevant.
Because of this, in the literature, logistic regression often defaults to the
one-hot case.
In linear regression, not only does J(W ) have a minimum, but so does
J(x, y, W ). Properness ultimately depends on properness of a quadratic |z|2 ,
In strict logistic regression, by (8.5.4), the critical point equation
∇z J(x, p, W ) = 0
can always be solved, so there is at least one minimum for each J(x, p, W ).
Here properness ultimately depends on properness of the partition function
Z(z).
In one-hot encoded regression, ∇W J(x, p, W ) = 0 can never be solved,
because q = σ(z) is always strict and p is one-hot encoded, see (8.5.6). Nev-
ertheless, properness of J(W ) is achievable, hence ∇W J(W ) = 0 is solvable,
if there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression. In logistic regression, the minimizer cannot be
found in closed form, so we have no choice but to apply gradient descent,
even for low dimensions.
8.6 Regression Examples

Let (xk , yk ), k = 1, 2, . . . , N , be a dataset in the plane. The simplest regres-
sion problem is to determine the line y = mx + b minimizing the residual
N
X
J(m, b) = (yk − mxk − b)2 . (8.6.1)
k=1
Then the line is the regression line.
Figure 8.15: Population versus employed: Linear Regression.
More generally, given a dataset x1 , x2 , . . . , xN in Rd , and scalar targets

y1 , y2 , . . . , yN , we want to minimize
N
X
J(w) = (yk − w · xk )2
k=1
8.6. REGRESSION EXAMPLES 435
over all weight vectors in Rd . Here we are fitting a regression hyperplane
y = w · x = w1 x1 + w2 x2 + · · · + wd xd .
For example, Figure 8.16 is a dataset and Figure 8.15 is a plot of popu-
lation versus employed, with the mean and the regression line shown.
GNP.deflator GNP Unemployed Armed Forces Population Year Employed

83 234.289 235.6 159 107.608 1947 60.323
88.5 259.426 232.5 145.6 108.632 1948 61.122
88.2 258.054 368.2 161.6 109.773 1949 60.171
89.5 284.599 335.1 165 110.929 1950 61.187
96.2 328.975 209.9 309.9 112.075 1951 63.221
98.1 346.999 193.2 359.4 113.27 1952 63.639
99 365.385 187 354.7 115.094 1953 64.989
100 363.112 357.8 335 116.219 1954 63.761
101.2 397.469 290.4 304.8 117.388 1955 66.019
104.6 419.18 282.2 285.7 118.734 1956 67.857
108.4 442.769 293.6 279.8 120.445 1957 68.169
110.8 444.546 468.1 263.7 121.95 1958 66.513
112.6 482.704 381.3 255.2 123.366 1959 68.655
114.2 502.601 393.1 251.4 125.368 1960 69.564
115.7 518.173 480.6 257.2 127.852 1961 69.331
116.9 554.894 400.7 282.7 130.081 1962 70.551
Table 8.16: Longley Economic Data [12].
Let X be the N × d matrix with rows x1 , x2 , . . . , xN , and let Y be the

vector (y1 , y2 , . . . , yN ). Then we can rewrite the residual as
J(w) = |Xw − Y |2 . (8.6.2)
From §2.3, any weight w∗ minimizing (8.6.2) is a solution the regression

equation
X t Xw∗ = X t Y. (8.6.3)
Since the pseudo-inverse provides a solution of the regression equation, we
have
Linear Regression
The weight w∗ = X + Y minimizes the residual (8.6.2) and solves the

regression equation (8.6.3).
We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
   
x1 1 y1
x 1 y 
   
X= 2 , Y =  2.
 . . . . . . . . .
xN 1 yN
In the scalar case, the regression equation (8.6.3) is 2 × 2. To simplify
the computation of X t X, let
N N
1 X 1 X
x̄ = xk , ȳ = yk .
N k=1 N k=1
Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.6,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N k=1 N
Then cov(x, y) is the covariance between x and y,

t x · x x̄ t x·y
X X=N , X Y =N .
x̄ 1 ȳ
With w = (m, b), the regression equation reduces to
(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.
The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
cov(x, x)m = (x · x − x̄2 )m = (x · y − x̄ȳ) = cov(x, y).
This derives
Linear Regression in the Plane
The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)
Now we use linear regression to do polynomial regression. Return to the

dataset (xk , yk ) in R2 (Figure 8.15). We can expand or “lift” the dataset
from R2 to R6 by working with the vectors (1, xk , x2k , x3k , x4k , yk ) instead of
(xk , yk ).
Assuming the data is given by Figure 8.16, we build the code for Figures
8.15 and 8.17. We begin by assuming the data is given as arrays,
from numpy import *

from pandas import read_csv
df - read_csv("longley.csv")
X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()
Then we standardize the data
X = X - mean(X)
Y = Y - mean(Y)
varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)
X = X/sqrt(varx)
Y = Y/sqrt(vary)
After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
from numpy.linalg import pinv
# polynomial function - degree d-1

def f(x,d):
A = column_stack([ X**i for i in range(d) ]) # Nxd
Aplus = pinv(A)
b = Y # Nx1
theta = dot(Aplus,b)
return sum([ x**i*theta[i] for i in range(d) ],axis=0)
Then we plot the data and the polynomial in six subplots.
xmin,ymin = amin(X), amin(Y)

xmax, ymax = amax(X), amax(Y)
# six subplots
rows, cols = 3,2
# x interval
x = arange(xmin,xmax,.01)
for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,f(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()
show()
Running this code with degree 1 returns Figure 8.15. Taking too high a
power can lead to overfitting, for example for degree 12.
Figure 8.17: Polynomial regression: Degrees 2, 4, 6, 8, 10, 12.
Here is an example of a simple logistic regression problem. A group of

students takes an exam. For the k-th student, we know the amount of time
xk they studied, and the outcome pk , whether or not they passed the exam.
Plotting these points on the (x, p) plane, the goal is to fit a curve as in
Figure 8.18.
More generally, we may only know the amount of study time xk , and the
probability pk that the student passed, where now 0 ≤ pk ≤ 1.
Figure 8.18: Exam outcomes (0 or 1) and logistic regression fit [26].
For example, the data may be as follows, where pk equals 1 or 0 according

to whether they passed or not.
x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1
Table 8.19: Hours studied and outcomes.
Let σ(z) be the sigmoid function. Then, as in the previous section, the
goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (8.6.4)
k=1
Once we have the minimizer (m, b), we have the best-fit curve
q = σ(mx + b)
(Figure 8.18).
If the targets pk are one-hot encoded, the dataset is as follows.
x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)
Table 8.20: Hours studied and one-hot encoded outcomes.
Here is the code for Figure 8.18,

X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
# obtained by gradient descent below

m, b = 1.49991537, -4.06373862
x = arange(0,6,.01)
plot(x,expit(m*x+b))
scatter(X,P)
grid()
show()
To apply the results from the previous section, first we incorporate the
bias and rewrite the dataset as
(x1 , 1), (x2 , 1), . . . , (xN , 1), N = 20.
Clearly the dataset is full-rank in R2 , hence J(m, b) is strictly convex.
Each sample x in the dataset is in R2 , and each target is one-hot encoded
as (p, 1 − p). This implies the weight matrix must satisfy (8.5.7) W 1 = 0. so

b −b
W = .
m −m
Since z = W t x, the outputs must satisfy z1 = z and z2 = −z. This leads to

a neural network with two inputs and two outputs (Figure 8.21).
p
b z
1 + q
−b J
σ I
m
−z
x + 1−q
−m
Figure 8.21: Neural network for student exam outcomes.
Since here d = 2, the networks in Figures 8.21 and 8.22 are equivalent.
In Figure 8.21, σ is the softmax function. In Figure 8.22, σ is the sigmoid
function.
1 b
z q J
+ σ I
m
x
Figure 8.22: Equivalent neural network for student exam outcomes.
Figure 8.18 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 8.23) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis. In particular, K0 and
K1 computed in the next paragraph are in the horizontal plane.
Figure 8.23: Exam outcomes dataset including bias inputs.
Referring to Figure 8.18, the convex hulls K0 and K1 are in feature space,
which here is the horizontal plane R2 . Now the convex hull K1 of the samples
corresponding to pk = 0 is the line segment joining (.5, 1) and (3.5, 1), and
the convex hull K1 of the samples corresponding to pk = 1 is the line segment
joining (1.75, 1) and (5.5, 1). Since K0 ∩K1 is the line segment joining (1.75, 1)
and (3.5, 1), the span of K0 ∩ K1 is the whole horizontal plane R2 in Figure
8.23. By the results of the previous section, J(w) is proper.
Here is the descent code.
from numpy import *

X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in

,→ zip(X,P) ],axis=0)
# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
t = .01 # learning rate
while not allclose(g,0):

w_new = w - t * g
if allclose(w,w_new): break
else: w = w_new
g = gradient(*w)
print("descent result: ",w)

print("gradient: ",gradient(*w))
This code returns
m = 1.49991537, b = −4.06373862.
These values are used to graph the sigmoid in Figure 8.18.
The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 8.24). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 8.25).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is
proper.
Figure 8.24: Projection of Iris dataset onto R2 .
Figure 8.25: Convex hulls of MNIST classes in R2 .

8.7 Strict Convexity

In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying
m ≤ D2 f (w) ≤ L, for every w. (8.7.1)
Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have
Gradient Descent on a Strictly Convex Function

If the short-step gradient descent sequence starting from w0 converges
to w∗ , then w∗ is the global minimum.
The simplest example of a convex loss function is the quadratic case
1
f (w) = w · Qw − b · w, (8.7.2)
2
where Q is a covariance matrix. Then D2 f (w) = Q. If the eigenvalues of Q
are between positive constants m and L, then f (w) is smooth and strictly
convex.
By (7.3.6), the gradient for this example is g = Qw − b. Hence the
minimizer is the unique solution w∗ = Q−1 b of the linear system Qw =
b. Thus gradient descent is a natural tool for solving linear systems and
computing inverses, at least for covariance matrices Q.
By (7.5.16), f (w) lies between two quadratics,
m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (8.7.3)
2 2
1
In the literature, the condition number is often defined as L/m.
8.7. STRICT CONVEXITY 447
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (8.7.1), the
estimate (8.7.3) shows these two error measures are equivalent. We use both
measures below.
Let t = 1/L. Inserting x = w and a = w∗ in the left half of (7.5.19) and

using ∇f (w∗ ) = 0 implies
1
f (w) ≤ f (w∗ ) + |∇f (w)|2 .
2m
Let E(w) = f (w) − f (w∗ ). Combining this inequality with (8.3.3), and
recalling r = m/L = mt, we arrive at
E(w+ ) ≤ (1 − r)E(w). (8.7.4)
Iterating this implies
E(w2 ) ≤ (1 − r)E(w1 ) ≤ (1 − r)(1 − r)E(w0 ) = (1 − r)2 E(w0 ).
In general, this leads to
Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
E(wn ) ≤ (1 − r)n E(w0 ), n = 1, 2, . . . . (8.7.5)
This is the basic gradient descent result GD-I.

Using coercivity of the gradient (7.5.20), we can obtain an improved result

GD-II.
Let E(w) = |w−w∗ |2 and set the learning rate at t = 2/(m+L). Inserting
x = w and a = w∗ in (7.5.20) and using ∇f (w∗ ) = 0 implies
mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (8.3.1) and t = 2/(m + L),
E(w+ ) = E(w) − 2tg · (w − w∗ ) + t2 |g|2

mL 2 2t
≤ 1 − 2t E(w) + t − |g|2
m+L m+L
2
L−m
= E(w).
L+m
This implies
Gradient Descent II
Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
2
t=
m+L
2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (8.7.6)
1+r
GD-II improves GD-I in two ways: The learning rate is larger,

2 1
> ,
m+L L
and the convergence rate is smaller,
2
1−r
< (1 − r),
1+r
implying faster convergence.
8.8. ACCELERATED GRADIENT DESCENT 449
For example, if L = 6 and m = 2, then r = 1/3, the learning rates are 1/6
versus 1/4, and the convergence rates are 2/3 versus 1/4. Even though GD-II
improves GD-I, the improvement is not substantial. In the next section, we
use momentum to derive better convergence rates.
Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (7.3.6), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,

+ 1 g
E(w ) = 1 − −1
E(w), u= .
(u · Qu)(u · Q u) |g|
Using a well-known inequality, Kantorovich’s inequality, one can show

that here the convergence rate is also (8.7.6). Thus, after all this work, there
is no advantage here, it simpler to stick with GD-II!
Nevertheless, the idea here, the line-search for a minimizer, is a sound
one, and is useful in some situations.
8.8 Accelerated Gradient Descent

In this section, we modify the gradient descent method by adding a term in-
corporating previous gradients, leading to gradient descent with momentum.
After this, we consider other variations, leading to the most frequently used
descent methods.
Recall in a descent sequence, the current point is w, the next point is w+ ,
and the previous point is w− .
In gradient descent with momentum, we add a momentum term to the
current point w, obtaining the lookahead point
w◦ = w + s(w − w− ). (8.8.1)
Here s is the decay rate. The momentum term reflects the direction induced
by the previous step. Because this mimics the behavior of a ball rolling
downhill, gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by
Momentum Gradient Descent Step
w+ = w − t∇f (w) + s(w − w− ). (8.8.2)
Here we have two hyperparameters, the learning rate and the decay rate.
We study convergence for the simplest case of a quadratic (8.7.2). In this

case, ∇f (w) = Qw − b, and the sequence satisfies the recursion
wn+1 = wn − t(Qwn − b) + s(wn − wn−1 ), n = 0, 1, 2, . . . . (8.8.3)
To initialize the recursion, we set w−1 = w0− = w0 . This implies w1 =

w0 − t(Qw0 − b).
We measure the convergence using the error E(w) = |w − w∗ |2 , and we
assume m < Q < L strictly, in the sense every eigenvalue λ satisfies
m < λ < L. (8.8.4)
As before, we set r = m/L.

Let v be an eigenvector of Q with eigenvalue λ. To solve (8.8.3), we
assume a solution of the form
wn = w∗ + ρn v, Qv = λv. (8.8.5)
Inserting this into (8.8.3) and using Qw∗ = b leads to the quadratic equation
ρ2 = (1 − tλ + s)ρ − s.
By the quadratic formula,

p
(1 − λt + s) ± (1 − λt + s)2 − 4s
ρ = ρ± = .
2
Assume the discriminant (1 − λt + s)2 − 4s is negative. This happens exactly

when √ √
(1 − s)2 (1 + s)2
<t< . (8.8.6)
λ λ
If we assume √ √
(1 − s)2 (1 + s)2
≤t≤ , (8.8.7)
m L
then (8.8.6) holds for every eigenvalue λ of Q.
Multiplying (8.8.7) by λ and factoring the discriminant as a difference of
two squares leads to
(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (8.8.8)
mL
When (8.8.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (8.8.9)
2
It follows the absolute value of ρ equals
p √
|ρ| = x2 + y 2 = s.
√
To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (8.8.7). This forces (8.8.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain
√
√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2
Let w̃n = wn − w∗ . Since Qwn − b = Qw̃n , (8.8.3) is a 2-step linear

recursion in the variables w̃n . Therefore the general solution depends on two
constants A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (8.8.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (8.8.9), then (8.8.5) is

a solution of (8.8.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
X d
∗
wn = w + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (8.8.10)
k=1
is the general solution of (8.8.3). Inserting n = 0 and n = 1 into (8.8.10), then

taking the dot product of the result with vk , we obtain two linear equations
for two unknowns Ak , Bk ,
Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,
for each k = 1, 2, . . . , d. Solving for Ak , Bk yields

1 − tλk − ρ̄k
Ak = (w0 − w∗ ) · vk , Bk = Āk .
ρk − ρ̄k
Let
(L − m)(L − m)
C = max . (8.8.11)
λ (L − λ)(λ − m)
Using (8.8.8), one verifies the estimate
|Ak |2 = |Bk |2 ≤ C |(w0 − w∗ ) · vk |2 .
Now use (2.9.4) twice, first with v = wn − w∗ , then with v = w0 − w∗ . By
(8.8.10) and the triangle inequality,
d
X
|wn − w∗ |2 = |(wn − w∗ ) · vk |2
k=1
Xd
= |Ak ρnk + Bk ρ̄nk |2
k=1
Xd
≤ (|Ak | + |Bk |)2 |ρk |2n
k=1
d
X
≤ 4Cs n
|(w0 − w∗ ) · vk |2
k=1
= 4Csn |w0 − w∗ |2 .
This derives the following result.
Momentum Gradient Descent - Heavy Ball
Suppose the loss function f (w) is quadratic (8.7.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (8.8.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (8.8.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (8.8.12)
1+ r
This heavy ball √

descent, due to Polyak [18], is an improvement over GD-
II (8.7.6), because r is substantially larger than r when r is small. The
downside of this momentum method is that the convergence (8.8.12) is only
guaranteed for f (w) quadratic (8.7.2). In fact, there are examples of non-
quadratic f (w) where heavy ball descent does not converge to w∗ . Neverthe-
less, this method is widely used.
The momentum method can be modified by evaluating the gradient at

the lookahead point w◦ (8.8.1),
Momentum Descent Step With Lookahead Gradient
w◦ = w + s(w − w− ),
(8.8.13)
w+ = w◦ − t∇f (w◦ ).
This leads to accelerated gradient descent, or momentum descent with

lookahead gradient. This result, due to Nesterov [16], is valid for any convex
function satisfying (8.7.1), not just quadratics.
The iteration (8.8.13) is in two steps, a momentum step followed by a
basic gradient descent step. The momentum step takes us from the current
point w to the lookahead point w◦ , and the gradient descent step takes us
from w◦ to the successive point w+ .
Starting from w0 , and setting w−1 = w0 , here it turns out the loss se-
quence f (w0 ), f (w1 ), f (w2 ), . . . is not always decreasing. Because of this,
we seek another function V (w) where the corresponding sequence V (w0 ),
V (w1 ), V (w2 ), . . . is decreasing.
To explain this, it’s best to assume w∗ = 0 and f (w∗ ) = 0. This can
always be arranged by translating the coordinate system. Then it turns out
L
V (w) = f (w) + |w − ρw− |2 , (8.8.14)
2
with a suitable choice of ρ, does the job. With the choices
√
1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r
we will show
V (w+ ) ≤ ρV (w). (8.8.15)
In fact, we see below (8.8.22), (8.8.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (8.3.3).
The derivation of (8.8.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (8.8.13), it is not clear what
w− means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work with
the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,
then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (8.8.14) to simplify the calculations.
We first show how (8.8.15) implies the result. Using (w0 )− = w0 and
(8.7.3),
L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (8.8.15), we obtain
f (wn ) ≤ V (wn ) ≤ ρn V (w0 ) ≤ 2ρn f (w0 ).
This derives
Momentum Descent - Lookahead Gradient

Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (8.8.13) with learning rate and decay rate
√
1 1− r
t= , s= √
L 1+ r

√ n
E(wn ) ≤ 2 1 − r E(w0 ), n = 1, 2, . . . . (8.8.16)
While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all
convex functions satisfying (8.7.1), and the fact, also due to Nesterov [16],
that this convergence rate is best-possible among all such functions.
Now we derive (8.8.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (8.3.1) with w◦ replacing w, (8.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (8.8.17)
2
Here we used t = 1/L.
By (7.5.15) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (8.8.18)
2
By (7.5.15) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (8.8.19)
2
Multiply (8.8.18) by ρ and (8.8.19) by 1 − ρ and add, then insert the sum
into (8.8.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(8.8.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (8.8.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (8.8.21)
2t 2t
Let
R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .

Solving for f (w) in (8.8.14) and inserting into (8.8.21) leads to

1
V (w+ ) ≤ ρV (w) − R(w, w − w− ). (8.8.22)
2t
If we can choose s and ρ so that R(a, b) is a positive scalar multiple of
2
|b| , then, by (8.8.22), (8.8.15) follows, completing the proof.
Based on this, we choose s, ρ to make R(a, b) independent of a, which is
equivalent to ∇a R = 0. But

2 2

∇a R = 2(1 − ρ) r − (1 − ρ) a + ρ − s(1 − r) b ,
so ∇a R = 0 is two equations in two unknowns s, ρ. This leads to the choices

for s and ρ made above. Once these choices are made, s(1 − r) = ρ2 and
ρ > s. From this,
R(a, b) = R(0, b) = (rs2 − s2 + ρ3 )|b|2 = ρ2 (ρ − s)|b|2 , (8.8.23)
which is positive.
8.9 Stochastic Gradient Descent

Chapter A
Appendices
A.1 SQL
Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes
are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence
matrix = dataset = CSV f ile = spreadsheet = table = array = dataf rame

(A.1.1)
One difference is that each entry in a table may be a string, or code, or an
image, not just a number. Nevertheless, every table has rows and columns;
rows are usually called records, and columns are columns.
A database is a collection of several tables that may or may not be linked
by columns with common data. Software that serves databases is a database
server. Often the computer running this software is also called a database
server, or a server for short. Databases created by a database server (soft-
ware) are stored as files on the database server.
There are many varieties of database server software. Here we use Mari-
aDB, a widely-used open-source database server. By using open-source soft-
ware, one is assured to be using the “purest” form of the software, in the
sense that proprietary extensions are avoided, and the software is compatible
with the widest range of commercial variations.
Because database tables can contain millions of records, it is best to ac-
cess a database server programmatically, using an application programming
interface, rather than a graphical user interface. The basic API for inter-
457
458 CHAPTER A. APPENDICES
acting with database servers is SQL (structured query language). SQL is a

programming language for creating and modifying databases.
Any application on your laptop that is used to access a database is called
an SQL client. The database server being accessed may be local, running
on the same computer you are logged into, or remote, running on another
computer on the internet. In our examples, the code assumes a local or
remote database server is being accessed.
Because SQL commands are case-insensitive, by default we write them
in lowercase. Depending on the SQL client, commands may terminate with
semicolons or not. As mentioned above, data may be numbers or strings.
The basic SQL commands are
select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
A.1. SQL 459
All the objects in (A.1.1) are also equivalent to a Python list-of-dicts. In

this section we explain how to convert between the objects
list-of-dicts ⇐⇒ JSON string ⇐⇒ dataframe ⇐⇒ CSV file ⇐⇒ SQL table
(A.1.2)
For all conversions, we use pandas. We begin describing a Python list-of-
dicts, because this does not require any additional Python modules.
A Python dictionary or dict is a Python object of the form (prices are in
cents)
item1 = {"dish": "Hummus", "price": 800, "quantity": 5}
This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers
or any unmutable Python objects. Since a Python list is mutable, a key
cannot be a list. Values may be any Python objects, so a value may be a
list. In a dict, values are accessed through their keys. For example, item1[
,→ "dish"] returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts,
for example,
item2 = {"dish": "Avocado", "price": 900, "quantity": 2}

L = [item1,item2]
Here L is a list and
len(L), L[0]["dish"]
returns
(2,'Hummus')
In other words, L is a list-of-dicts,
L == [{"dish": "Hummus", "price": 800, "quantity": 5},

,→ {"dish": ... }]
returns True.
A list-of-dicts L can be converted into a string using the json module, as

follows:
frpm json import *
s = dumps(L)
Now print L and print s. Even though L and s “look” the same, L is a
list, and s is a string. To emphasize this point, note
• len(L) == 2 and len(s) == 99,
• L[0:2] == L and s[0:2] == '[{'
• L[8] returns an error and s[8] == ':'
To convert back the other way, use
from json import *
L1 = loads(s)
Then L == L1 returns True. Strings having this form are called JSON
strings, and are easy to store in a database as VARCHARs (see Figure A.4).
The basic object in the Python module pandas is the dataframe (Figures
A.1, A.2, A.4, A.5). The pandas module can convert a dataframe df to
many, many other formats
df.to_dict(), df.to_csv(), df.to_excel(), df.to_sql(),

,→ df.to_json(), ...
To convert a list-of-dicts to a dataframe is easy. The code
df = DataFrame(L)
df
returns the dataframe in Figure A.1 (prices are in cents).

A.1. SQL 461
Figure A.1: Dataframe from list-of-dicts.
Figure A.2: Menu dataframe and SQL table.
To go the other way is equally easy. The code
L1 = df.to_dict('records')
L == L1
returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code
menu_df = read_csv("menu.csv")
menu_df
This returns Figure A.2 (prices are in cents).

To go the other way, to convert the dataframe df to the CSV file menu1
,→ .csv, use the code
df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)
The option index=False suppresses the index column, so menu2.csv has

two columns, while menu1.csv has three columns. Also useful is the method
.to_excel, which returns an excel file.
Now we explain how to convert between a dataframe and an SQL table.
What we have seen so far uses only the module pandas. To convert to SQL,
we need two more modules, sqlalchemy and pymysql.
The module sqlalchemy allows us to connect to a database server from
within Python, and the module pymysql is the code necessary to complete
the connection to our version of database server. For example, if we are
connecting to an Oracle database server, we would use the module cx-Oracle
instead of pymysql.
In Python, the standard module installation method is to use pip. To
install sqlalchemy and pymysql, type within jupyter:
pip install sqlalchemy

pip install pymysql
To connect using sqlalchemy, we first collect the connection data into

one URI string,
protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port
This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”\rawa”, the URI is
A.1. SQL 463
database = "/rawa"
uri = protocol + credentials + server + port + database
Using this uri, the connection is made as follows
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine(uri)
(In sqlalchemy, a connection is called an “engine”.) After this, to store the

dataframe df into a table Menu, use the code
df.to_sql('Menu',engine,if_exists='replace')
The if_exists = 'replace' option replaces the table Menu if it existed

prior to this command. Other options are if_exists='fail' and if_exists
,→ ='append'. The default is if_exists='fail', so
df.to_sql('Menu',engine)
returns an error if Menu exists.

To read a table into a dataframe, use for example the code
from sqlalchemy import text
query1 = text("select * from rawa.OrdersIn")

query2 = text("select * from rawa.OrdersIn where items
,→ like '%Hummus%';")
connection = engine.connect()
df1 = read_sql(query1, connection)
df2 = read_sql(query2, connection)
Better Python coding technique is to place read_sql and to_sql in a

with block, as follows
with engine.connect() as connection:

df = pd.read_sql(query, connection)
df.to_sql('Menu',engine)
One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.1.2).
Figure A.3: Rawa restaurant.
As an example how all this goes together, here is a task:

Given two CSV files menu.csv and orders.csv downloaded from
a restaurant website (Figure A.3), create three SQL tables Menu,
OrdersIn, OrdersOut.
The two CSV files are (click)
orders.csv and menu.csv.
The three SQL table columns are as follows (price, tip, tax, subtotal,
total are in cents)
A.1. SQL 465
Figure A.4: OrdersIn dataframe and SQL table.
/* Menu */
dish varchar
price integer
/* ordersin */
orderid integer
created datetime
customerid integer
items json
/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
orders.csv are in cents so they are INTs.)
1. Read the CSV files into dataframes menu_df and orders_df.
2. Convert the dataframes into list-of-dicts menu and orders.
3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId

,→ whose values are obtained from list-of-dicts orders.
4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values

are obtained from list-of-dicts orders (tips are in cents so they are
INTs).
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents
so they are INTs). The JSON string is of a list-of-dicts in the form
discussed above L = [item1, item2] (see row 0 in Figure A.4).
Do this by looping over each order in the list-of-dicts orders, then
looping over each item in the list-of-dicts menu, and extracting the
quantity ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are com-
puted from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed
using the Connecticut tax rate 7.35%. Tax is applied to the sum of
subtotal and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df

,→ , OrdersOut_df.
8. Upload menu_df, OrdersIn_df, OrdersOut_df to tables Menu, OrdersIn

,→ , OrdersOut.
The resulting dataframes ordersin_df and ordersout_df, and SQL ta-

bles OrdersIn and OrdersOut, are in Figures A.4 and A.5.
A.1. SQL 467
Figure A.5: OrdersOut dataframe and SQL table.
Complete Code for the Task
# step 1
protocol = "https://"
server = "math.temple.edu"
path = "/~hijab/teaching/csv_files/restaurant/"
url = protocol + server + path
menu_df = read_csv(url + "menu.csv")

orders_df = read_csv(url + "orders.csv")
# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')
# step 3
OrdersIn = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)
# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)
# step 5
from json import *
for i,r in enumerate(OrdersIn):

itemsOrdered = h
for item in menu:
dish = item["dish"]
price = item["price"]
if dish in orders[i]:
quantity = orders[i][dish
if quantity > 0:
d = {"dish": dish, "price": price,
,→ "quantity": quantity}
itemsOrdered.append(d)
r["items"] = dumps(itemsOrdered)
# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item
,→ in items ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
A.1. SQL 469
total = subtotal + tip + tax

r["tax"] = tax
r["total"] = total
# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)
# step 8
from sqlalchemy import create_engine, text
# connect to the database

credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
engine = create_engine(uri)
dtype1 = { "dish":sqlalchemy.String(60),
,→ "price":sqlalchemy.Integer }
dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
with engine.connect() as connection:

menu_df.to_sql('Menu', engine,
if_exists = 'replace', index = False, dtype = dtype1)
ordersin_df.to_sql("OrdersIn", engine,
index = False, if_exists = 'replace', dtype = dtype2)
ordersout_df.to_sql("OrdersOut", engine,
index = False, if_exists = 'replace', dtype = dtype3)
Moral of this section

In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filter-
ing).
• Use Python to do detailed computations on your laptop (analysis).

Now we consider the following simple problem. The total number of
orders in 3970. What is the total number of plates? To answer this, we loop
through all the orders, summing the number of plates in each order. The
answer is 14,949 plates.
from json import *

from sqlalchemy import create_engine, text
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
engine = sqlalchemy.create_engine(uri)
connection = engine.connect()
A.2. MINIMIZING SEQUENCES 471
query = text("select * from OrdersIn")

df = read_sql(query, connection)
num = 0
for item in df["items"]:

plates = loads(item)
num += sum( [ plate["quantity"] for plate in plates ])
print(num)
A more streamlined approach is to use map. First we define a function

whose input is a JSON string in the format of df["items"], and whose
output is the number of plates.
from json import *
def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])
Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.
num = df["items"].map(num_plates).sum()
print(num)
Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
A.2 Minimizing Sequences

Several times in the text, we dealt with minimizing functions, most notably
for the pseudo-inverse of a matrix (§2.3), for proper continuous functions
(§7.5), and for gradient descent (§8.3).
Throughout, the technical foundations underlying the existence of mini-
mizers were ignored. In this section, which may safely be skipped, we review
the foundational material supporting the existence of minimizers.
The first issue that must be clarified is the difference between the min-
imum and the infimum. In a given situation, it is possible that there is
no minimum. By contrast, in any reasonable situation, there is always an
infimum.
For example, since y = ex is an increasing function, the minimum
min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1
is y ∗ = e0 = 1, and the minimizer, the location at which the minimum occurs,

is x∗ = 0. Here we have one minimizer.
For the function y = x4 −2x2 in Figure 7.5, the minimum over −2 ≤ x ≤ 2
is y ∗ = −1, which occurs at the minimizers x∗ = ±1. Here we have two
minimizers.
On the other hand, if we attempt to minimize the function y = 1/x over
the open interval 1 < x < ∞, we have no minimizer, since 1/x approaches 0
as x approaches ∞. Here we say there is an infimum, and we have
inf 1/x = inf{1/x | 1 < x < ∞} = 0.

1<x<∞
In this situation, the minimizer does not exist, but, since the values of 1/x are
arbitrarily close to 0, we say the infimum is 0. Since there is no minimizer,
there is no minimum value. Also, even though 0 is the infimum, we do not
say ∞ is the “infimizer”, since ∞ is not an actual number.
Let S be a collection of real numbers. A lower bound for S is a number

b satisfying b ≤ x for every x in S. For example, 0 is a lower bound for the
closed interval 0 ≤ x ≤ 1, and also 0 is a lower bound for the open interval
0 < x < 1. Any number less than 0 is also a lower bound, for example, −1
is a lower bound, in either case.
Not every collection S of numbers has a lower bound, for example the
entire real line has no lower bound, since −∞ is not a number. If S does
have a lower bound, we say S is bounded below.
If S has a lower bound m that is in S, then we say m is the minimum of
S. If S is a finite set, then S has a minimum. However, as we saw above, if
S is infinite, a minimum need not exist. When the minimum exists, we write
m = min S.
If S is bounded below, then S has many lower bounds. The greatest
among these lower bounds is the infimum of S. A foundational axiom for
real numbers is that the infimum always exists. When m is the infimum of
S, we write m = inf S.
Existence of Infima
Any collection S of real numbers that is bounded below has an infimum:
There is a lower bound m for S that is greater than any other lower
bound b for S.
For example, for S = [0, 1], inf S = 0 and min S = 0, and, for S = (0, 1),
inf S = 0, but min S does not exist. For both these sets S, it is clear that 0
is the infimum. The power of the axiom comes from its validity for any set
S of scalars that is bounded below, no matter how complicated.
By definition, the infimum of S is the lower bound for S that is greater
than any other lower bound for S. From this, if min S exists, then inf S =
min S.
A sequence is an infinite ordered listing x1 , x2 , . . . of vectors. An error

sequence is a sequence of nonnegative scalars e1 , e2 , . . . that is decreasing
e1 ≥ e2 ≥ · · · ≥ 0.
We say an error sequence converges to zero if
inf en = 0.
n≥1
In this case, we write en → 0 as n → ∞. Let’s unpack this.

Suppose e1 ≥ e2 ≥ . . . is an error sequence converging to zero. Since 0
is the greatest lower bound of the set S = {e1 , e2 , . . . }, given any positive
ϵ > 0, there is a term eN satisfying eN < ϵ. Since the sequence is decreasing,
we conclude 0 ≤ en < ϵ for n ≥ N .
Error Sequence
An error sequence e1 ≥ e2 ≥ · · · ≥ 0 converges to zero iff for any ϵ > 0,
there is an N > 0 with
0 ≤ en < ϵ, n ≥ N.
Now let x1 , x2 , . . . be a sequence of vectors. We say the sequence con-

verges to x∗ or approaches x∗ if there is an error sequence e1 , e2 , . . . converg-
ing to zero with
|xn − x∗ | ≤ en n ≥ 1.
In this case, we write
lim xn = x∗ ,
n→∞
or we write xn → x∗ .
Note this definition of convergence is consistent with the previous defini-
tion, since an error sequence e1 , e2 , . . . converges to zero (in the first sense)
iff
lim en = 0
n→∞
(in the second sense).
Let x1 , x2 , . . . be a sequence. A subsequence is a selection of terms
xn1 , xn2 , xn3 , . . . , n1 < n2 < n3 < . . . .
Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
Note a subsequence of an error sequence converging to zero is also an
error sequence converging to zero. As a consequence, if a sequence converges
to x∗ , then every subsequence of the sequence converges to x∗ . From this
it follows that the sequence 1, −1, 1, −1, 1, −1, . . . does not converge to
anything: it bounces back and forth between ±1.
We say a set S is bounded if |x| ≤ b for all x in S, for some constant b.
Bounded Sequences Must Subconverge

Let x1 , x2 , . . . be a bounded sequence of vectors. Then there is a
subsequence x′1 , x′2 , . . . converging to some x∗ .
To see this, assume first x1 , x2 , . . . are scalars, and let x1 , x2 , . . . be a

bounded sequence of numbers, say a ≤ xn ≤ b for n ≥ 1. Bisect the interval
I0 = [a, b] into two equal subintervals. Then at least one of the subintervals,
call it I1 , has infinitely many terms of the sequence. Select x′1 in I1 and let
x∗1 be the right endpoint of I1 .
Now bisect I1 into two equal subintervals. Then at least one of the subin-
tervals, call it I2 , has infinitely many terms of the sequence. Select x′2 in I2
and let x∗2 be the right endpoint of I2 . Continuing in this manner, we obtain
a subsubsequence x′1 , x′2 , . . . with x′n in In , and a sequence x∗1 , x∗2 , . . . .
Since the intervals are nested
I0 ⊃ I! ⊃ I2 ⊃ . . . ,
the sequence x∗1 , x∗2 , . . . is decreasing and
x∗ = inf x∗n
n≥1
exists. Thus en = x∗n − x∗ is an error sequence converging to zero.

Since the length of In equals (b − a)/2n ,
0 ≤ x∗n − x′n ≤ (b − a)2−n ,
hence by the triangle inequality (2.2.4)
|x′n − x∗ | ≤ |x′n − x∗n | + |x∗n − x∗ | ≤ (b − a)2−n + en .
Since (b − a)2−n + en is an error sequence converging to zero, this establishes

x′n → x∗ .
Now let x1 , x2 , . . . be a sequence of vectors in Rd , and let v be a vector;

then x1 · v, x2 · v, . . . are scalars, so, from the previous paragraph, there is a
subsequence x′n · v (depending on v) converging to some x∗v .
Let e1 , e2 , . . . , ed be the standard basis in Rd . By choosing v = e1 ,
there is a subsequence x′1 , x′2 , . . . such that the first features of x′n converge.
By choosing v = e2 , and focusing on the subsequence x′1 , x′2 , . . . , there is
a sub-subsequence x′′1 , x′′2 , . . . such that the first and second features of x′′n
converge. Continuing in this manner, we obtain a subsequence x∗1 , x∗2 , . . .
such that the k-th feature of the subsequence converges to the k-th feature
of a single x∗ , for every 1 ≤ k ≤ d. From this, it follows that x∗n converges
to x∗ .
Let S be a set of vectors and let y = f (x) be a scalar-valued function

bounded below on S, f (x) ≥ b for some number b, for all x in S. Then, by
the above axiom, the infimum
m = inf f (x) = inf{f (x) | x in S} (A.2.1)

S
must exist.
A minimizer is a vector x∗ satisfying f (x∗ ) = m. As we saw above, a
minimizer may or may not exist, and, when the minimizer does exist, there
may be several minimizers.
A minimizing sequence for f (x) over S is a sequence x1 , x2 , . . . of vectors
in S such that the corresponding values f (x1 ), f (x2 ), . . . are decreasing and
converge to m = inf S f (x) as n → ∞. In other words, x1 , x2 , . . . is a
minimizing sequence for f (x) over S if
f (x1 ) ≥ f (x2 ) ≥ f (x3 ) ≥ . . .
and
inf f (x) = inf f (xn ).
S n≥1
If there is a minimizer x∗ in S, then inf S f (x) = minS f (x) = f (x∗ ), and

the sequence x∗ , x∗ , . . . is a minimizing sequence in S. However, in general,
there may be no such minimizer.
Existence of Minimizing Sequences
If S is a collection of vectors, and y = f (x) is bounded below on S,

then there is a minimizing sequence x1 , x2 , . . . in S.
If there is a minimizer x∗ in S, the sequence x∗ , x∗ , . . . is a minimizing

sequence in S. Otherwise, if there is no minimizer in S, pick any x0 in S.
Since m is the greatest lower bound for f (x), f (x0 ) is not a lower bound, so
there is x1 in S with
m < f (x1 ) < (f (x0 ) + m)/2,
or
0 < f (x1 ) − m < (f (x0 ) − m)/2.
Similarly, there is x2 with
0 < f (x2 ) − m < (f (x1 ) − m)/2.
Continuing in this manner, we have xn with
0 < f (xn ) − m < (f (xn−1 ) − m)/2.
Since this implies
0 < f (xn ) − m < 2−n (f (x0 ) − m),
this yields a minimizing sequence in S.

We note a subsequence of a minimizing sequence is also a minimizing
sequence.
A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn

approaches x∗ .
Now we can establish
Existence of Minimizers
If f (x) is continuous on Rd and S is a bounded set in Rd , then there
is a minimizer x∗ ,
f (x∗ ) = inf f (x). (A.2.2)
x in S
In general, the minimizer x∗ may lie outside the set S. To guarantee x∗

belongs to S, typically one assumes an additional requirement, the closedness
of S.
To establish the result, let m be as in (A.2.1), and let x1 , x2 , . . . be a
minimizing sequence for f (x) in S. Then x1 , x2 , . . . is bounded, so by the
previous result, there is a subsequence x′1 , x′2 , . . . converging to some x∗ .
Since x′1 , x′2 , . . . is also a minimizing sequence, and f (x) is continuous,
f (x∗ ) = lim f (x′n ) = lim f (xn ) = m.

n→∞ n→∞
This shows x∗ is a minimizer for f (x).

Bibliography
[1] Joshua Akey, Genome 560: Introduction to Statistical Genomics, 2008. https://ww
w.gs.washington.edu/academics/courses/akey/56008/lecture/lecture1.pdf.
[2] Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, Foundations
and Trends in Machine Learning, vol. 8, Now Publishers, 2015.
[3] Harald Cramér, Mathematical Methods of Statistics, Princeton University Press, 1946.
[4] A. Aldo Faisal Marc Peter Deisenroth and Cheng Soon Ong, Mathematics for Machine
Learning, Cambridge University Press, 2020.
[5] J. L. Doob, Probability and Statistics, Transactions of the American Mathematical
Society 36 (1934), 759-775.
[6] R. A. Fisher, The conditions under which χ2 measures the discrepancy between ob-
servation and hypothesis, Journal of the Royal Statistical Society 87 (1924), 442-450.
[7] Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press,
2016.
[8] Google, Machine Learning. https://developers.google.com/machine-learning.
[9] Robert M. Gray, Toeplitz and Circulant Matrices: A Review, Foundations and Trends
in Communications and Information Theory 2 (2006), no. 3, 155-239.
[10] T. L. Heath, The Works of Archimedes, Cambridge University, 1897.
[11] Lily Jiang, A Visual Explanation of Gradient Descent Methods, 2020. https://towa
rdsdatascience.com/a-visual-explanation-of-gradient-descent-methods-m
omentum-adagrad-rmsprop-adam-f898b102325c.
[12] J. W. Longley, An Appraisal of Least Squares Programs for the Electronic Computer
from the Point of View of the User, Journal of the American Statistical Association
62.319 (1967), 819-841.
[13] David G. Luenberger and Yinyu Ye, Linear and Nonlinear Programming, Springer,
2008.
[14] Ioannis Mitliagkas, Theoretical principles for deep learning, lecture notes, 2019. http
s://mitliagkas.github.io/ift6085-dl-theory-class-2019/.
479
480 BIBLIOGRAPHY
[15] Marvin Minsky and Seymour Papert, Perceptrons, An Introduction to Computational

Geometry, MIT Press, 1988.
[16] Yurii Nesterov, Lectures on Convex Optimization, Springer, 2018.
[17] Roger Penrose, A generalized inverse for matrices, Proceedings of the Cambridge
Philosophical Society 51 (1955), 406-413.
[18] Boris Teodorovich Polyak, Some methods of speeding up the convergence of iteration
methods, USSR Computational Mathematics and Mathematical Physics 4(5) (1964),
1-17.
[19] Karl Pearson, On the criterion that a given system of deviations from the probable in
the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling, Philosophical Magazine Series 5 50:302 (1900),
157-175.
[20] Sebastian Raschka, PCA in three simple steps, 2015. https://sebastianraschka.c
om/Articles/2015_pca_in_3_steps.html.
[21] Herbert Robbins and Sutton Monro, A Stochastic Approximation Method, The Annals
of Mathematical Statistics 22 (1951), no. 3, 400 – 407.
[22] Sheldon M. Ross, Probability and Statistics for Engineers and Scientists, Sixth Edi-
tion, Academic Press, 2021.
[23] Mark J. Schervish, Theory of Statistics, Springer, 1995.
[24] Stanford University, CS224N: Natural Language Processing with Deep Learning. ht
tps://web.stanford.edu/class/cs224n.
[25] Irène Waldspurger, Gradient Descent With Momentum, 2022. https://www.cerema
de.dauphine.fr/~waldspurger/tds/22_23_s1/advanced_gradient_descent.pd
f.
[26] Wikipedia, Logistic Regression, 2015. https://en.wikipedia.org/wiki/Logistic
_regression.
[27] Stephen J. Wright and Benjamin Recht, Optimization for Data Analysis, Cambridge
University, 2022.
Python
*, 8, 16 def.newton, 413
def.num_plates, 471
all, 188 def.outgoing, 364, 403
append, 187 def.pca, 180
def.pca_with_svd, 180
def.angle, 23, 73 def.plot_cluster, 188
def.assign_clusters, 187 def.plot_descent, 414
def.backward_prop, 359, 367, def.project, 119
409 def.project_to_ortho, 120
def.ball, 61 def.pvalue, 270
def.chi2_independence, 321 def.random_batch_mean, 247
def.confidence_interval, 293, def.random_vector, 188
305, 310, 313 def.tensor, 32
def.delta_out, 408 def.train_nn, 422
def.derivative, 366 def.ttest, 306
def.dimension_staircase, 128 def.type2_error, 299, 307
def.downstream, 409 def.update_means, 187
def.ellipse, 50, 57 def.update_weights, 422
def.find_first_defect, 126 def.zero_variance, 106
def.forward_prop, 358, 365, 404 def.ztest, 297
def.gd, 420 diag, 173
def.goodness_of_fit, 318 dict, 459
def.H, 345 display, 148
def.hexcolor, 10
def.incoming, 364, 403 enumerate, 181
def.J, 405 floor, 165
def.local, 406
def.nearest_index, 187 import, 8
481
482 PYTHON
itertools.product, 61 numpy.column_stack, 84, 98

numpy.corrcoef, 53
join, 10 numpy.cov, 46
json.dumps, 460 numpy.cumsum, 178
json.loads, 460 numpy.degrees, 23
numpy.dot, 71
keras
numpy.exp, 345
datasets
numpy.isclose, 156
mnist.load_data, 4
numpy.linalg.eig, 142
lambda, 363 numpy.linalg.eigh, 142, 178
list, 6 numpy.linalg.inv, 82
numpy.linalg.matrix_rank, 126
matplotlib.pyplot.axes, 185 numpy.linalg.norm, 21, 187
matplotlib.pyplot.contour, 57 numpy.linalg.pinv, 119
matplotlib.pyplot.figure, 181 numpy.linalg.svd, 173
matplotlib.pyplot.grid, 6 numpy.linspace, 61
matplotlib.pyplot.hist, 241 numpy.log, 345
matplotlib.pyplot.imshow, 8 numpy.mean, 14
matplotlib.pyplot.legend, 165 numpy.meshgrid, 61
matplotlib.pyplot.meshgrid, numpy.outer, 321
57 numpy.pi, 345
matplotlib.pyplot.plot, 44 numpy.random.binomial, 240
matplotlib.pyplot.scatter, 6 numpy.random.default_rng, 247
matplotlib.pyplot.show, 6 numpy.random.default_rng.
matplotlib.pyplot.stairs, 128 ,→ shuffle, 247
matplotlib.pyplot.subplot, numpy.random.normal, 269
181 numpy.random.randn, 284
matplotlib.pyplot.text, 166 numpy.random.random, 44
matplotlib.pyplot.title, 345 numpy.reshape, 177
matplotlib.pyplot.xlabel, 438 numpy.roots, 42
numpy.row_stack, 67
numpy.allclose, 81 numpy.shape, 63
numpy.amax, 438 numpy.sqrt, 23
numpy.amin, 438
numpy.arange, 57, 165 pandas.DataFrame, 460
numpy.arccos, 23, 73 pandas.DataFrame.drop, 70
numpy.argsort, 180 pandas.DataFrame.to_csv, 462
numpy.array, 8, 63 pandas.DataFrame.to_dict, 461
PYTHON 483
pandas.DataFrame.to_numpy, 70 sympy.*, 71
pandas.DataFrame.to_sql, 463 sympy.diag, 70
pandas.read_csv, 437, 461 sympy.diagonalize, 147
pandas.read_sql, 463 sympy.eigenvects, 147
sympy.init_printing, 147
random.choice, 10
sympy.Matrix, 63
random.random, 15
sympy.Matrix.col, 68
scipy.linalg.null_space, 97, sympy.Matrix.cols, 68
98 sympy.Matrix.columnspace, 90
scipy.linalg.orth, 91 sympy.Matrix.eye, 69
scipy.linalg.pinv, 85 sympy.Matrix.hstack, 66, 85
scipy.spatial.ConvexHull, 371 sympy.Matrix.inv, 82
simplices, 372 sympy.Matrix.nullspace, 96
scipy.special.comb, 216 sympy.Matrix.ones, 69
scipy.special.expit, 236, 441 sympy.Matrix.rank, 132
scipy.special.softmax, 385 sympy.Matrix.row, 68
scipy.stats.chi2, 274 sympy.Matrix.rows, 68
scipy.stats.norm, 261 sympy.Matrix.rowspace, 94
scipy.stats.t, 302, 305 sympy.Matrix.zeros, 69
sklearn.datasets.load_iris, 2 sympy.RootOf, 42
sklearn.decomposition sympy.shape, 63
.PCA, 182 sympy.solve, 255
sklearn.preprocessing sympy.symbols, 42
.StandardScaler, 81
sort, 178 tuple, 18
sqlalchemy.create_engine, 463
sqlalchemy.text, 463 zip, 185
484 PYTHON
Index
angle, 73, 134 relative, 348

approaches, 474 column space, 90
approximate equality, 165 columns, 66
Archimedes, 38 orthonormal, 77
arcsine law, 166, 329 combination, 192
asymptotic equality, 196 complex
average, 10 conjugate, 36
division, 35, 36
basis, 123 hermitian product, 36
of eigenvectors, 146 multiplication, 35, 36
of singular vectors, 170
numbers, 35
orthonormal, 123, 135, 146
plane, 35
standard, 64
polar representation, 38
Bayes theorem, 233
roots of unity, 39
perceptron, 238
concave, 331, 341
binomial, 210
condition number, 446
coefficient, 194, 211, 212
confidence, 266
theorem, 210, 212
Newton’s, 340 interval, 292
level, 290
cartesian plane, 16 contingency table, 320
Cauchy-Schwarz inequality, 23, 73 convex, 331
central limit theorem, 242, 261, combination, 370
263 dual, 337, 382, 387, 391
circle, 22 function, 369
unit, 21 hull, 371, 432
coin-tossing, 227 set, 371
entropy, 345 strictly, 332
485
486 INDEX
correlation gradient, 413, 447, 448

coefficient, 52 heavy ball, 453
matrix, 52, 81 sequence, 415
covariance, 45, 79 with lookahead gradient, 455
and correlation, 81 with momentum, 453
and variance, 49 diagonalizable, 145
biased, 46 diagonalization
ellipse, 49 eigen, 146
inverse ellipse, 49 singular, 173
inverse ellipsoid, 156 dimension, 123
matrix, 45 staircase, 127
unbiased, 46 direct sum, 121
CSV file, 68 distance formula, 20
distribution
dataset, 1 bernoulli, 232
attributes, 1 binomial, 232
centered, 12 chi-squared, 272, 274
covariance, 45 cumulative, 250
dimension, 135 F -, 315
example, 1 normal, 261, 262
features, 1 T -, 302
full-rank, 135 Z-, 261, 262
Iris, 1 dot product, 23, 71
mean, 43
projected, 48, 103, 121, 182 eigenspace, 156
reduced, 48, 103, 121, 182 eigenvalue, 141
sample, 1 bottom, 154
standardized, 52, 81 clustering, 165
vectors or points, 12 decomposition, 146
degree, 210 minimum variance, 154
derivative projected variance, 152
definition, 325 top, 152
directional, 349 transpose, 143
logarithm, 337 eigenvector, 140
partial, 349 eigenvectors
second, 326 best-aligned vector, 152
convexity, 332 is right singular vector, 174
descent orthogonal, 144
INDEX 487
entropy, 342, 388 complement, 203

cross-, 392 complete, 198
relative, 346, 390 connected, 200
epigraph, 373 cycle, 198, 200
epoch, 396 directed, 196
error edge, 196, 397
logistic, 428 incoming, 397
mean, 396 outgoing, 397
mean square, 405, 425 isomorphism, 206
standard, 259 laplacian, 209
Euler’s constant, 219 nodes, 196, 397
experiment, 238 adjacent, 196
exponential connected, 200
function, 221 degree, 199
series, 223 dominating, 199
hidden, 397
factorial, 192 input, 397
full-rank isolated, 199
dataset, 135 output, 397
matrix, 132 order, 197
function path, 200
cumulative distribution, 250 regular, 200
logistic, 236, 339 simple, 197
loss, 412, 425 size, 197
moment generating, 254 sub-, 198
chi-squared, 274 undirected, 196
independence, 256 walk, 200
partition, 339, 385 weighed, 196
probability density, 261 weight
sigmoid, 236, 339 matrix, 397
softmax, 385
fundamental theorem of algebra, hyperplane, 103, 374
42 separating, 375
suporting, 376
gradient, 350 tangent, 377
weight, 410, 422 hypothesis
graph, 196 alternate, 295
bipartite, 207 null, 295
488 INDEX
testing, 295 addition, 69

adjacency, 197, 201
iff, 77, 143 augmented, 92
incoming edge, 397 circulant, 165, 203
independence, 243 eigenvalues, 165
infimum, 472, 473 columns, 28
information, 344 covariance, 45
cross-, 392 dataset, 68
relative, 346, 390 diagonal, 68
inverse, 82 identity, 82
pseudo-, 112 incidence, 209
Iris dataset, 1 inverse, 30, 82
iteration, 396 nonnegative, 78
Jupyter, 4 orthogonal, 134
positive, 78
law of large numbers, 242, 260, projection, 116
287, 349 rows, 28
line-search, 449 scaling, 69
linear square, 68
combination, 88 symmetric, 31, 78
dependence, 95 trace, 32
independence, 95 transpose, 28, 65
system, 82, 149 weight, 197, 397
homogeneous, 26, 96 maximizer, 337
inhomogeneous, 27 mean, 10, 43, 251, 276
transformation, 131, 137 population, 251
log-odds, 339 sample, 259
logit, 339 minimizer, 472, 476
loss, 412, 425 existence, 379
logistic, 428 global, 377
mean square, 425 residual, 380
lower bound, 472 uniqueness, 379
minimizers
machine learning, 395 existence, 478
margin of error, 290 minimum, 472
mass-spring system, 159
matrix, 64 network, 364, 397
2 × 2, 28 iteration, 422
INDEX 489
neural, 397 dot, 23, 71, 134

layered, 410 matrix-matrix, 75
training, 421 matrix-vector, 74
neuron, 364, 397 tensor, 32, 77
perceptron, 398 projection, 116
shallow, 410 onto null space, 122
dense, 410 onto row space, 120
Newton’s method, 413 propagation
norm, 20 back, 357, 359
null space, 96 chain, 359
network, 367
orthogonal, 73 neural network, 409
complement, 99, 121 forward, 356, 358
orthonormal, 74 chain, 358
outgoing edge, 397 network, 365
neural network, 404
Pascal’s triangle, 213 proper, 377
perceptron, 238, 398 pseudo-inverse, 112
Bayes theorem, 399 Pythagoras theorem, 24
parallel, 410 Python, 4
permutation, 192 installation, 4
point, 63
critical, 331, 353, 419 quadratic form, 33
inflection, 332, 419
saddle, 331, 353 random variables, 244, 245
point of best-fit, 43 bernoulli, 232, 244
population, 9 correlation, 254
power of a test, 300 identically distributed, 258
principal axes, 56 independence, 256
principal components, 146, 176 moments, 254
probability, 238 standard, 253
binomial, 227 rank, 132
coin-tossing, 228 and eigenvalues, 148
conditional, 228, 243 and singular values, 170
multinomial, 384 column, 91
one-hot encoded, 427 full-, 132
strict, 427 nonzero eigenvalues, 148
product row, 94
490 INDEX
regression of pseudo-inverse, 175

linear, 425, 434, 436 vector, 167
convexity, 427 vectors
neural network, 425 left, 167
properness, 427 right, 167
logistic, 428 slope, 323
convexity, 429 space
neural network, 428 column, 90
one-hot encoded, 432 feature, 1, 63, 94
properness, 431, 432 null, 96
strict, 431 row, 94
regularization, 418 sample, 9
residual, 108 source, 131
minimizer, 109 sub-, 101
minimum norm, 111 target, 131
pseudo-inverse, 112 vector, 11
regression equation, 110 span, 89
vanishing, 109 spherical coordinates, 61
row space, 94 statistic, 13
rows, 66 Stirling’s approximation, 196
orthonormal, 77 sublevel set, 368
sum
scaling direct, 121
factor, 138 of spans, 121
principle, 61 of vectors, 64
sequence, 473
t-distribution, 302
convergent, 474
tangent
error, 473
line, 323
minimizing, 476
parabola, 333
sub-, 474
test
subconvergent, 474
goodness of fit, 317
series
independence, 321
alternating, 215
T , 306
exponential, 223
Z, 297
Taylor, 327, 328
transpose, 65
singular
triangle inequality, 74
value, 167
decomposition, 170 unit circle, 21
INDEX 491
variance, 45, 103, 252 orthogonal, 73

population, 252 orthonormal, 74, 98
projected, 49, 103, 139, 144 outgoing, 362, 398
sample, 280 polar, 20
total, 47 projected, 117–120
zero, 103 random, 276, 284
vectors, 12, 17, 63 reduced, 117–120
addition, 17, 64 scaling, 19
best aligned, 53 shadow, 17
cartesian, 17 span, 89
dimension, 63 standardized, 80
dot product, 23 subtraction, 20
gradient, 350 unit, 21, 72
downstream, 407 zero, 17, 64
incoming, 398
length, 20, 72 weight, 397
magnitude, 72 gradient, 410, 422
norm, 20, 72 matrix, 197
492 INDEX
Dr. Omar Hijab obtained his doctorate in mathematics from the University
of California at Berkeley. Currently he is Professor Emeritus at Temple
University, Philadelphia, Pennsylvania, USA.

Math For Data Science

Uploaded by

Math For Data Science

Uploaded by

Math for Data Science

This text is under construction and is continuously updated. Upon comple-

Important principles or results are displayed in these boxes.

Python code is displayed in these boxes.

Because SQL is usually part of a data scientist’s toolkit, an introduction

If a section contains the alert

then it is incomplete. If a section does not contain this alert,

List of Figures xiii

3 Principal Components 137

3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . 176

8 Machine Learning 395

8.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 424

1.1 Iris dataset [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.29 Covariance ellipse and incovariance ellipse. . . . . . . . . . . . 54

2.1 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . 107

3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . 138

4.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . 191

4.3 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . 196

5.1 The distribution of p given 7 heads in 10 tosses. . . . . . . . . 235

6.1 Statistics flowchart: p-value p and significance α. . . . . . . . 284

6.2 Histogram of sampling n = 25 students, repeated N = 1000

7.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . 323

8.1 A perceptron with activation function f. . . . . . . . . . . . . 399

8.5 Forward and back propagation between two neurons. . . . . . 403

A.1 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . 461

Figure 1.1: Iris dataset [20].

The Iris dataset is downloaded using the code

from sklearn import datasets

If sklearn is not installed, you’ll need to first run

pip install sklearn

Figure 1.2: Images in the MNIST dataset.

in d-dimensional feature space. We seek to find a lower-dimensional feature

The installation procedure is to first install Python, then install Jupyter

1.2 The MNIST Dataset

from keras.datasets import mnist

(train_X, train_y), (test_X, test_y) = mnist.load_data()

Figure 1.3: A portion of the MNIST dataset.

To compress the image means to reduce the number of dimensions in the

want to train the procedure on the dataset. It is then reasonable to expect

Figure 1.5: The MNIST dataset (3d projection).

Live exercise in class

from matplotlib.pyplot import *

Figure 1.6: A crude copy of the image.

Here is one possible code, returning Figure 1.6.

from matplotlib.pyplot import *

The top left image in Figure 1.4 is returned by the code

from matplotlib.pyplot import *

import matplotlib.pyplot as plt

from matplotlib.pyplot import imshow

So we have three versions of this code snippet.

In this text, we usually use the first style, as it is visually lightest. To

1.3 Averages and Vector Spaces

L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].

L_2 = [35, -32, -8, 45, -8].

L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].

Figure 1.7: HTML colors.

Here’s the code generating the colors

# HTML color codes are #rrggbb (6 hexes)

for i in range(5): scatter(i,0, c=hexcolor())

Let L be a list as above. The goal is to compute the sample average or

1. vectors can be added (and the sum v + w is back in V )

2. vector addition is commutative v + w = w + v

3. vector addition is associative u + (v + w) = (u + v) + w

4. there is a zero vector 0

5. vectors v have negatives −v

6. vectors can be multiplied by real numbers (and the product rv is back

7. multiplication is distributive over addition (r + s)v = rv + sv and

Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or

Figure 1.8: The vector v joining the points m and x.

Figure 1.9: Datasets of points versus datasets of vectors.

Centered Versus Non-Centered

then v1 , v2 , . . . , vN is a centered dataset of vectors.

Figure 1.10: A statistic f valued in a vector space V .

Let us go back to vector spaces. When we work with vector spaces,

'alpha' + 'romeo' == 'romeo' + 'alpha'

In general, a statistic need not be a number. A statistic can be anything

Let A be the 2 × 2 matrix