Mma PDF
Mma PDF
(Working Title)
Preface vii
I Vectors 1
1 Vectors 3
1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vector addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Scalar-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Inner product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Complexity of vector computations . . . . . . . . . . . . . . . . . . . . 21
2 Linear functions 25
2.1 Linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Taylor approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Clustering 55
4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 A clustering objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 The k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Linear independence 73
5.1 Linear dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Orthonormal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Gram-Schmidt algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv Contents
II Matrices 87
6 Matrices 89
6.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Zero and identity matrices . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 Transpose and addition . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . 99
6.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Appendices 347
A Notation 349
B Complexity 351
vi Contents
Index 363
Preface
This book is meant to provide a basic introduction to vectors, matrices, and least
squares methods, with a focus on applications. Vectors, matrices, and least squares
are basic topics in applied linear algebra. They are now used in a very wide range
of applications, including data fitting, machine learning and artificial intelligence,
tomography, navigation, image processing, finance, and control, to name just a few.
Our goal is to give the beginning student, with little or no prior exposure to linear
algebra, a good grounding in the basic ideas, as well as an appreciation for how
they are used in many applications.
The background required of the reader is familiarity with basic mathematical
notation. We use calculus in just a few places, but it does not play a critical
role and is not a strict prerequisite. Even though the book covers many topics
that are traditionally taught as part of probability and statistics, such as fitting
mathematical models to data, no knowledge of or background in probability and
statistics is needed.
The book covers less mathematics than a typical one on applied linear algebra.
We use only one theoretical concept from linear algeba, linear independence, and
only one computational too, the QR factorization; our approach to most applica-
tions relies on only one method, least squares (or some extension). In this sense
the book aims for intellectual economy: We cover many applications with just a
few basic ideas, concepts, and methods. The mathematics we do present, however,
is complete, in that we carefully justify every mathematical statement. In contrast
to most other applied linear algebra books, however, we describe many applica-
tions, including some that are typically considered advanced topics, like document
classification, control, state estimation, and portfolio optimization.
The book does not require any knowledge of computer programming, and can
be used as a conventional textbook, by reading the chapters and working the ex-
ercises. This approach misses out on one of the most compelling reasons to learn
the material: You can actually use the ideas and methods described in this book
to do practical things like build a prediction model from data, enhance images, or
optimize an investment portfolio. The growing power of computers, together with
the development of high level computer languages and packages that support vector
and matrix computation, have made it easy to use the methods described in this
book for real applications. We hope that every student of this book will comple-
ment their reading with computer programming exercises and projects, including
some that involve real data and problems.
If you read the whole book, work some of the exercises, and carry out computer
viii Preface
exercises to implement or use the ideas and methods, you will learn quite a lot.
While there will still be much for you to learn, you will know many of the basic
ideas behind modern data science and many other application areas, and you will
be empowered to use the methods for your own applications.
The book is structured into three parts. Part I introduces the reader to vectors,
and various vector operations and functions like addition, inner product, distance,
and angle. We also describe how vectors are used to represent word counts in a
document, time series, attributes of a patient, sales of a product, an audio track,
or an image. Part II does the same for matrices, culminating with matrix inverses.
Part III, on least squares, is the payoff, at least in terms of the applications. We
show how the simple and natural idea of approximately solving a set of over-
determined equations, and a few extensions of this basic idea, can be used to solve
a wide range of practical problems.
The whole book can be covered in a 15 week (semester) course; a 10 week
(quarter) course can cover most of the material, by skipping a few applications and
perhaps the last two chapters on nonlinear least squares.
Vectors
Chapter 1
Vectors
1.1 Vectors
A vector is an ordered finite list of numbers. Vectors are usually written as vertical
arrays, surrounded by square or curved brackets, as in
1.1 1.1
0.0 0.0
3.6
or 3.6 .
7.2 7.2
The elements (or entries, coefficients, components) of a vector are the values in the
array. The size (also called dimension or length) of the vector is the number of
elements it contains. The vector above, for example, has size four; its third entry
is 3.6. A vector of size n is called an n-vector. A 1-vector is considered to be the
same as a number, i.e., we do not distinguish between the 1-vector [ 1.3 ] and the
number 1.3.
We often use symbols to denote vectors. If we denote an n-vector using the
symbol a, the ith element of the vector a is denoted ai , where the subscript i is an
integer index that runs from 1 to n, the size of the vector.
Two vectors a and b are equal, which we denote a = b, if they have the same
size, and each of the corresponding entries is the same. If a and b are n-vectors,
then a = b means a1 = b1 , . . . , an = bn .
4 1 Vectors
The numbers or values of the elements in a vector are called scalars. We will
focus on the case that arises in most applications, where the scalars are real num-
bers. In this case we refer to vectors as real vectors. (Occasionally other types of
scalars arise, for example, complex numbers, in which case we refer to the vector
as a complex vector.) The set of all real numbers is written as R, and the set of all
real n-vectors is denoted Rn , so a Rn is another way to say that a is an n-vector
with real entries. Here we use set notation: a Rn means that a is an element of
the set Rn ; see appendix A.
a = (b1 , b2 , . . . , bm , c1 , c2 , . . . , cn , d1 , d2 , . . . , dp ).
Subvectors. In the equation above, we say that b, c, and d are subvectors or slices
of a, with sizes m, n, and p, respectively. One notation used to denote subvectors
uses colon notation. If a is a vector, then ar:s is the vector of size s r + 1, with
entries ar , . . . , as :
ar:s = (ar , . . . , as ).
The subscript r : s is called the index range. Thus, in our example above, we have
Notational conventions. Some authors try to use notation that helps the reader
distinguish between vectors and scalars (numbers). For example, Greek letters
(, , . . . ) might be used for numbers, and lower-case letters (a, x, f , . . . ) for
vectors. Other notational conventions include vectors given in bold font (g), or
vectors written with arrows above them (~a). These notational conventions are not
standardized, so you should be prepared to figure out what things are (i.e., scalars
or vectors) despite the authors notational scheme (if any exists).
Zero vectors. A zero vector is a vector with all elements equal to zero. Sometimes
the zero vector of size n is written as 0n , where the subscript denotes the size.
But usually a zero vector is denoted just 0, the same symbol used to denote the
number 0. In this case you have to figure out the size of the zero vector from the
context. (We will see how this is done later.)
Even though zero vectors of different sizes are different vectors, we use the same
symbol 0 to denote them. In computer programming this is called overloading: the
symbol 0 is overloaded because it can mean different things depending on the
context (e.g., the equation it appears in).
Unit vectors. A (standard) unit vector is a vector with all elements equal to zero,
except one element which is equal to one. The ith unit vector (of size n) is the
unit vector with ith element one, and is denoted ei . For example, the vectors
1 0 0
e1 = 0 , e2 = 1 , e3 = 0
0 0 1
are the three unit vectors of size 3. The notation for unit vectors is an example of
the ambiguity in notation noted above. Here, ei denotes the ith unit vector, and
not the ith element of a vector e. Thus we can describe the ith unit n-vector ei as
1 j=i
(ei )j =
0 j 6= i,
Ones vector. We use the notation 1n for the n-vector with all its elements equal
to one. We also write 1 if the size of the vector can be determined from the
context. (Some authors use e to denote a vector of all ones, but we will not use
this notation.) The vector 1 is sometimes called the ones vector.
Sparsity. A vector is said to be sparse if many of its entries are zero; its sparsity
pattern is the set of indices of nonzero entries. The number of the nonzero entries
of an n-vector x is denoted nnz(x). Unit vectors are sparse, since they have only
one nonzero entry. The zero vector is the sparsest possible vector, since it has no
nonzero entries. Sparse vectors arise in many applications.
6 1 Vectors
x2 x
x1
Figure 1.1 The 2-vector x specifies the position (shown as a dot) with coor-
dinates x1 and x2 in a plane.
x
x2
x1
Examples
An n-vector can be used to represent n quantities or values in an application. In
some cases the values are similar in nature (for example, they are given in the same
physical units); in others, the quantities represented by the entries of the vector are
quite different from each other. We briefly describe below some typical examples,
many of which we will see throughout the book.
Color. A 3-vector can represent a color, with its entries giving the Red, Green,
and Blue (RGB) intensity values (often between 0 and 1). The vector (0, 0, 0)
represents black, the vector (0, 1, 0) represents a bright pure green color, and the
vector (1, 0.5, 0.5) represents a shade of pink. This is illustrated in figure 1.3.
1.1 Vectors 7
Values across a population. An n-vector can give the values of some quantity
across a population of individuals or entities. For example, an n-vector b can give
the blood pressure of a collection of n patients, with bi the blood pressure of patient
i, for i = 1, . . . , n.
Time series. An n-vector can represent a time series or signal, that is, the value
of some quantity at different times. (The entries in a vector that represents a time
series are sometimes called samples, especially when the quantity is something
measured.) An audio (sound) signal can be represented as a vector whose entries
8 1 Vectors
90
85
80
xi ( F) 75
70
65
0 10 20 30 40 50
i
give the value of acoustic pressure at equally spaced times (typically 48000 or 44100
per second). A vector might give the hourly rainfall (or temperature, or barometric
pressure) at some location, over some time period. When a vector represents a time
series, it is natural to plot xi versus i with lines connecting consecutive time series
values. (These lines carry no information; they are added only to make the plot
easier to understand visually.) An example is shown in figure 1.4, where the 48-
vector x gives the hourly temperature in downtown Los Angeles.
Daily return. A vector can represent the daily return of a stock, i.e., its fractional
increase (or decrease if negative) in value over the day. For example the return time
series vector (0.022, +0.014, +0.004) means the stock price went down 2.2% on
the first day, then up 1.4% the next day, and up again 0.4% on the third day. In
this example, the samples are not uniformly spaced in time; the index refers to
trading days, and does not include weekends or market holidays. A vector can
represent the daily (or quarterly, hourly, or minute-by-minute) value of any other
quantity of interest for an asset, such as price or volume.
Cash flow. A cash flow into and out of an entity (say, a company) can be repre-
sented by a vector, with positive representing payments to the entity, and negative
representing payment by the entity. For example, with entries giving cash flow
each quarter, the vector (1000, 10, 10, 10, 1010) represents a one year loan of
$1000, with 1% interest only payments made each quarter, and the principal and
last interest payment at the end.
1.1 Vectors 9
Word count and histogram. A vector of length n can represent the number of
times each word in a dictionary of n words appears in a document. For example,
(25, 2, 0) means that the first dictionary word appears 25 times, the second one
twice, and the third one not at all. (Typical dictionaries used for document word
counts have many more than 3 elements.) A small example is shown in figure 1.6. A
variation is to have the entries of the vector give the histogram of word frequencies
in the document, so that, e.g., x5 = 0.003 means that 0.3% of all the words in the
document are the fifth word in the dictionary.
It is common practice to count variations of a word (say, the same word stem
with different endings) as the same word; for example, rain, rains, raining, and
10 1 Vectors
word 3
in
2
number
1
horse
0
the 4
document 1
Figure 1.6 A snippet of text (top), the dictionary (bottom left), and word
count vector (bottom right).
rained might all be counted as rain. Reducing each word to its stem is called
stemming. It is also common practice to exclude words that are too common (such
as a or the) or extremely rare. These are referred to as stop words.
Vector entry labels. In applications such as the ones described above each entry
of a vector has a meaning, such as the count of a specific word in a document, the
number of shares of a specific stock held in a portfolio, or the rainfall in a specific
1.2 Vector addition 11
hour. It is common to keep a separate list of labels or tags that explain or annotate
the meaning of the vector entries. As an example, we might associate the portfolio
vector (100, 50, 20) with the list of ticker symbols (AAPL, INTC, AMZN), so we
know that assets 1, 2, and 3 are Apple, Intel, and Amazon. In some applications,
such as an image, the meaning or ordering of the entries follow known conventions
or standards.
The result of vector subtraction is called the difference of the two vectors.
Properties. Several properties of vector addition are easily verified. For any vec-
tors a, b, and c of the same size we have the following.
Vector addition is commutative: a + b = b + a.
Vector addition is associative: (a + b) + c = a + (b + c). We can therefore
write both as a + b + c.
a + 0 = 0 + a = a. Adding the zero vector to a vector has no effect. (This
is an example where the size of the zero vector follows from the context: It
must be the same as the size of a.)
a a = 0. Subtracting a vector from itself yields the zero vector. (Here too
the size of 0 is the size of a.)
To show that these properties hold, we argue using the definition of vector
addition and vector equality. As an example, let us show that for any n-vectors a
and b, we have a + b = b + a. The ith entry of a + b is, by the definition of vector
addition, ai + bi . The ith entry of b + a is bi + ai . For any two numbers we have
ai + bi = bi + ai , so the ith entries of the vectors a + b and b + a are the same.
This is true for all of the entries, so by the definition of vector equality, we have
a + b = b + a.
12 1 Vectors
a
a+b
b
b b+a
Figure 1.7 Left. The lowest dark arrow shows the displacement a; the dis-
placement b, shown as a dashed arrow, starts from the head of the displace-
ment a and ends at the sum displacement a + b, shown as the longer dark
arrow. Right. The displacement b + a.
p+a
Verifying identities like the ones above, and many others we will encounter
later, can be tedious. But it is important to understand that the various properties
we will list can be derived using elementary arguments like the one above. We
recommend that the reader select a few of the properties we will see, and attempt
to derive them, just to see that it can be done. (Deriving all of them is overkill.)
Examples.
Displacements between two points. If the vectors p and q represent the posi-
tions of two points in 2-D or 3-D space, then p q is the displacement vector
from q to p, as illustrated in figure 1.9.
Word counts. If a and b are word count vectors (using the same dictionary)
1.2 Vector addition 13
pq p
Figure 1.9 The vector p q represents the displacement from the point
represented by q to the point represented by p.
for two documents, the sum a + b is the word count vector of a new document
created by combining the original two (in either order). The word count
difference vector a b gives the number of times more each word appears in
the first document than the second.
Bill of materials. Suppose q1 , . . . , qN are n-vectors that give the quantities of
n different resources required to accomplish N tasks. Then the sum n-vector
q1 + + qN gives the bill of materials for completing all N tasks.
Market clearing. Suppose the n-vector qi represents the amounts of n goods
or resources produced (when positive) or consumed (when negative) by agent
i, for i = 1, . . . , N , so (q5 )4 = 3.2 means that agent 5 consumes 3.2 units of
resource 4. The sum s = q1 + + qN is the n-vector of total net surplus of
the resources (or shortfall, when the entries are negative). When s = 0, we
have a closed market, which means that the total amount of each resource
produced by the agents balances the total amount consumed. In other words,
the n resources are exchanged among the agents. In this case we say that the
market clears (with the resource vectors q1 , . . . , qN ).
Audio addition. When a and b are vectors representing audio signals over
the same period of time, the sum a + b is an audio signal that is perceived as
containing both audio signals combined into one. If a represents a recording of
a voice, and b a recording of music (of the same length), the audio signal a + b
will be perceived as containing both the voice recording and, simultaneously,
the music.
Feature differences. If f and g are n-vectors that give n feature values for two
items, the difference vector d = f g gives the difference in feature values for
the two objects. For example, d7 = 0 means that the two objects have the
same value for feature 7; d3 = 1.67 means that the first objects third feature
value exceeds the second objects third feature value by 1.67.
Time series. If a and b represent time series of the same quantity, such as
daily profit at two different stores, then a + b represents a time series which is
the total daily profit at the two stores. An example (with monthly rainfall)
is shown in figure 1.10.
14 1 Vectors
8
Los Angeles
San Francisco
Sum
6
Rainfall (inches)
4
1 2 3 4 5 6 7 8 9 10 11 12
k
Scalar-vector multiplication can also be written with the scalar on the right, as in
1 1.5
9 (1.5) = 13.5 .
6 9
The meaning is the same: It is the vector obtained by multiplying each element
by the scalar. A similar notation is a/2, where a is a vector, meaning (1/2)a. The
scalar-vector product (1)a is written simply as a. Note that 0 a = 0 (where the
left-hand zero is the scalar zero, and the right-hand zero is a vector zero of the
same size as a).
()a = (a).
( + )a = a + a.
When scalar multiplication is written with the scalar on the right, we have the
right-distributive property:
a( + ) = a + a.
for any scalar and any n-vectors a and b. In this equation, both of the + symbols
refer to the addition of n-vectors.
b = b1 e 1 + + bn e n . (1.1)
In this equation bi are the entries in b (i.e., scalars), and ei is the ith unit vector.
A specific example is
1 1 0 0
3 = (1) 0 + 3 1 + 5 0 .
5 0 0 1
1.5a
a
(1.5)a
Figure 1.11 1.5a represents the displacement in the direction of the displace-
ment a, with magnitude scaled by 1.5; (1.5)a represents the displacement
in the opposite direction, also with magnitude scaled by 1.5.
Examples.
Displacements. When a vector a represents a displacement, and > 0, a
is a displacement in the same direction of a, with its magnitude scaled by
. When < 0, a represents a displacement in the opposite direction of a,
with magnitude scaled by ||. This is illustrated in figure 1.11.
Materials requirements. Suppose the n-vector q is the bill of materials for
producing one unit of some product, i.e., qi is the amount of raw material
required to produce one unit of product. To produce units of the product
will then require raw materials given by q. (Here we assume that 0.)
Audio scaling. If a is a vector representing an audio signal, the scalar-vector
product a is perceived as the same audio signal, but changed in volume
(loudness) by the factor ||. For example, when = 1/2 (or = 1/2), a
is perceived as the same audio signal, but quieter.
Audio mixing. When a1 , . . . , am are vectors representing audio signals (over
the same period of time, for example, simultaneously recorded), they are
called tracks. The linear combination 1 a1 + + m am is perceived as a
mixture (also called a mix ) of the audio tracks, with relative loudness given
by |1 |, . . . , |m |. A producer in a studio, or a sound engineer at a live show,
chooses values of 1 , . . . , m to give a good balance between the different
instruments, vocals, and drums.
Cash flow replication. Suppose that c1 , . . . , cm are vectors that represent cash
flows, such as particular types of loans or investments. The linear combination
f = 1 c1 + + m cm represents another cash flow. We say that the cash
flow f has been replicated by the (linear combination of the) original cash
18 1 Vectors
b
= 1.2
a = 0.4
= 0.4
When n = 1, the inner product reduces to the usual product of two numbers.
Properties. The inner product satisfies some simple properties that are easily
verified from the definition. If a, b, and c are vectors of the same size, and is a
scalar, we have the following.
Commutativity. aT b = bT a. The order of the two vector arguments in the
inner product does not matter.
Associativity with scalar multiplication. (a)T b = (aT b), so we can write
both as aT b.
Distributivity with vector addition. (a + b)T c = aT c + bT c. The inner product
can be distributed across vector addition.
These can be combined to obtain other identities, such as aT (b) = (aT b), or
aT (b + c) = aT b + aT c. As another useful example, we have, for any vectors
a, b, c, d of the same size,
(a + b)T (c + d) = aT c + aT d + bT c + bT d.
This formula expresses an inner product on the left-hand side as a sum of four
inner products on the right-hand side, and is analogous to expanding a product of
sums in algebra. Note that on the left-hand side, the two addition symbols refer to
vector addition, whereas on the right-hand side, the three addition symbols refer
to scalar (number) addition.
General examples.
Unit vector. eTi a = ai . The inner product of a vector with the ith standard
unit vector gives (or picks out) the ith element a.
Sum. 1T a = a1 + + an . The inner product of a vector with the vector of
ones gives the sum of the elements of the vector.
Average. (1/n)T a = (a1 + + an )/n. The inner product of an n-vector with
the vector 1/n gives the average of the elements of the vector.
Sum of squares. aT a = a21 + + a2n . The inner product of a vector with
itself gives the sum of the squares of the elements of the vector.
Selective sum. Let b be a vector all of whose entries are either 0 or 1. Then
bT a is the sum of the elements in a for which bi = 1.
Block vectors. If the vectors a and b are block vectors, and the corresponding
blocks have the same sizes (in which case we say they conform), then we have
T
a1 b1
aT b = ... ... = aT1 b1 + + aTk bk .
ak bk
The inner product of block vectors is the sum of the inner products of the blocks.
20 1 Vectors
A B
6
2
3
1
7 5
4
is the discounted total of the cash flow, i.e., its net present value (NPV), with
interest rate r.
pfinal
i pinitial
i
ri = initial
, i = 1, . . . , n,
pi
where pinitial
i and pfinal
i are the prices of asset i at the beginning and end
of the investment period. If h is an n-vector giving our portfolio, with hi
denoting the dollar value of asset i held, then the inner product rT h is the
total return of the portfolio, in dollars, over the period. If w represents the
fractional (dollar) holdings of our portfolio, then rT w gives the total return
of the portfolio. For example, if rT w = 0.09, then our portfolio return is 9%.
If we had invested $10000 initially, we would have earned $900.
applications. Integers are stored in a more compact format, and are represented
exactly. Vectors are stored as arrays of floating point numbers (or integers, when
the entries are all integers). Storing an n-vector requires 8n bytes to store. Cur-
rent memory and storage devices, with capacities measured in many gigabytes (109
bytes), can easily store vectors with dimensions in the millions or billions. Sparse
vectors are stored in a more efficient way that keeps track of indexes and values of
the nonzero entries.
Flop counts and complexity. So far we have seen only a few vector operations,
like scalar multiplication, vector addition, and the inner product. How quickly
these operations can be carried out depends very much on the computer hardware
and software, and the size of the vector.
A very rough estimate of the time required to carry out some computation, such
as an inner product, can be found by counting the total number of basic arithmetic
operations (addition, subtraction, multiplication, and division of two numbers).
Since numbers are stored in floating point format on computers, these operations
are called floating point operations, or FLOPs. This term is in such common use
that the acronym is now written in lower case letters, as flops, and the speed with
which a computer can carry out flops is expressed in Gflop/s (gigaflops per second,
i.e., billions of flops per second). Typical current values are in the range of 1
10 Gflop/s, but this can vary by several orders of magnitude. The actual time it
takes a computer to carry out some computation depends on many other factors
beyond the total number of flops required, so time estimates based on counting
flops are very crude, and are not meant to be more accurate than a factor of ten
or so. For this reason, gross approximations (such as ignoring a factor of 2) can be
used when counting the flops required in a computation.
The complexity of an operation is the number of flops required to carry it out, as
a function of the size or sizes of the input to the operation. Usually the complexity
is highly simplified, dropping terms that are small or negligible (compared to other
terms) when the sizes of the inputs are large.
we do not expect flop counts to predict the running time with an accuracy better
than a factor of 2. The order is useful in understanding how the time to execute
the computation will scale when the size of the operands changes. An order n
computation should take around 10 times longer to carry out its computation on
an input that is 10 times bigger.
Linear functions
In this chapter we introduce linear and affine functions, and describe some common
settings where they arise, including regression models.
of the 4-vector x, where x1 is the angle of attack of the airplane (i.e., the angle
between the airplane body and its direction of motion), x2 is its air speed, x3 is
the air density, and x4 is the angle of the airplane elevator control surface.
f (x) = aT x = a1 x1 + a2 x2 + + an xn (2.1)
for any n-vector x. This function gives the inner product of its n-dimensional
argument x with some (fixed) n-vector a. We can also think of f as forming a
weighted sum of the elements of x; the elements of a give the weights used in
forming the weighted sum.
Superposition and linearity. The inner product function f defined in (2.1) satisfies
the property
f (x + y) = aT (x + y)
= aT (x) + aT (y)
= (aT x) + (aT y)
= f (x) + f (y)
for all n-vectors x, y, and all scalars , . This property is called superposition.
A function that satisfies the superposition property is called linear. We have just
shown that the inner product with a fixed vector is a linear function.
The superposition equality
f (1 x1 + + k xk ) = 1 f (x1 ) + + k f (xk ),
f (1 x1 + + k xk ) = 1 f (x1 ) + f (2 x2 + + k xk )
= 1 f (x1 ) + 2 f (x2 ) + f (3 x3 + + k xk )
..
.
= 1 f (x1 ) + + k f (xk ).
2.1 Linear functions 27
1 x1 + (1)(2 x2 + + k xk ),
f (x) = f (x1 e1 + + xn en )
= x1 f (e1 ) + + xn f (en )
= aT x,
which holds for any linear scalar-valued function f , has several interesting impli-
cations. Suppose, for example, that the linear function f is given as a subroutine
(or a physical system) that computes (or results in the output) f (x) when we give
the argument (or input) x. Once we have found f (e1 ), . . . , f (en ), by n calls to the
subroutine (or n experiments), we can predict (or simulate) what f (x) will be, for
any vector x, using the formula (2.3).
The representation of a linear function f as f (x) = aT x is unique, which means
that there is only one vector a for which f (x) = aT x holds for all x. To see this,
suppose that we have f (x) = aT x for all x, and also f (x) = bT x for all x. Taking
x = ei , we have f (ei ) = aT ei = ai , using the formula f (x) = aT x. Using the
formula f (x) = bT x, we have f (ei ) = bT ei = bi . These two numbers must be the
same, so we have ai = bi . Repeating this argument for i = 1, . . . , n, we conclude
that the corresponding elements in a and b are the same, so a = b.
Examples.
and is denoted avg(x) (and sometimes x). The average of a vector is a linear
function. It can be expressed as avg(x) = aT x with
which holds when f is affine, and x is any n-vector. This formula shows that for
an affine function, once we know the n + 1 numbers f (0), f (e1 ), . . . , f (en ), we can
predict (or reconstruct or evaluate) f (x) for any n-vector x.
In some contexts affine functions are called linear. For example, when x is a
scalar, the function f defined as f (x) = x + is sometimes referred to as a linear
function of x, perhaps because its graph is a line. But when 6= 0, f is not a linear
function of x, in the standard mathematical sense; it is an affine function of x.
In this book we will distinguish between linear and affine functions. Two simple
examples are shown in figure 2.1.
f (x) g(x)
x x
Figure 2.1 Left. The function f is linear. Right. The function g is affine,
but not linear.
w1 w2 w3
Table 2.1 Loadings on a bridge (first three columns), and the associated
measured sag at a certain point (fourth column) and the predicted sag using
the affine model constructed from the first three experiments (fifth column).
denote the distance that a specific point on the bridge sags, in centimeters, due
to the load w. This is shown in figure 2.2. For weights the bridge is designed
to handle, the sag is very well approximated as a linear function s = f (x). This
function can be expressed as an inner product, s = cT w, for some n-vector c. From
the equation s = c1 w1 + + cn wn , we see that c1 w1 is the amount of the sag that
is due to the weight w1 , and similarly for the other weights. The coefficients ci ,
which have units of cm/ton, are called compliances, and give the sensitivity of the
sag with respect to loads applied at the n locations.
The vector c can be computed by (numerically) solving a partial differential
equation, given the detailed design of the bridge and the mechanical properties of
the steel used to construct it. This is always done during the design of a bridge.
The vector c can also be measured once the bridge is built, using the formula (2.3).
We apply the load w = e1 , which means that we place a one ton load at the first
load position on the bridge, with no load at the other positions. We can then
measure the sag, which is c1 . We repeat this experiment, moving the one ton load
to positions 2, 3, . . . , n, which gives us the coefficients c2 , . . . , cn . At this point
we have the vector c, so we can now predict what the sag will be with any other
loading. To check our measurements (and linearity of the sag function) we might
measure the sag under other more complicated loadings, and in each case compare
our prediction (i.e., cT w) with the actual measured sag.
Table 2.1 shows what the results of these experiments might look like, with each
row representing an experiment (i.e., placing the loads and measuring the sag). In
the last two rows we compare the measured sag and the predicted sag, using the
linear function with coefficients found in the first three experiments.
f f
f(x) = f (z) + (z)(x1 z1 ) + + (z)(xn zn ),
x1 xn
f
where x i
(z) denotes the partial derivative of f with respect to its ith argument,
evaluated at the n-vector z. The hat appearing over f on the left-hand side is a
common notational hint that it is an approximation of the function f .
The first-order Taylor approximation f(x) is a very good approximation of f (x)
when all xi are near the associated zi . Sometimes f is written with a second vector
argument, as f(x; z), to show the point z at which the approximation is developed.
The first term in the Taylor approximation is a constant; the other terms can be
interpreted as the contribution to the (approximate) change in the function value
(from f (z)) due to the change in xi (from zi ).
Evidently f is an affine function of x. (It is sometimes called the linear approx-
imation of f near z, even though it is in general affine, and not linear.) It can be
written compactly using inner product notation as
f (z) = ..
. (2.5)
.
f
xn (z)
The first term in the Taylor approximation (2.4) is the constant f (z), the value of
the function when x = z. The second term is the inner product of the gradient of
f at z and the deviation or perturbation of x from z, i.e., x z.
We can express the first-order Taylor approximation as a linear function plus a
constant,
f(x) = f (z)T x + (f (z) f (z)T z),
but the form (2.4) is perhaps easier to interpret.
The first-order Taylor approximation gives us an organized way to construct
an affine approximation of a function f : Rn R, near a given point z, when
there is a formula or equation that describes f , and it is differentiable. A simple
example, for n = 1, is shown in figure 2.3. Over the full x-axis scale shown, the
Taylor approximation f does not give a good approximation of the function f . But
for x near z, the Taylor approximation is very good.
f (x)
f(x)
Figure 2.3 A function f of one variable, and the first-order Taylor approxi-
mation f(x) = f (z) + f 0 (z)(x z) at z.
Table 2.2 Some values of x (first column), the function value f (x) (sec-
ond column), the Taylor approximation f(x) (third column), and the error
(fourth column).
Table 2.2 shows f (x) and f(x), and the approximation error |f(x) f (x)|, for some
values of x relatively near z. We can see that f is indeed a very good approximation
of f , especially when x is near z.
2.3 Regression model 33
y = xT + v, (2.6)
Often we omit the tildes, and simply write this as y = xT , where we assume that
the first feature in x is the constant 1. A feature that always has the value 1 is
not particularly informative or interesting, but it does simplify the notation in a
regression model.
34 2 Linear functions
Table 2.3 Five houses with associated feature vectors shown in the second
and third columns. The fourth and fifth column give the actual price, and
the price predicted by the regression model.
If y represents the selling price of the house, in thousands of dollars, the regression
model y = xT + v predicts the price in terms of the attributes or features. This
regression model is not meant to describe an exact relationship between the house
attributes and its selling price; it is a model or approximation. Indeed, we would
expect such a model to give, at best, only a crude approximation of selling price.
As a specific numerical example, consider the regression model parameters
These parameter values were found using the methods we will see in chapter 13,
based on records of sales for 774 houses in the Sacramento area. Table 2.3 shows
the feature vectors x for five houses that sold during the period, the actual sale
price y, and the predicted price y from the regression model above. Figure 2.4
shows the predicted and actual sale prices for 774 houses, including the five houses
in the table, on a scatter plot, with actual price on the horizontal axis and predicted
price on the vertical axis.
We can see that this particular regression model gives reasonable, but not very
accurate, predictions of the actual sale price. (Regression models for house prices
that are used in practice use many more than two regressors, and are much more
accurate.)
The model parameters in (2.8) are readily interpreted. The parameter 1 =
148.73 is the amount the regression model price prediction increases (in thousands
of dollars) when the house area increases by 1000 square feet (with the same number
of bedrooms). The parameter 2 = 18.85 is the price prediction increase with
the addition of one bedroom, with the total house area held constant, in units of
thousands of dollars per bedroom. It might seem strange that 2 is negative, since
one imagines that adding a bedroom to a house would increase its sale price, not
2.3 Regression model 35
800
Predicted price y (thousand dollars)
600
House 5
House 4
400
House 1
200
House 2
House 3
0
0 200 400 600 800
Actual price y (thousand dollars)
Figure 2.4 Scatter plot of actual and predicted sale prices for 774 houses
sold in Sacramento during a five-day period.
36 2 Linear functions
decrease it. The regression model (2.8) does predict that adding a bedroom to
a house will increase its sale price, provided the additional bedroom has an area
exceeding around 127 square feet. In any case, the regression model is crude enough
that any interpretation is dubious.
Chapter 3
In this chapter we focus on the norm of a vector, a measure of its length, and on
related concepts like distance, angle, standard deviation, and correlation.
3.1 Norm
The Euclidean norm of an n-vector x (named after the ancient Greek mathe-
matician Euclid), denoted kxk, is the squareroot of the sum of the squares of its
elements, q
kxk = x21 + x22 + + x2n .
The Euclidean norm is sometimes written with a subscript 2, as kxk2 . (The sub-
script 2 indicates that the entries of x are raised to the 2nd power.) Other less
widely used terms for the Euclidean norm of a vector are the magnitude, or length,
of a vector. (The term length should be avoided, since it is also often used to refer
to dimension of the vector.) We can express the Euclidean norm in terms of the
inner product of x with itself:
kxk = xT x.
We use the same notation for the norm of vectors of different dimensions.
As simple examples, we have
2
1
= 9 = 3,
0
1
= 1.
2
When x is a scalar, i.e., a 1-vector, the Euclidean norm is the same as the
absolute value of x. Indeed, the Euclidean norm can be considered a generalization
or extension of the absolute value or magnitude, that applies to vectors. The double
bar notation is meant to suggest this. Like the absolute value of a number, the
norm of a vector is a (numerical) measure of its size. We say a vector is small if
its norm is a small number, and we say it is large if its norm is a large number.
38 3 Norm and distance
(The numerical values of the norm that qualify for small or large depend on the
particular application and context.)
When we form the norm of a vector x, the entries all have equal status. This
makes sense when the entries of the vector x represent the same type of quantity,
using the same units (say, at different times or locations), for example meters
or dollars. When the entries of a vector represent different types of quantities,
the units should be chosen so the numerical values of the entries are comparable.
(Another approach is to use a weighted norm, described later on page 40.)
Properties of norm. Some important properties of the Euclidean norm are given
below. Here x and y are vectors of the same size, and is a scalar.
Homogeneity. kxk = ||kxk. Multiplying a vector by a scalar multiplies the
norm by the absolute value of the scalar.
Triangle inequality. kx + yk kxk + kyk. The Euclidean norm of a sum
of two vectors is no more than the sum of their norms. (The name of this
property will be explained later.)
Nonnegativity. kxk 0.
Definiteness. kxk = 0 only if x = 0.
The last two properties together, which state that the norm is always nonnegative,
and zero only when the vector is zero, are called positive definiteness. The first,
third, and fourth properties are easy to show directly from the definition of the
norm. As an example, lets verify the definiteness property. If kxk = 0, then
we also have kxk2 = 0, which means that x21 + + x2n = 0. This is a sum of n
nonnegative numbers, which is zero. We can conclude that each of the n numbers is
zero, since if any of them were nonzero the sum would be positive. So we conclude
that x2i = 0 for i = 1, . . . , n, and therefore xi = 0 for i = 1, . . . , n; and thus, x = 0.
Establishing the second property, the triangle inequality, is not as easy; we will
give a derivation a bit later.
Any real-valued function of an n-vector that satisfies the four properties listed
above is called a (general) norm. But in this book we will only use the Euclidean
norm, so from now on, we refer to the Euclidean norm as the norm.
Norm of a sum. A useful formula for the norm of the sum of two vectors x and
y is q
kx + yk = kxk2 + 2xT y + kyk2 . (3.1)
To derive this formula, we start with the square of the norm of x+y and use various
properties of the inner product:
kx + yk2 = (x + y)T (x + y)
= xT x + xT y + y T x + y T y
= kxk2 + 2xT y + kyk2 .
Taking the squareroot of both sides yields the formula (3.1) above. In the first line,
we use the definition of the norm. In the second line, we expand the inner product.
In the fourth line we use the definition of the norm, and the fact that xT y = y T x.
Norm of block vectors. The norm-squared of a stacked vector is the sum of the
norm-squared values of its subvectors. For example, with d = (a, b, c) (where a, b,
and c are vectors), we have
This idea is often used in reverse, to express the sum of the norm-squared values
of some vectors as the norm-square value of a block vector formed from them.
We can write the equality above in terms of norms as
p
k(a, b, c)k = kak2 + kbk2 + kck2 = k(kak, kbk, kck)k.
In words: The norm of a stacked vector is the norm of the vector formed from
the norms of the subvectors. (Which is quite a mouthful.) The right-hand side of
the equation above should be carefully read. The outer norm symbols enclose a
3-vector, with (scalar) entries kak, kbk, and kck.
since k of the numbers in the sum are at least a2 , and the other n k numbers are
nonnegative. We can conclude that k kxk2 /a2 , which is called the Chebyshev
inequality. When kxk2 /a2 n, the inequality tells us nothing, since we always
have k n. In other cases it limits the number of entries in a vector that can
be large. For a > kxk, the inequality is k kxk2 /a2 < 1, so we conclude that
k = 0 (since k is an integer). In other words, no entry of a vector can be larger in
magnitude than the norm of the vector.
The Chebyshev inequality is easier to interpret in terms of the RMS value of a
vector. We can write it as 2
k rms(x)
, (3.2)
n a
40 3 Norm and distance
where k is, as above, the number of entries of x with absolute value at least a. The
left-hand side is the fraction of entries of the vector that are at least a in absolute
value. The right-hand side is the inverse square of the ratio of a to rms(x). It says,
for example, that no more than 1/25 = 4% of the entries of a vector can exceed
its RMS value by more than a factor of 5. The Chebyshev inequality partially
justifies the idea that the RMS value of a vector gives an idea of the size of a
typical entry: It states that not too many of the entries of a vector can be much
bigger (in absolute value) than its RMS value. (A converse statement can also be
made: At least one entry of a vector has absolute value as large as the RMS value
of the vector.)
where w1 , . . . , wn are given positive weights, used to assign more or less importance
to the different elements of the n-vector x. If all the weights are one, the weighted
norm reduces to the usual (unweighted) norm.
Weighted norms arise naturally when the elements of the vector x have different
physical units, or natural ranges of values. One common rule of thumb is to choose
wi equal to the typical value of |xi | in the application or setting. This choice of
weights bring all the terms in the sum to the same order, one. We can also imagine
that the weights contain the same physical units as the elements xi , which makes
the terms in the sum (and therefore the norm as well) unitless.
3.2 Distance
Euclidean distance. We can use the norm to define the Euclidean distance be-
tween two vectors a and b as the norm of their difference:
dist(a, b) = ka bk.
For one, two, and three dimensions, this distance is exactly the usual distance
between points with coordinates a and b, as illustrated in figure 3.1. But the
Euclidean distance is defined for vectors of any dimension; we can refer to the
distance between two vectors of dimension 100. Since we only use the Euclidean
norm in this book, we will refer to the Euclidean distance between vectors as,
simply, the distance between the vectors.
If a and b are n-vectors, we refer to the
RMS value of the difference, ka bk/ n, as the RMS deviation between the two
vectors.
When the distance between two n-vectors x and y is small, we say they are
close or nearby, and when the distance kx yk is large, we say they are far.
The particular numerical values of kxyk that correspond to close or far depend
on the particular application.
3.2 Distance 41
Figure 3.1 The norm of the displacement b a is the distance between the
points with coordinates a and b.
Triangle inequality. We can now explain where the triangle inequality gets its
name. Consider a triangle in two or three dimensions, whose vertices have coordi-
nates a, b, and c. The lengths of the sides are the distances between the vertices,
dist(a, b) = ka bk, dist(b, c) = kb ck, dist(a, c) = ka ck.
Geometric intuition tells us that the length of any side of a triangle cannot exceed
the sum of the lengths of the other two sides. For example, we have
ka ck ka bk + kb ck. (3.3)
This follows from the triangle inequality, since
ka ck = k(a b) + (b c)k ka bk + kb ck.
This is illustrated in figure 3.2.
Examples.
Feature distance. If x and y represent vectors of n features of two objects,
the quantity kx yk is called the feature distance, and gives a measure of
how different the objects are (in terms of the feature values). Suppose for
example the feature vectors are associated with patients in a hospital, with
entries such as weight, age, presence of chest pain, difficulty breathing, and
the results of tests. We can use feature vector distance to say that one patient
case is near another one (at least in terms of their feature vectors).
42 3 Norm and distance
ka ck
kb ck
a ka bk b
z4
x z6
z5
z3
z2
z1
Figure 3.3 The point z3 is the nearest neighbor of x among the points z1 ,
. . . , z6 .
RMS prediction error. Suppose that the n-vector y represents a time series
of some quantity, for example, hourly temperature at some location, and y is
another n-vector that represents an estimate or prediction of the time series
y, based on other information. The difference y y is called the prediction
error, and its RMS value rms(y y) is called the RMS prediction error. If
this value is small (say, compared to rms(y)) the prediction is good.
Table 3.1 Pairwise word count histogram distances between five Wikipedia
articles.
different authors. As an example we form the word count histograms for the
5 Wikipedia articles with titles Veterans Day, Memorial Day, Academy
Awards, Golden Globe Awards, and Super Bowl, using a dictionary of
4423 words. (More detail is given in 4.4.) The pairwise distances between
the word count histograms is shown in table 3.1. We can see that pairs of
related articles have smaller word count histogram distances than less related
pairs of articles.
As a simple example consider the vector x = (1, 2, 3, 2). Its mean or average
value is avg(x) = 1, so the de-meaned vector is x = (0, 3, 2, 1). Its standard
deviation is std(x) = 1.872. We interpret this number as a typical value by which
the entries differ from the mean of the entries. These numbers are 0, 3, 2, and 1,
so 1.872 is reasonable.
We should warn the reader that another slightly different definition
of the stan-
dard deviationof a vector is widely used, in which the denominator n in (3.4) is
replaced with n 1 (for n 2). In this book we will only use the definition (3.4).
The average, RMS value, and standard deviation of a vector are related by the
formula
rms(x)2 = avg(x)2 + std(x)2 . (3.5)
This formula makes sense: rms(x)2 is the mean square value of the entries of x,
which can be expressed as the square of the mean value, plus the mean square
fluctuation of the entries of x around their mean value. We can derive this formula
from our vector notation formula for std(x) given above. We have
which can be re-arranged to obtain the identity above. This derivation uses many
of the properties for norms and inner products, and should be read carefully to
understand every step. In the second line, we expand the norm-square of the sum
of two vectors. In the third line we use the commutative property of scalar-vector
multiplication, moving scalars such as (1T x/n) to the front of each term, and also
the fact that 1T 1 = n.
Examples.
Mean return and risk. Suppose that an n-vector represents a time series of
return on an investment, expressed as a percentage, in n time periods over
some interval of time. Its average gives the mean return over the whole
interval, often shortened to its return. Its standard deviation is a measure of
how variable the return is, from period to period, over the time interval, i.e.,
how much it typically varies from its mean, and is often called the (per period)
risk of the investment. Multiple investments can be compared by plotting
them on a risk-return plot, which gives the mean and standard deviation of
the returns of each of the investments over some interval. A desirable return
history vector has high mean return and low risk; this means that the returns
in the different periods are consistently high. Figure 3.4 shows an example.
Temperature or rainfall. Suppose that an n-vector is a time series of the
daily average temperature at a particular location, over a one year period.
Its average gives the average temperature at that location (over the year) and
its standard deviation is a measure of how much the temperature varied from
3.3 Standard deviation 45
10 10
5 5
ak
bk
0 0
5 5
5 10 5 10
k k
10 10
5 5
dk
ck
0 0
5 5
5 10 5 10
k k
return
3
c
b
2
d
a
1
0 risk
0 2 4
its average value. We would expect the average temperature to be high and
the standard deviation to be low in a tropical location, and the opposite for
a location with high latitude.
Adding a constant. For any vector x and any number a, we have std(x+a1) =
std(x). Adding a constant to every entry of a vector does not change its
standard deviation.
1
z= (x avg(x)1).
std(x)
This vector is called the standardized version of x. It has mean zero, and standard
deviation one. Its entries are sometimes called the z-scores associated with the
original entries of x. For example, z4 = 1.4 means that x4 is 1.4 standard deviations
above the mean of the entries of x. Figure 3.5 shows an example.
The standardized values for a vector give a simple way to interpret the original
values in the vectors. For example, if an n-vector x gives the values of some
medical test of n patients admitted to a hospital, the standardized values or z-
scores tell us how high or low, compared to the population, that patients value is.
A value z6 = 3.2, for example, means that patient 6 has a very low value of the
measurement; whereas z22 = 0.3 says that patient 22s value is quite close to the
average value.
3.4 Angle 47
4 4 4
xk
xk
zk
0 0 0
4 4 4
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
k k k
3.4 Angle
Cauchy-Schwarz inequality. An important inequality that relates norms and in-
ner products is the Cauchy-Schwarz inequality:
for any n-vectors a and b. Written out in terms of the entries, this is
1/2 1/2
|a1 b1 + + an bn | a21 + + a2n b21 + + b2n ,
Dividing by 2kak kbk yields aT b kak kbk. Applying this inequality to a and
b we obtain aT b kak kbk. Putting these two inequalities together we get the
Cauchy-Schwarz inequality, |aT b| kak kbk.
This argument also reveals the conditions on a and b under which they satisfy
the Cauchy-Schwarz inequality with equality. This occurs only if ka bk = 0,
i.e., a = b. This means that each vector is a scalar multiple of the other (in
the case when they are nnozero). This statement remains true when either a or
b is zero. So the Cauchy-Schwarz inequality holds with equality when one of the
vectors is a multiple of the other; in all other cases, it holds with strict inequality.
48 3 Norm and distance
where we used the Cauchy-Schwarz inequality in the second line. Taking the square-
root we get the triangle inequality, ka + bk kak + kbk.
Angle between vectors. The angle between two nonzero vectors a, b is defined
as T
a b
= arccos
kak kbk
where arccos denotes the inverse cosine, normalized to lie in the interval [0, ]. In
other words, we define as the unique number between 0 and that satisfies
The angle between a and b is written as 6 (a, b), and is sometimes expressed in
degrees. (The default angle unit is radians; 360 is 2 radians.) For example,
6 (a, b) = 60 means 6 (a, b) = /3, i.e., aT b = (1/2)kakkbk.
The angle coincides with the usual notion of angle between vectors, when they
have dimension two or three. For example, the angle between the vectors a =
(1, 2, 1) and b = (2, 0, 3) is
5
arccos = arccos(0.5661) = 0.9690 = 55.52
6 13
(to 4 digits). But the definition of angle is more general; we can refer to the angle
between two vectors with dimension 100.
The angle is a symmetric function of a and b: we have 6 (a, b) = 6 (b, a). The
angle is not affected by scaling each of the vectors by a positive scalar: we have,
for any vectors a and b, and any positive numbers and ,
Acute and obtuse angles. Angles are classified according to the sign of aT b.
If the angle is zero, which means aT b = kakkbk, the vectors are aligned. Each
vector is a positive multiple of the other (assuming the vectors are nonzero).
If the angle is = 180 , which means aT b = kak kbk, the vectors are anti-
aligned. Each vector is a negative multiple of the other (assuming the vectors
are nonzero).
3.4 Angle 49
Figure 3.6 From left to right: examples of orthogonal, aligned, and anti-
aligned vectors, vectors that make an acute and an obtuse angle.
0
b
Figure 3.7 Two points a and b on a sphere with radius R and center at the
origin. The spherical distance between the points is equal to R6 (a, b).
Examples.
Spherical distance. Suppose a and b are 3-vectors that represent two points
that lie on a sphere of radius R (for example, locations on earth). The
spherical distance between them, measured along the sphere, is given by
R6 (a, b). This is illustrated in figure 3.7.
Table 3.2 Pairwise angles (in degrees) between word histograms of five
Wikipedia articles.
where = 6 (x, y). (The first equality comes from (3.1).) From this we can make
several observations.
If x and y are aligned ( = 0), we have kx + yk = kxk + kyk. Thus, their
norms add.
If x and y are orthogonal ( = 90 ), we have kx + yk2 = kxk 2 2
p + kyk . In this
case the norm-squared values add, and we have kx + yk = kxk + kyk2 .
2
aT b
= . (3.7)
kak kbk
Thus, = cos , where = 6 (a, b). We can also express the correlation coefficient in
terms of the vectors u and v obtained by standardizing a and b. With u = a/ std(a)
and v = b/ std(b), we have
= uT v/n. (3.8)
(We use kuk = kvk = n.)
This is a symmetric function of the vectors: the correlation between a and b
is the same as the correlation coefficient between b and a. The Cauchy-Schwarz
inequality tells us that the correlation coefficient ranges between 1 and +1. For
this reason, the correlation coefficient is sometimes expressed as a percentage. For
example, = 30% means = 0.3. When = 0, we say the vectors are uncorrelated.
The correlation coefficient tells us how the entries in the two vectors vary to-
gether. Roughly speaking, high correlation (say, = 0.8) means that entries of a
3.4 Angle 51
ak bk bk
k k ak
ak bk bk
k k ak
ak bk bk
k k ak
Figure 3.8 Three pairs of vectors a, b of length 10, with correlation coeffi-
cients 0.968 (top), 0.988 (middle), and 0.004 (bottom).
and b are typically above their mean for many of the same entries. The extreme
case = 1 occurs only if the vectors a and b are aligned, which means that each is a
positive multiple of the other, and the other extreme case = 1 occurs only when
a and b are negative multiples of each other. This idea is illustrated in figure 3.8,
which shows the entries of two vectors, as well as a scatter plot of them, for cases
with correlation near 1, near 1, and near 0.
The correlation coefficient is often used when the vectors represent time series,
such as the returns on two investments over some time interval, or the rainfall in
two locations over some time interval. If they are highly correlated (say, > 0.8),
the two time series are typically above their mean values at the same times. For
example, we would expect the rainfall time series at two nearby locations to be
highly correlated. As another example, we might expect the returns of two similar
companies, in the same business area, to be highly correlated.
52 3 Norm and distance
Standard deviation of sum. We can derive a formula for the standard deviation
of a sum from (3.6):
p
std(a + b) = std(a)2 + 2 std(a) std(b) + std(b)2 . (3.9)
To derive this from (3.6) we let a and b denote the de-meaned versions of a and b.
Then a + b is the de-meaned version of a + b, and std(a + b)2 = ka bk2 /n. Now
using (3.6) and = cos 6 (a, b), we get
which is smaller than std(a) + std(b) (unless one of them is zero). When = 1,
the standard deviation of the sum is as small as it can be,
Hedging investments. Suppose that vectors a and b are time series of returns
for two assets with the same return (average) and risk (standard deviation) ,
and correlation coefficient . (These are the traditional symbols used.) The vector
c = (a + b)/2 is the time series of returns for an investment with 50% in each of the
assets. This blended investment has the same return as the original assets, since
using (3.9). From this we see that the risk of the blended investment is never
more than the risk of the original assets, and is smaller when the correlation of
the original asset
returns is smaller. When the returns are uncorrelated, the risk
is a factor 1/ 2 = 0.707 smaller than the risk of the original assets. If the asset
returns are strongly negatively correlated (i.e., is near 1), the risk of the blended
investment is much smaller than the risk of the original assets. Investing in two
assets with uncorrelated, or negatively correlated, returns is called hedging (which
is short for hedging your bets). Hedging reduces risk.
3.5 Complexity 53
3.5 Complexity
Computing the norm of an n-vector requires n multiplications (to square each
entry), n 1 additions (to add the squares), and one squareroot. Even though
computing the squareroot typically takes more time than computing the product
or sum of two numbers, it is counted as just one flop. So computing the norm takes
2n flops. The cost of computing the RMS value of an n-vector is the same, since
we can ignore the two flops involved in division by n. Computing the distance
between two vectors costs 3n flops, and computing the angle between them costs
6n flops. All of these operations have order n.
De-meaning an n-vector requires 2n flops (n for forming the average and an-
other n flops for subtracting the average from each entry). Computing the standard
deviation costs 4n flops, and standardizing an n-vector costs 5n flops. The corre-
lation coefficient between two vectors costs 8n flops to compute. These operations
also have order n.
As a slightly more involved computation, suppose that we wish to determine the
nearest neighbor among a collection of k n-vectors z1 , . . . , zk to another n-vector
x. (This will come up in the next chapter.) The simple approach is to compute the
distances kxzi k for i = 1, . . . , k, and then find the minimum of these. (Sometimes
a comparison of two numbers is also counted as a flop.) The cost of this is 3kn flops
to compute the distances, and k 1 comparisons to find the minimum. The latter
term can be ignored, so the flop count is 3kn. The order of finding the nearest
neighbor in a collection of k n-vectors is kn.
54 3 Norm and distance
Chapter 4
Clustering
In this chapter we consider the task of clustering a collection of vectors into groups
or clusters of vectors that are close to each other, as measured by the distance
between pairs of them. We describe a famous clustering method, called the k-
means algorithm, and give some typical applications.
The material in this chapter will not be used in the sequel. But the ideas, and
the k-means algorithm in particular, are widely used in practical applications, and
rely only on the ideas developed in the previous three chapters. So this chapter
can be considered an interlude that covers useful material that builds on the ideas
developed so far.
4.1 Clustering
Suppose we have N n-vectors, x1 , . . . , xN . The goal of clustering is to group or
partition the vectors (if possible) into k groups or clusters, with the vectors in each
group close to each other. Clustering is very widely used in many application areas,
typically (but not always) when the vectors represent features of objects.
Normally we have k N , which means that there are many more vectors than
groups. Typical applications use values of k that range from a handful to a few
hundred or more, with values of N that range from hundreds to billions. Part of
the task of clustering a collection of vectors is to determine whether or not the
vectors can be divided into k groups, with vectors in each group near each other.
Of course this depends on k, the number of clusters, and the particular data, i.e.,
the vectors x1 , . . . , xN .
Figure 4.1 shows a simple example, with N = 300 2-vectors, shown as small
circles. We can easily see that this collection of vectors can be divided into k = 3
clusters, shown on the right with the colors representing the different clusters. We
could partition these data into other numbers of clusters, but we can see that k = 3
is a good value.
This example is not typical in several ways. First, the vectors have dimension
n = 2. Clustering any set of 2-vectors is easy: We simply scatter plot the values
56 4 Clustering
Figure 4.1 300 points in a plane. The points can be clustered in the three
groups shown on the right.
and check visually if the data are clustered, and if so, how many clusters there are.
In almost all applications n is larger than 2 (and typically, much larger than 2),
in which case this simple visual method cannot be used. The second way in which
it is not typical is that the points are very well clustered. In most applications,
the data are not as cleanly clustered as in this simple example; there are several or
even many points that lie in between clusters. Finally, in this example, it is clear
that the best choice of k is k = 3. In real examples, it can be less clear what the
best value of k is. But even when the clustering is not as clean as in this example,
and the best value of k is not clear, clustering can be very useful in practice.
Before we delve more deeply into the details of clustering and clustering algo-
rithms, we list some common applications where clustering is used.
Examples.
Topic discovery. Suppose xi are word histograms associated with N docu-
ments. A clustering algorithm partitions the documents into k groups, which
typically can be interpreted as groups of documents with the same or similar
topics, genre, or author. Since the clustering algorithm runs automatically
and without any understanding of what the words in the dictionary mean,
this is sometimes called automatic topic discovery.
Patient clustering. If xi are feature vectors associated with N patients ad-
mitted to a hospital, a clustering algorithm clusters the patients into k groups
of similar patients (at least in terms of their feature vectors).
Customer market segmentation. Suppose the vector xi gives the quantities (or
dollar values) of n items purchased by customer i over some period of time. A
clustering algorithm will group the customers into k market segments, which
are groups of customers with similar purchasing patterns.
ZIP-code clustering. Suppose that xi is a vector giving n quantities or statis-
tics for the residents of ZIP-code i, such as numbers of residents in various
4.1 Clustering 57
age groups, household size, education statistics, and income statistics. (In
this example N is around 40000.) A clustering algorithm might be used to
cluster the 40000 ZIP-codes into, say, k = 100 groups of ZIP-codes with
similar statistics.
Student clustering. Suppose the vector xi gives the detailed grading record
of student i in a course, i.e., her grades on each question in the quizzes,
homework assignments, and exams. A clustering algorithm might be used to
cluster the students into k = 10 groups of students who performed similarly.
(This is called a Likert scale.) Suppose the n-vector xi encodes the selections
of respondent i on the n questions, using the numerical coding 2, 1, 0, +1,
+2 for the responses above. A clustering algorithm can be used to cluster
the respondents into k groups, each with similar responses to the survey.
Weather zones. For each of N counties we have a 24-vector wi that gives the
average monthly temperature in the first 12 entries and the average monthly
rainfall in the last 12 entries. (We can standardize all the temperatures, and
all the rainfall data, so they have a typical range between 1 and +1.) The
vector wi summarizes the annual weather pattern in county i. A clustering
algorithm can be used to cluster the counties into k groups that have similar
weather patterns, called weather zones. This clustering can be shown on a
map, and used to recommend landcape plantings depending on zone.
Daily energy use patterns. The 24-vectors ui give the average (electric) en-
ergy use for N customers over some period (say, a month) for each hour of
the day. A clustering algorithm partitions customers into groups, each with
similar patterns of daily energy consumption. We might expect a clustering
algorithm to discover which customers have a swimming pool, an electric
water heater, or solar panels.
In each of these examples, it would be quite informative to know that the vectors
can be well clustered into, say, k = 5 or k = 37 groups. This can be used to develop
insight into the data. By examining the clusters we can often understand them,
and assign labels or descriptions to them.
58 4 Clustering
(Here we are using the notation of sets; see appendix A.) Formally, we can express
these index sets in terms of the group assignment vector c as
Gj = {i | ci = j},
A clustering objective. We can now give a single number that we use to judge a
choice of clustering, along with a choice of the group representatives. We define
which is the mean squared distance from the vectors to their associated represen-
tatives. Note that J clust depends on the cluster assignments (i.e., c), as well as the
choice of the group representatives z1 , . . . , zk . The smaller J clust is, the better the
clustering. An extreme case is J clust = 0, which means that the distance between
every original vector and its assigned representative is zero. This happens only
when the original collection of vectors only take k different values, and each vector
is assigned to the representative it is equal to. (This extreme case would probably
not occur in practice.)
Our choice of clustering objective J clust makes sense, since it encourages all
points to be near their associated representative, but there are other reasonable
4.2 A clustering objective 59
Partitioning the vectors with the representatives fixed. Suppose that the group
representatives z1 , . . . , zk are fixed, and we seek the group assignments c1 , . . . , cN
that achieve the smallest possible value of J clust . It turns out that this problem
can be solved exactly.
The objective J clust is a sum of N terms. The choice of ci (i.e., the group
to which we assign the vector xi ) only affects the ith term in J clust , which is
(1/N )kxi zci k2 . We can choose ci to minimize just this term, since ci does not
affect the other N 1 terms in J clust . How do we choose ci to minimize this term?
This is easy: We simply choose ci to be the value of j that minimizes kxi zj k
over j. In other words, we should assign each data vector xi to its nearest neighbor
among the representatives. This choice of assignment is very natural, and easily
carried out.
So when the group representatives are fixed, we can readily find the best group
assignment (i.e., the one that minimizes J clust ), by assigning each vector to its
nearest representative. With this choice of group assignment, we have (by the way
the assignment is made)
This has a simple interpretation: It is the mean of the squared distance from the
data vectors to their closest representative.
60 4 Clustering
Optimizing the group representatives with the assignment fixed. Now we turn
to the problem of choosing the group representatives, with the clustering (group
assignments) fixed, in order to minimize our objective J clust . It turns out that this
problem also has a simple and natural solution.
We start by re-arranging the sum of N terms into k sums, each associated with
one group. We write
J clust = J1 + + Jk ,
where X
Jj = (1/N ) kxi zj k2
iGj
is the contribution to the objective J clust from the vectors in group j. (The sum
here means that we should add up all terms of the form kxi zj k2 , for any i Gj ,
i.e., for any vector xi in group j; see appendix A.)
The choice of group representative zj only affects the term Jj ; it has no effect
on the other terms in J clust . So we can choose each zj to minimize Jj . Thus we
should choose the vector zj so as to minimize the mean square distance to the
vectors in group j. This problem has a very simple solution: We should choose zj
to be the average (or mean or centroid) of the vectors xi in its group:
X
zj = (1/|Gj |) xi ,
iGj
where |Gj | is standard mathematical notation for the number of elements in the
set Gj , i.e., the size of group j.
So if we fix the group assignments, we minimize J clust by choosing each group
representative to be the average or centroid of the vectors assigned to its group.
(This is sometimes called the group centroid or cluster centroid.)
Ties in step 1 can be broken by assigning xi to the group associated with one
of the closest representatives with the smallest value of j.
It is possible that in step 1, one or more of the groups can be empty, i.e.,
contain no vectors. In this case we simply drop this group (and its repre-
sentative). When this occurs, we end up with a partition of the vectors into
fewer than k groups.
If the group assignments found in step 1 are the same in two successive
iterations, the representatives in step 2 will also be the same. It follows that
the groups assignments and group representatives will never change in future
iterations, so we should stop the algorithm. This is what we mean by until
convergence.
Convergence. The fact that J clust decreases in each step implies that the k-means
algorithm converges in a finite number of steps. However, depending on the initial
choice of representatives, the algorithm can, and does, converge to different final
partitions, with different objective values.
The k-means algorithm is a heuristic, which means it cannot guarantee that the
partition it finds minimizes our objective J clust . For this reason it is common to
run the k-means algorithm several times, with different initial representatives, and
choose the one among them with the smallest final value of J clust . Despite the fact
that the k-means algorithm is a heuristic, it is very useful in practical applications,
and very widely used.
Figure 4.3 shows a few iterations generated by the k-means algorithm, applied
to the example of figure 4.1. We take k = 3 and start with randomly chosen group
representatives. The final clustering is shown in figure 4.4. Figure 4.5 shows how
the clustering objective decreases in each step.
62 4 Clustering
Iteration 1
Iteration 2
Iteration 10
Figure 4.3 Three iterations of the k-means algorithm. The group represen-
tatives are shown as large stars. In each row, the left-hand plot shows the
result of partitioning the vectors in the 3 groups (step 1 of algorithm 4.1).
The right-hand plot shows the updated representatives (step 2 of the algo-
rithm).
64 4 Clustering
1.5
J clust
0.5
1 3 5 7 9 11 13 15
Iteration
Figure 4.5 The clustering objective J clust after step 1 of each iteration.
4.4 Examples 65
the representatives found using the training data set. Roughly speaking, this test
tells us whether our clustering works on data that it has not seen. If the mean
square distance for the training and test sets are reasonably close to other, we
conclude that k is not too big; if the mean square distance for the test set is much
larger, we conclude that k is too large. We will encounter the idea of validation
again, and discuss it in more detail, in 13.2.
4.4 Examples
4.4.1 Image clustering
Figure 4.6 25 images of handwritten digits from the MNIST data set. Each
image has size 28 28, and can be represented by a 784-vector.
42
40
J clust
38
36
1 5 9 13 17 21 25
Iteration
Figure 4.7 Clustering objective J clust after each iteration of the k-means
algorithm, for three initial partitions, on digits of the MNIST set.
4.4 Examples 67
103
J clust 7.5
1 5 10 15 20
Iteration
Figure 4.10 Clustering objective J clust after each iteration of the k-means
algorithm, for three initial partitions, on Wikipedia word count histograms.
We can validate the clustering found using a separate test set consisting of
10000 images of handwritten digits. The objective value on the training set (60000
images) is J clust = 35.17; the objective value on the test set (10000 images), with
the cluster representatives found from the training set, is 35.09. These numbers
are very close, so we conclude that k = 20 is not too large.
We start with a corpus of N = 500 Wikipedia articles, compiled from weekly lists
of the most popular articles between September 6, 2015, and June 11, 2016. We
remove the section titles and reference sections (bibliography, notes, references,
further reading), and convert each document to a list of words. The conversion
removed numbers and stop words, and applied a stemming algorithm to nouns and
verbs. We then formed a dictionary of all the words that appear in at least 20
documents. This resulted in a dictionary of 4423 words. Each document in the
corpus is represented by a word histogram vector of length 4423.
We apply the k-means algorithm with k = 9, and 20 randomly chosen initial
partitions. The k-means algorithm converges to similar but slightly different clus-
terings of the documents in each case. Figure 4.10 shows the clustering objective
versus iteration of the k-means algorithm for three of these, including the one that
gave the lowest final value of J clust , which we use below.
Table 4.1 summarizes the clustering with the lowest value of J clust . For each of
the nine clusters we show the largest ten coefficients of the word histogram of the
cluster representative. Table 4.2 gives the size of each cluster and the titles of the
4.4 Examples 69
4.5 Applications
Clustering, and the k-means algorithm in particular, has many uses and applica-
tions. It can be used for exploratory data analysis, to get an idea of what a large
collection of vectors looks like. When k is small enough, say less than a few tens,
it is common to examine the group representatives, and some of the vectors in the
associated groups, to interpret or label the groups. Clustering can also be used for
more specific directed tasks, a few of which we describe below.
Table 4.2 Cluster sizes and titles of 10 documents closest to the cluster
representatives.
72 4 Clustering
Chapter 5
Linear independence
In this chapter we explore the concept of linear independence, which will play an
important role in the sequel.
The converse is also true: If any vector in a collection of vectors is a linear combi-
nation of the other vectors, then the collection of vectors is linearly dependent.
Following standard mathematical language usage, we will say The vectors a1 ,
. . . , ak are linearly dependent to mean The list of vectors a1 , . . . , ak is linearly
dependent. But it must be remembered that linear dependence is an attribute of
a collection of vectors, and not individual vectors.
1 a1 + + k ak = 0 (5.1)
74 5 Linear independence
Examples.
A list consisting of a single vector is linearly dependent only if the vector is
zero. It is linearly independent only if the vector is nonzero.
Any list of vectors containing the zero vector is linearly dependent.
A list of two vectors is linearly dependent if and only if one of the vectors
is a multiple of the other one. More generally, a list of vectors is linearly
dependent if any one of the vectors is a multiple of another one.
The vectors
0.2 0.1 0
a1 = 7 , a2 = 2 , a3 = 1
8.6 1 2.2
are linearly dependent, since a1 + 2a2 3a3 = 0. We can express any of
these vectors as a linear combination of the other two. For example, we have
a2 = (1/2)a1 + (3/2)a3 .
The vectors
1 0 1
a1 = 0 , a2 = 1 , a3 = 1
0 1 1
are linearly independent. To see this, suppose 1 a1 + 2 a2 + 3 a3 = 0. This
means that
1 3 = 0, 2 + 3 = 0, 2 + 3 = 0.
Adding the last two equations we find that 23 = 0, so 3 = 0. Using this,
the first equation is then 1 = 0, and the second equation is 2 = 0.
The standard unit n-vectors e1 , . . . , en are linearly independent. To see this,
suppose that (5.1) holds. We have
1
0 = 1 e1 + + n en = ... ,
n
so we conclude that 1 = = n = 0.
5.2 Basis 75
x = 1 a1 + + k ak .
When the vectors a1 , . . . , ak are linearly independent, the coefficients that form x
are unique: If we also have
x = 1 a1 + + k ak ,
0 = (1 1 )a1 + + (k k )ak .
5.2 Basis
Independence-dimension inequality. If the n-vectors a1 , . . . , ak are linearly in-
dependent, then k n. In words:
We will prove this fundamental fact below; but first, we describe the concept of
basis, which relies on the independence-dimension inequality.
76 5 Linear independence
Examples.
The n standard unit n vectors e1 , . . . , en are a basis. Any n-vector b can be
written as the linear combination
b = b1 e 1 + + bn e n .
This expansion is unique, which means that there is no other linear combi-
nation of e1 , . . . , en that equals b.
The vectors
1.2 0.3
a1 = , a2 =
2.6 3.7
are a basis. The vector b = (1, 1) can be expressed in only one way as a linear
combination of them:
b = 0.6513 a1 0.7280 a2 .
(The coefficients are given here to 4 significant places. We will see later how
these coefficients can be computed.)
5.2 Basis 77
Cash flows and single period loans. As a practical example, we consider cash
flows over n periods, with positive entries meaning income or cash in and negative
entries meaning payments or cash out. We define the single-period loan cash flow
vectors as
li = (0i1 , 1, (1 + r), 0ni1 ), i = 1, . . . , n 1,
where r 0 is the per-period interest rate. The cash flow li represents a loan of $1
in period i, which is paid back in period i + 1 with interest r. (The subscripts on
the zero vectors above give their dimensions.) Scaling li changes the loan amount;
scaling li by a negative coefficient converts it into a loan to another entity (which
is paid back in period i + 1 with interest).
The vectors e1 , l1 , . . . , ln1 are a basis. (The first vector e1 represents income of
$1 in period 1.) To see this, we show that they are linearly independent. Suppose
that
1 e1 + 2 l2 + + n ln1 = 0.
The last entry is (1 + r)n = 0, which implies that n = 0 (since 1 + r > 0).
Using n = 0, the second to last entry is (1 + r)n1 = 0, so we conclude that
n1 = 0. Continuing this way we find that n2 , . . . , 2 = 0 are all zero. The first
entry of the equation above, 1 + 2 = 0, then implies 1 = 0. We conclude that
the vectors e1 , l1 , . . . , ln=1 are linearly independent, and therefore a basis.
This means that any cash flow n-vector c can be expressed as a linear combi-
nation of (i.e., replicated by) an initial payment and one period loans:
c = 1 e1 + 2 l1 + + n ln1 .
In this example we can work out what the coefficients are. Using a similar
argument as the one above that establishes linear independence we get
cn cn1 cn
n = , n1 = ,
1+r 1 + r (1 + r)2
Finally, we have
c2 cn
1 = c1 + + + ,
1+r (1 + r)n1
which is exactly the net present value (NPV) of the cash flow, with interest rate r.
Thus we see that any cash flow can be replicated as an income in period 1 equal
to its net present value, plus a linear combination of one-period loans at interest
rate r.
78 5 Linear independence
with
j1 k
1 X X
= ( i i + i1 i ).
j i=1 i=j+1
Since the vectors ai = (bi , i ) are linearly independent, the equality (5.2) only
holds when all the coefficients i and are all zero. This in turns implies that
the vectors c1 , . . . , ck1 are linearly independent. By the induction hypothesis
k 1 n 1, so we have established that k n.
combined into one statement about the inner products of pairs of vectors in the
collection: a1 , . . . , ak is orthonormal means that
1 i=j
aTi aj =
0 i 6= j.
1 a1 + + k ak = 0.
0 = aTi (1 a1 + + k ak )
= 1 (aTi a1 ) + + k (aTi ak )
= i ,
since aTi aj = 0 for j 6= i and aTi ai = 1. Thus, the only linear combination of a1 ,
. . . , ak that is zero is the one with all coefficients zero.
x = 1 a1 + + k ak .
Taking the inner product of the left- and right-hand sides of this equation with ai
yields
aTi x = aTi (1 a1 + + k ak ) = i ,
80 5 Linear independence
Pedersen Gram and Erhard Schmidt, although it was already known before their
work.
If the vectors are linearly independent, The Gram-Schmidt algorithm produces
an orthonormal collection of vectors q1 , . . . , qk with the following properties: For
each i = 1, . . . , k, ai is a linear combination of q1 , . . . , qi , and qi is a linear
combination of a1 , . . . , ai . If the vectors a1 ,. . . , aj1 are linearly independent, but
a1 ,. . . , aj are linearly dependent, the algorithm detects this and terminates. In
other words, the Gram-Schmidt algorithm finds the first vector aj that is a linear
combination of previous vectors a1 , . . . , aj1 .
Analysis of Gram-Schmidt algorithm. Let us show that the following hold, for
i = 1, . . . , k, assuming a1 , . . . , ak are linearly independent.
2. q1 , . . . , qi are orthonormal.
3. ai is a linear combination of q1 , . . . , qi .
4. qi is a linear combination of a1 , . . . , ai .
Step 3 of the algorithm ensures that q1 , . . . , qi are normalized; to show they are
orthogonal we will show that qi qj for j = 1, . . . , i 1. (Our induction hypothesis
tells us that qr qs for r, s < i.) For any j = 1, . . . , i 1, we have (using step 1 of
the algorithm)
using qjT qk = 0 for j 6= k and qjT qj = 1. (This explains why step 1 is called the
orthogonalization step: We subtract from ai a linear combination of q1 , . . . , qi1
that ensures qi qj for j < i.) Since qi = (1/kqi k)qi , we have qiT qj = 0 for
j = 1, . . . , i 1. So assertion 2 holds for i.
It is immediate that ai is a linear combination of q1 , . . . , qi :
1 a1 + + k ak = 0 (5.6)
0 = qkT (1 a1 + + k ak )
= 1 qkT a1 + + k1 qkT ak1 + k qkT ak
= k kqk k,
1 a1 + + k1 ak1 = 0.
Pk
where we use the fact that i=1 (i 1) = k(k 1)/2. The complexity of the
Gram-Schmidt algorithm is 2nk 2 ; its order is nk 2 . We can guess that its running
time grows linearly with the lengths of the vectors n, and quadratically with the
number of vectors k.
In the special case of k = n, the complexity of the Gram-Schmidt method is
2n3 . For example, if the Gram-Schmidt algorithm is used to determine whether a
collection of 1000 1000-vectors is linearly independent (and therefore a basis), the
computational cost is around 2 109 flops. On a modern computer, can we can
expect this to take on the order of one second.
When the Gram-Schmidt algorithm is implemented, a variation on it called the
modified Gram-Schmidt algorithm is typically used. This algorithm produces the
same results as the Gram-Schmidt algorithm (5.1), but is less sensitive to the small
round-off errors that occur when arithmetic calculations are done using floating
point numbers. (We do not consider round-off error in this book.)
86 5 Linear independence
Part II
Matrices
Chapter 6
Matrices
In this chapter we introduce matrices and some basic operations on them. We give
some applications in which they arise.
6.1 Matrices
A matrix is a rectangular array of numbers written between rectangular brackets,
as in
0 1 2.3 0.1
1.3 4 0.1 0 .
4.1 1 0 1.7
It is also common to use large parentheses instead of rectangular brackets, as in
0 1 2.3 0.1
1.3 4 0.1 0 .
4.1 1 0 1.7
An important attribute of a matrix is its size or dimensions, i.e., the numbers of
rows and columns. The matrix above has 3 rows and 4 columns, so its size is 3 4.
A matrix of size m n is called an m n matrix.
The elements (or entries or coefficients) of a matrix are the values in the array.
The i, j element is the value in the ith row and jth column, denoted by double
subscripts: the i, j element of a matrix A is denoted Aij (or Ai,j , when i or j is
more than one digit or character). The positive integers i and j are called the (row
and column) indices. If A is an m n matrix, then the row index i runs from 1 to
m and the column index j runs from 1 to n. Row indices go from top to bottom,
so row 1 is the top row and row m is the bottom row. Column indices go from left
to right, so column 1 is the left column and column n is the right column.
If the matrix above is B, then we have B13 = 2.3, B32 = 1. The row index
of the bottom left element (which has value 4.1) is 3; its column index is 1.
Two matrices are equal if they have the same size, and the corresponding entries
are all equal. As with vectors, we normally deal with matrices with entries that
90 6 Matrices
are real numbers, which will be our assumption unless we state otherwise. The set
of real m n matrices is denoted Rmn . But matrices with complex entries, for
example, do arise in some applications.
Square, tall, and wide matrices. A square matrix has an equal number of rows
and columns. A square matrix of size n n is said to be of order n. A tall matrix
has more rows than columns (size m n with m > n). A wide matrix has more
columns than rows (size m n with n > m).
Notational conventions. Many authors (including us) tend to use capital letters
to denote matrices, and lower case letters for (column or row) vectors. But this
convention is not standardized, so you should be prepared to figure out whether
a symbol represents a matrix, column vector, row vector, or a scalar, from con-
text. (The more considerate authors will tell you what the symbols represent, for
example, by referring to the matrix A when introducing it.)
Amj
for j = 1, . . . , n. The same matrix has m rows, given by the (1 n row vectors)
bi = Ai1 Ain ,
for i = 1, . . . , m.
As a specific example, the 2 3 matrix
1 2 3
4 5 6
6.1 Matrices 91
where B, C, D, and E are matrices. Such matrices are called block matrices; the
elements B, C, and D are called blocks or submatrices of A. The submatrices can
be referred to by their block row and column indices; for example, C is the 1,2
block of A.
Block matrices must have the right dimensions to fit together. Matrices in the
same (block) row must have the same number of rows (i.e., the same height);
matrices in the same (block) column must have the same number of columns (i.e.,
the same width). In the example above, B and C must have the same number of
rows, and C and E must have the same number of columns. Matrix blocks placed
next to each other in the same row are said to be concatenated ; matrix blocks
placed above each other are called stacked.
As an example, consider
2 2 1 4
B= 0 2 3 , C = 1 , D= , E= .
1 3 5 4
(Note that we have dropped the left and right brackets that delimit the blocks.
This is similar to the way we drop the brackets in a 1 1 matrix to get a scalar.)
We can also divide a larger matrix (or vector) into blocks. In this context the
blocks are called submatrices of the big matrix. As with vectors, we can use colon
notation to denote submatrices. If A is an m n-matrix, and p, q, r, s are integers
with 1 p q m and 1 r s n, then Ap:q,r:s denotes the submatrix
Apr Ap,r+1 Aps
Ap+1,r Ap+1,r+1 Ap+1,s
Ap:q,r:s = .. .. .. .
. . .
Aqr Aq,r+1 Aqs
92 6 Matrices
Examples
Table interpretation. The most direct interpretation of a matrix is as a table of
numbers that depend on two indexes, i and j. (A vector is a list of numbers that
depend on only one index.) In this case the rows and columns of the matrix usually
have some simple interpretation. Some examples are given below.
for asset 4. The 3rd row of R is an n-row-vector that gives the returns of all
assets in the universe in period 3.
An example of an asset return matrix, with a universe of n = 4 assets over
T = 3 periods, is shown in table 6.1.
2 3
1 4
often called a data matrix or feature matrix. Its jth column is the feature n-vector
for the jth object (in this context sometimes called the jth example). The ith row
of the data matrix X is an N -row-vector whose entries are the values of the ith
feature across the examples. We can also directly interpret the entries of the data
matrix: Xij (which is a number) is the value of the ith feature for the jth example.
As another example, a 3 M matrix can be used to represent a collection of
M locations or positions in 3-D space, with its jth column giving the jth position.
R = {(1, 2), (1, 3), (2, 1), (2, 4), (3, 4), (4, 1)}. (6.2)
This matrix is called the adjacency matrix associated with the graph. The rela-
tion (6.2), for example, is represented by the matrix
0 1 1 0
1 0 0 1
A=
0
.
0 1 0
1 0 0 0
This is the adjacency matrix of the associated graph, shown in figure 6.1. (We will
encounter another matrix associated with a directed graph in 7.3.)
6.2 Zero and identity matrices 95
then
1 0 1 2 3
0 1 4 5 6
I A
= 0 0 1 0 0 .
0 I
0 0 0 1 0
0 0 0 0 1
The dimensions of the two identity matrices follow from the size of A. The identity
matrix in the 1,1 position must be 2 2, and the identity matrix in the 2,2 position
must be 3 3. This also determines the size of the zero matrix in the 2,1 position.
The importance of the identity matrix will become clear later, in 10.1.
96 6 Matrices
Sparse matrices. A matrix A is said to be sparse if many of its entries are zero,
or (put another way) just a few of its entries are nonzero. Its sparsity pattern is
the set of indices (i, j) for which Aij 6= 0. The number of nonzeros of a sparse
matrix A is the number of entries in its sparsity pattern, and denoted nnz(A). If
A is m n we have nnz(A) mn. Its density is nnz(A)/(mn), which is no more
than one. Densities of sparse matrices that arise in applications are typically small
or very small, as in 102 or 104 . There is no precise definition of how small the
density must be for a matrix to qualify as sparse. A famous definition of sparse
matrix due to Wilkinson is: A matrix is sparse if it has enough zero entries that it
pays to take advantage of them. Sparse matrices can be stored and manipulated
efficiently on a computer.
Many common matrices are sparse. An n n identity matrix is sparse, since
it has only n nonzeros, so its density is 1/n. The zero matrix is the sparsest
possible matrix, since it has no nonzero entries. Several special sparsity patterns
have names; we describe some important ones below.
Like sparse vectors, sparse matrices arise in many applications. A typical cus-
tomer purchase history matrix (see page 93) is sparse, since each customer has
likely only purchased a small fraction of all the products available.
(Note that in the first example, one of the diagonal elements is also zero.)
The notation diag(a1 , . . . , an ) is used to compactly describe the n n diagonal
matrix A with diagonal entries A11 = a1 , . . . , Ann = an . This notation is not yet
standard, but is coming into more prevalent use. As examples, the matrices above
would be expressed as
If we transpose a matrix twice, we get back the original matrix: (AT )T = A. (The
superscript T in the transpose is the same one used to denote the inner product of
two n-vectors; we will soon see how they are related.)
Row and column vectors. Transposition converts row vectors into column vectors
and vice versa. It is sometimes convenient to express a row vector as aT , where a
is a column vector. For example, we might refer to the m rows of an m n matrix
A as aTi , . . . , aTm , where a1 , . . . , am are (column) n-vectors. As an example, the
second row of the matrix
0 7 3
4 0 1
can be written as (the row vector) (4, 0, 1)T .
It is common to extend concepts from (column) vectors to row vectors, by
applying the concept to the transposed row vectors. We say that a set of row
vectors is linearly dependent (or independent) if their transposes (which are column
vectors) are linearly dependent (or independent). For example, the rows of a
matrix A are linearly independent means that the columns of AT are linearly
independent. As another example, the rows of a matrix A are orthonormal means
that their transposes, the columns of AT , are orthonormal. Clustering the rows of
a matrix X means clustering the columns of X T .
Transpose of block matrix. The transpose of a block matrix has the simple form
(shown here for a 2 2 block matrix)
T
AT CT
A B
= ,
C D BT DT
98 6 Matrices
Two matrices of the same size can be added together. The result is another matrix
of the same size, obtained by adding the corresponding elements of the two matrices.
For example,
0 4 1 2 1 6
7 0 + 2 3 = 9 3 .
3 1 0 4 3 5
Matrix subtraction is similar. As an example,
1 6 0 6
I = .
9 3 9 2
(This gives another example where we have to figure out the size of the identity
matrix. Since we can only add or subtract matrices of the same size, I refers to a
2 2 identity matrix.)
Commutativity. A + B = B + A.
(The reader should check that the two sums in (6.3) give the entries of the vector
on the right-hand side.)
yi = bTi x, i = 1, . . . , m,
where bTi is the row i of A. The matrix-vector product can also be interpreted in
terms of the columns of A. If ak is the kth column of A, then y = Ax can be
written
y = x1 a1 + x2 a2 + + xn an .
This shows that y = Ax is a linear combination of the columns of A; the coefficients
in the linear combination are the elements of x.
(where entries not shown are zero, and entries with diagonal dots are 1 or
1, continuing the pattern) is called the difference matrix. The vector Dx is
6.4 Matrix-vector multiplication 101
Application examples.
Feature matrix and weight vector. Suppose X is a feature matrix, where its N
columns x1 , . . . , xN are the n-feature vectors for N objects or examples. Let
the n-vector w be a weight vector, and let si = xTi w be the score associated
with object i using the weight vector w. Then we can write s = X T w, where
s is the N -vector of scores of the objects.
Portfolio return time series. Suppose that R is a T n asset return matrix,
that gives the returns of n assets over T periods. Let h denote the n-vector
of dollar value investments in the assets over the T periods, so, e.g., h3 = 200
means that we have invested $200 in asset 3. (Short positions are denoted
by negative entries in h.) Then Rh, which is a T -vector, is the time series of
the portfolio profit (in $) over the periods 1, . . . , T . If w is a set of portfolio
weights (with 1T w = 1), then Rw is the time series of portfolio returns, given
as a fraction.
As an example, consider a portfolio of the 4 assets in table 6.1, with weights
w = (0.4, 0.3, 0.2, 0.5). The product Rw = (0.00213, 0.00201, 0.00241)
gives the portofio returns over the three periods in the example.
Polynomial evaluation at multiple points. Suppose the entries of the n-vector
c are the coefficients of a polynomial p of degree n 1 or less:
p(t) = c1 + c2 t + + cn1 tn2 + cn tn1 .
102 6 Matrices
Inner product. When a and b are n-vectors, aT b is exactly the inner product of a
and b, obtained from the rules for transposing matrices and matrix-vector product.
We start with the n-(column) vector a, consider it as an n1 matrix, and transpose
it to obtain the n-row-vector aT . Now we multiply this 1n matrix by the n-vector
b, which we consider an n 1 matrix, to obtain the 1 1 matrix aT b, which we also
consider a scalar. So the notation aT b for the inner product is just a very special
case of matrix-vector multiplication.
A(u + v) = Au + Av.
6.5 Complexity
Computer representation of matrices. An m n matrix is usually represented
on a computer as an m n array of floating point numbers, which requires 8mn
bytes. In some software systems symmetric matrices are represented in a more
efficient way, by only storing the upper triangular elements in the matrix, in some
specific order. This reduces the memory requirement by around a factor of two.
104 6 Matrices
Sparse matrices are represented by various methods that encode for each nonzero
element its row index i (an integer), its column index j (an integer) and its value
Aij (a floating point number). When the row and column indexes are represented
using 4 bytes, this requires a total of around 16 nnz(A) bytes.
Matrix examples
In this chapter we describe some special matrices that occur often in applications.
Scaling. Scaling is the mapping y = ax, where a is a scalar. This can be expressed
as y = Ax with A = aI. This mapping stretches a vector by the factor |a| (or
shrinks it when |a| < 1), and it flips the vector (reverses its direction) if a < 0.
Reflection. Suppose that y is the vector obtained by reflecting x through the line
that passes through the origin, inclined radians with respect to horizontal. Then
we have
cos(2) sin(2)
y= x.
sin(2) cos(2)
106 7 Matrix examples
Ax
x Ax
Ax
x
x
Figure 7.1 From left to right: A dilation with A = diag(2, 2/3), a coun-
terclockwise rotation by /6 radians, and a reflection through a line that
makes an angle of /4 radians with the horizontal line.
Projection onto a line. The projection of the point x onto a set is the point in
the set that is closest to x. Suppose y is the projection of x onto the line that
passes through the origin, inclined radians with respect to horizontal. Then we
have
(1/2)(1 + cos(2)) (1/2) sin(2)
y= x.
(1/2) sin(2) (1/2)(1 cos(2))
Some of these geometric transformations are illustrated in figure 7.1.
Finding the matrix. One simple method to find the matrix associated with a
linear geometric transformation is to find its columns. The ith column is the vector
obtained by applying the transformation to ei . As a simple example consider
(clockwise) rotation by 90 in 2-D. Rotating the vector e1 = (1, 0) by 90 gives
(0, 1); rotating e2 = (0, 1) by 90 gives (1, 0). So rotation by 90 is given by
0 1
y= x.
1 0
7.2 Selectors
An m n selector matrix A is one in which each row is a unit vector (transposed):
T
ek 1
..
A = . ,
eTkm
where k1 , . . . , km are integers in the range 1, . . . , n. When it multiplies a vector,
it simply copies the ki th entry of x into the ith entry of y = Ax:
y = (xk1 , xk2 , . . . , xkm ).
In words, each entry of Ax is a selection of an entry of x.
The identity matrix, and the reverser matrix
T 0 0 0 1
en 0 0 1 0
A = ... = ... ... . . . ... ..
.
eT1 0 1 0 0
1 0 0 0
are special cases of selector matrices. (The reverser matrix reverses the order of
the entries of a vector: Ax = (xn , xn1 , . . . , x2 , x1 ).) Another one is the r : s slicing
matrix, which can be described as the block matrix
A = 0m(r1) Imm 0m(ns) ,
where m = s r + 1. (We show the dimensions of the blocks for clarity.) We have
Ax = xr:s , i.e., multiplying by A gives the r : s slice of a vector.
The incidence matrix is evidently sparse, since it has only two nonzero entries
in each column (one with value 1 and other with value 1). The jth column is
associated with the jth edge; the indices of its two nonzero entries give the nodes
that the edge connects. The ith row of A corresponds to node i; its nonzero entries
tell us which edges connect to the node, and whether they point into or away from
the node. The incidence matrix for the graph shown in figure 7.2 is
1 1 0 1 0
1 0 1 0 0
A= .
0 0 1 1 1
0 1 0 0 1
3
2 3
1 4 5
2
1 4
Figure 7.2 Directed graph with four vertices and five edges.
x = (1, 1, 1, 0, 1)
1 2 3
1 2 3 n
satisfies Ax + s = 0. This flow can be explained in words: The unit external flow
entering node 1 splits three way, with 0.6 flowing up, 0.3 flowing right, and 0.1
flowing diagonally up (on edge 4). The upward flow on edge 1 passes through
node 2, where flow is conserved, and proceeds right on edge 3 towards node 3.
The rightward flow on edge 2 passes through node 4, where flow is conserved, and
proceeds up on edge 5 to node 3. The one unit of excess flow arriving at node 3 is
removed as external flow.
arises in many applications, and is called the Laplacian (associated with the graph).
It can be expressed as X
L(v) = (vk vl )2 ,
edges (k,l)
which is the sum of the squares of the differences of v across all edges in the graph.
The Laplacian is small when the potentials of nodes that are connected by edges
are near each other.
The Laplacian is used as a measure the non-smoothness (roughness) of a po-
tential function on a graph. A set of node potentials with small Laplacian can be
thought of as smoothly varying across the graph. Conversely a set of potentials
with large Laplacian can be thought of as non-smooth or rough. The Laplacian
will arise as a measure of roughness in several applications we will encounter later.
As a simple example, consider the potential vector v = (1, 1, 2, 1) for the
graph shown in figure 7.2. For this set of potentials, the differences across the edges
are relatively large, with AT v = (2, 2, 3, 1, 3), and the associated Laplacian
is kAT vk2 = 27. Now consider the potential vector v = (1, 2, 2, 1). The associated
edge differences are AT v = (1, 0, 0, 1, 1), and the Laplacian has the much smaller
value kAT vk2 = 3.
Chain graph. The incidence matrix and the Laplacian function have a particularly
simple form for the simple chain graph shown in figure 7.3, with n vertices and n1
edges. The n (n 1) incidence matrix is the transpose of the difference matrix
D described on page 100, in (6.4). The Laplacian is then
3 3
2 2
ak
bk
1 1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
k k
Figure 7.4 Two vectors of length 100, with Laplacians L(a) = 1.14 and
L(b) = 8.99.
the sum of squares of the differences between consecutive entries of the n-vector v.
This is used as a measure of the non-smoothness of the vector v. Figure 7.4 shows
an example.
7.4 Convolution
The convolution of an n-vector a and an m-vector b is the (n + m 1)-vector
denoted c = a b, with entries
X
ck = ai bj , k = 1, . . . , n + m 1, (7.2)
i+j=k+1
where the subscript in the sum means that we should sum over all values of i and
j in their index ranges 1, . . . , n and 1, . . . , m, for which the sum i + j is k + 1. For
example with n = 4, m = 3, we have
c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b3 + a2 b2 + a3 b1
c4 = a2 b3 + a3 b2 + a4 b1
c5 = a3 b3 + a4 b2
c6 = a4 b3 .
then the coefficients of the product polynomial p(x)q(x) are represented by c = ab:
To see this we will show that ck is the coefficient of xk1 in p(x)q(x). We expand the
product polynomial into mn terms, and collect those terms associated with xk1 .
These terms have the form ai bj xi+j2 , for P i and j that satisfy i + j 2 = k 1,
i.e., i + j = k 1. It follows that ck = i+j=k+1 ai bj , which agrees with the
convolution formula (7.2).
a b = T (b)a = T (a)b,
The matrices T (b) and T (a) are called Toeplitz matrices, since the entries on any
diagonal (i.e., indices with ij constant) are the same. The columns of the Toeplitz
matrix T (a) are simply shifted versions of the vector a, padded with zero entries.
7.4 Convolution 113
Examples.
Time series smoothing. Suppose the n-vector x is a time series, and a =
(1/3, 1/3, 1/3). Then y = a x can be interpreted as a smoothed version of
the original time series: for 3 i n 3, yi is the average of xi , xi1 , xi2 .
(We can drop the qualifier 3 i n 3, by defining xi to be zero outside
the index range 1, . . . , n.) The time series y is called the (3-period) moving
average of the time series x. Figure 7.5 shows an example.
First order differences. If the n-vector x is a time series and a = (1, 1), the
time series y = a x gives the first order differences in the series x:
y = (x1 , x2 x1 , x3 x2 , . . . , xn xn1 , xn ).
2.5
2
xk
1.5
0.5
0
0 20 40 60 80 100
k
2.5
2
(x b)k
1.5
0.5
0
0 20 40 60 80 100
k
where the indices are restricted to their ranges (or alternatively, we assume that
Aij and Bkl are zero, when the indices are out of range). This is not denoted
C = A B, however, in standard mathematical notation. So we will use the
notation C = A ? B.
The same properties that we observed for 1-D convolution hold for 2-D convo-
lution: We have A ? B = B ? A, (A ? B) ? C = A ? (B ? C), and for fixed B, A ? B
is a linear function.
Figure 7.6 An 8 9 image and its convolution with the point spread func-
tion (7.3).
blurring of the fine details. This is illustrated in figure 7.6 for the 8 9 matrix
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 0 0 0 0 0 1 1
1 1 1 0 1 1 0 1 1
X= 1 1
(7.4)
1 0 1 1 0 1 1
1 1 1 0 1 1 0 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
Dhor =
1 1 ,
the pixel values in the image Y = X ? B are the horizontal first order differences
of those in X:
(and Yi1 = Xi1 , Xi,n+1 = Xin for i = 1, . . . , m). With the point spread function
1
Dver = ,
1
the pixel values in the image Y = X ? B are the vertical first order differences of
those in X:
Yij = Xij Xi1,j , i = 2, . . . , m, j = 1, . . . , n
(and Y1j = X1j , Xm+1,j = Xmj for j = 1, . . . , n). As an example, the convolu-
tions of the matrix (7.4) with Dhor and Dver are
1 0 0 0 0 0 0 0 0 1
1 0 0
0 0 0 0 0 0 1
1 0 1 0 0 0 0 1 0 1
1 0 0 1 1 0 1 1 0 1
X ? Dhor =
1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 1 1 0 1
1 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1
and
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 0 0
0 0 1 0 1 1 0 0 0
X ? Dver
=
0 0 0 0 0 0 0 0 0 .
0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
Figure 7.7 shows the effect of convolution on a larger image. The figure shows
an image of size 512512 and its convolution with the 88 matrix B with constant
entries Bij = 1/64.
118 7 Matrix examples
Figure 7.7 512 512 image and the 519 519 image that results from the
convolution of the first image with an 8 8 matrix with constant entries
1/64.
Chapter 8
Linear equations
In this chapter we consider vector-valued linear and affine functions, and systems
of linear equations.
holds for all n-vectors x and y and all scalars and . It is a good exercise to
parse this simple looking equation, since it involves overloading of notation. On
the left-hand side, the scalar-vector multiplications x and y involve n-vectors,
and the sum x+y is the sum of two n-vectors. The function f maps n-vectors to
m-vectors, so f (x + y) is an m-vector. On the right-hand side, the scalar-vector
multiplications and the sum are those for m-vectors. Finally, the equality sign is
equality between two m-vectors.
120 8 Linear equations
f (x + y) = A(x + y)
= A(x) + A(y)
= (Ax) + (Ay)
= f (x) + f (y)
Thus we can associate with every matrix A a linear function f (x) = Ax.
The converse is also true. Suppose f is a function that maps n-vectors to m-
vectors, and is linear, i.e., (8.1) holds for all n-vectors x and y and all scalars
and . Then there exists an m n matrix A such that f (x) = Ax for all x. This
can be shown in the same way as for scalar-valued functions in 2.1, by showing
that if f is linear, then
where ek is the kth unit vector of size n. The right-hand side can also be written
as a matrix-vector product Ax, with
A = f (e1 ) f (e2 ) f (en ) .
The expression (8.2) is the same as (2.3), but here f (x) and f (ek ) are vectors. The
implications are exactly the same: a linear vector valued function f is completely
characterized by evaluating f at the n unit vectors e1 , . . . , en .
As in 2.1 it is easily shown that the matrix-vector representation of a linear
function is unique. If f : Rn Rm is a linear function, then there exists exactly
one matrix A such that f (x) = Ax for all x.
(This is the n n identity matrix with the order of its columns reversed. It
is sometimes called the reverser matrix.)
8.1 Linear and affine functions 121
f (x) = (x1 , x1 + x2 , x1 + x2 + x3 , . . . , x1 + x2 + + xn ).
Examples of functions that are not linear. Here we list some examples of func-
tions f that map n-vectors x to n-vectors f (x) that are not linear. In each case
we show a superposition counterexample.
Absolute value. f replaces each element of x with its absolute value: f (x) =
(|x1 |, |x2 |, . . . , |xn |).
The absolute value function is not linear. For example, with n = 1, x = 1,
y = 0, = 1, = 0, we have
f (x + y) = 1 6= f (x) + f (y) = 1,
f (x + y) = f (x) + f (y)
122 8 Linear equations
holds for all n-vectors x, y, and all scalars , that satisfy + = 1. In other
words, superposition holds for affine combinations of vectors. (For linear functions,
superposition holds for any linear combinations of vectors.)
The matrix A and the vector b in the representation of an affine function as
f (x) = Ax + b are unique. These parameters can be obtained by evaluating f at
the vectors 0, e1 , . . . , en , where ek is the kth unit vector in Rn . We have
A= f (e1 ) f (0) f (e2 ) f (0) f (en ) f (0) , b = f (0).
Just like affine scalar-valued functions, affine vector valued functions are often
called linear, even though they are linear only when the vector b is zero.
evaluated at the point z. The rows of the Jacobian are fi (z)T , for i = 1, . . . , m.
As in the scalar valued case, Taylor approximation is sometimes written with a
second argument as f(x; z) to show the point z around which the approximation is
made. Evidently the Taylor series approximation f is an affine function of x. (It is
often called a linear approximation of f , even though it is not, in general, a linear
function.)
y = xT + v, (8.4)
where the n-vector x is a feature vector for some object, is an n-vector of weights,
v is a constant (the offset), and y is the (scalar) value of the regression model
prediction.
Now suppose we have a set of N objects (also called samples or examples), with
feature vectors x1 , . . . , xN . The regression model predictions associated with the
examples are given by
yi = xTi + v, i = 1, . . . , N.
where X is the new feature matrix, with a new first row of ones, and = (v, ) is
vector of regression model parameters. This is often written without the tildes, as
y = X T , by simply including the feature one as the first feature.
The equation above shows that the N -vector of predictions for the N examples
is a linear function of the model parameters (v, ). The N -vector of prediction
errors is an affine function of the model parameters.
Examples.
x1 + x2 = 1, x1 = 1, x1 x2 = 0
is written as Ax = b with
1 1 1
A = 1 0 , b = 1 .
1 1 0
It has no solutions.
x1 + x2 = 1, x2 + x3 = 2
is written as Ax = b with
1 1 0 1
A= , b= .
0 1 1 2
8.3.1 Examples
x1 a1 + + xn an = b,
a1 R1 + + ap Rp b1 P1 + + bq Pq .
Here R1 , . . . , Rp are the reactants, P1 , . . . , Pq are the products, and the numbers
a1 , . . . , ap and b1 , . . . , bq are positive numbers that tell us how many of each of
these molecules is involved in the reaction. They are typically integers, but can be
scaled arbitrarily; we could double all of these numbers, for example, and we still
have the same reaction. As a simple example, we have the electrolysis of water,
2H2 O 2H2 + O2 ,
which has one reactant, water (H2 O) and two products: molecular hydrogen (H2 ),
and molecular oxygen (O2 ). The coefficients tell us that 2 water molecules create
2 hydrogen molecules and 1 oxygen molecule. The coefficients in a reaction can
be multiplied by any nonzero numbers; for example, we could write the reaction
above as 3H2 O 3H2 + (3/2)O2 . By convention reactions are written with all
coefficients integers, with least common divisor one.
In a chemical reaction the numbers of constituent atoms must balance. This
means that for each atom appearing in any of the reactants or products, the total
amount on the left-hand side must equal the total amount on the right-hand side.
(If any of the reactants or products is charged, i.e., an ion, then the total charge
must also balance.) In the simple water electrolysis reaction above, for example,
we have 4 hydrogen atoms on the left (2 water molecules, each with 2 hydrogen
atoms), and 4 on the right (2 hydrogen molecules, each with 2 hydrogen atoms).
The oxygen atoms also balance, so this reaction is balanced.
Balancing a chemical reaction with specified reactants and products, i.e., finding
the numbers a1 , . . . , ap and b1 , . . . , bq , can be expressed as a system of linear
equations. We can express the requirement that the reaction balances as a set of
m equations, where m is the number of different atoms appearing in the chemical
reaction. We define the m p matrix R by
(The entries of R are nonnegative integers.) The matrix R is interesting; for exam-
ple, its jth column gives the chemical formula for reactant Rj . We let a denote the
p-vector with entries a1 , . . . , ap . Then, the m-vector Ra gives the total number of
atoms of each type appearing in the reactants. We define an m q matrix P in a
similar way, so the m-vector P b gives the total number of atoms of each type that
appears in the products.
8.3 Systems of linear equations 127
These equations are easily solved, and have the solution (1, 1, 1/2). (Multiplying
these coefficients by 2 gives the reaction given above.)
s1 2
1
Figure 8.1 A node in a diffusion system with label 1, exogeneous flow s1 and
three incident edges.
8
2 3
In a diffusion system, the flows must satisfy (flow) conservation, which means
that at each node, the total flow entering each node from adjacent edges and the
exogenous source, must be zero. This is illustrated in figure 8.1, which shows three
edges adjacent to node 1, two entering node 1 (flows 1 and 2), and one (flow 3)
leaving node 1, and an exogeneous flow. Flow conservation at this node is expressed
as
f1 + f2 f3 + s1 = 0.
Rf = AT e, (8.7)
(I A)x = d.
This model of the sector inputs and outputs of an economy was developed by
Wassily Leontief in the late 1940s, and is now known as Leontief input-output
analysis. He was awarded the Nobel Prize in Economics for this work in 1973.
130 8 Linear equations
Chapter 9
xt+1 = At xt , t = 1, 2, . . . . (9.1)
Here the n n matrices At are called the dynamics matrices. The equation above
is called the dynamics or update equation, since it gives us the next value of x, i.e.,
xt+1 , as a function of the current value xt . Often the dynamics matrix does not
depend on t, in which case the linear dynamical system is called time-invariant.
If we know xt (and At , At+1 , . . . ,) we can determine xt+1 , xt+2 , . . . , simply
by iterating the dynamics equation (9.1). In other words: If we know the current
value of x, we can find all future values. In particular, we do not need to know the
past states. This is why xt is called the state of the system. Roughly speaking, it
contains all the information needed to determine its future evolution.
132 9 Linear dynamical systems
Linear dynamical system with input. There are many variations on and exten-
sions of the basic linear dynamical system model (9.1), some of which we will
encounter in the sequel. As an example, we can add an additional term to the
update equation:
xt+1 = At xt + Bt ut + ct , t = 1, 2, . . . . (9.2)
Here ut is an m-vector called the input (at time t) and Bt are n m matrices
called the input matrices. The vector ct is called the offset. The input and offset
are used to model other factors that affect the time evolution of the state. Another
name for the input ut is exogenous variable, since, roughly speaking, it comes from
outside the system.
Markov model. The linear dynamical system (9.1) is sometimes called a Markov
model (after the famous mathematician Andrey Markov). Markov studied systems
in which the next state value depends on the current one, and not on the previous
state values xt1 , xt2 , . . . . The linear dynamical system (9.1) is the special case
of a Markov system where the next state is a linear function of the current state.
In a variation on the Markov model, called a (linear) K-Markov model, the
next state xt+1 depends on the current state and K 1 previous states. Such a
system has the form
Models of this form are used in time series analysis and econometrics, where they
are called auto-regressive models. We will see later that the Markov model (9.3)
can be reduced to a standard linear dynamical system (9.1).
Simulation. If we know the dynamics (and input) matrices, and the state at time
t, we can find the future state trajectory xt+1 , xt+2 , . . . by iterating the equa-
tion (9.1) (or (9.2), provided we also know the input sequence ut , ut+1 , . . . ). This
is called simulating the linear dynamical system. Simulation makes predictions
about the future state of a system. (To the extent that (9.1) is only an approx-
imation or model of some real system, we must be careful when interpreting the
results.) We can carry out what-if simulations, to see what would happen if the
system changes in some way, or if a particular set of inputs occurs.
4
Population (millions)
3
0
0 10 20 30 40 50 60 70 80 90 100
Age
Figure 9.1 Age distribution in the US in 2010. (United States Census Bu-
reau, [Link]).
individual people. Also, note that the model does not track people 100 and older.
The distribution of ages in the US in 2010 is shown in figure 9.1.
The birth rate is given by a 100-vector b, where bi is the average number of
births per person with age i 1, i = 1, . . . , 100. (This is half the average number
of births per woman with age i 1, assuming equal numbers of men and women
in the population.) Of course bi is approximately zero for i < 13 and i > 50. The
approximate birth rates for the US in 2010 are shown in figure 9.2. The death rate
is given by a 100-vector d, where di is the portion of those aged i 1 who will die
this year. The death rates for the US in 2010 are shown in figure 9.3.
To derive the dynamics equation (9.1), we find xt+1 in terms of xt , taking into
account only births and deaths, and not immigration. The number of 0-year olds
next year is the total number of births this year:
(xt+1 )1 = bT xt .
The number of i-year olds next year is the number of (i 1)-year-olds this year,
minus those who die:
We can assemble these equations into the time-invariant linear dynamical system
0 10 20 30 40 50 60 70 80 90 100
Age
Figure 9.2 Approximate birth rate versus age in the US in 2010. The figure is
based on statistics for age groups of five years (hence, the piecewise-constant
shape) and assumes an equal number of men and women in each age group.
(Martin J.A., Hamilton B.E., Ventura S.J. et al., Births: Final data for 2010.
National Vital Statistics Reports; vol. 61, no. 1. National Center for Health
Statistics, 2012.)
30
Death rate (%)
20
10
0 10 20 30 40 50 60 70 80 90 100
Age
Figure 9.3 Death rate versus age, for ages 099, in the US in 2010. (Centers
for Disease Control and Prevention, National Center for Health Statistics,
[Link].)
9.2 Population dynamics 135
Population (millions)
3
0
0 10 20 30 40 50 60 70 80 90 100
Age
where A is given by
b1 b2 b3 b98 b99 b100
1 d1 0 0 0 0 0
0 1 d2 0 0 0 0
A= .. .. .. .. .. .. .
. . . . . .
0 0 0 1 d98 0 0
0 0 0 0 1 d99 0
We can use this model to predict the total population in 10 years (not including
immigration), or to predict the number of school age children, or retirement age
adults. Figure 9.4 shows the predicted age distribution in 2020, computed by
iterating the model xt+1 = Axt for t = 1, . . . , 10, with initial value x1 given by
the 2010 age distribution of figure 9.1. Note that the distribution is based on an
approximate model, since we neglect the effect of immigration, and assume that the
death and birth rates remain constant and equal to the values shown in figures 9.2
and 9.3.
Population dynamics models are used to carry out projections of the future age
distribution, which in turn is used to predict how many retirees there will be in
some future year. They are also used to carry out various what if analyses, to
predict the effect of changes in birth or death rates on the future age distribution.
It is easy to include the effects of immigration and emigration in the population
dynamics model (9.4), by simply adding a 100-vector ut :
xt+1 = Axt + ut ,
which is a time-invariant linear dynamical system of the form (9.2), with input ut
and B = I. The vector ut gives the net immigration in year t over all ages; (ut )i
is the number of immigrants in year t of age i 1. (Negative entries mean net
emigration.)
136 9 Linear dynamical systems
1
Susceptible
Infected
0.8 Recovered
Deceased
0.6
xt
0.4
0.2
0
0 50 100 150 200
Time t
f ( )
m
0 p( )
d2 p dp
m ( ) = ( ) + f ( ),
d 2 d
where m > 0 is the mass, f ( ) is the external force acting on the mass at time ,
and > 0 is the drag coefficient. The right-hand side is the total force acting on
the mass; the first term is the drag force, which is proportional to the velocity and
in the opposite direction.
Introducing the velocity of the mass, v( ) = dp( )/d , we can write the equation
138 9 Linear dynamical systems
We simulate this system for a period of 2.5 seconds, starting from initial state
x1 = (0, 0), which corresponds to the mass starting at rest (zero velocity) at position
0. The simulation involves iterating the dynamics equation from k = 1 to k = 250.
Figure 9.7 shows the force, position, and velocity of the mass, with the axes labeled
using continuous time .
9.4 Motion of a mass 139
0.5
0
f ( )
0.5
1.5
0 1 2 3 4 5 6
0.15
0.1
p( )
0.05
0 1 2 3 4 5 6
0.4
0.2
v( )
0.2
0 1 2 3 4 5 6
Figure 9.7 Simulation of mass moving along a line. Applied force (top),
position (middle), and velocity (bottom).
140 9 Linear dynamical systems
xt+1 = xt + Asc ft + pt st , t = 1, 2, . . . .
B = Asc I .
A = I,
(Note that Asc refers to the supply chain graph incidence matrix, while A is the
dynamics matrix in (9.2).) This gives
s2
2
f1
s1 p2
1 f3
p1 s3
f2
3
p3
A simple example is shown in figure 9.8. The supply chain dynamics equation
is
1 1 0 1 0 0
ft
xt+1 = xt + 1 0 1 0 1 0 st , t = 1, 2, . . . .
pt
0 1 1 0 0 1
It is a good exercise to check that the matrix-vector product (the middle term of
the right-hand side) gives the amount of commodity added at each location, as a
result of shipment and purchasing.
142 9 Linear dynamical systems
Chapter 10
Matrix multiplication
There are several ways to remember this rule. To find the i, j element of the
product C = AB, you need to know the ith row of A and the jth column of B.
The summation above can be interpreted as moving left to right along the ith row
of A while moving top to bottom down the jth column of B. As you go, you
keep a running sum of the product of elements, one from A and one from B.
As a specific example, we have
1 1
1.5 3 2 3.5 4.5
0 2 = .
1 1 0 1 1
1 0
To find the 1, 2 entry of the right-hand matrix, we move along the first row of
the left-hand matrix, and down the second column of the middle matrix, to get
(1.5)(1) + (3)(2) + (2)(0) = 4.5.
Matrix-matrix multiplication includes as special cases several other types of
multiplication (or product) we have encountered so far.
144 10 Matrix multiplication
We have
6 11 9 3
AB = , BA = .
3 3 17 0
Two matrices A and B that satisfy AB = BA are said to commute. (Note that for
AB = BA to make sense, A and B must both be square.)
Properties of matrix multiplication. The following properties hold and are easy
to verify from the definition of matrix multiplication. We assume that A, B, and
C are matrices for which all the operations below are valid, and that is a scalar.
Associativity: (AB)C = A(BC). Therefore we can write the product simply
as ABC.
Associativity with scalar multiplication: (AB) = (A)B, where is a scalar
and A and B are matrices (that can be multiplied). This is also equal to
A(B). (Note that the products A and B are defined as scalar-matrix
products, but in general, unless A and B have one row, not as matrix-matrix
products.)
Distributivity with addition: Matrix multiplication distributes across matrix
addition: A(B+C) = AB+AC and (A+B)C = AC +BC. On the right-hand
sides of these equations we use the higher precedence of matrix multiplication
over addition, so, for example, AC + BC is interpreted as (AC) + (BC).
Transpose of product. The transpose of a product is the product of the
transposes, but in the opposite order: (AB)T = B T AT .
From these properties we can derive others. For example, if A, B, C, and D are
square matrices of the same size, we have the identity
(A + B)(C + D) = AC + AD + BC + BD.
This is the same as the usual formula for expanding a product of sums of scalars;
but with matrices, we must be careful to preserve the order of the products.
for any matrices A, B, . . . , H for which the matrix products above make sense. This
formula is the same as the formula for multiplying two 2 2 matrices (i.e., with
scalar entries); but when the entries of the matrix are themselves matrices (as in
the block matrix above), we must be careful to preserve the multiplication order.
Multiple linear equations. We can use the column interpretation of matrix mul-
tiplication to express a set of k linear equations with the same m n coefficient
matrix F ,
F xi = gi , i = 1, . . . , k,
in the compact form
F X = G,
where X = [x1 xk ] and G = [g1 gk ].
This shows that the rows of AB are obtained by applying B T to the transposed
row vectors ak of A.
Thus we can interpret the matrix-matrix product as the mn inner products aTi bj
arranged in an m n matrix.
10.1 Matrix-matrix multiplication 147
The entries of the Gram matrix G give all inner products of pairs of columns of A.
Note that a Gram matrix is symmetric, since aTi aj = aTj ai . This can also be seen
using the transpose of product rule:
The Gram matrix will play an important role later in this book.
As an example, suppose the m n matrix A gives the membership of m items
in n groups, with entries
1 item i is in group j
Aij =
0 item i is not in group j.
(So the jth column of A gives the membership in the jth group, and the ith row
gives the groups that item i is in.) In this case the Gram matrix G has a nice
interpretation: Gij is the number of items that are in both groups i and j, and Gii
is the number of items in group i.
AB = a1 bT1 + + ap bTp ,
In some special cases the complexity is less than 2mnp flops. As an example,
when we compute the m m Gram matrix G = AT A we only need to compute the
entries in the upper (or lower) half of G, since G is symmetric. This saves around
half the flops, so the complexity is around mn2 flops. But the order is the same.
D = ABC
1 1 1 1
+ < + .
n q m p
In words: to find h(x), we first apply the function g, to obtain the partial result
g(x) (which is a p-vector); then we apply the function f to this result, to obtain
h(x) (which is an m-vector). In the formula h(x) = f (g(x)), f appears to the left
of g; but when we evaluate h(x), we apply g first. The composition h is evidently
a linear function, that can be written as h(x) = Cx with C = AB.
Using this interpretation of matrix multiplication as composition of linear func-
tions, it is easy to understand why in general AB 6= BA, even when the dimen-
sions are compatible. Evaluating the function h(x) = ABx means we first evaluate
y = Bx, and then z = Ay. Evaluating the function BAx means we first evaluate
y = Ax, and then z = By. In general, the order matters. As an example, take the
2 2 matrices
1 0 0 1
A= , B= ,
0 1 1 0
for which
0 1 0 1
AB = , BA = .
1 0 1 0
The mapping f (x) = Ax = (x1 , x2 ) changes the sign of the first element of the
vector x. The mapping g(x) = Bx = (x2 , x1 ) reverses the order of two elements
of x. If we evaluate f (g(x)) = ABx = (x2 , x1 ), we first reverse the order, and
then change the sign of the first element. This result is obviously different from
g(f (x)) = BAx = (x2 , x1 ), obtained by changing the sign of the first element,
and then reversing the order of the elements.
Dn x = (x2 x1 , . . . , xn xn1 ).
The left-hand matrix is associated with the second difference linear function
frothat maps 5-vectors into 3 vectors. The middle matrix D4 is associated with the
difference function that maps 4-vectors into 3-vectors. The right-hand matrix D5
is associated with the difference function that maps 5-vectors into 4-vectors.
where A = AC, b = Ad + b.
The function h is differentiable and its partial derivatives follow from those of f
and g via the chain rule:
hi fi g1 fi gp
(z) = (g1 (z)) (z) + + (gp (z)) (z)
xj y1 xi yp xi
Dh(z) = Df (g(z))Dg(z)
The same result can be interpreted as a composition of two affine functions, the
first order Taylor approximation of f at g(z),
2 3
1 4
Each term in the sum is 0 or 1, and equal to one only if there is an edge from vertex
j to vertex k and an edge from vertex k to vertex i, i.e., a path of length exactly
two from vertex j to vertex i via vertex k. By summing over all k, we obtain the
total number of paths of length two from j to i. The adjacency matrix A for the
graph in figure 10.1, for example, and its square are given by
0 1 0 0 1 1 0 1 1 0
1 0 1 0 0 0 1 1 1 2
2
A = 0 0 1 1 1 ,
A = 1 0 1 2 1 .
1 0 0 0 0 0 1 0 0 1
0 0 0 1 0 1 0 0 0 0
10.3 Matrix power 153
We can verify there is exactly one path of length two from vertex 1 to itself, i.e.,
the path (1, 2, 1)), and one path of length two from vertex 3 to vertex 1, i.e., the
path (3, 2, 1). There are two paths of length two from vertex 4 to vertex 3: the
paths (4, 3, 3) and (4, 5, 3), so A34 = 2.
The property extends to higher powers of A. If ` is a positive integer, then
the i, j element of A` is the number of paths of length ` from vertex j to vertex i.
This can be proved by induction on `. We have already shown the result for ` = 2.
Assume that it is true that the elements of A` give the paths of length ` between
the different vertices. Consider the expression for the i, j element of A`+1 :
n
X
(A`+1 )ij = Aik (A` )kj .
k=1
The kth term in the sum is equal to the number of paths of length ` from j to k if
there is an edge from k to i, and is equal to zero otherwise. Therefore it is equal
to the number of paths of length ` + 1 from j to i that end with the edge (k, i),
i.e., of the form (j, . . . , k, i). By summing over all k we obtain the total number of
paths of length ` + 1 from vertex j to i. This can be verified in the example. The
third power of A is
1 1 1 1 2
2 0 2 3 1
3
A = 2 1 1 2 2 .
1 0 1 1 0
0 1 0 0 1
The (A3 )24 = 3 paths of length three from vertex 4 to vertex 2 are (4, 3, 3, 2),
(4, 5, 3, 2), (4, 5, 1, 2).
1.5
Factor 0.5
0 10 20 30 40 50 60 70 80 90 100
Age
Figure 10.2 Contribution factor per age in 2010 to the total population in
2020. The value for age i 1 is the ith component of the row vector 1T A10 .
(The first term agrees with the formula for xt+` with no input.) The other terms
are readily interpreted. The term Aj But+`j is the contribution to the state xt+`
due to the input at time t + ` j.
10.4 QR factorization
Matrices with orthonormal columns. As an application of Gram matrices, we can
express the condition that the n-vectors a1 , . . . , ak are orthonormal in a simple
way using matrix notation:
AT A = I,
where A is the nk matrix with columns a1 , . . . , ak . There is no standard term for
a matrix whose columns are orthonormal: We refer to a matrix whose columns are
orthonormal as a matrix whose columns are orthonormal. But a square matrix
that satisfies AT A = I is called orthogonal ; its columns are an orthonormal basis.
Orthogonal matrices have many uses, and arise in many applications.
We have already encountered some orthogonal matrices, including identity ma-
trices, 2-D reflections and rotations (page 105), and permutation matrices (page 108).
Norm, inner product, and angle properties. Suppose the columns of the m n
matrix A are orthonormal, and x and y are any n-vectors. We let f : Rn Rm
10.4 QR factorization 155
In the first line, we use the transpose-of-product rule; in the second, we re-associate
a product of 4 matrices (considering the row vector xT and column vector x as
matrices); in the third line we use AT A = I, and in the fourth line we use Iy = y.
From the second property we can derive the first one: By taking y = x we get
(Ax)T (Ax) = xT x; taking the squareroot of each side gives kAxk = kxk. The third
property, angle preservation, follows from the first two, since
(Ax)T (Ay)
T
x y
6 (Ax, Ay) = arccos = arccos = 6 (x, y).
kAxkkAyk kxkkyk
where qi is the vector obtained in the first step of the Gram-Schmidt algorithm, as
ai = R1i q1 + + Rii qi ,
where Rij = qiT aj for i < j and Rii = kqi k. Defining Rij = 0 for i > j, we can
express the equations above in compact matrix form as
A = QR.
Matrix inverses
In this chapter we introduce the concept of matrix inverse. We show how matrix
inverses can be used to solve linear equations, and how they can be computed using
the QR factorization.
Examples.
If A is a number (i.e., a 1 1 matrix), then a left inverse X is the same as
the inverse of the number. In this case, A is left-invertible whenever A is
nonzero, and it has only one left inverse.
Any nonzero n-vector a, considered as an n 1 matrix, is left invertible. For
any index i with ai 6= 0, the row n-vector x = (1/ai )eTi satisfies xa = 1.
The matrix
3 4
A= 4 6
1 1
158 11 Matrix inverses
0 = C(Ax) = (CA)x = Ix = x,
which shows that the only linear combination of the columns of A that is 0 is the
one with all coefficients zero.
We will see below that the converse is also true; a matrix has a left inverse if
and only if its columns are linearly independent. So the generalization of a number
has an inverse if and only if it is nonzero is a matrix has a left inverse if and only
if its columns are linearly independent.
Cb = C(Ax) = (CA)x = Ix = x,
which means that x = Cb is a solution of the set of linear equations. The columns
of A are linearly independent (since it has a left inverse), so there is only one
solution of the linear equations Ax = b; in other words, x = Cb is the solution of
Ax = b.
Now suppose there is no x that satisfies the linear equations Ax = b, and
let C be a left inverse of A. Then x = Cb does not satisfy Ax = b, since no
vector satisfies this equation by assumption. This gives a way to check if the linear
equations Ax = b have a solution, and to find one when there is one, provided we
have a left inverse of A. We simply test whether A(Cb) = b. If this holds, then we
have found a solution of the linear equations; if it does not, then we can conclude
that there is no solution of Ax = b.
In summary, a left inverse can be used to determine whether or not a solution of
an over-determined set of linear equations exists, and when it does, find the unique
solution.
11.1 Left and right inverses 159
Right inverse. Now we turn to the closely related concept of right inverse. A
matrix X that satisfies
AX = I
is called a right inverse of A. The matrix A is right-invertible if a right inverse
exists. Any right inverse has the same dimensions as AT .
Left and right inverse of matrix transpose. If A has a right inverse B, then B T
is a left inverse of AT , since B T AT = (AB)T = I. If A has a left inverse C, then
C T is a right inverse of AT , since AT C T = (CA)T = I. This observation allows us
to map all the results for left invertibility given above to similar results for right
invertibility. Some examples are given below.
A matrix is right invertible if and only if its rows are linearly independent.
A tall matrix cannot have a right inverse. Only square or wide matrices can
be right invertible.
Solving linear equations with a right inverse. Consider the set of m linear equa-
tions in n variables Ax = b. Suppose A is right-invertible, with right inverse B.
This implies that A is square or wide, so the linear equations Ax = b are square or
under-determined.
Then for any m-vector b, the n-vector x = Bb satisfies the equation Ax = b.
To see this, we note that
Ax = A(Bb) = (AB)b = Ib = b.
Examples. Consider the matrix appearing in the example above on page 157,
3 4
A= 4 6
1 1
and the two left inverses
1 11 10 16 1 0 1 6
B= , C= .
9 7 8 11 2 0 1 4
(Recall that B T and C T are both right inverses of AT .) We can find a solution
of AT y = b for any vector b.
Left and right inverse of matrix product. Suppose A and D are compatible for
the matrix product AD (i.e., the number of columns in A is equal to the number
of rows in D.) If A has a right inverse B and D has a right inverse E, then EB is
a right inverse of AD. This follows from
If A has a left inverse C and D has a left inverse F , then F C is a left inverse
of AD. This follows from
(F C)(AD) = F (CA)D = F D = I.
11.2 Inverse
If a matrix is left- and right-invertible, then the left and right inverses are unique
and equal. To see this, suppose that AX = I and Y A = I, i.e., X is any right
inverse and Y is any left inverse of A. Then we have
X = (Y A)X = Y (AX) = Y,
i.e., any left inverse of A is equal to any right inverse of A. This implies that the
left inverse is unique: If we have AX = I, then the argument above tells us that
X = Y , so we have X = X, i.e., there is only one right inverse of A. A similar
argument shows that Y (which is the same as X) is the only left inverse of A.
When a matrix A has both a left inverse Y and a right inverse X, we call the
matrix X = Y simply the inverse of A, and denote it as A1 . We say that A is
invertible or nonsingular. A square matrix that is not invertible is called singular.
AA1 = A1 A = I.
Solving linear equations with the inverse. Consider the square system of n linear
equations with n variables, Ax = b. If A is invertible, then for any n-vector b,
x = A1 b (11.1)
So B is a right inverse of A.
We have just shown that for a square matrix A,
(The symbol = means that the left-hand condition implies the right-hand condi-
tion.) Applying the same result to the transpose of A allows us to also conclude
that
So all six of these conditions are equivalent; if any one of them holds, so do the
other five.
In summary, for a square matrix A, the following are equivalent.
A is invertible.
The columns of A are linearly independent.
The rows of A are linearly independent.
A has a left inverse.
A has a right inverse.
162 11 Matrix inverses
Examples.
The identity matrix I is invertible, with inverse I 1 = I, since II = I.
A diagonal matrix A is invertible if and only if its diagonal entries are nonzero.
The inverse of an n n diagonal matrix A with nonzero diagonal entries is
1/A11 0 0
0 1/A22 0
A1 = .. .. . ,
. .. ..
. .
0 0 1/Ann
since
A11 /A11 0 0
0 A22 /A22 0
AA1 = .. .. .. = I.
..
. . . .
0 0 Ann /Ann
In compact notation, we have
Note that the inverse on the left-hand side of this equation is the matrix
inverse, while the inverses appearing on the right-hand side are scalar inverses.
As a non-obvious example, the matrix
1 2 3
A= 0 2 2
3 4 4
Inverse of matrix product. If A and B are invertible (hence, square) and of the
same size, then AB is invertible, and
(AB)1 = B 1 A1 . (11.2)
x = 1 a1 + + n an .
So the dual basis gives us a simple way to find the coefficients in the expansion of
a vector in the a1 , . . . , an basis. We can summarize this as the identity
which holds for any n-vector x. This explicit formula shows how to express an arbi-
trary vector x as a linear combination of a basis a1 , . . . , an . To get the coefficients,
we take the inner product with the dual basis vectors.
164 11 Matrix inverses
bT1 = bT2 =
1/2 1/2 , 1/2 1/2 .
Negative matrix powers. We can now give a meaning to matrix powers with
negative integer exponents. Suppose A is a square invertible matrix and k is a
positive integer. Then by repeatedly applying property (11.2), we get
(Ak )1 = (A1 )k .
L11 x1 = 0
L21 x1 + L22 x2 = 0
L31 x1 + L32 x2 + L33 x3 = 0
..
.
Ln1 x1 + Ln2 x2 + + Ln,n1 xn1 + Lnn xn = 0.
Since L11 6= 0, the first equation implies x1 = 0. Using x1 = 0, the second equation
reduces to L22 x2 = 0. Since L22 6= 0, we conclude that x2 = 0. Using x1 = x2 = 0,
the third equation now reduces to L33 x3 = 0, and since L33 is assumed to be
nonzero, we have x3 = 0. Continuing this argument, we find that all entries of x
are zero, and this shows that the columns of L are linearly independent. It follows
that L is invertible.
A similar argument can be followed to show that an upper triangular matrix
with nonzero diagonal elements is invertible. One can also simply note that if R
is upper triangular, then L = RT is lower triangular with the same diagonal, and
use the formula (LT )1 = (L1 )T for the inverse of the transpose.
11.3 Solving linear equations 165
Complexity of back substitution. The first step requires 1 flop (division by Rnn ).
The next step requires one multiply, one subtraction, and one division, for a total
of 3 flops. The kth step requires k 1 multiplies, k 1 subtractions, and one
division, for a total of 2k 1 flops. The total number of flops for back substitution
is then
1 + 3 + 5 + + (2n 1) = n2
flops.
(The formula above was allegedly discovered by the mathematician Carl Friedrich
Gauss when he was a child. Here is his argument, for the case when n is even: Lump
the first entry in the sum together with the last entry, the second entry together
with the second-to-last entry, and so on. Each of these pairs add up to 2n; since
there are n/2 such pairs, the total is (n/2)(2n) = n2 . A similar argument works
when n is odd.)
Solving linear equations using the QR factorization. The formula (11.3) for the
inverse of a matrix in terms of its QR factorization suggests a method for solving
a square system of linear equations Ax = b with A invertible. The solution
x = A1 b = R1 QT b (11.4)
can be found by first computing the matrix vector product y = QT b, and then
solving the triangular equation Rx = y by back substitution.
The first step requires 2n3 flops (see 5.4), the second step requires 2n2 flops,
and the third step requires n2 flops. The total number of flops is then
so the order is n3 , cubic in the number of variables, which is the same as the number
of equations.
In the complexity analysis above, we found that the first step, the QR factor-
ization, dominates the other two; that is, the cost of the other two is negligible
in comparison to the cost of the first step. This has some interesting practical
implications, which we discuss below.
11.3 Solving linear equations 167
Factor-solve methods with multiple right-hand sides. Now suppose that we must
solve several sets of linear equations,
all with the same coefficient matrix A, but different right-hand sides. Solving the
k problems independently, by applying algorithm 11.2 k times, costs 2kn3 flops. A
more efficient method exploits the fact that A is the same matrix in each problem,
so we can re-use the matrix factorization in step 1 and only need to repeat steps 2
and 3 to compute xk = R1 QT bk for l = 1, . . . , k. (This is sometimes referred to
as factorization caching, since we save or cache the factorization after carrying it
out, for later use.) The cost of this method is 2n3 + 3kn2 flops, or approximately
2n3 flops if k n. The (surprising) conclusion is that we can solve multiple sets of
linear equations, with the same coefficient matrix A, at essentially the same cost
as solvng one set of linear equations.
Computing the matrix inverse. We can now describe a method to compute the
inverse B = A1 of an (invertible) n n matrix A. We first compute the QR
factorization of A, so A1 = R1 QT . We can write this as RB = QT , which,
written out by columns is
Rbi = qi , i = 1, . . . , n,
where bi is the ith column of B and qi is the ith column of QT . We can solve these
equations using back substitution, to get the columns of the inverse B.
The complexity of this method is 2n3 flops (for the QR factorization) and n3 for
n back substitutions, each of which costs n2 flops. So we can compute the matrix
inverse in around 3n3 flops.
This gives an alternative method for solving the square set of linear equations
Ax = b: We first compute the inverse matrix A1 , and then the matrix-vector
168 11 Matrix inverses
product x = (A1 )b. This method has a higher flop count than directly solving
the equations using algorithm 11.2 (3n3 versus 2n3 ), so algorithm 11.2 is the usual
method of choice. While the matrix inverse appears in many formulas (such as the
solution of a set of linear equations), it is computed far less often.
Sparse linear equations. Systems of linear equations with sparse coefficient ma-
trix arise in many applications. By exploiting the sparsity of the coefficient matrix,
these linear equations can be solved far more efficiently than by using the generic
algorithm 11.2. One method is to use the same basic algorithm 11.2, replacing
the QR factorization with a variant that handles sparse matrices (see page 156).
The memory usage and computational complexity of these methods depends in a
complicated way on the sparsity pattern of the coefficient matrix. In order, the
memory usage is typically a modest multiple of nnz(A) + n, the number of scalars
required to specify the problem data A and b, which is typically much smaller than
n2 + n, the number of scalars required to store A and b if they are not sparse.
The flop count for solving sparse linear equations is also typically closer in order
to nnz(A) than n3 , the order when the matrix A is not sparse.
11.4 Examples
Polynomial interpolation. The 4-vector c gives the coefficients of a cubic polyno-
mial,
p(x) = c1 + c2 x + c3 x2 + c4 x3
(see pages 126 and 101). We seek the coefficients that satisfy
(1.1)2 (1.1)3
1 1.1
1 0.4 (0.4)2 (0.4)3
A= .
(0.2)2 (0.2)3
1 0.2
1 0.8 (0.8)2 (0.8)3
(to 4 decimal places). This is illustrated in figure 11.1, which shows the two cu-
bic polynomials that interpolate the two sets of points shown as filled circles and
squares, respectively.
The columns of A1 are interesting: They give the coefficients of a polynomial
that evaluates to 0 at three of the points, and 1 at the other point. For example,
11.4 Examples 169
p(x)
x
1.5 1 0.5 0 0.5 1
Figure 11.1 Cubic interpolants through two sets of points, shown as circles
and squares.
Balancing chemical reactions. (See page 126 for background.) We consider the
problem of balancing the chemical reaction
a1 Cr2 O2
7 + a2 Fe
2+
+ a3 H+ b1 Cr3+ + b2 Fe3+ + b3 H2 O,
where the superscript gives the charge of each reactant and product. There are 4
atoms (Cr, O, Fe, H) and charge to balance. The reactant and product matrices
are (using the order just listed)
2 0 0 1 0 0
7 0 0
0 0 1
R=
0 1 0 ,
P =
0 1 0 .
0 0 1 0 0 2
2 2 1 3 3 0
170 11 Matrix inverses
p(x) p(x)
1 1
0 0
x x
1 0 1 1 0 1
p(x) p(x)
1 1
0 0
x x
1 0 1 1 0 1
Figure 11.2 Lagrange polynomials associated with the points 1.1, 0.4,
0.2, 0.8.
11.4 Examples 171
0.8
0.6
0.4
0.2
11.5 Pseudo-inverse
Linearly independent columns and Gram invertibility. We first show that an
m n matrix A has linearly independent columns if and only if its n n Gram
matrix AT A is invertible.
First suppose that the columns of A are linearly independent. Let x be an
n-vector which satisfies (AT A)x = 0. Multiplying on the left by xT we get
0 = xT 0 = xT (AT Ax) = xT AT Ax = kAxk2 ,
which implies that Ax = 0. Since the columns of A are linearly independent, we
conclude that x = 0. Since the only solution of (AT A)x = 0 is x = 0, we conclude
that AT A is invertible.
Now lets show the converse. Suppose the columns of A are linearly dependent,
which means there is a nonzero n-vector x which satisfies Ax = 0. Multiply on the
left by AT to get (AT A)x = 0. This shows that the Gram matrix AT A is singular.
This particular left-inverse of A will come up in the sequel, and has a name,
the pseudo-inverse of A (also called the Moore-Penrose inverse). It is denoted A
11.5 Pseudo-inverse 173
(or A+ ):
A = (AT A)1 AT . (11.5)
When A is square, the pseudo-inverse A reduces to the ordinary inverse:
A = (AT A)1 AT = A1 AT AT = A1 I = A1 .
Note that this equation does not make sense (and certainly is not correct) when A
is not square.
This matrix has linearly independent columns, and QR factorization with (to 4
digits)
0.5883 0.4576
5.0990 7.2563
Q= 0.7845 0.5230 R= .
0 0.5883
0.1961 0.7191
Least squares
Chapter 12
Least squares
where we should specify that the variable is x (meaning that we should choose x).
The matrix A and the vector b are called the data for the problem (12.1), which
178 12 Least squares
means that they are given to us when we are asked to choose x. The quantity to
be minimized, kAx bk2 , is called the objective function (or just objective) of the
least squares problem (12.1).
The problem (12.1) is sometimes called linear least squares to emphasize that
the residual r (whose norm squared we are to minimize) is an affine function of x,
and to distinguish it from the nonlinear least squares problem, in which we allow
the residual r to be an arbitrary function of x. We will study the nonlinear least
squares problem in chapter 18.
Any vector x that satisfies kAxbk2 kAxbk2 for all x is a solution of the least
squares problem (12.1). Such a vector is called a least squares approximate solution
of Ax = b. It is very important to understand that a least squares approximate
solution x of Ax = b need not satisfy the equations Ax = b; it simply makes the
norm of the residual as small as it can be. Some authors use the confusing phrase
x solves Ax = b in the least squares sense, but we emphasize that a least squares
approximate solution x does not, in general, solve the equation Ax = b.
If kAx bk (which we call the optimal residual norm) is small, then we can say
that x approximately solves Ax = b. On the other hand, if there is an n-vector x
that satisfies Ax = b, then it is a solution of the least squares problem, since its
associated residual norm is zero.
Another name for the least squares problem (12.1), typically used in data fitting
applications (the topic of the next chapter), is regression. We say that x, a solution
of the least squares problem, is the result of regressing the vector b onto the columns
of A.
Ax = x1 a1 + + xn an
Row interpretation. Suppose the rows of A are the n-vectors aT1 , . . . , aTm , so the
residual components are given by
ri = aTi x bi , i = 1, . . . , m.
the sum of the squares of the residuals in m scalar linear equations. Minimizing
this sum of squares of the residuals is a reasonable compromise if our goal is to
choose x so that all of them are small.
12.2 Solution 179
2x1 = 1, x1 + x2 = 0, 2x2 = 1,
has no solution. (From the first equation we have x1 = 1/2, and from the last
equation we have x2 = 1/2; but then the second equation does not hold.) The
corresponding least squares problem is
This least squares problem can be solved using the methods described in the next
section (or simple calculus). Its unique solution is x = (1/3, 1/3). The least
squares approximate solution x does not satisfy the equations Ax = b; the corre-
sponding residuals are
r = (1/3, 2/3, 1/3),
with sum of squares value kAx bk2 = 2/3. Let us compare this to another choice
of x, x = (1/2, 1/2), which corresponds to (exactly) solving the first and last of
the three equations in Ax = b. It gives residual
r = Ax b = (0, 1, 0),
12.2 Solution
In this section we derive several expressions for the solution of the least squares
problem (12.1), under one assumption on the data matrix A:
0.5
f (x) + 2
f (x) + 1
x2
0
0.5
1
1 0.5 0 0.5 1
x1
Figure 12.1 Level curves of the function kAx bk2 = (2x1 1)2 + (x1 +
x2 )2 + (2x2 + 1)2 .
Solution via calculus. In this section we find the solution of the least squares
problem using some basic results from calculus, reviewed in C.2. (We will also
give an independent verification of the result, that does not rely on calculus, below.)
We know that any minimizer x of the function f (x) = kAx bk2 must satisfy
f
(x) = 0, i = 1, . . . , n,
xi
which we can express as the vector equation
f (x) = 0,
where f (x) is the gradient of f evaluated at x. This gradient can be expressed
in matrix form as
f (x) = 2AT (Ax b). (12.3)
This formula can be derived from the chain rule given on page 150, and the
gradient of the sum of squares function, given in C.1. For completeness, we will
derive the formula (12.3) from scratch here. Writing the least squares objective
out as a sum, we get
2
Xm Xn
f (x) = kAx bk2 = Aij xj bi .
i=1 j=1
m
X n
X
= 2 Aij xj bi (Aik )
i=1 j=1
m
X
= 2(AT )ki (Ax b)i
i=1
2AT (Ax b) k .
=
These equations are called the normal equations. The coefficient matrix AT A is
the Gram matrix associated with the columns of A; its entries are inner products
of columns of A.
Our assumption (12.2) that the columns of A are linearly independent implies
that the Gram matrix AT A is invertible (11.5, page 172). This implies that
is the only solution of the normal equations (12.4). So this must be the unique
solution of the least squares problem (12.1).
We have already encountered the matrix (AT A)1 AT that appears in (12.5): It
is the pseudo-inverse of the matrix A, given in (11.5). So we can write the solution
of the least squares problem in the simple form
x = A b. (12.6)
where we use (AT A)x = AT b in the third line. With this simplification, (12.7)
reduces to
kAx bk2 = kA(x x)k2 + kAx bk2 .
This shows that x minimizes kAxbk2 ; we now show that it is the unique minimizer.
Suppose equality holds above, that is, kAx bk2 = kAx bk2 . Then we have
kA(x x)k2 = 0, which implies A(x x) = 0. Since A has linearly independent
columns, we conclude that x x = 0, i.e., x = x. So the only x with kAx bk2 =
kAx bk2 is x = x; for all x 6= x, we have kAx bk2 > kAx bk2 .
Row form. The formula for the least squares approximate solution can be ex-
pressed in a useful form in terms of the rows aTi of the matrix A.
m
!1 m
!
X X
x = (AT A)1 AT b = ai aTi bi ai . (12.8)
i=1 i=1
x = R1 QT b. (12.9)
2. Compute QT b.
Backslash notation. The very close relation between solving a square set of linear
equations, and finding the least squares approximate solution of an over-determined
set of equations, has inspired a common notation for both that is used in some
software packages for manipulating matrices: A\b is taken to mean the solution
A1 b when A is square (and invertible), and the least squares approximate solution
A b when A is tall (with linearly independent columns). This backslash notation
is not standard mathematical notation, so we will not use it in this book.
184 12 Least squares
Complexity. The complexity of the first step of algorithm 12.1 is 2mn2 flops. The
second step involves a matrix-vector multiplication, which takes 2mn flops. The
third step requires n2 flops. The total number of flops is
neglecting the second and third terms, which are smaller than the first by factors
of n and 2m, respectively. The order of the algorithm is mn2 . The complexity is
linear in the row dimension of A and quadratic in the number of variables.
Algorithm 12.1 is another example of a factor-solve algorithm. Suppose we need
to solve several least squares problems
minimize kAxk bk k2 ,
Sparse least squares. Least squares problems with sparse A arise in several appli-
cations and can be solved more efficiently, for example by using a QR factorization
tailored for sparse matrices (see page 156) in the generic algorithm 12.1.
Another simple approach for exploiting sparsity of A is to solve the normal
equations AT Ax = AT b by solving a larger (but sparse) system of equations,
0 AT
x 0
= . (12.10)
A I y b
This is a square set of m+n linear equations. Its coefficient matrix is sparse when A
is sparse. If (x, y) satisfies these equations, it is easy to see that x satisfies (12.10);
conversely, if x satisfies the normal equations, (x, y) satisfies (12.10) with y =
Ax b. Any method for solving a sparse system of linear equations can be used to
solve (12.10).
12.4 Examples
Advertising purchases. We have m demographic groups or audiences that we
want to advertise to, with a target number of impressions or views for each group,
which we give as a vector v des . (The entries are positive.) To reach these audiences,
we purchase advertising in n different channels (say, different web publishers, radio,
print, . . . ), in amounts that we give as an n-vector s. (The entries of s are non-
negative, which we ignore.) The m n matrix R gives the number of impressions
in each demographic group per dollar spending in the channels: Rij is the num-
ber of impressions in group i per dollar spent on advertising in channel j. (These
entries are estimated, and are all nonnegative.) The jth column of R gives the
effectiveness or reach (in impressions per dollar) for channel i. The ith row of R
12.4 Examples 185
Channel 1
Channel 2
2 Channel 3
Impressions
1 2 3 4 5 6 7 8 9 10
Group
shows which media demographic i is exposed to. The total number of impressions
in each demographic group is the m-vector v, which is given by v = Rs. The goal
is to find s so that v = Rs v des . We can do this using least squares, by choosing
s to minimize kRs v des k2 . (We are not guaranteed that the resulting channel
spend vector will be nonnegative.) This least squares formulation does not take
into account the total cost of the advertising; we will see in chapter 16 how this
can be done.
We consider a simple numerical example, with n = 3 channels and m = 10
demographic groups, and matrix
0.97 1.86 0.41
1.23 2.18 0.53
0.80 1.24 0.62
1.29 0.98 0.51
1.10 1.23 0.69
R= ,
0.67 0.34 0.54
0.87 0.26 0.62
1.10 0.16 0.48
1.92 0.22 0.71
1.29 0.12 0.62
with units of dollars per 1000 views. The entries of this matrix range over an 18:1
range, so the 3 channels are quite different in terms of their audience reach; see
figure 12.2.
We take v des = 103 1, i.e., our goal is to reach one million customers in each of
186 12 Least squares
1,000
Impressions
500
0
1 2 3 4 5 6 7 8 9 10
Group
Figure 12.3 Views vector that best approximates the target of one million
impressions in each group.
the 10 demographic groups. Least squares gives the advertising budget allocation
which achieves a views vector with RMS error 132, or 13.2% of the target values.
The views vector is shown in figure 12.3.
25m
1 (4.0m)
2 (3.5m)
1 (4.0m)
2 (3.5m) 1.4
3 (6.0m) 3 (6.0m)
1.2
6 (6.0m) 6 (6.0m)
4 (4.0m) 4 (4.0m)
7 (5.5m) 7 (5.5m) 1
5 (4.0m) 5 (4.0m)
0.8
9 (5.0m) 10 (4.5m) 9 (5.0m) 10 (4.5m)
8 (5.0m) 8 (5.0m)
0.6
0
0 25m 0 25m
Figure 12.4 A square area divided in a 25 25 grid. The circles show the
positions of 10 lamps, the number in parentheses next to each circle is the
height of the lamp. The left-hand plot shows the illumination pattern with
lamps set to power one. The plot on the right shows the illumination pattern
for the lamp powers that minimize the sum square deviation with a desired
uniform illumination of one.
(We are not guaranteed that these powers are nonnegative, or less than the maxi-
mum allowed power level.)
An example is shown in figure 12.4. The area is a 25 25 grid with m = 625
pixels, each (say) 1m square. The lamps are at various heights ranging from 3m
to 6m, and at the positions shown in the figure. The illumination decays with an
inverse square law, so Aij is proportional to d2ij , where dij is the (3-D) distance
between the center of the pixel and the lamp position. The matrix A is scaled so
that when all lamps have power one, the average illumination level is one. The
desired illumination pattern is 1, i.e., uniform with value 1.
With p = 1, the resulting illumination pattern is shown is shown on the left of
figure 12.4. The RMS illumination error is 0.24. We can see that the corners are
quite a bit darker than the center, and there are pronounced bright spots directly
beneath each lamp. Using least squares we find the lamp powers
p = (1.46, 0.79, 2.97, 0.74, 0.08, 0.21, 0.21, 2.05, 0.91, 1.47).
The resulting illumination pattern has an RMS error of 0.14, about half of the RMS
error with all lamp powers set to one. The illumination pattern is shown on the
right of figure 12.4; we can see that the illumination is more uniform than when all
lamps have power 1. Most illumination values are near the target level 1, with the
corners a bit darker and the illumination a bit brighter directly below each lamp,
but less so than when all lamps have power one. This is clear from figure 12.5,
which shows the histogram of patch illumination values for all lamp powers one,
and for lamp powers p.
188 12 Least squares
120
80
60
40
20
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Intensity
120
100
Number of pixels
80
60
40
20
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Intensity
In this chapter we introduce one of the most important applications of least squares
methods, to the problem of data fitting. The goal is to find a mathematical model,
or an approximate model, of some relation, given some observed data.
Data. We dont know f , but we might have some idea about its general form.
But we do have some data, given by
x1 , . . . , x N , y1 , . . . , yN ,
where the n-vector xi is the feature vector and the scalar yi is the associated value
of the outcome for data sample i. These data are also called observations, examples,
or measurements, depending on the context.
y f(x),
190 13 Least squares data fitting
Prediction error. Our goal is to choose the model f so that it is consistent with
the data, i.e., we have yi f(xi ), for i = 1, . . . , N . (There is another goal in
choosing f, that we will discuss in 13.2.) For data sample i, our model predicts
the value yi = f(xi ), so the prediction error or residual for this data point is
ri = yi yi .
Least squares model fitting. A very common method for choosing the model pa-
rameters is to minimize the RMS prediction error, which is the same as minimizing
the sum of squares of the prediction errors, krk2 . We now show that this is a least
squares problem. Expressing f(xi ) in terms of the model parameters, we have
yi = Ai1 1 + + Aip p , i = 1, . . . , N,
Aij = fj (xi ), i = 1, . . . , N, j = 1, . . . , p,
krk2 = kA yk2 .
Choosing to minimize this is evidently a least squares problem, of the same form
as (12.1). Provided the columns of A are linearly independent, we can solve this
least squares problem to find , the model parameter values that minimize the
norm of the prediction error on our data set, as
We say that the model parameter values are obtained by least squares fitting on
the data set.
We can interpret each term in kA yk2 . The term y = A is the N -vector
of measurements or outcomes that is predicted by our model, with the parameter
vector . The term y is the N -vector of actual observed or measured outcomes.
The difference A y is the N -vector of prediction errors. Finally, kA yk2 is
the sum of squares of the prediction errors. This is minimized by the least squares
fit = .
The number kA yk2 is called the minimum sum square error (for the given
model basis and data set). The number kA yk2 /N is called the minimum
mean square error (MMSE) (of our model, on the data set). Its squareroot is
the minimum RMS fitting error. The model performance on the data set can be
visualized by plotting yi versus yi on a scatter plot, with a dashed line showing
y = y for reference.
Least squares fit with a constant. We start with the simplest possible fit: We
take p = 1, with f1 (x) = 1 for all x. In this case the model f is a constant function,
with f(x) = 1 for all x. Least squares fitting in this case is the same as choosing
the best constant value 1 to approximate the data y1 , . . . , yN .
In this simple case, A is the N 1 matrix 1, which always has linearly inde-
pendent columns (since it has one column, which is nonzero). The formula (13.1)
is then
1 = (AT A)1 AT y = N 1 1T y = avg(y),
where we use 1T 1 = N . So the best constant fit to the data is simply its mean,
f(x) = avg(y).
The RMS fit to the data (i.e., the RMS value of the optimal residual) is
rms(avg(y)1 y) = std(y),
the standard deviation of the data. This gives a nice interpretation of the average
value and the standard deviation of the outcomes.
It is common to compare the RMS fitting error for a more sophisticated model
with the standard deviation of the outcomes, which is the optimal RMS fitting
error for a constant model.
192 13 Least squares data fitting
1 1
0.8 0.8
yi = avg(y)
0.6 0.6
f(x)
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x yi
Figure 13.1 The constant fit f(x) = avg(y) to N = 20 data points and a
scatter plot of y versus y.
This says that the value of pth basis function can be expressed as a linear combina-
tion of the values of the first p 1 basis functions on the given data set. Evidently,
then, the pth basis function is redundant (on the given data set).
Straight-line fit. We take basis functions f1 (x) = 1 and f2 (x) = x. Our model
has the form
f(x) = 1 + 2 x,
which is a straight line when plotted. (This is perhaps why f is sometimes called
a linear model, even though it is in general an affine, and not linear, function of
x.) Figure 13.2 shows an example. The matrix A is given by
13.1 Least squares data fitting 193
f(x)
Figure 13.2 Least squares fit of a straight line to 50 points (xi , yi ) in a plane.
1 x1
1 x2
A= .. .. = 1 x ,
. .
1 xN
where x is the N -vector of values (x1 , . . . , xN ). Provided that there are at least
two different values appearing in x1 , . . . , xN , this matrix has linearly independent
columns. The parameters in the optimal straight-line fit to the data are given by
1
= (AT A)1 AT y.
2
This expression is simple enough for us to work it out explicitly, although there is
no computational advantage to doing so. The Gram matrix is
1T x
N
AT A = .
1T x xT x
The 2-vector AT y is
1T y
T
A y= ,
xT y
so we have (using the formula for the inverse of a 2 2 matrix)
T
x x 1T x
T
1 1 1 y
= .
2 N xT x (1T x)2 1T x N xT y
Multiplying the scalar term by N 2 , and dividing the matrix and vector terms by
N , we can express this as
rms(x)2 avg(x)
1 1 avg(y)
= .
2 rms(x)2 avg(x)2 avg(x) 1 xT y/N
194 13 Least squares data fitting
The optimal slope 2 of the straight line fit can be expressed more simply in terms
of the correlation coefficient between the data vectors x and y, and their standard
deviations. We have
N xT y (1T x)(1T y)
2 =
N xT x (1T x)2
(x avg(x)1)T (y avg(y)1)
=
kx avg(x)1k2
std(y)
= .
std(x)
In the last step we used the definitions
(x avg(x)1)T (y avg(y)1) kx avg(x)1k
= , std(x) =
N std(x) std(y) N
from chapter 3. From the first of the two normal equations, N 1 + (1T x)2 = 1T y,
we also obtain a simple expression for 1 :
1 = avg(y) 2 avg(x).
Putting these results together, we can write the least squares fit as
std(y)
f(u) = avg(y) + (u avg(x)).
std(x)
This can be expressed in the more symmetric form
f(u) avg(y) u avg(x)
= ,
std(y) std(x)
which has a nice interpretation. The left-hand side is the difference between the
predicted response value and the mean response value, divided by its standard de-
viation. The right-hand side is the correlation coefficient times the same quantity,
computed for the dependent variable.
The least squares straight-line fit is used in many application areas.
Asset and in finance. In finance, the straight-line fit is used to compare the
returns of an individual asset, yi , to the returns of the whole market, xi . (The
return of the whole market is typically taken to be a sum of the individual asset
returns, weighted by their capitalizations.) The straight-line model for predicting
the asset return y from the market return x is typically written in the form
y = (rrf + ) + (x avg(x)),
where rrf is the risk-free interest rate over the period. (Comparing this to our
straight-line model, we find that 2 = , and 1 = rrf + avg(x).) The
parameter tells us how much of the return of the particular asset is explained,
or predicted, by the market return, and the parameter tells us what the average
return is, over and above the risk-free interest rate. This model is so common that
the terms Alpha and Beta are widely used in finance. (Though not always with
exactly the same meaning, since there a few variations on how the parameters are
defined.)
13.1 Least squares data fitting 195
90
70
60
Figure 13.3 World petroleum consumption between 1980 and 2013 (dots)
and least squares straight-line fit (data from [Link]).
Time series trend. Suppose the data represents a series of samples of a quantity
y at time (epoch) xi = i. The straight-line fit to time series data,
yi = 1 + 2 i, i = 1, . . . , N,
is called the trend line. Its slope, which is 2 , is interpreted as the trend in the
quantity over time. Subtracting the trend line from the original time series we
get the de-trended time series, y y. The de-trended time series shows how the
time series compares with its straight-line fit: When it is positive, it means the
time series is above its straight-line fit, and when it is negative, it is below the
straight-line fit.
An example is shown in figures 13.3 and 13.4. Figure 13.3 shows world petroleum
consumption versus year, along with the straight-line fit. Figure 13.4 shows the
de-trended world petroleum consumption.
In many applications, the de-trended time series has a clear periodic component,
i.e., a component that repeats itself periodically. As an example, figure 13.5 shows
an estimate of the road traffic (total number of miles traveled in vehicles) in the
196 13 Least squares data fitting
US, for each month between January 2000 and December 2014. The most striking
aspect of the time series is the pattern that is repeated every year, with a peak in
the summer and a minimum in the winter. In addition there is a slowly increasing
long term trend. The bottom figure shows the least squares fit of a sum of three
components
where y const and y lin are defined as before, and the third component is periodic
with period P = 12. This periodic, or seasonal, component can be expressed as
3:2+P
3:2+P
y seas = .. ,
.
3:2+P
13.1 Least squares data fitting 197
105
2.6
Miles (millions)
2.4
2.2
2
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
Month
5
10
2.6
Miles (millions)
2.4
2.2
2
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
Month
Figure 13.5 Top. Vehicle miles traveled in the US, per month, in the period
January 2000 December 2014 (U.S. Department of Transportation, Bu-
reau of Transportation Statistics, [Link]). Bottom. Least
squares fit of a sum of three time series: a constant, a linear trend, and a
seasonal component with a 12-month period.
198 13 Least squares data fitting
which consists of the pattern (3 , . . . , 2+p ), repeated N times. The least squares
fit is computed by minimizing kA yk2 where is a (P + 2)-vector and
1 1 1 0 0
1 2 0 1 0
.. .. .. .. . . ..
.
. . . . .
1 P 0 0 1
1 P +1 1 0 0
1 P +2 0 1 0
.. .. .. .. . . ..
A=
. . . . . . .
1 2P 0 0 1
.. .. .. .. ..
. . . . .
1 N P +1 1 0 0
1 N P +2 0 1 0
.. .. .. .. .. ..
. . . . . .
1 N 0 0 1
In this example, N = 15P = 180. The residual or prediction error in this case is
called the de-trended, seasonally adjusted series.
Polynomial fit. A simple extension beyond the straight-line fit is a polynomial fit,
with
fi (x) = xi1 , i = 1, . . . , p,
f(x) = 1 + 2 x + + p xp1 .
x1p1
1 x1
1 x2 x2p1
A= . ,
.. ..
.. . .
1 xN xp1
N
i.e., it is a Vandermonde matrix (see (6.6)). Its columns are linearly independent
provided the numbers x1 , . . . , xN include at least p different values. Figure 13.6
shows an example of the least squares fit of polynomials of degree 2, 6, 10, and 15
to a set of 100 data points. Since any polynomial of degree less than r is also a
polynomial of degree less than s, for r s, it follows that the RMS fit attained
by a polynomial with a larger degree is smaller (or at least, no larger) than that
obtained by a fit with a smaller degree polynomial. This suggests that we should
use the largest degree polynomial that we can, since this results in the smallest
residual and the best RMS fit. But we will see in 13.2 that this is not true, and
explore rational methods for choosing a model from among several candidates.
13.1 Least squares data fitting 199
degree 2 degree 6
f(x) f(x)
x x
degree 10 degree 15
f(x) f(x)
x x
Figure 13.6 Least squares polynomial fits of degree 2, 6, 10, and 15 to 100
points.
200 13 Least squares data fitting
(x + 1)+ (x + 1)+
3 3
2 2
1 1
0 0
x x
3 1 1 3 3 1 1 3
f(x)
x
2 1 0 1 2
where (u)+ = max{u, 0}. These basis functions are shown in figure 13.7 for k = 2
knot points at a1 = 1, a2 = 1. An example of a piecewise-linear fit with these
knot points is shown in figure 13.8.
13.1 Least squares data fitting 201
13.1.2 Regression
Recall that the regression model has the form
y = xT + v,
where is the weight vector and v is the offset. We can put this model in our
general data fitting form using the basis functions f1 (x) = 1, and
fi (x) = xi1 , i = 2, . . . , n + 1,
so p = n + 1. The regression model can then be expressed as
y = xT 2:n+1 + 1 ,
and we see that = 2:n+1 and v = 1 .
The N (n + 1) matrix A in our general data fitting form is given by
A = 1 XT ,
House price regression. In 2.3 we described a simple regression model for the
selling price of a house based on two attributes, area and number of bedrooms.
The values of the parameters and the offset v given in (2.8) were computed by
least squares fitting on a set of data consisting of 774 house sales in Sacramento
over a 5 day period. The RMS fitting error for the model is 74.8 (in thousands of
dollars). For comparison, the standard deviation of the prices in the data set is
112.8. So this very basic regression model predicts the prices substantially better
than a constant model (i.e., the mean price of the houses in the data set).
202 13 Least squares data fitting
When the dependent variable y is positive and varies over a large range, it is
common to replace it with its logarithm w = log y, and then use least squares to
develop a model for w, w = g(x). We then form our estimate of y using y = eg(x) .
When we fit a model w = g(x) to the logarithm w = log y, the fitting error for w
can be interpreted in terms of the percentage or relative error between y and y,
defined as
= max{y/y, y/y} 1.
So = 0.1 means either y = 1.1y (i.e., we over-estimate by 10%) or y = (1/1.1)y
(i.e., we under-estimate by 10%). The connection between the relative error be-
tween y and y, and the residual r in predicting w, is
= e|r| 1.
13.2 Validation 203
70
68
66
Temperature ( F)
64
62
60
58
56
20 40 60 80 100 120
k
13.2 Validation
Generalization ability. In this section we address a key point in model fitting:
The goal of model fitting is typically not to just achieve a good fit on the given
204 13 Least squares data fitting
data set, but rather to achieve a good fit on new data that we have not yet seen.
This leads us to a basic question: How well can we expect a model to predict y for
future or other unknown values of x? Without some assumptions about the future
data, there is no good way to answer this question.
One very common assumption is that the data are described by a formal prob-
ability model. With this assumption, techniques from probability and statistics
can be used to predict how well a model will work on new, unseen data. This
approach has been very successful in many applications, and we hope that you will
learn about these methods in another course. In this book, however, we will take
a simple intuitive approach to this issue.
The ability of a model to predict the outcomes for new unseen data values is
called its generalization ability. If it predicts outcomes well, it is said to have good
generalization ability; in the opposite case, it is said to have poor generalization
ability. So our question is: How can we assess the generalization ability of a model?
Validation on a test set. A simple but effective method for assessing the gener-
alization ability of a model is called out-of-sample validation. We divide the data
we have into two sets: a training set and a test set (also called a validation set).
This is often done randomly, with 80% of the data put into the training set and
20% put into the test set. Another common choice for the split ratio is 90%10%.
A common way to describe this is to say that 20% of the data were reserved for
validation.
To fit our model, we use only the data in the training set. The model that we
come up with is based only on the data in the training set; the data in the test set
has never been seen by the model. Then we judge the model by its RMS fit on the
test set. Since the model was developed without any knowledge of the test set data,
the test data is effectively data that are new and unseen, and the performance of
our model on this data gives us at least an idea of how our model will perform on
new, unseen data. If the RMS prediction error on the test set is large, then we can
conclude that our model has poor generalization ability. Assuming that the test
data is typical of future data, the RMS prediction error on the test set is what
we might guess our RMS prediction error will be on new data.
If the RMS prediction error of the model on the training set is similar to the
RMS prediction error on the test set, we have increased confidence that our model
has reasonable generalization ability. (A more sophisticated validation method
called cross-validation, described below, can be used to gain even more confidence.)
For example, if our model achieves an RMS prediction error of 10% (compared
to rms(y)) on the training set and 11% on the test set, we can guess that it will have
a similar RMS prediction error on other unseen data. But there is no guarantee
of this, without further assumptions about the data. The basic assumption we are
making here is that the future data will look like the test data, or that the test
data were typical. Ideas from statistics can make this idea more precise, but we
will leave this idea informal and intuitive.
Over-fitting. When the RMS prediction error on the training set is much smaller
than the RMS prediction error on the test set, we say that the model is over-fit.
13.2 Validation 205
It tells us that, for the purposes of making predictions on new, unseen data, the
model is much less valuable than its performance on the training data suggests.
Good models, which perform well on new data, do not suffer from over-fit.
Roughly speaking, an over-fit model trusts the data it has seen (i.e., the training
set) too much; it is too sensitive to the changes in the data that will likely be seen
in the future data. One method for avoiding over-fit is to keep the model simple;
another technique, called regularization, is discussed in chapter 15.
Choosing among different models. We can use least squares fitting to fit multiple
models to the same data. For example, in univariate fitting, we can fit a constant,
an affine function, a quadratic, or a higher order polynomial. Which is the best
model among these? Assuming that the goal is to make good predictions on new,
unseen data, we should choose the model with the smallest RMS prediction error on
the test set. Since the RMS prediction error on the test set is only a guess about
what we might expect for performance on new, unseen data, we can soften this
advice to we should choose a model that has test set RMS error that is near the
minimum over the candidates.
We observed earlier that when we add basis functions to a model, our fitting
error on the training data can only decrease (or stay the same). But this is not
true for the test error. The test error need not decrease when we add more basis
functions. Indeed, when we have too many basis functions, we can expect over-fit,
i.e., larger error on the test set.
If we have a sequence of basis functions f1 , f2 , . . . , we can consider models
based on using just f1 (which is typically the constant function 1), then f1 and
f2 , and so on. As we increase p, the number of basis functions, our training error
will go down (or stay the same). But the test error typically decreases, and then
starts to increase when p is too large, and the resulting models suffer from over-
fit. The intuition for this typical behavior is that for p too small, our model is
too simple to fit the data well, and so cannot make good predictions; when p is
too large, our model is too complex and suffers from over-fit, and so makes poor
predictions. Somewhere in the middle, where the model achieves near minimum
test set performance, is a good choice (or several good choices) of p.
To illustrate these ideas, we consider the example shown in figure 13.6. Using a
training set of 100 points, we find least squares fits of polynomials of degrees 0, 1,
. . . , 20. (The polynomial fits of degrees 2, 6, 10, and 15 are shown in the figure.)
We now obtain a new set of data for validation, also with 100 points. These test
data are plotted along with the polynomial fits obtained from the training data
in figure 13.10. This is a real check of our models, since these data points were
not used to develop the models. Figure 13.11 shows the RMS training and test
errors for polynomial fits of different degrees. We can see that the RMS training
error decreases with every increase in degree. The RMS test error decreases until
degree 6 and starts to increase for degrees larger than 6. This plot suggests that a
polynomial fit of degree 6 is a reasonable choice.
With a 6th degree polynomial, the relative RMS test error for both training
and test sets is around 0.3. It is a good sign, in terms of generalization ability,
that the training and test errors are similar. While there are no guarantees, we can
guess that the 6th degree polynomial model will have a relative RMS error around
206 13 Least squares data fitting
degree 2 degree 6
f(x) f(x)
x x
degree 10 degree 15
f(x) f(x)
x x
Figure 13.10 The polynomial fits of figure 13.6 evaluated on a test set of 100
points.
13.2 Validation 207
1
Train
Test
0.6
0.4
0.2
0 5 10 15 20
Degree
Figure 13.11 RMS error versus polynomial degree for the fitting example
in figures 13.6 and 13.10. Circles indicate RMS errors on the training set.
Squares show RMS errors on the test set.
0.3 on new, unseen data, provided the new, unseen data is sufficiently similar to
the test set data.
Table 13.1 Five-fold validation for the simple regression model of the house
sales data set.
Table 13.2 Five-fold validation for the constant model of the house sales
data set.
We randomly partition the data set of 774 sales records into five folds, four of
size 155 and one of size 154. Then we fit five regression models, each of the form
y = v + 1 x1 + 2 x2
to the data set after removing one of the folds. Table 13.1 summarizes the results.
The model parameters for the 5 different regression models are not exactly the
same, but quite similar. The train and test RMS errors are reasonably similar,
which suggests that our model does not suffer from over-fit. Scanning the RMS
error on the test sets, we can expect that our prediction error on new houses will be
around 70-80 (thousand dollars) RMS. We can also see that the model parameters
change a bit, but not drastically, in each of the folds. This gives us more confidence
that, for example, 2 being negative is not a fluke of the data.
For comparison, table 13.2 shows the RMS errors for the constant model y = v,
where v is the mean price of the training set. The results suggest that the constant
model can predict house prices with a prediction error around 105-120 (thousand
dollars).
Figure 13.12 shows the scatter plots of actual and regression model predicted
prices for each of the five training and test sets. The results for training and
test sets are reasonably similar in each case, which gives us confidence that the
regression model will have similar performance on new, unseen houses.
13.2 Validation 209
Fold 1 Fold 2
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 0 200 400 600 800
Fold 3 Fold 4
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 0 200 400 600 800
Fold 5
800
600
400
200
0
0 200 400 600 800
Figure 13.12 Scatter plots of actual and predicted prices for the five simple
regression models of table 13.1. The horizontal axis is the actual selling
price the vertical axis is the predicted price, both in thousand dollars. Blue
circles are samples in the training set, red circles samples in the test set.
210 13 Least squares data fitting
Validating time series predictions. When the original data are unordered, for
example, patient records or customer purchase histories, the division of the data
into a training and test set is typically done randomly. This same method can
be used to validate a time series prediction model, such as an AR model, but it
does not give the best emulation of how the model will ultimately be used. In
practice, the model will be trained on past data and then used to make predictions
on future data. When the training data in a time series prediction model are
randomly chosen, the model is being built with some knowledge of the future, a
phenomemon called look-ahead or peak-ahead. Look-ahead can make a model look
better than it is really is at making predictions.
To avoid look-ahead, the training set for a time series prediction model is typ-
ically taken to be the data examples up to some point in time, and the test data
are chosen as points that are past that time (and sometimes, at least M samples
past that time, taking into account the memory of the predictor). In this way we
can say that the model is being tested by making predictions on data it has never
seen. As an example, we might train an AR model for some daily quantity using
data from the years 2006 through 2008, and then test the resulting AR model on
the data from year 2009.
As an example, we return to the AR model of hourly temperatures at Los
Angeles International Airport described on page 202 and shown in figure 13.9. We
divide the one month of data into a training set (May 124) and a test set (May
2531). The coefficients in an AR model are computed using the 24 24 8 = 568
samples in the training set. The RMS error on the training set is 1.03 F. The RMS
prediction error on the test set is 0.98 F, which is similar to the RMS prediction
error on the training set, giving us confidence that the AR model is not over-fit.
(The fact that the test RMS error is very slightly smaller than the training RMS
error has no significance.) Figure 13.13 shows the prediction on the first five days
of the test set. The predictions look very similar to those shown in figure 13.9.
68
66
Temperature ( F)
64
62
60
58
56
20 40 60 80 100 120
k
Adding new features to get a richer model. In many cases the basis functions
include the constant one, i.e., we have f1 (x) = 1. (This is equivalent to having the
offset in the basic regression model.) It is also very common to include the original
features as well, as in fi (x) = xi1 , i = 2, . . . , n + 1. If we do this, we are effectively
starting with the basic regression model; we can then add new features to get a
richer model. In this case we have p > n, so there are more mapped features that
original features. (Whether or not it is a good idea to add the new features can be
determined by out of sample validation or cross-validation.)
212 13 Least squares data fitting
Dimension reduction. In some cases, and especially when the number n of the
original features is very large, the feature mappings are used to construct a smaller
set of p < n features. In this case we can think of the feature mappings or basis
functions as a dimension reduction or data aggregation procedure.
so that across the data set, the average value of fi (x) is near zero, and the standard
deviation is around one. (This is done by choosing bi to be near the mean of
the feature i values over the data set, and choosing ai to be near the standard
deviation of the values.) This is called standardizing or z-scoring the features.
The standardized feature values are easily interpretable since they correspond to
z-values; for example, f3 (x) = +3.3 means that the value of original feature 2 is
quite a bit above the typical value. The standardization of each original feature
is typically the first step in feature engineeering. The constant feature is not
standardized. (In fact, it cannot be standardized since its standard deviation across
the data set is zero.)
Winsorizing features. When the data include some very large values that are
thought to be errors (say, in collecting the data), it is common to clip or winsorize
the data. This means that we set any values that exceed some chosen maximum
absolute value to that value. Assuming, for example, that a feature entry x5
has already been standardized (so it represents z-scores across the examples), we
replace x5 with its winsorized value (with threshold 3),
x5 |x5 | 3
x5 = 3 x5 > 3
3 x5 < 3.
Log transform. When feature values are positive and vary over a wide range,
it is common to replace them with their logarithms. If the feature value also
includes the value 0 (so the logarithm is undefined) a common variation on the log
transformation is to use xk = log(xk + 1). This compresses the range of values that
we encounter. As an example, suppose the original features record the number
of visits to websites over some time period. These can easily vary over a range of
10000:1 (or even more) for a very popular website and a less popular one; taking the
logarithm of the visit counts gives a feature with less variation, which is possibly
more interpretable. (The decision as to whether to use the original feature values
or their logarithms can be decided by validation.)
13.3 Feature engineering 213
Expanding categoricals. Some features take on only a few values, such as 1 and
1 or 0 and 1, which might represent some value like presence or absence of some
symptom. (Such features are called Boolean.) A Likert scale response (see page 57)
naturally only takes on a small number of values, such as 2, 1, 0, 1, 2. Another
example is an original feature that takes on the values 1, 2, . . . , 7, representing
the day of the week. Such features are called categorical in statistics, since they
specify which category the example is in, and not some real number.
Expanding a categorical feature with l values means replacing it with a set of
l 1 new features, each of which is Boolean, and simply records whether or not
the original feature has the associated value. (When all these features are zero,
it means the original feature had the default value.) As an example, suppose the
original feature x1 takes on only the values 1, 0, and 1. Using the feature value
0 as the default feature value, we replace x1 with the two mapped features
1 x1 = 1 1 x1 = 1
f1 (x) = f2 (x) =
0 otherwise, 0 otherwise.
In words, f1 (x) tells us if x1 has the value 1, and f2 (x) tells us if x1 has the value
1. (We do not need a new feature for the default value x1 = 0; this corresponds to
f1 (x) = f2 (x) = 0.) There is no need to expand an original feature that is Boolean
(i.e., takes on two values).
As an example, consider a model that is used to predict house prices based on
various features that include the number of bedrooms, that ranges from 1 to 5 (say).
In the basic regression model, we use the number of bedrooms directly as a feature.
If we expand this categorical feature, using 2 bedrooms as the default, we have 4
Boolean features that correspond to a house having 1, 3, 4, and 5 bedrooms. In the
basic model there is one parameter value that corresponds to value per bedroom;
we multiply this parameter by the number of bedrooms to get the contribution to
our price prediction. When we expand the bedroom categorical feature, we have
4 parameters in our model, which assign the amounts to add to our prediction for
houses with 1, 3, 4, and 5 bedrooms, respectively.
Generalized additive model. We introduce new features that are nonlinear func-
tions of the original features, such as, for each xi , the functions min{xi + a, 0} and
max{xi b, 0}, where a and b are parameters. A common choice, assuming that
xi has already been standardized, is a = b = 1. This leads to the predictor
y = 1 (x1 ) + + n (xn ),
which has kink or knot points at the values a and +b. This model has 3n pa-
rameters. This model is a sum of functions of the original features, and is called a
generalized additive model.
214 13 Least squares data fitting
Products and interactions. New features can be developed from pairs of original
feautures, such as their products. From the original features we can add xi xj , for
i, j = 1, . . . , n, i j. Products are used to model interactions among the features.
Product features are easily interpretable when the original features are Boolean,
i.e., take the values 0 or 1. Thus xi = 1 means that feature i is present or has
occured, and xi xj = 1 exactly when both feature i and j have occured.
Custom mappings. In many applications custom mappings of the raw data are
used as additional features, in addition to the original features given. For example
in a model meant to predict an assets future price using prior prices, we might also
use the highest and lowest prices over the last week. Another well known example
in financial models is the price-to-earnings ratio, constructed from the price and
(last) earnings features.
In document analysis applications word count features are typically replaced
with term frequency inverse document frequency (TFID) values, which scale the
raw count values by a function of the frequency with which the word appears
across the given set of documents, usually in such a way that uncommon words are
given more weight. (There are many variations on the particular scaling function
to use. Which one to use in a given application can be determined by out of sample
or cross-validation.)
Predictions from other models. In many applications there are existing models
for the data. A common trick is to use the predictions of these models as features
in your model. In this case you can describe your model as one that combines
or blends the raw data available with predictions made from one or more existing
models to create a new prediction.
Random features. The new features are given by a nonlinear function of a random
linear combination of the original features. To add K new features of this type,
we first generate a random K n matrix R. We then generate new features as
(Rx)+ or |Rx|, where ()+ and || are applied elementwise to the vector Rx. (Other
nonlinear functions can be used as well.)
This approach to generating new features is quite counter-intuitive, since you
would imagine that feature engineering should be done using detailed knowledge
of, and intuition about, the particular application. Nevertheless this method can
be very effective in some applications.
13.3.4 Summary
The discussion above makes it clear that there is much art in choosing features to
use in a model. But it is important to keep several things in mind when creating
new features:
Try simple models first. Start with a constant, then a simple regression model,
and so on. You can compare more sophisticated models against these.
Adding new features can easily lead to over-fit. (This will show up when
validating the model.) The most straightforward way to avoid over-fit is to
keep the model simple. We mention here that another approach to avoiding
over-fit, called regularization (covered in chapter 15), can be very effective
when combined with feature engineering.
y = 1 x1 + 2 x2 + v
given in 2.3. Here we examine a more complicated model, with 8 basis functions,
8
X
y = i fi (x).
i=1
The first function is the constant f1 (x) = 1. The next two are functions of x1 , the
area of the house,
In words, f2 (x) is the area of the house, and f3 (x) is the amount by which the
area exceeds 1.5 (i.e., 1500 square feet). The weighted sum 2 f2 (x) + 3 f3 (x) is
a piecewise-linear function of the house area, with one knot at 1.5. The function
f4 (x) is equal to the number of bedrooms x2 . The function f5 (x) is equal to x3 , i.e.,
one if the property is a condominium, and zero otherwise. The last three functions
are again Boolean, and indicate the location of the house. We partition the 62
ZIP codes present in the data set into four groups, corresponding to different areas
around the center of Sacramento, as shown in table 13.3.
The coefficients in the least squares fit are
800
Predicted price y (thousand dollars)
600
400
200
0
0 200 400 600 800
Actual price y (thousand dollars)
Figure 13.14 Scatter plot of actual and predicted prices for a model with
eight parameters.
Fold 1 Fold 2
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 0 200 400 600 800
Fold 3 Fold 4
800 800
600 600
400 400
200 200
0 0
0 200 400 600 800 0 200 400 600 800
Fold 5
800
600
400
200
0
0 200 400 600 800
Figure 13.15 Scatter plots of actual and predicted prices for the five models
of table 13.4. The horizontal axis is the actual selling price the vertical axis
is the predicted price, both in thousand dollars. Blue circles are samples in
the training set, red circles samples in the test set.
13.3 Feature engineering 219
In this chapter we consider the problem of fitting a model to data where the outcome
takes on values like true or false (as opposed to being numbers, as in chapter 13).
We will see that least squares can be used for this problem as well.
14.1 Classification
In the data fitting problem of chapter 13, the goal is to reproduce or predict the
outcome y, which is a (scalar) number. In a classification problem, the outcome or
dependent variable y takes on only a finite number of values, and for this reason is
sometimes called a label, or in statistics, a categorical. In the simplest case, y has
only two values, for example true or false, or spam or not spam. This is called
the 2-way classification problem, or the Boolean classification problem. We start
by considering the Boolean classification problem.
We will encode y as a real number, taking y = +1 to mean true and y = 1
to mean false. As in real-valued data fitting, we assume that an approximate
relationship of the form y f (x) holds, where f : Rn {1, +1}. (This notation
means that the function f takes an n-vector argument, and gives a resulting value
that is either +1 or 1.) Our model will have the form y = f(x), where f : Rn
{1, +1}. The model f is also called a classifier, since it classifies n-vectors into
those for which f(x) = +1 and those for which f(x) = 1.
Fraud detection. The vector x gives a set of features associated with a credit
card holder, such as her average monthly spending levels, median price of
purchases over the last week, number of purchases in different categories, av-
erage balance, and so on, as well as some features associated with a particular
proposed transaction. The outcome y is +1 for a fraudulent transaction, and
1 otherwise. The data used to create the classifier is taken from historical
data, that includes (some) examples of transactions that were later verified
to be fraudulent and (many) that were verified to be bona fide.
Boolean document classification. The vector x is a word count (or histogram)
vector for a document, and the outcome y is +1 if the document has some
specific topic (say, politics) and 1 otherwise. The data used to construct the
classifier might come from a corpus of documents with their topics labeled.
Disease detection. The examples correspond to patients, with outcome y =
+1 meaning the patient has a particular disease, and y = 1 meaning they do
not. The vector x contains relevant medical features associated with the pa-
tient, including for example age, sex, results of tests, and specific symptoms.
The data used to build the model come from hospital records or a medical
study; the outcome is the associated diagnosis, confirmed by a doctor.
Prediction errors. For a given data point (x, y), with predicted outcome y = f(x),
there are only four possibilities:
True positive. y = +1 and y = +1.
True negative. y = 1 and y = 1.
False positive. y = 1 and y = +1.
False negative. y = +1 and y = 1.
In the first two cases the predicted label is correct, and in the last two cases, the
predicted label is an error. We refer to the third case as a false positive or type I
error, and we refer to the fourth case as a false negative or type II error. In some
applications we care equally about making the two types of errors; in others we
may care more about making one type of error than another.
Error rate and confusion matrix. For a given data set (x1 , y1 ), . . . , (xN , yN ) and
model f, we can count the numbers of each of the four possibilities that occur
across the data set, and display them in a contingency table or confusion matrix,
which is a 2 2 table with the columns corresponding to the value of yi and the
rows corresponding to the value of yi . (This is the convention used in machine
learning; in statistics, the rows and columns are sometimes reversed.) The entries
give the total number of each of the four cases listed above, as shown in table 14.1.
The diagonal entries correspond to correct decisions, with the upper left entry the
number of true positives, and the lower right entry the number of true negatives.
The off-diagonal entries correspond to errors, with the upper right entry the number
of false negatives, and the lower left entry the number of false positives. The total
of the four numbers is N , the number of examples in the data set. Sometimes the
totals of the rows and columns are shown, as in table 14.1.
14.1 Classification 223
Prediction
Outcome y = +1 y = 1 Total
y = +1 Ntp Nfn Np
y = 1 Nfp Ntn Nn
All Ntp + Nfp Nfn + Ntp N
Prediction
Outcome y = +1 (spam) y = 1 (not spam) Total
y = +1 (spam) 95 32 127
y = 1 (not spam) 19 1120 1139
All 114 1152 1266
Table 14.2 Confusion matrix of a spam detector on a data set of 1266 ex-
amples.
Various performance metrics are expressed in terms of the numbers in the con-
fusion matrix.
The error rate is the total number of errors (of both kinds) divided by the
total number of examples, i.e., (Nfp + Nfn )/N .
The true positive rate (also known as the sensitivity or recall rate) is Ntp /Np .
This gives the fraction of the data points with y = +1 for which we correctly
guessed y = +1.
The false positive rate (also known as the false alarm rate) is Nfp /Nn . The
false positive rate is the fraction of data points with y = 1 for which we
incorrectly guess y = +1.
The specificity or true negative rate is one minus the false positive rate, i.e.,
Ntn /Nn . The true negative rate is the fraction of the data points with y = 1
for which we correctly guess y = 1.
A good classifier will have small (near zero) error rate and false positive rate, and
high (near one) true positive rate and true negative rate. Which of these metrics
is more important depends on the particular application.
An example confusion matrix is given in table 14.2 for the performance of a
spam detector on a data set of N = 1266 examples (emails) of which 127 are spam
(y = +1) and the remaining 1139 are not spam (y = 1). On the data set, this
classifier has 95 true positives and 1120 true negatives, 19 false positives, and 32
false negatives. Its error rate is (19 + 32)/1266 = 4.03%. Its true positive rate
is 95/127 = 74.8% (meaning it is detecting around 75% of the spam in the data
224 14 Least squares classification
set), and its false positive rate is 19/1139 = 1.67% (meaning it incorrectly labeled
around 1.7% of the non-spam messages as spam).
Prediction
Outcome y = +1 y = 1 Total
y = +1 46 4 50
y = 1 7 93 100
All 53 97 150
Table 14.3 Confusion matrix for a Boolean classifier of the Iris data set.
If 4 is the coefficient with the largest magnitude, then we can say that x4 is the
feature that contributes the most to our classification decision.
We illustrate least squares classification with a famous data set, first used in the
1930s by the statistician Ronald Fisher. The data are measurements of four at-
tributes of three types of iris flowers: Iris Setosa, Iris Versicolour, and Iris Vir-
ginica. The data set contains 50 examples of each class. The four attributes are:
We compute a Boolean classifier of the form (14.2) that distinguishes the class Iris
Virginica from the other two classes. Using the entire set of 150 examples we find
the coefficients
The confusion matrix associated with this classifier is shown in table 14.3. The
error rate is 7.3%.
Validation. To test our least squares classification method, we apply 5-fold cross-
validation. We randomly divide the data set into 5 folds of 30 examples (10 for
each class). The results are shown in table 14.4. The test data sets contain only
30 examples, so a single prediction error changes the test error rate significantly
(i.e., by 3.3%). This explains what would seem to be large variation seen in the
test set error rates. We might guess that the classifier will perform on new unseen
data with an error rate in the 710% range, but our test sets are not large enough
to predict future performance more accurately than this. (This is an example of
the limitation of cross-validation when the data set is small; see the discussion on
page 210.)
226 14 Least squares classification
Table 14.4 Five-fold validation for the Boolean classifier of the Iris data set.
Figure 14.1 Location of the pixels used as features in the handwritten digit
classification example.
We now consider a much larger example, the MNIST data set described in 4.4.1.
The (training) data set contains 60,000 images of size 28 by 28. (A few samples are
shown in figure 4.6.) The number of examples per digit varies between 5421 (for
digit five) and 6742 (for digit one). The pixel intensities are scaled to lie between 0
and 1. We remove the pixels that are nonzero in fewer than 600 training examples.
The remaining 493 pixels are shown as the white area in figure 14.1. There is also
a separate test set containing 10, 000 images. Here we will consider classifiers to
distinguish the digit zero from the other nine digits.
In this first experiment, we use the 493 pixel intensities, plus an additional
feature with value 1, as the n = 494 features in the least squares classifier (14.1).
The performance on the (training) data set is shown in the confusion matrix in
table 14.5. The error rate is 1.6%, the true positive rate is 87.1%, and the false
positive rate is 0.3%.
Figure 14.2 shows the distribution of the values of f(xi ) for the two classes in
14.2 Least squares classifier 227
Prediction
Outcome y = +1 y = 1 Total
y = +1 5158 765 5923
y = 1 169 53910 54077
All 5325 54675 60000
Table 14.5 Confusion matrix for a classifier for recognizing the digit zero,
on a training set of 60000 examples.
Prediction
Outcome y = +1 y = 1 Total
y = +1 864 116 980
y = 1 42 8978 9020
All 906 9094 10000
Table 14.6 Confusion matrix for the classier for recognizing the digit zero,
on a test set of 10000 examples.
the training set. The interval [2.1, 2.1] is divided in 100 intervals of equal width.
For each interval, the height of the blue bar is the fraction of the total number
of training examples xi from class +1 (digit zero) that have a value f(xi ) in the
interval. The height of the red bar is the fraction of the total number of training
examples from class 1 (digits 19) with f(xi ) in the interval. The vertical dashed
line shows the decision boundary: For f(xi ) to the left (i.e., negative) we guess
that digit i is from class 1, i.e., digit 19; for f(xi ) to the right of the dashed
line, we guess that digit i is from class +1, i.e., digit 0. False positives correspond
to red bars to the right of the dashed line, and false negatives correspond to blue
bars to the left of the line.
Figure 14.3 shows the values of the coefficients k , displayed as an image. We
can interpret this image as a map of the sensitivity of our classifier to the pixel
values. Pixels with i = 0 are not used at all; pixels with larger positive values of
i are locations where the larger the image pixel value, the more likely we are to
guess that the image represents the digit zero.
Validation. The performance of the least squares classifier on the test set is shown
in the confusion matrix in table 14.6. For the test set the error rate is 1.6%, the
true positive rate is 88.2%, and the false positive rate is 0.5%. These performance
metrics are similar to those for the training data, which suggests that our classifier
is not over-fit, and gives us some confidence in our classifier.
228 14 Least squares classification
0.1 Positive
Negative
0.08
Fraction 0.06
0.04
0.02
0
2 1.5 1 0.5 0 0.5 1 1.5 2
f (xi )
Figure 14.2 The distribution of the values of f(xi ) in the Boolean classi-
fier (14.1) for recognizing the digit zero, over all elements xi of the training
set. The red bars correspond to the digits from class 1, i.e., the digits 19;
the blue bars correspond to the digits from class +1, i.e., the digit 0.
0.1
0.05
0.05
0.1
Figure 14.3 The coefficients k in the least squares classifier that distin-
guishes the digit zero from the other nine digits.
14.2 Least squares classifier 229
Prediction Prediction
Outcome y = +1 y = 1 Total Outcome y = +1 y = 1 Total
y = +1 5813 110 5923 y = +1 963 17 980
y = 1 15 54062 54077 y = 1 7 9013 9020
All 5828 54172 60000 All 970 9030 10000
Table 14.7 Confusion matrices for the Boolean classifier to recognize the
digit zero after addition of 5000 new features. The table on the left is for
the training set. The table on the right is for the test set.
One useful modification of the least squares classifier (14.1) is to skew the decision
boundary, by subtracting a constant from f(x) before taking the sign:
Positive
Negative
0.15
Fraction
0.10
0.05
0
2 1.5 1 0.5 0 0.5 1 1.5 2
f(xi )
Figure 14.4 The distribution of the values of f(xi ) in the Boolean classi-
fier (14.1) for recognizing the digit zero, after addition of 5000 new features.
Example. We examine the skewed threshold least squares classifier (14.3) for the
example described above, where we attempt to detect whether or not a handwritten
digit is zero. Figure 14.5 shows how the error, true positive, and false positive rates
depend on the decision threshold , for the training set data. We can see that as
increases, the true positive rate decreases, as does the false positive rate. We
can see that for this particular case the total error rate is minimized by choosing
= 0.1, which gives error rate 1.4%, slightly lower than the basic least squares
classifier. The limiting cases when is negative enough, or positive enough, are
readily understood. When is very negative, the prediction is always y = +1; our
error rate is then the fraction of the data set with y = 1. When is very positive,
the prediction is always y = 1, which gives an error rate equal to the fraction of
the data set with y = +1.
The same information (without the total error rate) is plotted in the traditional
ROC curve shown in figure 14.6. The dots show the basic least squares classifier,
14.2 Least squares classifier 231
0.1 Positive
Negative
0.08
Fraction
0.06
0.04
0.02
0
2 1.5 1 0.5 0 0.5 1 1.5 2
f(xi )
1 True positive
False positive
Total error
0.8
0.6
Rate
0.4
0.2
Figure 14.5 True positive, false positive, and total error rate versus decision
threshold . The vertical dashed line is shown for decision threshold =
0.25.
232 14 Least squares classification
0.95 = 0.25
0.8
0.75
= 0.25
0.7
0 0.05 0.1 0.15 0.2
False positive rate
with = 0, and the skewed threshold least squars classifiers for = 0.25 and
= 0.25. These curves are for the training data; the same curves for the test
data look similar, giving us some confidence that our classifiers will have similar
performance on new, unseen data.
Disease diagnosis. The labels are a set of diseases (including one label that
corresponds to disease-free), and the features are medically relevant values,
such as patient attributes and the results of tests. Such a classifier carries
out diagnosis (of the diseases corresponding to the labels). The classifier is
trained on cases in which a definitive diagnosis has been made.
Prediction errors and confusion matrix. For a multi-class classifier f and a given
data point (x, y), with predicted outcome y = f(x), there are K 2 possibilities, cor-
responding to the all pairs of values of y, the actual outcome, and y, the predicted
outcome. For a given data set (training or validation set) with N elements, the
numbers of each of the K 2 occurrences are arranged into a K K confusion matrix,
where Nij is the number of data points for which y = i and y = j.
The K diagonal entries N11 , . . . , NKK correspond to the cases when the predic-
tion is correct; the K 2 K off-diagonal entries Nij , i 6= j, correspond to prediction
errors. For each i, Nii is the number of data points with label i for which we cor-
rectly guessed y = i. For i 6= j, Nij is the number of data points for which we have
mistaken label i (its true value) for the label j (our incorrect guess). For K = 2
(Boolean classification) there are only two types of prediction errors, false positive
and false negative. For K > 2 the situation is more complicated, since there are
many more types of errors a predictor can make. From the entries of the confusion
matrix we can derive various measures of the accuracy of the predictions. We let
234 14 Least squares classification
y
y Dislike Neutral Like
Dislike 183 10 5
Neutral 7 61 8
Like 3 13 210
Ni (with one index) denote the total number of data points for which y = i, i.e.,
Ni = Ni1 + + NiK . We have N = N1 + + NK .
The simplest measure is the overall error rate, which is the total number of
errors (the sum of all off-diagonal entries in the confusion matrix) divided by the
data set size (the sum of all entries in the confusion matrix):
X X
(1/N ) Nij = 1 (1/N ) Nii .
i6=j i
This measure implicitly assumes that all errors are equally bad. In many applica-
tions this is not the case; for example, some medical mis-diagnoses might be worse
for a patient than others.
We can also look at the rate with which we predict each label correctly. The
quantity Nii /Ni is called the true label i rate. It is the fraction of data points with
label y = i for which we correctly predicted y = i. (The true label i rates reduce
to the true positive and true negative rates for Boolean classifiers.)
A simple example, with K = 3 labels (Dislike, Neutral, and Like), and a total
number N = 500 data points, is shown in table 14.8. Out of 500 data points, 454
(the sum of the diagonal entries) were classified correctly. The remaining 46 data
points (the sum of the off-diagonal entries) correspond to the 6 different types of
errors. The overall error rate is 46/500 = 9.2%. The true label Dislike rate is
183/(183 + 10 + 5) = 92.4%, i.e., among the data points with label Dislike, we
correctly predicted the label on 92.4% of the data. The true label Neutral rate is
61/(7 + 16 + 8) = 80.3%, and the true label Like rate is 210/(3 + 13 + 210) = 92.9%.
The idea behind the least squares Boolean classifier can be extended to handle
multi-class classification problems. For each possible label value, we construct a
new data set with the Boolean label +1 if the label has the given value, and 1
otherwise. (This is sometimes called a one-versus-others classifier.) From these K
Boolean classifiers we must create a classifier that chooses one of the K possible
labels. We do this by selecting the label for which the least squares regression fit
has the highest value, which roughly speaking is the one with the highest level of
14.3 Multi-class classifiers 235
where fk is the least squares regression model for label k against the others. The
notation argmax means the index of the largest value among the numbers fk (x),
for k = 1, . . . , K. Note that fk (x) is the real-valued prediction for the Boolean
classifier for class k versus not class k; it is not the Boolean classifier, which is
sign(fk (x)).
As an example consider a multi-class classification problem with 3 labels. We
construct 3 different least squares classifiers, for 1 versus 2 or 3, for 2 versus 1 or
3, and for 3 versus 1 or 2. Suppose for a given feature vector x, we find that
The largest of these three numbers is f3 (x), so our prediction is f(x) = 3. We can
interpret these numbers and our final decision. The first classifier is fairly confident
that the label is not 1. Accordinging to the second classifier, the label could be 2,
but it does not have high confidence in this prediction. Finally, the third classifier
predicts the label is 3, and moreover has relatively high confidence in this guess.
So our final guess is label 3. (This interpretation suggests that if we had to make a
second guess, it should be label 2.) Of course here we are anthropomorphizing the
individual label classifiers, since they do not have beliefs or levels of confidence in
their predictions. But the story is helpful in understanding the motivation behind
the classifier above.
Skewed decisions. In a Boolean classifier we can skew the decision threshold (see
14.2.3) to trade off the true positive and false positive rates. In a K-class classifier,
an analogous method can be used to trade off the K true label i rates. We apply
an offset k to fl (x) before finding the largest value. This gives the predictor
f(x) = argmax fk (x) k ,
k=1,...,K
where k are constants chosen to trade off the true label k rates. If we decrease k ,
we predict f(x) = k more often, so all entries of the kth column in the confusion
matrix increase. This increases our rate of true positives for label k (since Nkk
increases), which is good. But it can decrease the true positive rates for the other
labels.
Prediction
Class Setosa Versicolour Virginica Total
Setosa 40 0 0 40
Versicolour 0 27 13 40
Virginica 0 4 36 40
All 40 31 49 120
Table 14.9 Confusion matrix for a 3-class classifier of the Iris data set, on a
training set of 120 examples.
1 + + K = (K 2)e1 ,
where k is the coefficient vector for distinguishing class k from the others. Once
we have computed 1 , . . . , K1 , we can find K by simple vector subtraction.
This explains why for the Boolean classification case we have K = 2, but we
only have to solve one least squares problem. In 14.2 we compute one coefficient
vector ; if the same problem were to be considered a K-class problem with K = 2,
we would have 1 = . (This one distinguishes class 1 versus class 2.) The other
coefficient vector is then 2 = 1 . (This one distinguishes class 2 versus class 1.)
We compute a 3-class classifier for the Iris data set described on page 225. The
examples are randomly partitioned into a training set of size 120, containing 40
examples of each class, and a training set of size 30, with 10 examples of each class.
The 3 3 confusion matrix for the training set is given in table 14.9. The error
rate is 14.2%. The results for the test set are in table 14.10. The error rate is
13.3%, similar enough to the training error rate to give us some confidence in our
classifier. The true Setosa rate is 100% for both train and test sets, suggesting
that our classifier can detect this type well. The true Versicolour rate is 67.5% for
the train data, and 60% for the test set. The true Virginica rate is 90% for the
14.3 Multi-class classifiers 237
Prediction
Class Setosa Versicolour Virginica Total
Setosa 10 0 0 10
Versicolour 0 6 4 10
Virginica 0 0 10 10
All 10 6 14 30
Table 14.10 Confusion matrix for a 3-class classifier of the Iris data set, on
a test set of 30 examples.
train data, and 100% for the test set. This suggests that our classifier can detect
Virginica well, but perhaps not as well as Setosa. (The 100% true Virginica rate
on the test set is a matter of luck, due to the very small number of test examples
of each type; see the discussion on page 210.)
fk (x) = sign(xT k + vk ),
to distinguish digit k from the other digits. The ten Boolean classifiers are combined
into a multi-class classifier
The 10 10 confusion matrix for the data set and the test set are given in ta-
bles 14.11 and 14.12.
The error rate on the training set is 14.5%; on the test set it is 13.9%. The true
label rates on the test set range from 73.5% for digit 5 to 97.5% for digit 1.
Many of the entries of the confusion matrix make sense. From the first row of
the matrix, we see a handwritten 0 was rarely mistakenly classified as a 1, 2, or
9; presumably these digits look different enough that they are easily distinguished.
The most common error (80) corresponds to y = 9, y = 4, i.e., mistakenly identi-
fying a handwritten 9 as a 4. This makes sense since these two digits can look very
similar.
Feature engineering. After adding the 5000 randomly generated new features the
training set error is reduced to about 1.5%, and the test set error to 2.6%. The
confusion matrices are given in tables 14.13 and 14.14. Since we have (substantially)
reduced the error in the test set, we conclude that adding these 5000 new features
was a successful exercise in feature engineering.
238 14 Least squares classification
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 5669 8 21 19 25 46 65 4 60 6 5923
1 2 6543 36 17 20 30 14 14 60 6 6742
2 99 278 4757 153 116 17 234 92 190 22 5958
3 38 172 174 5150 31 122 59 122 135 128 6131
4 13 104 41 5 5189 52 45 24 60 309 5842
5 164 94 30 448 103 3974 185 44 237 142 5421
6 104 78 77 2 64 106 5448 0 36 3 5918
7 55 191 36 48 165 9 4 5443 13 301 6265
8 69 492 64 225 102 220 64 21 4417 177 5851
9 67 66 26 115 365 12 4 513 39 4742 5949
All 6280 8026 5262 6182 6180 4588 6122 6277 5247 5836 60000
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 944 0 1 2 2 8 13 2 7 1 980
1 0 1107 2 2 3 1 5 1 14 0 1135
2 18 54 815 26 16 0 38 22 39 4 1032
3 4 18 22 884 5 16 10 22 20 9 1010
4 0 22 6 0 883 3 9 1 12 46 982
5 24 19 3 74 24 656 24 13 38 17 892
6 17 9 10 0 22 17 876 0 7 0 958
7 5 43 14 6 25 1 1 883 1 49 1028
8 14 48 11 31 26 40 17 13 756 18 974
9 16 10 3 17 80 0 1 75 4 803 1009
All 1042 1330 887 1042 1086 742 994 1032 898 947 10000
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 5888 1 2 1 3 2 10 0 14 2 5923
1 1 6679 27 6 11 0 0 10 6 2 6742
2 11 7 5866 6 12 0 3 22 26 5 5958
3 1 4 31 5988 0 27 0 24 34 22 6131
4 1 15 3 0 5748 1 13 4 5 52 5842
5 6 2 4 26 7 5335 23 2 9 7 5421
6 8 5 0 0 3 15 5875 0 11 1 5918
7 3 25 23 4 8 0 1 6159 5 37 6265
8 5 16 11 12 9 17 11 7 5749 14 5851
9 10 5 1 29 41 16 2 35 25 5785 5949
All 5934 6759 5968 6072 5842 5413 5938 6263 5884 5927 60000
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 972 0 0 2 0 1 1 1 3 0 980
1 0 1126 3 1 1 0 3 0 1 0 1135
2 6 0 998 3 2 0 4 7 11 1 1032
3 0 0 3 977 0 13 0 5 8 4 1010
4 2 1 3 0 953 0 6 3 1 13 982
5 2 0 1 5 0 875 5 0 3 1 892
6 8 3 0 0 4 6 933 0 4 0 958
7 0 8 12 0 2 0 1 992 3 10 1028
8 3 1 3 6 4 3 2 2 946 4 974
9 4 3 1 12 11 7 1 3 3 964 1009
All 997 1142 1024 1006 977 905 956 1013 983 997 10000
In this chapter we consider the problem of choosing a vector that achieves a com-
promise in making two or more norm squared objectives small. The idea is widely
used in data fitting, image reconstruction, control, and other applications.
Multi-objective least squares via weighted sum. A standard method for finding
a value of x that gives a compromise in making all the objectives small is to choose
x to minimize a weighted sum objective:
where 1 , . . . , k are positive weights, that express our relative desire for the terms
to be small. If we choose all i to be one, the weighted sum objective is the sum
of the objective terms; we give each of them equal weight. If 2 is twice as large
as 1 , it means that we attach twice as much weight to the objective J2 as to J1 .
Roughly speaking, we care twice as strongly that J2 should be small, compared
242 15 Multi-objective least squares
to our desire that J1 should be small. We will discuss later how to choose these
weights.
Scaling all the weights in the weighted sum objective (15.1) by any positive
number is the same as scaling the weighted sum objective J by the number, which
does not change its minimizers. Since we can scale the weights by any positive
number, it is common to choose 1 = 1. This makes the first objective term J1
our primary objective; we can interpret the other weights as being relative to the
primary objective.
Weighted sum least squares via stacking. We can minimize the weighted sum
objective function (15.1) by expressing it as a standard least squares problem. We
start by expressing J as the norm squared of a single vector:
2
1 (A1 x b1 )
..
J =
,
.
k (Ak x bk )
x = (AT A)1 AT b
= (1 AT1 A1 + + k ATk Ak )1 (1 AT1 b1 + + k ATk bk ). (15.3)
This reduces to our standard formula for the solution of a least squares problem
when k = 1 and 1 = 1. (In fact, when k = 1, 1 does not matter.) We can
compute x via the QR factorization of A.
Independent columns of stacked matrix. Our assumption (12.2) that the columns
of A in (15.2) are linearly independent is not the same as assuming that each of
A1 , . . . , Ak has linearly independent columns. We can state the condition that A
has linearly independent columns as: There is no nonzero vector x that satisfies
Ai x = 0 for i = 1, . . . , k. This implies that if just one of the matrices A1 , . . . , Ak
has linearly independent columns, then A does.
15.1 Multi-objective least squares 243
The stacked matrix A can have linearly independent columns even when none
of the matrices A1 , . . . , Ak do. This can happen when mi < n for all i, i.e., all
Ai are wide. However, we must have m1 + + mk n, since A must be tall (or
square) for the linearly independent columns assumption to hold.
Optimal trade-off curve. We start with the special case of two objectives (also
called the bi-criterion problem), and write the weighted sum objective as
J = J1 + J2 = kA1 x b1 k2 + kA2 x b2 k2 ,
where > 0 is the relative weight put on the second objective, compared to the
first. For small , we care much more about J1 being small than J2 being small;
for large, we care much less about J1 being small than J2 being small.
Let x() denote the weighted sum least squares solution x as a function of ,
assuming the stacked matrices have linearly independent columns. These points
are called Pareto optimal, which means there is no point z that satisfies
kA1 z b1 k2 kA1 x() b1 k2 , kA2 z b2 k2 kA2 x() b2 k2 ,
with one of the inequalities holding strictly. Roughly speaking, there is no point z
that is as good as x() in one of the objectives, and beats it on the other one. To
see why this is the case, we note that any such z would have a value of J that is
less than that achieved by x(), which minimizes J, a contradiction.
We can plot the two objectives kA1 x() b1 k2 and kA2 x() b2 k2 against each
other, as varies over (0, ), to understand the trade-off of the two objectives.
This curve is called the optimal trade-off curve of the two objectives. There is no
point z that achieves values of J1 and J2 that lies below and to the left of the
optimal trade-off curve.
Simple example. We consider a simple example with two objectives, with A1 and
A2 both 10 5 matrices. The entries of the weighted least squares solution x()
are plotted against in figure 15.1. On the left, where is small, x() is very close
to the least squares approximate solution for A1 , b1 . On the right, where is large,
x() is very close to the least squares approximate solution for A2 , b2 . In between
the behavior of x() is very interesting; for instance, we can see that x()3 first
increases with increasing before eventually decreasing.
Figure 15.2 shows the values of the two objectives J1 and J2 versus . As
expected, J1 increases as increases, and J2 decreases as increases. (It can
be shown that this always holds.) Roughly speaking, as increases we put more
emphasis on making J2 small, which comes at the expense of making J1 bigger.
The optimal trade-off curve for this bi-criterion problem is plotted in figure 15.3.
The left end-point corresponds to minimizing kA1 x b1 k2 , and the right end-point
corresponds to minimizing kA2 x b2 k2 . We can conclude, for example, that there
is no vector z that achieves kA1 z b1 k2 2.60 and kA2 z b2 k2 1.96.
The steep slope of the optimal trade-off curve near the left end-point means
that we can achieve a substantial reduction in J2 with only a small increase in J1 .
The small slope of the optimal trade-off curve near the right end-point means that
we can achieve a substantial reduction in J1 with only a small increase in J2 . This
is quite typical, and indeed, is why multi-criterion least squares is useful.
244 15 Multi-objective least squares
x1 ()
0.6
x2 ()
x3 ()
0.4 x4 ()
x5 ()
0.2
0.2
4
J1 ()
J2 ()
3.5
2.5
3.5
= 0.1
3
J2 ()
2.5
=1
2 = 10
2.5 3 3.5 4
J1 ()
Figure 15.3 Optimal trade-off curve for the bi-criterion least squares problem
of figures 15.1 and 15.2.
Using multi-objective least squares. In the rest of this chapter we will see several
specific applications of multi-objective least squares. Here we give some general
remarks on how it is used in applications.
First we identify a primary objective J1 that we would like to be small. The
objective J1 is typically the one that would be used in an ordinary single-objective
least squares approach, such as the mean square error of a model on some training
data, or the mean-square deviation from some target or goal.
We also identify one or more secondary objectives J2 , J3 , . . . , Jk , that we would
also like to be small. These secondary objectives are typically generic ones, like
the desire that some parameters be small or smooth, or close to some previous
or prior value. In estimation applications these secondary objectives typically cor-
respond to some kind of prior knowledge or assumption about the vector x that
we seek. We wish to minimize our primary objective, but are willing to accept an
increase in it, if this gives a sufficient decrease in the secondary objectives.
The weights are treated like knobs in our method, that we change (turn or
tune or tweak) to achieve a value of x that we like (or can live with). For given
246 15 Multi-objective least squares
15.2 Control
In control applications, the goal is to decide on a set of actions or inputs, specified
by an n-vector x, that achieve some goals. The actions result in some outputs or
effects, given by an m-vector y. We consider here the case when the inputs and
outputs are related by an affine model
y = Ax + b.
the norm squared deviation of the output from the desired output. The main
objective is to choose an action x so that the output is as close as possible to the
desired value.
There are many possible secondary objectives. The simplest one is the norm
squared value of the input, J2 = kxk2 , so the problem is to optimally trade off
missing the target output (measured by ky y des k2 ), and keeping the input small
(measured by kxk2 ).
Another common secondary objective has the form J2 = kx xnom k2 , where
nom
x is a nominal or standard value for the input. In this case the secondary ob-
jective it to keep the input close to the nominal value. This objective is sometimes
used when x represents a new choice for the input, and xnom is the current value.
In this case the goal is to get the output near its target, while not changing the
input much from its current value.
15.2 Control 247
Dynamics. The system can also be dynamic, meaning that we take into account
time variation of the input and output. In the simplest case x is the time series of
a scalar input, so xi is the action taken in period i, and yi is the (scalar) output in
period i. In this setting, y des is a desired trajectory for the output. A very common
model for modeling dynamic systems, with x and y representing scalar input and
output time series, is a convolution: y = h x. In this case, A is Toeplitz, and b
represents a time series, which is what the output would be with x = 0.
As a typical example in this category, the input xi can represent the torque
applied to the drive wheels of a locomotive (say, over 1 second intervals), and yi is
the locomotive speed.
In addition to the usual secondary objective J2 = kxk2 , it is common to have
an objective that the input should be smooth, i.e., not vary too rapidly over time.
This is achieved with the objective kDxk2 , where D is the (n1)n first difference
248 15 Multi-objective least squares
matrix
1 1 0 0 0 0
0 1 1 0 0 0
.. .. .. .. .. ..
D= . (15.4)
. . . . . .
0 0 0 1 1 0
0 0 0 0 1 1
In 17.2 we will see another way to carry out control with a dynamic system.
If we guess that x has the value x, then we are implicitly making the guess that v
has the value y Ax. If we assume that smaller values of v (measured by kvk) are
more plausible than larger values, then a sensible choice for x is the least squares
approximate solution, which minimizes kAxyk2 . We will take this as our primary
objective.
15.3 Estimation and inversion 249
Suppose that the T -vector y is a (measured) time series, that we believe is a noisy
version of a periodic time series, i.e., one that repeats itself every P periods. We
might also know or assume that the periodic time series is smooth, i.e., its adjacent
values are not too far apart.
Periodicity arises in many time series. For example, we would expect a time
series of hourly temperature at some location to approximately repeat itself every
24 hours, or the monthly snowfall in some region to approximately repeat itself
every 12 months. (Periodicity with a 24 hour period is called diurnal ; periodicity
with a yearly period is called seasonal or annual.) As another example, we might
expect daily total sales at a restaurant to approximately repeat itself weekly. The
goal is to get an estimate of Tuesdays total sales, given some historical daily sales
data.
The periodic time series will be represented by a P -vector x, which gives its
values over one period. It corresponds to the full time series
y = (x, x, . . . , x)
which just repeats x, where we assume here for simplicity that T is a multiple of
P . (If this is not the case, the last x is replaced with a slice of the form x1:k .) We
can express y as y = Ax, where A is the T P selector matrix
I
A = ... .
x1 x2 , ..., xP 1 xP , xP x1 .
(Note that we include the wrap-around pair xP and x1 here.) We measure non-
smoothness as kDcirc xk2 , where Dcirc is the P P circular difference matrix
1 1 0 0 0 0
0 1 1 0 0 0
.. .. .. .. .. ..
Dcirc =
. . . . . .
.
0 0 0 1 1 0
0 0 0 0 1 1
1 0 0 0 0 1
15.3 Estimation and inversion 251
For = 0 we recover the simple averaging mentioned above; as gets bigger, the
estimated signal becomes smoother, ultimately converging to a constant (which is
the mean of the original time series data).
The time series Ax is called the extracted seasonal component of the given time
series data y (assuming we are considering yearly variation). Subtracting this from
the original data yields the time series yAx, which is called the seasonally adjusted
time series.
The parameter can be chosen using validation. This can be done by selecting
a time interval over which to build the estimate, and another one to validate it.
For example, with 4 years of data, we might train our model on the first 3 years of
data, and test it on the last year of data.
Example. In figure 15.4 we apply this method to a series of hourly ozone mea-
surements. The top figure shows hourly measurements over a period of 14 days
(July 114, 2014). We represent these values by a 336-vector c, with c24(j1)+i ,
i = 1, . . . , 24, defined as the hourly values on day j, for j = 1, . . . , 14. As indicated
by the gaps in the graph, a number of measurements are missing from the record
(only 275 of the 336 = 24 14 measurements are available). We use the notation
Mj {1, 2, . . . , 24} to denote the set containing the indices of the available mea-
surements on day j. For example, M8 = {1, 2, 3, 4, 6, 7, 8, 23, 24}, because on July
8, the measurements at 4AM, and from 8AM to 9PM are missing. The middle and
bottom figures show two periodic time series. The time series are parametrized
by a 24-vector x, repeated 14 times to get the full series (x, x, . . . , x). The two
estimates of x in the figure were computed by minimizing
14 X 23
!
X X
2 2 2
(xi log(c24(j1)+i )) + (xi+1 xi ) + (x1 x24 )
j=1 iMj i=1
101
July 1
July 2
July 3
July 4
July 5
July 6
July 7
July 8
July 9
July 10
July 11
July 12
July 13
July 14
July 15
101
Ozone level (ppm)
102
July 1
July 2
July 3
July 4
July 5
July 6
July 7
July 8
July 9
July 10
July 11
July 12
July 13
July 14
101 July 15
Ozone level (ppm)
102
July 1
July 2
July 3
July 4
July 5
July 6
July 7
July 8
July 9
July 10
July 11
July 12
July 13
July 14
July 15
Time
Figure 15.4 Top. Hourly ozone level at Azusa, California, during the first
14 days of July 2014 (California Environmental Protection Agency, Air Re-
sources Board, [Link]). Measurements start at 12AM on July 1st,
and end at 11PM on July 14. Note the large number of missing measure-
ments. In particular, all 4AM measurements are missing. Middle. Smooth
periodic least squares fit to logarithmically transformed measurements, using
= 1. Bottom. Smooth periodic least squares fit using = 100.
15.3 Estimation and inversion 253
the reconstructed image. Specifically, suppose the vector x has length M N and
contains the pixel intensities of an M N image X stored column-wise. Let Dh
be the (N 1)M matrix
I I 0 0 0 0
0 I I 0 0 0
.. . . . . .. ,
Dh = . .. .. .. .. .
0 0 0 I I 0
0 0 0 0 I I
where all blocks have size M M , and let Dv be the (M 1)N matrix
D 0 0
0 D 0
Dv = . .. . . .. ,
.. . . .
0 0 D
With these definitions the penalty term in (15.5) is the sum of squared differences
of intensities at adjacent pixels in a row or column:
M N
X X 1 M
X 1 X
N
kDh xk2 + kDv xk2 = (Xij Xi,j+1 )2 + (Xij Xi+1,j )2 .
i=1 j=1 i=1 j=1
This quantity is the Laplacian (see page 110), for the graph that connects each
pixel to its left and right, and up and down, neighbors.
Example. In figures 15.5 and 15.6 we illustrate this this method for an image of
size 512 512. The blurred, noisy image is shown in the left part of figure 15.5.
Figure 15.6 shows the estimates x, obtained by minimizing (15.5), for four different
values of the parameter . The best result (in this case, judged by eye) is obtained
for around 0.007 and is shown on the right in figure 15.5.
15.3.4 Tomography
In tomography, the vector x represents the values of some quantity (such as density)
in a region of interest in n voxels (or pixels) over a 3-D (or 2-D) region. The entries
of the vector y are measurements obtained by passing a beam of radiation through
254 15 Multi-objective least squares
Figure 15.5 Left: Blurred, noisy image. Right: Result of regularized least
squares deblurring with = 0.007.
the region of interest, and measuring the intensity of the beam after it exits the
region.
A familiar application is the computer-aided tomography (CAT) scan used in
medicine. In this application, beams of X-rays are sent through a patient, and
an array of detectors measure the intensity of the beams after passing through
the patient. These intensity measurements are related to the integral of the X-
ray absorption along the beam. Tomography is also used in applications such as
manufacturing, to assess internal damage or certify quality of a welded joint.
Line integral measurements. For simplicity we will assume that each beam is a
single line, and that the received value yi is the integral of the quantity over the
region, plus some measurement noise. (The same method can be used when more
complex beam shapes are used.) We consider the 2-D case.
Let d(x, y) denote the density (say) at the position (x, y) in the region. (Here
x and y are the scalar 2-D coordinates, not the vectors x and y in the estimation
problem.) We assume that d(x, y) = 0 outside the region of interest. A line through
the region is defined by the set of points
where (x0 , y0 ) denotes a (base) point on the line, and is the angle of the line with
respect to the x-axis. The parameter t gives the distance along the line from the
point (x0 , y0 ). The line integral of d is given by
Z
d(p(t)) dt.
We assume that m lines are specified (i.e., by their base points and angles), and
the measurement yi is the line integral of d, plus some noise, which is presumed
small.
We divide the region of interest into n pixels (or voxels in the 3-D case), and
assume that the density has the constant value xi over pixel i. Figure 15.7 illus-
trates this for a simple example with n = 25 pixels. (In applications the number
of pixels or voxels is in the thousands or millions.) The line integral is then given
by the sum of xi (the density in pixel i) times the length of the intersection of the
line with pixel i. In figure 15.7, with the pixels numbered row-wise starting at the
top left corner, with width and height one, the line integral for the line shown is
The coefficient of xi is the length of the intersection of the line with pixel i.
x1 x2
x6
(x0 , y0 )
Figure 15.7 A square region of interest divided into 25 pixels, and a line
passing through it.
Example. A simple 2-D example is shown in figures 15.815.10. Figure 15.8 shows
the geometry of the m = 4000 lines and the square region, shown as the square.
The square is divided into 100 100 pixels, so n = 10000.
The density of the object we are imaging is shown in figure figure 15.9. In
this object the density of each pixel is either 0 or 1 (shown as white or black,
respectively). We reconstruct or estimate the object density from the 4000 line
(noisy) line integral measurements by solving the regularized least squares problem
where kDxk2 is the sum of squares of the differences of the pixel values from their
neighbors. Figure 15.10 shows the results for six different value of . We can see
that for small the reconstruction is relatively sharp, but suffers from noise. For
large the noise in the reconstruction is smaller, but it is too smooth.
15.3 Estimation and inversion 257
Figure 15.8 The square region at the center of the picture is surrounded
by 100 points shown as circles. 40 lines (beams) emanate from each point.
(The lines are shown for two points only.) This gives a total of 4000 lines
that intersect the region.
.
= 102 = 101
=1 =5
= 10 = 100
Figure 15.10 Regularized least squares reconstruction for six values of the
regularization parameter.
15.4 Regularized data fitting 259
kA yk2 + k2:p k2 ,
kX T + v1 yk2 + kk2 .
Here we penalize being large (because this leads to sensitivity of the model), but
not the offset v. Choosing to minimize this weighted objective is called ridge
regression.
Regularization path. We get a different model for every choice of . The way
the parameters change with is called the regularization path. When p is small
260 15 Multi-objective least squares
enough (say, less than 15 or so) the parameter values can be plotted, with on the
horizontal axis. Usually only 30 or 50 values of are considered, typically spaced
logarithmically over a large range.
An appropriate value of can be chosen via out-of-sample or cross validation.
As increases, the RMS fit on the training data worsens (increases). But (as with
model order) the test set RMS prediction error typically decreases as increases,
and then, when gets too big, it increases. A good choice of regularization parame-
ter is one which approximately minimizes the test set RMS prediction error. When
a range of values of approximately minimize the RMS error, common practice
is to take the largest value of . The idea here is to use the model of minimum
sensitivity, as measured by kk2 , among those that make good predictions on the
test set.
Example. We illustrate these ideas with a small example with synthetic (simu-
lated) data. We start with a signal, shown in figure 15.11 consisting of a constant
plus four sinusoids:
X 4
s(t) = c + k cos(k t + k ),
k=1
with coefficients
(Note that the model is exact when the parameters are chosen as 1 = c, k = k1 ,
k = 2, . . . , 5. This rarely occurs in practice.) We fit our model using regularized
least squares on 10 noisy samples of the function, shown as the blue circles in
figure 15.11. We will test the model obtained on the 20 noisy data points shown
as the red circles in figure 15.11.
Figure 15.12 shows the regularization path and the RMS training and test errors
as functions of the regularization parameter , as it varies over a large range. The
regularization path shows that as increases, the parameters 2 , . . . , 5 get smaller
(i.e., shrink), converging towards zero as gets very large. We can see that the
training prediction error increases with increasing (since we are trading off model
sensitivity for sum square fitting error). The test error follows a typical pattern:
It first decreases to a minimum value, then increases again. The minimum test
error occurs around = 0.079; any choice between around = 0.065 and 0.100
(say) would be reasonable. The horizontal dashed lines show the true values of
the coefficients (i.e., the ones we used to synthesize the data) given in (15.7). We
can see that for near 0.079, our estimated parameters are very close to the true
values.
15.4 Regularized data fitting 261
s(t) 2
Train
1 Test
Figure 15.11 A signal s(t) and 30 noisy samples. Ten of the samples are
used as training set, the other 20 as test set.
B = (eT2 , . . . , eTp ),
so B = 2:p . From the last p 1 entries in the equation above, we get xi = 0
for i = 2, . . . , p, which implies that x2 = = xp = 0. Using these values of x2 ,
. . . , xp , and the fact that the first column of A is 1, the top m equations become
1x1 = 0, and we conclude that x1 = 0 as well. So the columns of the stacked
matrix are always linearly independent.
1.2
Train
1 Test
0.8
RMS error
0.6
0.4
0.2
0
105 103 101 101 103 105
2 1 2
3 4
5
1
Coefficients
Figure 15.12 Top. RMS training and test errors as a function of the reg-
ularization parameter . Bottom. The regularization path. The dashed
horizontal lines show the values of the coefficients used to generate the data.
15.5 Complexity 263
you have data points, in which case the matrix A is wide. Regularization is often
the key to success in feature engineering, which can greatly increase the number of
features.
15.5 Complexity
In the general case we can minimize the weighted sum objective (15.1) by creating
the stacked matrix and vector A and b in (15.2), and then using the QR factorization
to solve the resulting least squares problem. The computational cost of this method
is order mn2 flops, where m = m1 + + mk is the sum of heights of the matrices
A1 , . . . , Ak .
When using multi-objective least squares, it is common to minimize the weighted
sum objective for some, or even many, different choices of weights. Assuming that
the weighted sum objective is minimized for L different values of the weights, the
total computational cost is order Lmn2 flops.
The matrix appearing in the inverse is a weighted sum of the Gram matrices Gi =
ATi Ai associated with the matrices Ai . We can compute x by forming these Gram
matrices Gi , along with the vectors hi = ATi bi , then forming the weighted sums
G = 1 G1 + + k Gk , h = 1 h1 + + k hk ,
flops. When m is much larger than k + n, which is a common occurrence, this cost
is smaller than Lmn2 , the cost for the simple method.
As a simple example consider Tychonov regularization. We will compute
where the m n matrix A is wide, i.e., m < n, and > 0. (Here we drop the
subscript on A, b, and m since we have only one matrix in this problem.) The
associated (m + n) n stacked matrix (see (15.2))
A
A =
I
always has linearly independent columns. Using the QR factorization to solve the
stacked least squares problem requires order (m + n)n2 flops, which grows like n3 .
We will show now how this special problem can be solved far more efficiently when
m is much smaller than n, using something called the kernel trick. Recall that the
minimizer of J is given by (see (15.3))
which holds for any matrix A and any > 0. Note that the left-hand side of
the identity involves the inverse of an n n matrix, whereas the right-hand side
involves the inverse of a (smaller) m m matrix.
To show the identity (15.8), we first observe that the matrices AT A + I and
T
AA + I are invertible. We start with the equation
and multiply each side by (AT A+I)1 on the left and (AAT +I)1 on the right,
which yields the identity above.
Using (15.8) we can express the minimizer of J as
We can compute the term (AAT + I)1 (b Axdes ) by computing the QR factor-
ization of the (m + n) m matrix
AT
A = ,
I
which has a cost of (m + n)m2 flops. The other operations involve matrix-vector
products and have order (at most) mn flops, so we can use this method to compute
x in order (m + n)m2 flops. This complexity grows only linearly in n.
Chapter 16
In this chapter we discuss a useful extension of the least squares problem that
includes linear equality constraints. Like least squares, the constrained least squares
problem can be reduced to a set of linear equations, which can be solved using the
QR factorization.
f(x)
q(x)
p(x)
x
a
Figure 16.1 Least squares fit of two cubic polynomials to 140 points, with
continuity constraints p(a) = q(a) and p0 (a) = q 0 (a).
x that minimizes kAx bk2 ). Indeed each of these problems can be considered a
special case of the constrained least squares problem (16.1).
The constrained least squares problem can also be thought of as a limit of a bi-
objective least squares problem, with primary objective kAx bk2 and secondary
objective kCx dk2 . Roughly speaking, we put infinite weight on the second
objective, so that any nonzero value is unacceptable (which forces x to satisfy
Cx = d). So we would expect (and it can be verified) that minimizing the weighted
objective
kAx bk2 + kCx dk2 ,
for a very large value of yields a vector close to a solution of the constrained least
squares problem (16.1). We will encounter this idea again in chapter 19, when we
consider the nonlinear constrained least squares problem.
with a given, and p(x) and q(x) polynomials of degree three or less,
p(x) = 1 + 2 x + 3 x2 + 4 x3 , q(x) = 5 + 6 x + 7 x2 + 8 x3 .
We also impose the condition that p(a) = q(a) and p0 (a) = q 0 (a), so that f(x) is
continuous and has a continuous first derivative at x = a. Suppose the N data
16.1 Constrained least squares problem 267
The conditions p(a) q(a) = 0 and p0 (a) q 0 (a) = 0 are two linear equations
1 + 2 a + 3 a2 + 4 a3 5 6 a 7 a2 8 a3 = 0
2 + 23 a + 34 a2 6 27 a 38 a2 = 0.
We can determine the coefficients = (1 , . . . , 10 ) that minimize the sum of squares
of the prediction errors, subject to the continuity constraints, by solving a con-
strained least squares problem
minimize kA bk2
subject to C = d.
The matrices and vectors A, b, C, d are defined as
x21 x31
1 x1 0 0 0 0 y1
1 x2 x22 x32 0 0 0 0 y2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
x2M x3M
1 xM 0 0 0 0 yM
A= , b= ,
0 0 0 0 1 xM +1 x2M +1 x3M +1 yM +1
x2M +2 x3M +2
0 0 0 0 1 xM +2 yM +2
. .. .. .. .. .. .. .. ..
..
. . . . . . . .
0 0 0 0 1 xN x2N x3N yN
and
a2 a3 a2 a3
1 a 1 a 0
C= , d= .
0 1 2a 3a2 0 1 2a 3a2 0
This method is easily extended to piecewise-polynomial functions with more than
two intervals. Functions of this kind are called splines.
Optimal
Scaled
1,000
Impressions
500
0
1 2 3 4 5 6 7 8 9 10
Group
Figure 16.2 Advertising with budget constraint. The optimal views vector
is the solution of the constrained least squares problem with budget con-
straint. The scaled views vector is obtained by scaling the unconstrained
least squares solution so that it satisfies the budget constraint. Hence this
is a scalar multiple of the views vector of figure 12.3.
An important special case of the constrained least squares problem (16.1) is when
A = I and b = 0:
minimize kxk2
(16.2)
subject to Cx = d.
In this problem we seek the vector of smallest or least norm that satisfies the linear
equations Cx = d. For this reason the problem (16.2) is called the least norm
problem or minimum-norm problem.
16.1 Constrained least squares problem 269
1 1
Position
Force
0 0.5
1 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
Example. The 10-vector f represents a series of forces applied, each for one sec-
ond, to a unit mass on a surface with no friction. The mass starts with zero velocity
and position. By Newtons laws, its final velocity and position are given by
v fin = f1 + f2 + + f10
fin
p = (19/2)f1 + (17/2)f2 + + (1/2)f10 .
Now suppose we want to choose a force sequence that results in v fin = 0, pfin = 1,
i.e., a force sequence that moves the mass to a resting position one meter to the
right. There are many such force sequences; for example f bb = (1, 1, 0, . . . , 0).
This force sequence accelerates the mass to velocity 0.5 after one second, then
decelerates it over the next second, so it arrives after two seconds with velocity 0,
at the destination position 1. After that it applies zero force, so the mass stays
where it is, at rest at position 1. The superscript bb refers to bang-bang, which
means that the force applies a large force to get the mass moving (the first bang)
and another large force (the second bang) to slow it to zero velocity. The force and
position versus time for this choice of f is shown in figure 16.3.
Now we ask, what is the smallest force sequence that can achieve v fin = 0,
fin
p = 1, where smallest is measured by the sum of squares of the applied forces,
kf k2 = f12 + + f10
2
? This problem can be posed as a least norm problem,
2
minimize kf
k
1 1 1 1 0
subject to f= ,
19/2 17/2 3/2 1/2 1
with variable f . The solution f ln , and the resulting position, are shown in fig-
ure 16.4. The norm square of the least norm solution f ln is 0.0121; in contrast,
the norm square of the bang-bang force sequence is 2, a factor of 165 times larger.
(Note the very different vertical axis scales in figures 16.4 and 16.3.)
270 16 Constrained least squares
0.05 1
Position
Force 0 0.5
0.05 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
Figure 16.4 Left: The smallest force sequence f ln that transfers the mass
over a unit distance in 10 steps. Right: The resulting position of the mass
p(t).
16.2 Solution
Optimality conditions via Lagrange multipliers. We will use the method of La-
grange multipliers (see C.3) to solve the constrained least squares problem (16.1).
Later we give an independent verification, that does not rely on calculus or La-
grange multipliers, that the solution we derive is correct.
We first write the contrained least squares problem with the constraints given
as a list of p scalar equality constraints:
objective kAx bk2 as a sum of terms involving the entries of x (as was done on
page 181) and taking the partial derivative of L with respect to xi we obtain
n p
L X X
(x, z) = 2 (AT A)ij xj 2(AT b)i + zj ci = 0.
xi j=1 j=1
Since the matrix on the left has linearly independent columns (by assumption),
we conclude that x = 0. The first block equation above then becomes C T z = 0.
But by our assumption that the columns of C T are linearly independent, we have
z = 0. So (x, z) = 0, which is a contradiction.
The converse is also true. First suppose that the rows of C are dependent.
Then there is a nonzero vector z with C T z = 0. Then
2AT A C T
0
= 0,
C 0 z
which shows the KKT matrix is not invertible. Now suppose that the stacked
matrix in (16.5) has dependent columns, which means there is a nonzero vector x
for which
A
x = 0.
C
Direct calculation shows that
2AT A CT
x
= 0,
C 0 0
which shows that the KKT matrix is not invertible.
When the conditions (16.5) hold, the constrained least squares problem (16.1)
has the (unique) solution x, given by
1
2AT A C T 2AT b
x
= . (16.6)
z C 0 d
(This formula also gives us z, the set of Lagrange multipliers.) From (16.6), we
observe that the solution x is a linear function of (b, d).
from which we conclude that kAx bk2 kAx bk2 . So x minimizes kAx bk2
subject to Cx = d.
It remains to show that for x 6= x, we have the strict inequality kAx bk2 >
kAx bk2 , which by the equation above is equivalent to kA(x x)k2 > 0. If this
is not the case, then A(x x) = 0. We also have C(x x) = 0, and so
A
(x x) = 0.
C
By our assumption that the matrix on the left has linearly independent columns,
we conclude that x = x.
The second step cannot fail, provided the assumption (16.5) holds. Let us
analyze the complexity of this algorithm. The first step is multiplying an n m
matrix by an m n matrix, which requires 2mn2 flops. (In fact we can get away
with half this number, since the Gram matrix is symmetric, and we only have to
compute the entries on and above the diagonal.) The second step requires the
solution of a square system of n + p equations, which costs 2(n + p)3 flops, so the
total is
2mn2 + 2(n + p)3
flops. This grows linearly in m and cubicly in n and p. The assumption (16.5)
implies p n, so in terms of order, (n + p)3 can be replaced with n3 .
to simplify (16.7). This factorization exists because the stacked matrix has linearly
independent columns, by our assumption (16.5). In (16.8) we also partition Q in
two blocks Q1 and Q2 , of size m n and p n, respectively. If we make the
substitutions A = Q1 R, C = Q2 R, and AT A + C T C = RT R in (16.7) we obtain
We multiply the first equation on the left by RT (which we know exists) to get
We now use the second part of the assumption (16.5) to show that the matrix
QT2 = RT C T has linearly independent columns. Suppose QT2 z = RT C T z = 0.
Multiplying with RT gives C T z = 0. Since C has linearly independent rows, this
implies z = 0, and we conclude that the columns of QT2 are linearly independent.
The matrix QT2 therefore has a QR factorization QT2 = QR. Substituting this
into (16.10) gives
RT Rw = 2RT QT QT1 b 2d,
which we can write as
Rw = 2QT QT1 b 2RT d.
We can use this to compute w, first by computing RT d (by forward substitution),
then forming the right-hand side, and then solving for w using back substitution.
Once we know w, we can find x from (16.9). The method is summarized in the
following algorithm.
Rx = QT1 b (1/2)QT2 w
by back substitution.
p n m + p,
and therefore (m + p)n2 np2 . So the flop count above is no more than 3(m + p)n2
flops. In particular, its order is (m + p)n2 .
Sparse constrained least squares. Constrained least squares problems with sparse
matrices A an C arise in many applications; we will see several examples in the
next chapter. Just as for solving linear equations, or (unconstrained) least squares
problems, there are methods that exploit the sparsity in A and C to solve con-
strained least squares problems more efficiently than the generic algorithms 16.1
or 16.2. The simplest such methods follow these basic algorithms, replacing the
QR factorizations with sparse QR factorizations (see page 156).
One potential problem with forming the KKT matrix as in algorithm 16.1 is
that the Gram matrix AT A can be far less sparse than the matrix A. This problem
can be avoided using a trick analogous to the one used on page 184 to solve sparse
(unconstrained) least squares problems. We form the square set of m + n + p linear
equations
AT CT
0 x 0
A (1/2)I 0 y = b .
C 0 0 z d
276 16 Constrained least squares
If (x, y, z) satisfies these equations, it is easy to see that (x, z) satisfies the KKT
equations (16.4); conversely, if (x, z) satisfies the KKT equations (16.4), (x, y, z)
satisfies the equations above, with y = 2(Ax b). Provided A and C are sparse,
the coefficient matrix above is sparse, and any method for solving a sparse system
of linear equations can be used to solve it.
Solution of least norm problem. Here we specialize the solution of the general
constrained least squares problem (16.1) given above to the special case of the least
norm problem (16.2).
We start with the conditions (16.5). The stacked matrix is in this case
I
,
C
which always has linearly independent columns. So the conditions (16.5) reduce
to: C has linearly independent rows. We make this assumption now.
For the least norm problem, the KKT equations (16.4) reduce to
2I C T
x 0
= .
C 0 z d
We can solve this using the methods for general constrained least squares, or derive
the solution directly, which we do now. The first block row of this equation is
2x + C T z = 0, so
x = (1/2)C T z.
We substitute this into the second block equation, C x = d, to obtain
(1/2)CC T z = d.
Since the rows of C are linearly independent, CC T is invertible, so we have
z = 2(CC T )1 d.
Substituting this expression for z into the formula for x above gives
x = C T (CC T )1 d. (16.11)
We have seen the matrix in this formula before: It is the pseudo-inverse of a wide
matrix with linearly independent rows. So we can express the solution of the least
norm problem (16.2) in the very compact form
x = C d.
In 11.5, we saw that C is a right inverse of C; here we see that not only does
x = C d satisfy Cx = d, but it gives the vector of least norm that satisfies Cx = d.
In 11.5, we also saw that the pseudo-inverse of C can be expressed as C =
QRT , where C T = QR is the QR factorization of C T . The solution of the least
norm problem can therefore be expressed as
x = QRT d
and this leads to an algorithm for solving the least norm problem via the QR
factorization.
16.3 Solving constrained least squares problems 277
short position plus our initial amount to be invested into asset 3. We do not invest
in asset 2 at all.
The leverage L of the portfolio is given by
L = |w1 | + + |wn |,
the sum of the absolute values of the weights. If all entries of w are nonnegative
(which is a called a long-only portfolio), we have L = 1; if some entries are negative,
then L > 1. If a portfolio has a leverage of 5, it means that for every $1 of portfolio
value, we have $3 of total long holdings, and $2 of total short holdings. (Other
definitions of leverage are used, for example, (L 1)/2.)
Multi-period investing with allocation weights. The investments are held for T
periods of, say, one day each. (The periods could just as well be hours, weeks, or
months). We describe the investment returns by the T n matrix R, where Rtj
is the fractional return of asset j in period t. Thus R61 = 0.02 means that asset 1
gained 2% in period 6, and R82 = 0.03 means that asset 2 lost 3%, over period
8. The jth column of R is the return time series for asset j; the tth row of R gives
the returns of all assets in period t. It is often assumed that one of the assets is
cash, which has a constant (positive) return rf , where the superscript stands for
risk-free. If the risk-free asset is asset n, then the last column of R is rf 1.
Suppose we invest a total (positive) amount Vt at the beginning of period t,
so we invest Vt wj in asset j. At the end of period t, the dollar value of asset j is
Vt wj (1 + Rtj ), and the dollar value of the whole portfolio is
n
X
Vt+1 = Vt wj (1 + Rtj ) = Vt (1 + rtT w),
j=1
where rtT is the tth row of R. We assume Vt+1 is positive; if the total portfolio
value becomes negative we say that the portfolio has gone bust and stop trading.
The total (fractional) return of the portfolio over period t, i.e., its fractional
increase in value, is
Vt+1 Vt Vt (1 + rtT w) Vt
= = rtT w.
Vt Vt
Note that we invest the total portfolio value in each period according to the weights
w. This entails buying and selling assets so that the dollar value fractions are once
again given by w. This is called re-balancing the portfolio.
The portfolio return in each of the T periods can be expressed compactly using
matrix-vector notation as
r = Rw,
where r is the T -vector of portfolio returns in the T periods, i.e., the time series
of portfolio returns. (Note that r is a T -vector, which represents the time series
of total portfolio return, whereas rt is an n-vector, which gives the returns of the
n assets in period t.) If asset n is risk-free, and we choose the allocation w = en ,
then r = Ren = rf 1, i.e., we obtain a constant return in each period of rf .
17.1 Portfolio optimization 281
where V1 is the total amount initially invested in period t = 1. This total value time
series is often plotted using V1 = $10000 as the initial investment by convention.
The product in (17.1) arises from re-investing our total portfolio value (including
any past gains or losses) in each period. In the simple case when the last asset is
risk-free and we choose w = en , the total value grows as Vt = V1 (1 + rf )t1 . This
is called compounded interest at rate rf .
When the returns rt are small (say, a few percent), and T is not too big (say, a
few hundred), we can approximate the product above using the sum or average of
the returns. To do this we expand the product in (17.1) into a sum of terms, each
of which involves a product of some of the returns. One term involves none of the
returns, and is V1 . There are t 1 terms that involve just one return, which have
the form V1 rs , for s = 1, . . . , t 1. All other terms in the expanded product involve
the product of at least two returns, and so can be neglected since we assume that
the returns are small. This leads to the approximation
Vt V1 + V1 (r1 + + rt1 ),
VT +1 V1 + T avg(r)V1 .
This approximation suggests that to maximize our total final portfolio value, we
should seek high return, i.e., a large value for avg(r).
Portfolio return and risk. The choice of weight vector w is judged by the result-
ing portfolio return time series r = Rw. The portfolio mean return (over the T
periods), often shortened to just the return, is given by avg(r). The portfolio risk
(over the T periods) is the standard deviation of portfolio return, std(r).
The quantities avg(r) and std(r) give the per-period return and risk. They
are often converted to their equivalent values for one year, which are called the
annualized return and risk, and reported as percentages. If there are P periods in
one year, these are given by
P avg(r), P std(r),
respectively. For example, suppose each period is one (trading) day. There are
about 250 trading days in one year, so the annualized return and risk are given
by 250 avg(r) and 15.81 std(r). Thus a daily return sequence r with per-period
(daily) return 0.05% (0.0005) and risk 0.5% (0.005) has an annualized return and
risk of 12.5% and 7.9%, respectively. (The squareroot of P in the risk annualization
comes from the assumption that the fluctuations in the returns vary randomly and
independently from period to period.)
282 17 Constrained least squares applications
We want to choose w so that we achieve high return and low risk. This means
that we seek portfolio returns rt that are consistently high. This is an optimization
problem with two objectives, return and risk. Since there are two objectives, there
is a family of solutions, that trade off return and risk. For example, when the
last asset is risk-free, the portfolio weight w = en achieves zero risk (which is the
smallest possible value), and return rf . We will see that other choices of w will lead
to higher return, but with higher risk as well. Portfolio weights that minimize risk
for a given level of return (or maximize return for a given level of risk) are called
Pareto optimal. The risk and return of this family of weights are typically plotted
on a risk-return plot, with risk on the horizontal axis and return on the vertical
axis. Individual assets can be considered (very simple) portfolios, corresponding
to w = ej . In this case the corresponding portfolio return and risk are simply the
return and risk of asset j (over the same T periods).
One approach is to fix the return of the portfolio to be some given value , and
minimize the risk over all portfolios that achieve the required return. Doing this
for many values of produces (different) portfolio allocation vectors that trade off
risk and return. Requiring that the portfolio return be can be expressed as
where = RT 1/T is the n-vector of the average asset returns. This is a single
linear equation in w. Assuming that it holds, we can express the square of the risk
as
std(r)2 = (1/T )kr avg(r)1k2 = (1/T )kr 1k2 .
Thus to minimize risk (squared), with return value , we must solve the linearly
constrained least squares problem
2
minimize kRw
T 1k
1 1 (17.2)
subject to w= .
T
(We dropped the factor 1/T from the objective, which does not affect the solution.)
This is a constrained least squares problem with two linear equality constraints.
The first constraint sets the sum of the allocation weights to one, and the second
requires that the mean portfolio return is .
The portfolio optimization problem has solution
1
2RT R
w 1 2T
z1 = 1T 0 0 1 , (17.3)
z2 T 0 0
where z1 and z2 are Lagrange multipliers for the equality constraints (which we
dont care about).
As a historical note, the portfolio optimization problem (17.2) is not exactly the
same as the one proposed by Markowitz. His formulation used a statistical model
of returns, where instead we are using a set of actual (or realized ) returns.
17.1 Portfolio optimization 283
Future returns and the big assumption. The portfolio optimization problem (17.2)
suffers from what would appear to be a serious conceptual flaw: It requires us to
know the asset returns over the periods t = 1, . . . , T , in order to compute the op-
timal allocation to use over those periods. This is silly: If we knew any future
returns, we would be able to achieve as large a portfolio return as we like, by sim-
ply putting large positive weights on the assets with positive returns and negative
weights on those with negative returns. The whole challenge in investing is that
we do not know future returns.
Assume the current time is period T , so we know the (so-called realized ) return
matrix R. The portfolio weight w found by solving (17.2), based on the observed
returns in periods t = 1, . . . , T , can still be useful, when we make one (big) as-
sumption:
In other words, if the asset returns for future periods T + 1, T + 2, . . . are similar
in nature to the past periods t = 1, . . . , T , then the portfolio allocation w found by
solving (17.2) could be a wise choice to use in future periods.
Every time you invest, you are warned that the assumption (17.4) need not
hold; you are required to acknowledge that past performance is no guarantee of
future performance. The assumption (17.4) often holds well enough to be useful,
but in times of market shift it need not.
This situation is similar to that encountered when fitting models to observed
data, as in chapters 13 and 14. The model is trained on past data that you have
observed; but it will be used to make predictions on future data, that you have not
yet seen. A model is useful only to the extent that future data looks like past data.
And this is an assumption which often (but not always) holds reasonably well.
Just as in model fitting, investment allocation vectors can (and should) be
validated before being used. For example, we determine the weight vector by
solving (17.2) using past returns data over some past training period, and check
the performance on some other past testing period. If the portfolio performance
over the training and testing periods are reasonably consistent, we gain confidence
(but no guarantee) that the weight vector will work in future periods. For example,
we might determine the weights using the realized returns from two years ago, and
then test these weights by the performance of the portfolio over last year. If the test
works out, we use the weights for next year. In portfolio optimization, validation
is sometimes called back-testing, since you are testing the investment method on
previous realized returns, to get an idea of how the method will work on (unknown)
future returns.
The basic assumption (17.4) oftens hold less well than the analogous assumption
in data fitting, i.e., that future data looks like past data. For this reason we expect
less coherence between the training and test performance of a portfolio, compared
to a generic data fitting application. This is especially so when the test period has
a small number of periods in it, like 100; see the discussion on page 210.
284 17 Constrained least squares applications
0.4
0.3
Return
0.2
0.1
1/n
0
risk-free
Figure 17.1 The open circles show annualized risk and return for 20 assets
(19 stocks and one risk-free asset with a return of 1%). The solid line shows
risk and return for the Pareto optimal portfolios. The dots show risk and
return for three Pareto optimal portfolios with 10%, 20%, and 40% return,
and the portfolio with weights wi = 1/n.
17.1.3 Example
We use daily return data for 19 stocks over a period of 2000 days. After adding a
risk-free asset with a 1% annual return, we obtain a 2000 20 return matrix R.
The circles in figure 17.1 show the annualized risk and return for the 20 assets, i.e.,
the pairs
( 250 std(Rei ), 250 avg(Rei )), i = 1, . . . , 20.
It also shows the Pareto-optimal risk-return curve, and the risk and return for the
uniform portfolio with equal weights wi = 1/n. The annualized risk, return, and
the leverage for five portfolios (the four Pareto-optimal portfolios indicated in the
figure, and the 1/n portfolio) are given in table 17.1. Figure 17.2 shows the total
portfolio value (17.1) for the five portfolios. Figure 17.3 shows the portfolio values
for a different test period of 500 days.
17.1.4 Variations
There are many variations on the basic portfolio optimization problem (17.2). We
describe a few of them here.
17.1 Portfolio optimization 285
Return Risk
Portfolio Train Test Train Test Leverage
Risk-free 0.01 0.01 0.00 0.00 1.00
10% 0.10 0.08 0.09 0.07 1.96
20% 0.20 0.15 0.18 0.15 3.03
40% 0.40 0.30 0.38 0.31 5.48
1/n 0.10 0.21 0.23 0.13 1.00
Table 17.1 Annualized risk, return, and leverage for five portfolios.
150 Risk-free
1/n
10%
Value (thousand dollars)
20%
100 40%
50
10
Figure 17.2 Total value over time for five portfolios: the risk-free portfolio
with 1% annual return, the Pareto optimal portfolios with 10%, 20%, and
40% return, and the uniform portfolio. The total value is computed using
the 2000 20 daily return matrix R.
286 17 Constrained least squares applications
18
Risk-free
1/n
16 10%
14
12
10
Figure 17.3 Value over time for the five portfolios in figure 17.2 over a test
period of 500 days.
xt+1 = At xt + Bt ut , t = 1, 2, . . . . (17.6)
yt = Ct xt , t = 1, 2, . . . . (17.7)
We usually have m n and p n, i.e., there are fewer inputs and outputs than
states.
In control applications, the input ut represents quantities that we can choose
or manipulate, like control surface deflections or engine thrust on an airplane. The
state xt , input ut , and output yt typically represent deviations from some standard
or desired operating condition, for example, the deviation of aircraft speed and
altitude from the desired values. For this reason it is desirable to have xt , yt , and
ut small.
Linear quadratic control refers to the problem of choosing the input and state
sequences, over a time period t = 1, . . . , T , so as to minimize a sum of squares
objective, subject to the dynamics equations (17.6), the output equations (17.7),
and additional linear equality constraints. (In linear quadratic, linear refers to
the linear dynamics, and quadratic refers to the objective function, which is a
sum of squares.)
Most control problems include an initial state constraint, which has the form
x1 = xinit , where xinit is a given initial state. Some control problems also include a
final state constraint xT = xdes , where xdes is a given (desired) final (also called
terminal or target) state.
The objective function has the form J = Joutput + Jinput , where
The positive parameter weights the input objective Jinput relative to the output
objective Joutput .
The linear quadratic control problem (with initial and final state constraints)
is
minimize Joutput + Jinput
subject to xt+1 = At xt + Bt ut , t = 1, . . . , T 1, (17.8)
x1 = xinit , xT = xdes ,
where the variables to be chosen are x1 , . . . , xT and u1 , . . . , uT 1 .
z = (x1 , . . . , xT , u1 , . . . , uT 1 ).
17.2 Linear quadratic control 289
In this matrix, (block) entries not shown are zero, and the identity matrices in
the lower right corner have dimension m. (The lines in the matrix delineate the
portions related to the states and the inputs.) The dynamics constraints, and the
with
initial and final state constraints, can be expressed as Cz = d,
A1 I B1 0
A2 I B2 0
..
.. .. ..
C =
. . . ,
d =
. ,
AT 1 I BT 1
0
I xinit
I xdes
where (block) entries not shown are zero. (The vertical line separates the portions
of the matrix associated with the states and the inputs, and the horizontal lines
separate the dynamics equations and the initial and final state constraints.)
The solution z of the constrained least squares problem
gives us the optimal input trajectory and the associated optimal state (and output)
since here b = 0, it is a
trajectory. The solution z is a linear function of b and d;
init des
linear function of x and x .
Complexity. The large constrained least squares problem (17.9) has dimensions
so using one of the standard methods described above would require order
flops. But the matrices A and C are very sparse, and by exploiting this sparsity
(see page 275), the large constrained least squares problem can be solved in order
T (m + p + n)(m + n)2 flops, which grows only linearly in T .
290 17 Constrained least squares applications
0.4
0.3
Output
0.2
0.1
0 20 40 60 80 100
t
17.2.1 Example
C= 0.218 3.597 1.683 ,
with initial condition xinit = (0.496, 0.745, 1.394), and target or desired final state
xdes = 0, and T = 100. In this example, both the input ut and the output yt have
dimension one, i.e., are scalar.
Figure 17.4 shows the output when the input is zero,
yt = CAt1 xinit , t = 1, . . . , T.
which is called the open-loop output. Figure 17.5 shows the optimal trade-off curve
of the objectives Jinput and Joutput , found by varying the parameter , solving
the problem (17.9), and evaluating the objectives Jinput and Joutput . The points
corresponding to the values = 0.05, = 0.2, and = 1 are shown as circles. As
always, increasing has the effect of decreasing Jinput , at the cost of increasing
Joutput .
The optimal input and output trajectories for these three values of are shown
in figure 17.6. Here too we see that for larger , the input is smaller but the output
is larger.
17.2 Linear quadratic control 291
4.4
4.2
=1
Joutput
= 0.2
3.8
= 0.05
3.6
0 1 2 3 4
Jinput
Figure 17.5 Optimal trade-off curve of the objectives Jinput and Joutput .
17.2.2 Variations
There are many variations on the basic linear quadratic control problem described
above. We describe some of them here.
0.5 0.4
0 0.2
ut
yt
0.5 0
0 20 40 60 80 100 0 20 40 60 80 100
t t
0.5 0.4
0 0.2
ut
yt
0.5 0
0 20 40 60 80 100 0 20 40 60 80 100
t t
0.5 0.4
0 0.2
ut
yt
0.5 0
0 20 40 60 80 100 0 20 40 60 80 100
t t
Figure 17.6 Optimal inputs (left) and outputs (right) for = 0.05 (top),
= 0.2 (center), and = 1 (bottom).
17.2 Linear quadratic control 293
ut = Kxt
for t = 1, 2, . . .. The matrix K is called the state feeback gain matrix. State feedback
control is very widely used in practical applications, especially ones where there is
no fixed future time T when the state must take on some desired value; instead, it
is desired that both xt and ut should be small and converge to zero. One practical
advantage of linear state feedback control is that we can find the state feedback
matrix K ahead of time; when the system is operating, we determine the input
values using one simple matrix-vector multiply. Here we show how an appropriate
state feedback gain matrix K can be found using linear quadratic control.
Let z denote the solution of the linear quadratic control problem, i.e., the
solution of the linearly constrained least squares problem (17.8), with xdes = 0.
The solution z is a linear function of xinit and xdes ; since here xdes = 0, z is a linear
function of xinit = x1 . Since u1 , the optimal input at t = 1, is a slice or subvector of
z, we conclude that u1 is a linear function of x1 , and so can be written as u1 = Kx1
for some m n matrix K. The columns of K can be found by solving (17.8) with
initial conditions xinit = e1 , . . . , en . This can be done efficiently by factorizing the
coefficient once, and then carrying out n solves.
This matrix generally provides a good choice of state feedback gain matrix.
With this choice, the input u1 with state feeback control and under linear quadratic
control are the same; for t > 1, the two inputs differ. An interesting phenomenon,
beyond the scope of this book, is that the state feedback gain matrix K found this
way does not depend very much on T , provided it is chosen large enough.
Example. For the example described in 17.2.1 the state feedback gain matrix for
= 1 is
K = 0.308 2.659 1.446 .
In figure 17.7, we plot the input and output trajectories with linear quadratic
control (in blue) and using the simpler linear state feedback control ut = Kxt .
We can see that the input sequence found using linear quadratic control achieves
yT = 0 exactly; the input sequence found by linear state feedback control makes
yT small, but not zero.
294 17 Constrained least squares applications
0.1 0.4
0
ut
yt
0.2
0.1
0
Figure 17.7 The blue curves are the solutions of (17.8) for = 1. The
red curves are the inputs and outputs that result from the constant state
feedback ut = Kxt .
xt+1 = At xt + Bt wt , yt = Ct xt + vt , t = 1, 2, . . . . (17.10)
Here the n-vector xt is the state of the system, the p-vector yt is the measurement,
the m-vector wt is the input or process noise, and the p-vector vt is the measurement
noise or residual. The matrices At , Bt , and Ct are the dynamics, input, and output
matrices, respectively.
In state estimation, we know the matrices At , Bt , and Ct over the time period
t = 1, . . . , T , as well as the measurements y1 , . . . , yT , but we do not know the
process or measurement noises. The goal is to guess or estimate the state sequence
x1 , . . . , xT . State estimation is widely used in many application areas, including
all guidance and navigation systems, such as GPS (global positioning system).
Since we do not know the process or measurement noises, we cannot exactly
deduce the state sequence. Instead we will guess or estimate the state sequence
x1 , . . . , xT and process noise sequence w1 , . . . , wT 1 , subject to the requirement
that they satisfy the dynamic system model (17.10). When we guess the state
sequence, we implicitly guess that the measurement noise is vt = yt Ct xt . We
make one fundamental assumption: The process and measurement noises are both
small, or at least, not too large.
Our primary objective is the sum of squares of the norms of the measurement
residuals,
If this quantity is small, it means that the proposed state sequence guess is consis-
tent with our measurements. Note that the quantities in the squared norms above
are the same as vt .
17.3 Linear quadratic state estimation 295
The secondary objective is the sum of squares of the norms of the process noise,
Jproc = kw1 k2 + + kwT 1 k2 .
Our prior assumption that the process noise is small corresponds to this objective
being small.
Estimation versus control. The least squares state estimation problem is very
similar to the linear quadratic control problem, but the interpretation is quite dif-
ferent. In the control problem, we can choose the inputs; they are under our control.
Once we choose the inputs, we know the state sequence. In the control problem,
the inputs are typically actions that we take to affect the state trajectory. In the
estimation problem, the inputs (called process noise in the estimation problem) are
unknown, and the problem is to guess them. Our job is to guess the state sequence,
which we do not know. This is a passive task. We are not choosing inputs to affect
the state; rather, we are observing the outputs and hoping to deduce the state
sequence. The mathematical formulations of the two problems, however, are very
closely related. The close connection between the two problems is sometimes called
control/estimation duality.
Formulation as constrained least squares problem. The least squares state esti-
mation problem (17.11) can be formulated as a linearly constrained least squares
problem, using stacking. We define the stacked vector
z = (x1 , . . . , xT , w1 , . . . , wT 1 ).
The objective in (17.11) can be expressed as kAz bk2 , with
C1
y1
C2 y2
.. ..
.
.
A =
CT ,
yT .
b =
I
0
.. .
..
.
I 0
296 17 Constrained least squares applications
with d = 0 and
The constraints in (17.11) can be expressed as Cz = d,
A1 I B1
A2 I B2
C = .
. .. . .. . ..
AT 1 I BT 1
The constrained least squares problem has dimensions
n = T n + (T 1)m, m = T p + (T 1)m, p = (T 1)n.
so using one of the standard methods described above would require order
(p + m)n2 T 3 (m + p + n)(m + n)2
flops. As in the case of linear quadratic control, the matrices A and C are very
sparse, and by exploiting this sparsity (see page 275), the large constrained least
squares problem can be solved in order T (m + p + n)(m + n)2 flops, which grows
only linearly in T .
The least squares state estimation problem was formulated in around 1960 by
Rudolf Kalman and others (in a statistical framework). He and others developed a
particular recursive algorithm for solving the problem, and the whole method has
come to be known as Kalman filtering. For this work Kalman was awarded the
Kyoto Prize in 1985.
17.3.1 Example
We consider a system with n = 4, p = 2, and m = 2, and time-invariant matrices
1 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0
A= 0 0 1
, B= , C= .
0 1 0 0 1 0 0
0 0 0 1 0 1
This is a very simple model of motion of a mass moving in 2-D. The first two
components of xt represent the position coordinates; components 3 and 4 represent
the velocity coordinates. The input wt acts like a force on the mass, since it adds to
the velocity. We think of the 2-vector Cxt as the exact or true position of the mass
at period t. The measurement yt = Cxt + vt is a noisy measurement of the mass
position. We will estimate the state trajectory over t = 1, . . . , T , with T = 100.
In figure 17.8 the 100 measured positions yt are shown as circles in 2-D. The
solid black line shows Cxt , i.e., the actual position of the mass. We solve the least
squares state estimation problem (17.11) for a range of values of . The estimated
trajectories C xt for three values of are shown as red lines. We can see that = 1
is too small for this example: The estimated state places too much trust in the
measurements, and is following measurement noise. We can also see that = 105
is too large: The estimated state is very smooth (since the estimated process noise
is small), but the imputed noise measurements are too high. In this example the
choice of is simple, since we have the true position trajectory. We will see later
how can be chosen using validation in the general case.
17.3 Linear quadratic state estimation 297
=1
0
t=1
500
(xt )2
1000
1500
= 103 = 105
Figure 17.8 The circles show 100 noisy measurements in 2-D. Top left. The
black line is the exact position Cxt . Three other plots. The red lines are
estimated trajectories C xt for three values of .
298 17 Constrained least squares applications
17.3.2 Variations
Known initial state. There are several interesting variations on the state estima-
tion problem. For example, we might know the initial state x1 . In this case we
simply add an equality constraint x1 = xknown
1 .
Missing measurements. Another useful variation on the least squares state esti-
mation problem allows for missing measurements, i.e., we only know yt for t T ,
where T is the set of times for which we have a measurement.P We can han-
T
dle
P this variation two (equivalent) ways: We can either replace t=1 kvt k2 with
2
tT kvt k , or we can consider yt for t 6 T to be optimization variables as well.
(Both lead to the same state sequence estimate.) When there are missing mea-
surements, we can estimate what the missing measurements might have been, by
taking
yt = Ct xt , t 6 T .
(Here we assume that vt = 0.)
17.3.3 Validation
The technique of estimating what a missing measurement might have been directly
gives us a method to validate a quadratic state estimation method, and in partic-
ular, to choose . To do this, we remove some of the measurements (say, 20%),
and carry out least squares state estimation pretending that those measurements
are missing. Our state estimate produces predicted values for the missing (really,
held back) measurements, which we can compare to the actual measurements. We
choose a value of which approximately minimizes this (test) prediction error.
i.e., we only sum the measurement errors over the measurements we have. For each
value of we compute the RMS train and test errors
!1/2 1/2
1 X
2 1 X 2
Etrain = kC xt yt k , Etest = kC xt yt k .
80p tT 20p
t6T
The training error (squared and scaled) appears directly in our minimization prob-
lem. The test error, however, is a good test of our estimation method, since it
compares predictions of positions (in this example) with measurements of position
that were not used to form the estimates. The errors are shown in figure 17.9, as
a function of the parameter . We can clearly see that for < 100 or so, we are
over-fit, since the test RMS error substantially exceeds the train RMS error. We
can also see that around 103 is a good choice.
17.3 Linear quadratic state estimation 299
80
Train
Test
60
RMS error
40
20
Figure 17.9 Training and test error for the state estimation example.
300 17 Constrained least squares applications
Chapter 18
fi (x) = 0, i = 1, . . . , m,
f (x) = 0, (18.1)
where f (x) = (f1 (x), . . . , fm (x)) is an m-vector, and the zero vector on the right-
hand side has dimension m. We can think of f as a function that maps n-vectors to
302 18 Nonlinear least squares
When we cannot find a solution of the equations (18.1), we can seek an approximate
solution, by finding x that minimizes the sum of squares of the residuals,
This means finding x for which kf (x)k2 kf (x)k2 holds for all x. We refer to
such a point as a least squares approximate solution of (18.1), or more directly, as
a solution of the nonlinear least squares problem
where the n-vector x is the variable to be found. When the function f is affine, the
nonlinear least squares problem (18.2) reduces to the (linear) least squares problem
from chapter 12.
The nonlinear least squares problem (18.2) includes the problem of solving
nonlinear equations (18.1) as a special case, since any x that satisfies f (x) = 0 is
also a solution of the nonlinear least squares problem. But as in the case of linear
equations, the least squares approximate solution of a set of nonlinear equations is
often very useful even when it does not solve the equations. So we will focus on
the nonlinear least squares problem (18.2).
kf (x)k2 = 0, i = 1, . . . , n,
xi
18.1 Nonlinear equations and least squares 303
or, in vector form, kf (x)k2 = 0 (see C.2). This gradient can be expressed as
m m
!
X X
2 2
kf (x)k = fi (x) =2 fi (x)fi (x) = 2Df (x)T f (x),
i=1 i=1
where the mn matrix Df (x) is the derivative or Jacobian matrix of the function f
at the point f , i.e., the matrix of its partial derivatives (see 8.3). So if x minimizes
kf (x)k2 , it must satisfy
2Df (x)T f (x) = 0. (18.3)
This optimality condition must hold for any solution of the nonlinear least squares
problem (18.2). But the optimality condition can also hold for other points that
are not solutions of the nonlinear least squares problem. For this reason the opti-
mality condition (18.3) is called a necessary condition for optimality, because it is
necessarily satisfied for any solution x. It is a not a sufficient condition for opti-
mality, since the optimality condition (18.3) is not enough (i.e., is not sufficient)
to guarantee that the point is a solution of the nonlinear least squares problem.
When the function f is affine, the optimality conditions (18.3) reduce to the
normal equations (12.4), the optimality conditions for the (linear) least squares
problem.
S(p)
D(p)
p
Figure 18.1 Supply and demand as functions of the price, shown on the hori-
zontal axis. They intersect at the point shown as a circle. The corresponding
price is the equilibrium price.
18.1.5 Examples
In this section we list a few applications that reduce to solving a set of nonlinear
equations, or a nonlinear least squares problem.
(The vector f (p) is called the excess supply, at the set of prices p.) This is
shown in figure 18.1 for a simple case with n = 1.
18.1 Nonlinear equations and least squares 305
Ri
(x) = 0, i = 1, . . . , n.
xi
(As in linear least squares model fitting, we can add a regularization term to
this objective function.)
solving the associated linear least squares problem to find the next iterate. This
combines two of the most powerful ideas in applied mathematics: Calculus is used
to form an affine approximation of a function near a given point, and least squares
is used to compute an approximate solution of the resulting affine equations.
We now describe the algorithm in more detail. At each iteration k, we form
the affine approximation f of f at the current iterate x(k) , given by the Taylor
approximation
f(x; x(k) ) = f (x(k) ) + Df (x(k) )(x x(k) ), (18.5)
where the m n matrix Df (x(k) ) is the Jacobian or derivative matrix of f (see 8.3
and C.1). The affine function f(x; x(k) ) is a very good approximation of f (x)
provided x is near x(k) , i.e., kx x(k) k is small.
The next iterate x(k+1) is then taken to be the minimizer of kf(x; x(k) )k2 , the
norm squared of the affine approximation of f at x(k) . Assuming that the derivative
matrix Df (x(k) ) has linearly independent columns (which requires m n), we have
1
x(k+1) = x(k) Df (x(k) )T Df (x(k) ) Df (x(k) )T f (x(k) ).
1. Form affine approximation at current iterate using calculus. Evaluate the Ja-
cobian Df (x(k) ) and define
2. Update iterate using linear least squares. Set x(k+1) as the minimizer of kf(x; x(k) )k2 ,
1
x(k+1) = x(k) Df (x(k) )T Df (x(k) ) Df (x(k) )T f (x(k) ).
which occurs if and only if Df (x(k) )T f (x(k) ) = 0 (since we assume that Df (x(k) )
has linearly independent columns). Roughly speaking, the Gauss-Newton algo-
rithm stops only when the optimality condition (18.3) holds.
We can also observe that
holds, since x(k+1) minimizes kf(x; x(k) )k2 , and f(x(k) ; x(k) ) = f (x(k) ). Roughly
speaking, the norm of the residual of the approximation goes down in each iteration.
This is not the same as
i.e., the norm of the residual goes down in each iteration, which is what we would
like.
For the special case m = n, the Gauss-Newton algorithm reduces to another famous
algorithm for solving a set of n nonlinear equations in n variables, called the Newton
algorithm. (The algorithm is sometimes called the Newton-Raphson algorithm,
since Newton developed the method only for the special case n = 1, and Joseph
Raphson later extended it to the case n > 1.)
f(x; x(k) )
f (x)
x(k+1)
x(k)
Figure 18.2 One iteration of the Newton algorithm for solving an equation
f (x) = 0 in one variable.
2. Update iterate by solving linear equations. Set x(k+1) as the solution of f(x; x(k) ) =
0,
1
x(k+1) = x(k) Df (x(k) ) f (x(k) ).
The basic Newton algorithm shares the same shortcomings as the basic Gauss-
Newton algorithm, i.e., it can diverge, and the iterations terminate if the derivative
matrix is not invertible.
and is illustrated in figure 18.2. To update x(k) we form the Taylor approximation
and set it to zero to find the next iterate x(k+1) . If f 0 (x(k) ) 6= 0, the solution of
f(x; x(k) ) = 0 is given by the right-hand side of (18.8). If f 0 (x(k) ) = 0, the Newton
algorithm terminates with an error.
f (x) f (x)
1 1
x x
3 2 1 1 2 3 3 2 1 1 2 3
1 1
Figure 18.3 The first iterations in the Newton algorithm for solving f (x) = 0,
for two starting points: x(1) = 0.95 and x(1) = 1.15.
1 1
f (x(k) )
f (x(k) )
0 0
1 1
2 4 6 2 4 6
k k
Figure 18.4 Value of f (x(k) ) versus iteration number k for Newtons method
in the example of figure 18.3, started at x(1) = 0.95 and x(1) = 1.15.
18.3 Levenberg-Marquardt algorithm 311
Since (k) is positive, the stacked matrix in this least squares problem has linearly
independent columns, even when Df (x(k) ) does not. It follows that the solution of
the least squares problem exists and is unique.
From the normal equations of the least squares problem we can derive a useful
expression for x(k+1) :
Df (x(k) )T Df (x(k) ) + (k) I x(k+1)
= Df (x(k) )T Df (x(k) )x(k) f (x(k) ) + (k) x(k)
= Df (x(k) )T Df (x(k) ) + (k) I x(k) Df (x(k) )T f (x(k) ),
and therefore
1
x(k+1) = x(k) Df (x(k) )T Df (x(k) ) + (k) I Df (x(k) )T f (x(k) ). (18.11)
312 18 Nonlinear least squares
Stopping criteria. The algorithm is stopped before the maximum number of it-
erations if any of the following conditions hold.
kf (x(k+1) )k2 is small enough. This means we have (almost) solved f (x) = 0
(and also, almost minimized kf (x)k2 ).
kx(k+1) x(k) k is small. This means that the algorithm has (almost) con-
verged, and that the optimality condition (18.3) almost holds.
The test in step 3 fails too many consecutive times, or (k) becomes larger
than some given maximum value.
18.3 Levenberg-Marquardt algorithm 313
Even when the algorithm converges normally (the second case), we can say very
little for sure about the point computed. The point found may be a minimizer of
kf (x)k2 , or perhaps not. A with many other heuristic algorithms, the point found
is often very useful in applications, even if we cannot be sure that it solves the
nonlinear least squares problem.
18.3.1 Examples
Nonlinear equation. Our first example is the sigmoid function (18.9) from page 309.
We saw in figures 18.3 and 18.4 that the Gauss-Newton method, which reduces to
Newtons method in this case, diverges when the initial value x(1) is 1.15. The
Levenberg-Marquardt algorithm, however, solves this problem. Figure 18.5 shows
the value of the residual f (x(k) ), and the value of (k) , for the Levenberg-Marquardt
algorithm started from x(1) = 1.15 and (1) = 1. It converges to the solution x = 0
in around 10 iterations.
Equilibrium prices. We illustrate algorithm 18.1 with a small instance of the equi-
librium price problem, with supply and demand functions
0.8 1
0.6
f (x(k) )
(k)
0.4 0.5
0.2
0
0
5 10 5 10
k k
Figure 18.5 Values of f (x(k) ) and (k) versus the iteration number k
for the Levenberg-Marquardt algorithm applied to f (x) = (exp(x)
exp(x))/(exp(x)+exp(x)). The starting point is x(1) = 1.15 and (1) = 1.
where E d and E s are the demand and supply elasticity matrices, dnom and snom
are the nominal demand and supply vectors, and the log and exp appearing in
the equations apply to vectors elementwise. Figure 18.6 shows the contour lines of
kf (p)k2 , where f (p) = S(p) D(p) is the excess supply, for
and
0.5 0.2 0.5 0.3
Ed = , Es = .
0 0.5 0.15 0.8
Figure 18.7 shows the iterates of the algorithm 18.1, started at p = (3, 9) and
(1) = 1. The values of kf (p(k) )k2 and the regularization parameter (k) versus
iteration k are shown figures 18.8 and figures 18.8.
with (1) = 0.1. Figure 18.11 shows the iterates x(k) for the three starting points.
When started at (1.8, 3.5) (blue circles) or (3.0, 1.5) (brown diamonds) the al-
gorithm converges to (1.18, 0.82), the point that minimizes kf (x)k2 . When the
18.3 Levenberg-Marquardt algorithm 315
10
6
p2
2
2 4 6 8 10
p1
Figure 18.6 Contour lines of the square norm of the excess supply f (p) =
S(p) D(p) for a small example with two commodities. The point marked
with a star are the equilibrium prices, for which f (p) = 0.
10
6
p2
2
2 4 6 8 10
p1
150
100
kf (p(k) )k2
50
0 5 10 15
0.8
0.6
(k)
0.4
0.2
0
0 5 10 15
k
Figure 18.8 Cost function kf (p(k) k2 and regularization parameter (k) versus
iteration number k in the example of figure 18.7.
18.3 Levenberg-Marquardt algorithm 317
3
x2
0
0 1 2 3 4
x1
kf (x)k
2
4
2
00 x2
1 2 3 40
x1
x2
2
0
0 1 2 3 4
x1
Example. Figure (18.13) shows a nonlinear model fitting example. The model is
an exponentially decaying sinusoid
f(x; ) = 1 e2 x cos(3 x + 4 ),
3
kf (x(k) )k2
0
1 2 3 4 5 6 7 8 9 10
0.3
0.2
(k)
0.1
0
1 2 3 4 5 6 7 8 9 10
k
Figure 18.12 Cost function kf (x(k) )k2 and regularization parameter (k)
versus iteration number k for the three starting points.
320 18 Nonlinear least squares
f(x; )
to the data points (xi , yi ), i = 1, . . . , N , where yi {1, 1}, using linear least
squares. The parameters 1 , . . . , p are chosen to minimize the sum squares ob-
18.5 Nonlinear least squares classification 321
f(x; )
Figure 18.14 The solid line minimizes the sum of the squares of the orthog-
onal distances of points to the graph of the polynomial.
jective
N
(f(xi ) yi )2 ,
X
(18.12)
i=1
This is 4 times the number of classification errors we make on the training set. To
see this, we note that when f(xi ) = yi , which means that a correct prediction was
made on the ith data point, we have (f(xi ) yi )2 = 0. When f(xi ) 6= yi , which
means that an incorrect prediction was made on the ith data point, one of the
values is +1 and the other is 1, so we have (f(xi ) yi )2 = 4.
The objective (18.13) is what we really want; the least squares objective (18.12)
is a surrogate for what we want. But we cannot use the Levenberg-Marquardt algo-
rithm to minimize the objective (18.13), since the sign function is not differentiable.
To get around this, we replace the sign function with a differentiable approximation,
for example the sigmoid function
eu eu
(u) = , (18.14)
eu + eu
322 18 Nonlinear least squares
(u)
u
4 2 2 4
shown in figure 18.15. We choose by solving the nonlinear least squares problem
of minimizing
N
((f(xi )) yi )2 ,
X
(18.15)
i=1
i=1
where `(u, y) is a loss function. For the linear least squares objective (18.12),
the loss function is `(u, y) = (u y)2 . For the nonlinear least squares objective
with the sign function (18.13), the loss function is `(u, y) = (sign(u) y)2 . For the
differentiable nonlinear least squares objective (18.13), the loss function is `(u, y) =
((u) y)2 . Roughly speaking, the loss function `(u, y) tells us how bad it is to
have f(xi ) = u when y = yi .
Since the outcome y takes on only two values, 1 and +1, we can plot the loss
functions as functions of u for these two values of y. Figure 18.16 shows these three
functions, with the value for y = 1 in the left column and the value for y = +1
in the right column. We can see that all three loss functions discourage prediction
errors, since their values are higher for sign(u) 6= y than when sign(u) = y.
The loss function for nonlinear least squares classification with the sign function
(shown in the middle row) assesses a cost of 0 for a correct prediction and 4 for
an incorrect prediction. The loss function for nonlinear least squares classification
18.5 Nonlinear least squares classification 323
Prediction Prediction
Outcome y = +1 y = 1 Total Outcome y = +1 y = 1 Total
y = +1 5627 296 5923 y = +1 945 35 980
y = 1 148 53929 54077 y = 1 40 8980 9020
All 5775 54225 60000 All 985 9015 10000
Table 18.1 Confusion matrices for a Boolean classifier to recognize the digit
zero. The table on the left is for the training set. Thet table on the right is
for the test set.
with the sigmoid function (shown in the bottom row) is a smooth approximation
of this.
(u + 1)2 (u 1)2
4 4
3 3
2 2
1 1
u u
3 1 1 3 3 1 1 3
3 3
2 2
1 1
u u
3 1 1 3 3 1 1 3
3 3
2 2
1 1
u u
3 1 1 3 3 1 1 3
Figure 18.16 The loss functions `(u, y) for linear least squares classification
(top), nonlinear least squares classification with the sign function (middle),
and nonlinear least squares classification with the sigmoid function (bottom).
The left column shows `(u, 1) and the right columns shows `(u, +1).
18.5 Nonlinear least squares classification 325
Train
Classification error (%) Test
0
109 106 103 100 103 106
Positive
0.08 Negative
0.06
Fraction
0.04
0.02
0
10 8 6 4 2 0 2 4 6 8
f(xi )
Figure 18.18 The distribution of the values of f(xi ) used in the Boolean
classifier (14.1) for recognizing the digit zero. The function f was computed
by solving the nonlinear least squares problem (18.15).
326 18 Nonlinear least squares
102 Train
Test
100
0 2 4 6 8 10
Iteration
Prediction
Outcome y = +1 y = 1 Total
y = +1 967 13 980
y = 1 11 9009 9020
All 978 9022 10000
Table 18.2 Confusion matrix on the test set for the Boolean classifier to
recognize the digit zero after addition of 5000 new features.
In this example the algorithm takes several tens of iterations to converge, i.e., until
the stopping criterion for the nonlinear least squares problem is satisfied. But in
this application we are more interested in the performance of the classifier, and
not minimizing the objective of the nonlinear least squares problem. Figure 18.19
shows the classification error of the classifier (on the train and test data sets) with
parameter (k) , the kth iterate of the Levenberg-Marquardt algorithm. We can
see that the classification errors reach their final values of 0.7% after just a few
iterations. This phenomenon is very typical in nonlinear data fitting problems.
Feature engineering. After adding the 5000 random features used in chapter 14,
the training and test errors we obtain the classification errors shown in figure 18.20.
The error on the training set is zero for small . For = 1000, the error on the
test set is 0.24%, with the confusion matrix in table 18.2. The distribution of f(xi )
on the training set in figure 18.21 shows why the training error is zero.
Figure 18.22 shows the classification errors versus Levenberg-Marquardt iter-
18.5 Nonlinear least squares classification 327
Train
Classification error (%) 0.6 Test
0.4
0.2
0.06 Positive
Negative
0.04
Fraction
0.02
0
10 8 6 4 2 0 2 4 6 8
f(xi )
Figure 18.21 The distribution of the values of f(xi ) used in the Boolean
classifier (14.1) for recognizing the digit zero, after addition of 5000 new
features. The function f was computed by solving the nonlinear least squares
problem (18.15).
328 18 Nonlinear least squares
102 Train
Test
100
101
102
103
0 2 4 6 8 10
Iteration
Multi-class classifier. Next we apply the nonlinear least squares method to the
multi-class classification of recognizing the ten digits in the MNIST data set. For
each digit k, we compute a Boolean classifier fk (x) = xT k + vk by solving a
regularized nonlinear least squares problem (18.16). The same value of is used
in the ten nonlinear least squares problems. The Boolean classifiers are combined
into a multi-class classifier
Figure 18.23 shows the classification errors versus . The test set confusion matrix
(for = 1) is given in table 18.3. The classification error on the test set is 7.6%,
down from the 13.9% error we obtained for the same set of features with the least
squares method of chapter 14.
Feature engineering. Figure 18.24 shows the error rates when we add the 5000
randomly generated features. The train and test error rates are now 0.02% and 2%.
The test set confusion matrix for = 1000 is given in table 18.4. This classifier
has matched human performance in classifying digits correctly. Further, or more
sophisticated, feature engineering can bring the test performance well below what
humans can achieve.
18.5 Nonlinear least squares classification 329
10
Classification error (%)
7
Train
Test
6
105 103 101 101 103
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 964 0 0 2 0 2 5 3 3 1 980
1 0 1112 4 3 0 1 4 1 10 0 1135
2 5 5 934 13 7 3 13 10 38 4 1032
3 3 0 19 926 1 21 2 8 21 9 1010
4 1 2 4 2 917 0 7 1 10 38 982
5 10 2 2 31 10 782 17 7 23 8 892
6 8 3 3 1 5 20 910 1 7 0 958
7 2 6 25 5 11 5 0 947 4 23 1028
8 13 10 4 18 16 27 8 9 865 4 974
9 8 6 0 12 43 11 1 19 23 886 1009
All 1014 1146 995 1013 1010 872 967 1006 1004 973 10000
Table 18.3 Confusion matrix for test set. The error rate is 7.6%.
.
330 18 Nonlinear least squares
Train
Test
3
Prediction
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 972 1 1 1 0 1 1 1 2 0 980
1 0 1124 2 2 0 0 3 1 3 0 1135
2 5 0 1006 1 3 0 2 6 9 0 1032
3 0 0 3 986 0 5 0 3 7 6 1010
4 0 0 4 1 966 0 4 1 0 6 982
5 2 0 2 5 2 875 5 0 1 0 892
6 7 2 0 1 3 2 941 0 2 0 958
7 1 7 6 1 2 0 0 1003 3 5 1028
8 3 0 0 4 4 5 3 4 949 2 974
9 2 5 0 5 6 4 1 6 2 978 1009
All 992 1139 1024 1007 986 892 960 1025 978 997 10000
Table 18.4 Confusion matrix for test set after adding 5000 features. The
error rate is 2.0%.
Chapter 19
minimize kf (x)k2
(19.1)
subject to g(x) = 0,
where the n-vector x is the variable to be found. Here f (x) is an m-vector, and
g(x) is a p-vector. We sometimes write out the components of f (x) and g(x), to
express the problem as
We refer to fi (x) as the ith (scalar) residual, and gi (x) = 0 as the ith (scalar)
equality constraint. When the functions f and g are affine, the equality constrained
nonlinear least squares problem (19.1) reduces to the (linear) least squares problem
with equality constraints from chapter 16.
We say that a point x is feasible for the problem (19.1) if it satisfies g(x) = 0. A
point x is a solution of the problem (19.1) if it is feasible and has smallest objective
among all feasible points, i.e., if whenever g(x) = 0, we have kf (x)k2 kf (x)k2 .
332 19 Constrained nonlinear least squares
Like the nonlinear least squares problem, or solving a set of nonlinear equations,
the constrained nonlinear least squares problem is in general hard to solve exactly.
But the Levenberg-Marquardt algorithm for solving the (unconstrained) nonlinear
least squares problem (18.2) can be leveraged to handle the problem with equality
constraints. We will describe a basic algorithm below, the penalty algorithm, and
a variation on it that works much better in practice, the augmented Lagrangian al-
gorithm. These algorithms are heuristics for (approximately) solving the nonlinear
least squares problem (19.1).
Linear equality constraints. One special case of the constrained nonlinear least
squares problem (19.1) is when the constraint function g is affine, in which case
the constraints g(x) = 0 can be written Cx = d for some p n matrix C and
a p-vector d. In this case the problem (19.1) is called a nonlinear least squares
problem with linear equality constraints. It can be (approximately) solved by
the Levenberg-Marquardt algorithm described above, by simply adding the linear
equality constraints to the linear least squares problem that is solved in step 2.
The more challenging problem is the case when g is not affine.
Using Lagrange multipliers (see C.3) we can derive a condition that any solu-
tion of the constrained nonlinear least squares problem (19.1) must satisfy. The
Lagrangian for the problem (19.1) is
where the p-vector z is the vector of Lagrange multipliers. The method of Lagrange
multipliers tells us that for any solution x of (19.1), there is a set of Lagrange
multipliers z that satisfy
L L
(x, z) = 0, i = 1, . . . , n, (x, z) = 0, i = 1, . . . , p
xi zi
(provided the rows of Dg(x) are linearly independent). The p-vector z is called an
optimal Lagrange multiplier.
The second set of equations can be written as gi (x) = 0, i = 1, . . . , p, in vector
form
g(x) = 0, (19.3)
i.e., x is feasible, which we already knew. The first set of equations can be written
in vector form as
2Df (x)T f (x) + Dg(x)T z = 0. (19.4)
This equation is the extension of the condition (18.3) for the unconstrained non-
linear least squares problem (18.2). The equation (19.4), together with (19.3), i.e.,
x is feasible, form the optimality conditions for the problem (19.1).
If x is a solution of the constrained nonlinear least squares problem (19.1), then
it satisfies the optimality condition (19.4) for some Lagrange multiplier vector z
19.2 Penalty algorithm 333
(provided the rows of Dg(x) are linearly independent). So x and z satisfy the
optimality conditions.
These optimality conditions are not sufficient, however; there can be choices of
x and z that satisfy them, but x is not a solution of the constrained nonlinear least
squares problem.
1. Solve unconstrained nonlinear least squares problem. Set x(k+1) to be the (ap-
proximate) minimizer of
The penalty algorithm is stopped early if kg(x(k) )k is small enough, i.e., the equality
constraint is almost satisfied.
The penalty algorithm is simple and easy to implement, but has an important
drawback: The parameter k rapidly increases with iterations (as it must, to drive
334 19 Constrained nonlinear least squares
Defining
z (k+1) = 2(k) g(x(k+1) )
as our estimate of a suitable Lagrange multiplier in iteration k + 1, we see that the
optimality condition (19.4) (almost) holds for x(k+1) and z (k+1) . (The feasibility
condition g(x(k) ) = 0 only holds in the limit as k .)
Augmented Lagrangian. The augmented Lagrangian for the problem (19.1), with
parameter > 0, is defined as
This is the Lagrangian, augmented with the new term kg(x)k2 ; alternatively, it
can be interpreted as the composite objective function (19.5) used in the penalty
algorithm, with the Lagrange multiplier term g(x)T z added.
The augmented Lagrangian (19.7) is also the ordinary Lagrangian associated
with the problem
minimize kf (x)k2 + kg(x)k2
subject to g(x) = 0.
This problem is equivalent to the original constrained nonlinear least squares prob-
lem (19.1): A point x is a solution of one if and only if it a solution of the other.
(This follows since the term kg(x)k2 is zero for any feasible x.)
1. Solve unconstrained nonlinear least squares problem. Set x(k+1) to be the (ap-
proximate) minimizer of
2. Update z (k) .
z (k+1) = z (k) + 2(k) g(x(k+1) ).
3. Update (k) .
(k) kg(x(k+1) )k < 0.25kg(x(k) )k
(k+1) =
2(k) kg(x(k+1) )k 0.25kg(x(k) )k.
Figure 19.1 shows the contour lines of the cost function kf (x)k2 (solid line) and
the constraint function g(x) (dashed line). The point x = (0, 0) is optimal with
corresponding Lagrange multiplier z = 2. One can verify that g(x) = 0 and
1 0 1 1
2Df (x)T f (x) + Dg(x)T z = 2 2 = 0.
1 2 1 1
The vertical jumps in the optimality condition norm occur in steps 2 and 3 of the
augmented Lagrangian algorithm, and in step 2 of the penalty algorithm, when the
parameters and z are updated.
Figure 19.5 shows the value of the penalty parameter versus the cumulative
number of Levenberg-Marquardt iterations in the two algorithms.
19.4 Nonlinear control 337
g(x) = 1
g(x) = 0
x2
0 g(x) = 1
1
1 0 1
x1
Figure 19.1 Contour lines of the cost function kf (x)k2 (solid line) and the
constraint function g(x) (dashed line) for a nonlinear least squares problem
in two variables with one equality constraint.
xk+1 = f (xk , uk ), k = 1, 2, . . . , N,
where the n-vector xk is the state, and the m-vector uk is the input or control, at
time period k. The function f : Rn+m Rn specifies what the next state is, as a
function of the current state and the current input. When f is an affine function,
this reduces to a linear dyanimcal system.
In nonlinear control, the goal is to choose the inputs u1 , . . . , uN 1 to achieve
some goal for the state and input trajectories. In many problems the initial state
x1 is given, and the final state xN is specified. Subject to these constraints, we may
wish the control inputs to be small and smooth, which suggests that we minimize
N
X N
X 1
kuk k2 + kuk+1 uk k2 ,
k=1 k=1
where > 0 is a parameter used to trade off input size and smoothness. (In many
nonlinear control problems the objective also involves the state trajectory.)
We can formulate the nonlinear control problem, with a norm squared objective
that involves the state and input, as a large constrained least problem, and then
solve it using the augmented Lagrangian algorithm. We illustrate this with a
specific example.
338 19 Constrained nonlinear least squares
0 0
x(3)
(2)
x
0.5 0.5
0.5 0 0.5 0.5 0 0.5
0 0
x(4) x(5)
0.5 0.5
0.5 0 0.5 0.5 0 0.5
0 0
x(6) x(7)
0.5 0.5
0.5 0 0.5 0.5 0 0.5
(1) = 1 (2) = 2
0.5 0.5
0 0
x(3)
x(2)
0.5 0.5
0.5 0 0.5 0.5 0 0.5
(3) = 4 (4) = 8
0.5 0.5
0 0
(4)
x x(5)
0.5 0.5
0.5 0 0.5 0.5 0 0.5
(5) = 16 (6) = 32
0.5 0.5
0 0
x(6) x(7)
0.5 0.5
0.5 0 0.5 0.5 0 0.5
101 Feasibility
Opt. cond.
100
101
Residual
102
103
104
105
106
0 20 40 60 80 100 120 140
101 Feasibility
Opt. cond.
100
101
Residual
102
103
104
105
106
0 20 40 60 80 100 120 140
Cumulative Levenberg-Marquardt iterations
Figure 19.4 Feasibility and optimality condition error versus the cumulative
number of Levenberg-Marquardt iterations in the augmented Lagrangian
algorithm (top) and the penalty algorithm (bottom).
19.4 Nonlinear control 341
Aug. Lag.
104 Penalty
Penalty parameter
103
102
101
100
0 20 40 60 80 100 120 140
Cumulative Levenberg-Marquardt iterations
(p1 , p2 )
Control of a car. Consider a car with position p = (p1 , p2 ) and orientation (angle)
. The car has wheelbase (length) L, steering angle , and speed s (which can be
negative, meaning the car moves in reverse). This is illustrated in figure 19.6.
The wheelbase L is a known constant; all of the other quantities p, , , and s
are functions of time. The dynamics of the car motion are given by the differential
equations
dp1
(t) = s(t) cos (t),
dt
dp2
(t) = s(t) sin (t),
dt
d
(t) = (s(t)/L) tan (t).
dt
Here we assume that the steering angle is always less than 90 , so the tangent
term in the last equation makes sense. The first two equations state that the car is
moving in the direction (t) (its orientation) at speed s(t). The last equation gives
342 19 Constrained nonlinear least squares
the change in orientation as a function of the car speed and the steering angle. For
a fixed steering angle and speed, the car moves in a circle.
We can control the speed s and the steering angle ; the goal is to move the car
over some time period from a given initial position and orientation to a specified
final position and orientiation.
We now discretize the equations in time. We take a small time interval h, and
obtain the approximations
We will use these approximations to derive nonlinear state equations for the car
motion, with state xk = (p1 (kh), p2 (kh), (kh)) and input uk = (s(kh), (kh)). We
have
xk+1 = f (xk , uk ),
with
cos(xk )3
f (xk , uk ) = xk + h(uk )1 sin(xk )3 .
(tan(uk )2 )/L
We now consider the nonlinear optimal control problem
N 1
NP
kuk k2 + kuk+1 uk k2
P
minimize
k=1 k=1
subject to x2 = f (0, u1 ) (19.12)
xk+1 = f (xk , uk ), k = 2, . . . , N 1
xfinal = f (xN , uN ),
and different values of xfinal . They are computed using the augmented Lagrangian
algorithm. The algorithm is started at the same starting point for each example.
The starting point for the input variables uk is randomly chosen, the starting point
for the states xk is zero.
19.4 Nonlinear control 343
1
1
0.5
0.5
0 0
0.6
0.4
0.4
0.2
0.2 0
0 0.2
Figure 19.7 Solution trajectories of 19.12 for different end states xfinal . The
outline of the car shows the position (p1 (kh), p2 (kh)), orientation (kh), and
the steering angle (kh) at time kh.
344 19 Constrained nonlinear least squares
Speed
0.5 0.5
Angle
uk
uk
0 0
0.5 0.5
0 10 20 30 40 50 0 10 20 30 40 50
k k
0.5 0.5
uk
uk
0 0
0.5 0.5
0 10 20 30 40 50 0 10 20 30 40 50
k k
Figure 19.8 The two inputs (speed and steering angle) for the trajectories
in figure 19.7.
19.4 Nonlinear control 345
102 102
Feasibility
Opt. cond.
100 100
Residual
Residual
102 102
104 104
106 106
0 100 200 300 400 0 50 100 150 200 250 300 350
Cumulative LM iterations Cumulative LM iterations
102 102
100 100
Residual
Residual
102 102
104 104
106 106
Notation
Vectors
x1
..
. n-vector with entries x1 , . . . , xn .
xn
(x1 , . . . , xn ) n-vector with entries x1 , . . . , xn .
xi The ith element of a vector x.
xr:s Subvector with entries from r to s.
0 Vector with all entries zero.
1 Vector with all entries one.
ei The ith standard basis vector.
xT y Inner product of vectors x and y.
kxk Norm of vector x.
rms(x) RMS value of a vector x.
avg(x) Average of entries of a vector x.
std(x) Standard deviation of a vector x.
dist(x, y) Distance between vectors x and y.
6 (x, y) Angle between vectors x and y.
xy Vectors x and y are orthogonal.
Matrices
X11 X1n
.. .. m n matrix with entries X , . . . , X .
. . 11 mn
Xm1 Xmn
Ellipsis notation
In this book we use standard mathematical ellipsis notation in lists and sums. We
write k, . . . , l to mean the list of all integers from k to l. For example, 3, . . . , 7
means 3, 4, 5, 6, [Link] notation is used to describe a list of numbers or vectors,
or in sums, as in i=1,...,n ai , which we also write as a1 + + an . Both of these
mean the sum of the n terms a1 , a2 , . . . , an .
Sets
In a few places in this book we encounter the mathematical concept of sets. The
notation {a1 , . . . , an } refers to a set with elements a1 , . . . , an . This is not the
same as the vector with elements a1 ,. . . , an , which is denoted (a1 , . . . , an ). For
sets the order does not matter, so, for example, we have {1, 2, 6} = {6, 1, 2}. We
can also specify a set by giving conditions that its entries must satisfy, using the
notation {x | condition(x)}, which means the set of x that satisfy the condition,
which depends on x. We say that a set contains its elements, or that the elements
are in the set, using the symbol , as in 2 {1, 2, 6}. The symbol 6 means not in,
or not an element of, as in 3 6 {1, 2, 6}.
P We can use sets to describe a sum over some elements in a list. The notation
PiS xi means the sum over all xi for which i is in the set S. As an example,
i{1,2,6} ai means a1 + a2 + a6 .
A few sets have specific names: R is the set of real numbers (or scalars), and
Rn is the set of all n-vectors. So R means that is a number, and x Rn
means that x is an n-vector.
Appendix B
Complexity
Vector operations
ax n
x+y n
xT y 2n
kxk 2n
kx yk 3n
rms(x) 2n
std(x) 4n
6 (x, y) 6n
Matrix operations
aA mn
A+B mn
Ax 2mn
AC 2mnp
AT A mn2
352 B Complexity
Big-times-small-squared mnemonic
Many of the complexities listed above that involve two dimensions can be remem-
bered using a simple mnemonic: The cost is
2 (big) (small)2 flops,
where big and small refer to big and small problem dimensions. We list some
specific examples of this rule below.
In the QR factorization of an m n matrix, we have m n, so m is the big
dimension and n is the small dimension. The complexity is 2mn2 flops.
Computing the pseudo-inverse A of an m n matrix A when A is tall
(and has independent columns) costs 2mn2 flops. When A is wide (and has
independent rows), it is 2nm2 flops.
For least squares, we have m n, so m is the big dimension and n is the small
dimension. The cost of computing the least squares approximate solution is
2mn2 flops.
For the least norm problem, we have p n, so n is the big dimension and p
is the small dimension. The cost is 2np2 flops.
The constrained least squares problem involves two matrices A and C, and
three dimensions that satisfy m + p n. The numbers
m + p and n are the
A
big and small dimensions of the stacked matrix . The cost of solving
C
the constrained least squares problem is 2(m + p)n2 flops.
Appendix C
Calculus does not play a big role in this book, except in chapters 18 and 19 (on
nonlinear least squares and constrained nonlinear least squares), where we use
derivatives, Taylor approximations, and the method of Lagrange multipliers. In
this appendix we collect some basic material about derivatives and optimization,
focussing on the few results and formulas we use.
C.1 Derivatives
C.1.1 Scalar-valued function of a scalar
f (z + t) f (z)
lim ,
t0 t
(if the limit exists) is called the derivative of the function f at the point z. It gives
the slope of the graph of f at the point (z, f (z)). We denote the derivative of f at
z as f 0 (z). We can think of f 0 as a scalar-valued function of a scalar variable; this
function is called the derivative (function) of f .
Taylor approximation. Let us fix the number z. The (first order) Taylor approx-
imation of the function f at the point z is defined as
a second argument, separated by a semicolon, to denote the point z where the ap-
proximation is made. Using this notation, the left-hand side of the equation above
is written f(x; z). The Taylor approximation is sometimes called the linearized
approximation of f at z. (Here linear uses informal mathematical language, where
affine is sometimes called linear.) The Taylor approximation function f is affine,
i.e., a linear function plus a constant.
The Taylor approximation f satisfies f(z; z) = f (z), i.e., at the point z it agrees
with the function f . For x near z, f(x; z) is a very good approximation of f (x).
For x not close enough to z, however, the approximation can be poor.
Partial derivative. The partial derivative of f at the point z, with respect to its
ith argument, is defined as
f f (z1 , . . . , zi1 , zi + t, zi+1 , . . . , zn ) f (z)
(z) = lim
xi t0 t
f (z + tei ) f (z)
= lim ,
t0 t
C.1 Derivatives 355
(if the limit exists). Roughly speaking the partial derivative is the derivative with
respect to the ith argument, with all other arguments fixed.
f (z) = ..
.
.
f
xn (z)
f f
f(x) = f (z) + (z)(x1 z1 ) + + (z)(xn zn )
x1 xn
which is the sum of squares of the arguments. The partial derivatives are
f
(z) = 2zi , i = 1, . . . , n.
xi
f (z) = 2z.
356 C Derivatives and optimization
(Note the resemblance to the formula for the derivative of the square of a scalar
variable.)
There are rules for the gradient of a combination of functions similar to those
for functions of a scalar. For example if f (x) = ag(x) + bh(x), we have
f (z) = ag(z) + bh(z).
Jacobian. The partial derivatives of the components of f (x) with respect to the
components of x, evaluated at z, are arranged into an mn matrix denoted Df (z),
called the derivative or Jacobian matrix of f at z. (In the notation Df (z), the
D and f go together; Df does not represent, say, a matrix-vector product.) The
derivative matrix is defined by
fi
Df (z)ij = (z), i = 1, . . . , m, j = 1, . . . , n.
xj
The rows of the Jacobian are fi (z)T , for i = 1, . . . , m. For m = 1, i.e., when
f is a scalar-valued function, the derivative matrix is a row vector of size n, the
transpose of the gradient of the function. The derivative matrix of a vector-valued
function of a vector is a generalization of the derivative of a scalar-valued function
of a scalar.
Finding Jacobians. We can always find the derivative matrix by calculating par-
tial derivatives of the entries of f with respect to the components of the argument
vector. In many cases the result simplifies using matrix vector notation. As an
example, let us find the derivative of the (scalar-valued) function
h f1 fm
(z) = 2f1 (z) (z) + + 2fm (z) (z).
xj xj xj
Arranging these to form the row vector Dh(z), we see we can write this using
matrix multiplication as
Dh(z) = 2f (z)T Df (z).
(Note the analogy to the formula for the scalar-valued function of a scalar variable
h(x) = f (x)2 , which is h0 (z) = 2f 0 (z)f (z).)
Many of the formulas for derivatives in the scalar case also hold for the vector
case, with scalar multiplication replaced with matrix multiplication (provided the
order of the terms is correct). As an example, consider the composition function
f (x) = g(h(x)), where h : Rn Rk and g : Rk Rm . The Jacobian or derivative
matrix of f at z is given by
Df (z) = Dg(h(z))Dh(z).
C.2 Optimization
Derivative condition for minimization. Suppose h is a scalar-valued function of
a scalar argument. If x minimizes h(x), we must have h0 (x) = 0. This fact is easily
understood: If h0 (x) 6= 0, then by taking a point x slightly less than x (if h0 (x) > 0)
or slightly more than x (if h0 (x) < 0), we would obtain h(x) < h(x), which shows
that x does not minimize h(x). This leads to the classic calculus-based method
for finding a minimizer of a function f : Find the derivative, and set it equal to
zero. One subtlety here is that there can be (and generally are) points that satisfy
h0 (z) = 0, but are not minimizers of h. So we generally need to check which of the
solutions of h0 (z) = 0 are in fact minimizers of h.
358 C Derivatives and optimization
Gradient condition for minimization. This basic calculus-based method for find-
ing a minimizer of a scalar-valued function can be generalized to functions with
vector arguments. If the n-vector x minimizes h : Rn R, then we must have
h
(x) = 0, i = 1, . . . , n.
xi
In vector notation, we must have
h(x) = 0.
Like the case of scalar argument, this is easily seen to hold if x minimizes h. Also
as in the case of scalar argument, there can be points that satisfy h(z) = 0 but
are not minimizers of h. So we need to check if points found this way are in fact
minimizers of h.
must hold, where gi : Rn R are given functions. We can write the constraints
in compact vector form g(x) = 0, where g(x) = (g1 (x), . . . , gp (x)), and express the
problem as
minimize f (x)
subect to g(x) = 0.
We seek a solution of this optimization problem, i.e., a point x that satisfies g(x) =
0 (i.e., is feasible) and, for any other x that satisfies g(x) = 0, we have h(x) h(x).
The method of Lagrange multipliers is an extension of the derivative or gradient
conditions for (unconstrained) minimization, that handles constrained optimization
problems.
KKT conditions. The KKT conditions (named for Karush, Kuhn, and Tucker)
state that if x is a solution of the constrained optimization problem, then there is
a vector z that satisfies
L L
(x, z) = 0, i = 1, . . . , n, (x, z) = 0, i = 1, . . . , p.
xi zi
(This is provided the rows of Dg(x) are linearly independent, a technical condition
we ignore.) As in the unconstrained case, there are pairs x, z that satisfy the KKT
conditions but x is not a solution of the constrained optimization problem.
The KKT conditions give us a method for solving the constrained optimization
optimization that is similar to the approach for the unconstrained optimization
problem. We attempt to solve the KKT equations for x and z; then we check to
see if any of the points found are really solutions.
We can simplify the KKT conditions, and express them compactly using matrix
notation. The last p equations can be expressed as gi (x) = 0, which we already
knew. The first n can be expressed as
x L(x, z) = 0,
where x denotes the gradient with respect to the xi arguments. This can be
written as
These conditions will hold for a solution of the problem (assuming the rows of Dg(x)
are independent). But there are points that satisfy them but are not solutions.
360 C Derivatives and optimization
Appendix D
Whats next
In this appendix we list some further topics of study that are closely related to the
material in this book, give a different perspective on the same material, complement
it, or provide useful extensions.
Mathematics
Probability and statistics. In this book we do not use probability and statistics,
even though we cover multple topics that are traditionally addressed using ideas
from probability and statistics, including data fitting and classification, control,
state estimation, and portfolio optimization. Further study of many of the topics
in this book requires a background in basic probability and statistics, and we
strongly urge you learn this material.
Abstract linear algebra. This book covers some of the most important basic ideas
from linear algebra, such as linear independence. In a more abstract course you
will learn about vector spaces, subspaces, nullspace and range. Eigenvalues and
singular values are very useful topics that we do not cover in this book.
problem. Convex optimization problems can be solved efficiently and exactly, and
include a very wide range of practically useful problems that arise in many ap-
plication areas, including all of the ones we have seen in this book. We would
strongly urge you to learn convex optmization, which is very widely used in many
applications.
Applications
Machine learning and artificial intelligence. This book covers some of the basic
ideas of machine learning and artificial intelligence, including a first exposure to
clustering, data fitting, classification, validation, and feature engineering. In a fur-
ther course on this material, you will learn about unsupervised learning methods
(like k-means) such as principal components analysis and nonnegative matrix fac-
torization, and more sophisticated clustering methods. You will also learn about
more sophisticated regression and classification methods, such as logistic regression
and the support vector machine, as well as methods for computing model parame-
ters that scale to extremely large scale problems. Additional topics might include
feature engineering and deep neural networks.
Linear dynamical systems, control, and estimation. We cover the very basics
of each of these topics; entire courses cover them in much more detail, including
applications in aerospace, navigation, and GPS.
Signal and image processing. Traditional signal processing, which is still very
widely used, focusses on convolution; more modern approaches use convex opti-
mization, especially in non-real-time applications, like image enhancement or med-
ical image reconstruction. You will find whole courses on signal processing for a
specific application area, like communications, speech, audio, and radar; for im-
age processing, there are whole courses on microscopy, computational photography,
tomography, and medical imaging.
Time series analysis. Time series analysis, and especially prediction, play an
important role in many applications areas, including finance and supply chain op-
timization. It is typically taught in a statistics or operations research course, or as
a specialty course in a specific area such as econometrics.
Index
diag, 96 return, 8
k-means return matrix, 92
algorithm, 60 attribute vector, 10
complexity, 65 audio
validation, 62 addition, 13
z-score, 212 mixing, 17, 102
auto-regressive model, 202
acute angle, 48
addition back substitution, 165
audio, 13 balancing chemical reactions, 126, 169
matrix, 98 basis, 75
vector, 11 dual, 163
adjacency matrix, 94, 108 functions, 190
advertising, 184, 267 orthonormal, 80
affine Ben Afleck, 71
approximation, 30 bill of materials, 13
combination, 16 birth rate, 133
function, 28, 121 block
algorithm matrix, 91, 145
k-means, 60 vector, 4
back substitution, 165 byte, 22, 103
constrained least squares, 273
Gauss-Newton, 306 Carl Friedrich Gauss, 306
Gram-Schmidt, 80, 155 cash flow
least norm, 276 discounted, 21
least squares, 183 replication, 17
Levenberg-Marquardt, 306 vector, 8, 77
Newton, 308 categorical feature, 213
solving linear equations, 166 Cauchy-Schwarz inequality, 47
aligned vectors, 48 centroid, 60
angle, 47 chain graph, 110
document, 49 chain rule, 150, 354, 357
orthogonal, 48 Chebyshev inequality, 39, 46
annualized chemical
return and risk, 281 equilibrium, 305
anti-aligned vectors, 48 reaction balance, 126
approximation circular difference matrix, 251
affine, 30 classification, 221
least squares, 178 handwritten digit, 226, 323
Taylor, 30 iris flower, 225
AR model, 202 closed-loop, 151
argmax, 235 cluster centroid, 60
argument of function, 25 clustering, 55
asset digits, 65
allocation, 279 objective, 58
alpha and beta, 194 optimal, 59
364 Index
risk, 44 subtraction
annualized, 281 matrix, 98
RMS, 38 subvector, 4
deviation, 40 sum
prediction error, 42 vector, 11
ROC, 229 superposition, 26, 119
Ronald Fisher, 225 supply chain dynamics, 140
root-mean-square, 38 support vector machine, 224
round-off error, 85 survey response, 57
row vector, 90
linearly independent, 97 tall matrix, 90
Rudolph Kalman, 296 Taylor approximation, 30, 123, 353
running sum, 101, 121 term frequency inverse document frequency,
214
scalar, 3 test data set, 204
scalar-matrix multiplication, 99 test set, 204
scalar-vector TFID, 214
multiplication, 15 thermal resistance, 128
score, 20 time series
seasonal component, 198 auto-regressive model, 202
second difference matrix, 149 de-trended, 195
segment, 18 vector, 7
shaping demand, 247 time-invariant, 131
short position, 7, 21 time-series
sigmoid function, 309 prediction validation, 210
sign function, 224 Toeplitz matrix, 112
signal, 7 topic discovery, 56, 68
singular matrix, 160 tracking, 291
skewed classifier, 229 training data set, 204
slice, 4 training set, 204
social network graph, 98 transpose, 97
sparse triangle inequality, 38, 41, 48
constrained least squares, 275 triangular matrix, 96, 164
least squares, 184
linear equations, 168, 276 uncorrelated, 50
matrix, 96 under-determined, 125
QR factorization, 156 unit vector, 5
vector, 23 upper triangular matrix, 96
sparse matrix
multiplication, 148 validation, 62, 203, 204
sparsity, 5 k-means, 62
spherical distance, 49 clustering, 62
square matrix, 90 limitations, 210
stacked set, 204
matrix, 91 Vandermonde matrix, 198
vector, 4 variable, 177
standard deviation, 43, 191 vector, 3
standardization, 46 addition, 11
standardized features, 212 affine combination, 16
state, 131 aligned, 48
state feedback control, 151 angle, 47
stemming, 10, 68 anti-aligned, 48
stop words, 10 basis, 75
straight-line fit, 192 block, 4
submatrix, 91 cash flow, 8, 77
subset clustering, 55
vector, 10 coefficients, 3
Index 369