Machine Learning and Data Mining Notes 1647447657
Machine Learning and Data Mining Notes 1647447657
Lecture Notes
CSC 411/D11
Computer Science Department
University of Toronto
Version: February 6, 2012
Contents
Conventions and Notation iv
2 Linear Regression 5
2.1 The 1D case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Multidimensional inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Multidimensional outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Nonlinear Regression 9
3.1 Basis function regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Overfitting and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Quadratics 17
4.1 Optimizing a quadratic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Estimation 35
7.1 Learning a binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.1 MAP, ML, and Bayes’ Estimates . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 Learning Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Classification 42
8.1 Class Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.4 K-Nearest Neighbors Classification . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.5 Generative vs. Discriminative models . . . . . . . . . . . . . . . . . . . . . . . . 47
8.6 Classification by LS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.7 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.7.1 Discrete Input Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.7.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9 Gradient Descent 53
9.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10 Cross Validation 56
10.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11 Bayesian Methods 59
11.1 Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.3 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14 Lagrange Multipliers 83
14.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14.2 Least-Squares PCA in one-dimension . . . . . . . . . . . . . . . . . . . . . . . . 87
14.3 Multiple constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
14.4 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
15 Clustering 92
15.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
15.2 K-medoids Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.3 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.3.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.3.2 Numerical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.3.3 The Free Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
15.3.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
15.3.5 Relation to K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
15.3.6 Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
15.4 Determining the number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . 101
18 AdaBoost 123
18.1 Decision stumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
18.2 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
18.3 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Aside:
Text in “aside” boxes provide extra background or information that you are not re-
quired to know for this course.
Acknowledgements
Graham Taylor and James Martens assisted with preparation of these notes.
1. The Artifical Intelligence View. Learning is central to human knowledge and intelligence,
and, likewise, it is also essential for building intelligent machines. Years of effort in AI
has shown that trying to build intelligent computers by programming all the rules cannot be
done; automatic learning is crucial. For example, we humans are not born with the ability
to understand language — we learn it — and it makes sense to try to have computers learn
language instead of trying to program it all it.
3. The Stats View. Machine learning is the marriage of computer science and statistics: com-
putational techniques are applied to statistical problems. Machine learning has been applied
to a vast number of problems in many contexts, beyond the typical statistics problems. Ma-
chine learning is often designed with different considerations than statistics (e.g., speed is
often more important than accuracy).
2. Application: The model is used to make decisions about some new test data.
For example, in the spam filtering case, the training data constitutes email messages labeled as ham
or spam, and each new email message that we receive (and which to classify) is test data. However,
there are other ways in which machine learning is used as well.
1. Supervised Learning, in which the training data is labeled with the correct answers, e.g.,
“spam” or “ham.” The two most common types of supervised learning are classification
(where the outputs are discrete labels, as in spam filtering) and regression (where the outputs
are real-valued).
2. Unsupervised learning, in which we are given a collection of unlabeled data, which we wish
to analyze and discover patterns within. The two most important examples are dimension
reduction and clustering.
3. Reinforcement learning, in which an agent (e.g., a robot or controller) seeks to learn the
optimal actions to take based the outcomes of past actions.
There are many other types of machine learning as well, for example:
4. Active learning, in which obtaining data is expensive, and so an algorithm must determine
which training data to acquire
1. How do we parameterize the model we fit? For the example in Figure 1, how do we param-
eterize the curve; should we try to explain the data with a linear function, a quadratic, or a
sinusoidal curve?
2. What criteria (e.g., objective function) do we use to judge the quality of the fit? For example,
when fitting a curve to noisy data, it is common to measure the quality of the fit in terms of
the squared error between the data we are given and the fitted curve. When minimizing the
squared error, the resulting fit is usually called a least-squares estimate.
3. Some types of models and some model parameters can be very expensive to optimize well.
How long are we willing to wait for a solution, or can we use approximations (or hand-
tuning) instead?
4. Ideally we want to find a model that will provide useful predictions in future situations. That
is, although we might learn a model from training data, we ultimately care about how well
it works on future test data. When a model fits training data well, but performs poorly on
test data, we say that the model has overfit the training data; i.e., the model has fit properties
of the input that are not particularly relevant to the task at hand (e.g., Figures 1 (top row and
bottom left)). Such properties are refered to as noise. When this happens we say that the
model does not generalize well to the test data. Rather it produces predictions on the test
data that are much less accurate than you might have hoped for given the fit to the training
data.
Machine learning provides a wide selection of options by which to answer these questions,
along with the vast experience of the community as to which methods tend to be successful on
a particular class of data-set. Some more advanced methods provide ways of automating some
of these choices, such as automatically selecting between alternative models, and there is some
beautiful theory that assists in gaining a deeper understanding of learning. In practice, there is no
single “silver bullet” for all learning. Using machine learning in practice requires that you make
use of your own prior knowledge and experimentation to solve problems. But with the tools of
machine learning, you can do amazing things!
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
6
1.5
4
1
0.5
−2
−0.5
−4
−6 −1
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Figure 1: A simple regression problem. The blue circles are measurements (the training data), and
the red curves are possible fits to the data. There is no one “right answer;” the solution we prefer
depends on the problem. Ideally we want to find a model that provides good predictions for new
inputs (i.e., locations on the x-axis for which we had no training data). We will often prefer simple,
smooth models like that in the lower right.
2 Linear Regression
In regression, our goal is to learn a mapping from one real-valued space to another. Linear re-
gression is the simplest form of regression: it is easy to understand, often quite effective, and very
efficient to learn and use.
To estimate w and b, we solve for the w and b that minimize this objective function. This can be
done by setting the derivatives to zero and solving.
dE X
= −2 (yi − (wxi + b)) = 0 (3)
db i
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 2: An example of linear regression: the red line is fit to the blue data points.
where we define x̄ and ȳ as the averages of the x’s and y’s, respectively. This equation for b∗ still
depends on w, but we can nevertheless substitute it back into the energy function:
X
E(w, b) = ((yi − ȳ) − w(xi − x̄))2 (6)
i
Then:
dE X
= −2 ((yi − ȳ) − w(xi − x̄))(xi − x̄) (7)
dw i
dE
Solving dw
= 0 then gives:
P
i (y − ȳ)(xi − x̄)
w = ∗
Pi 2
(8)
i (xi − x̄)
The values w∗ and b∗ are the least-squares estimates for the parameters of the linear regression.
1
Above we used subscripts to index the training set, while here we are using the subscript to index the elements of
the input and weight vectors. In what follows the context should make it clear what the index denotes.
For convenience, we can fold the bias b into the weights, if we augment the inputs with an addi-
tional 1. In other words, if we define
w1 x1
.. ..
.
, x̃ = .
w̃ = (10)
wD xD
b 1
If we stack the outputs in a vector and the inputs in a matrix, then we can also write this as:
where
y1 xT1 1
y = ... , X̃ = ...
(14)
T
yN xN 1
P
and || · || is the usual Euclidean norm, i.e., ||v||2 = i vi2 . (You should verify for yourself that
Equations 12 and 13 are equivalent).
Equation 13 is known as a linear least-squares problem, and can be solved by methods from
linear algebra. We can rewrite the objective function as:
We can optimize this by setting all values of dE/dwi = 0 and solving the resulting system of
equations (we will cover this in more detail later in Chapter 4). In the meantime, if this is unclear,
start by reviewing your linear algebra and vector calculus). The solution is given by:
(You may wish to verify for yourself that this reduces to the solution for the 1D case in Section
2.1; however, this takes quite a lot of linear algebra and a little cleverness). The matrix X̃+ ≡
(X̃T X̃)−1 X̃T is called the pseudoinverse of X̃, and so the solution can also be written:
In MATLAB, one can directly solve the system of equations using the slash operator:
w̃∗ = X̃\y (19)
There are some subtle differences between these two ways of solving the system of equations. We
will not concern ourselves with these here except to say that I recommend using the slash operator
rather than the pseudoinverse.
where X̃T = [x̃1 x̃2 . . . x̃N ]. With a little thought you can see that this really amounts to K
distinct estimation problems, the solutions for which are given by w̃j∗ = X̃+ yj′ .
Another common convention is to stack up everything into a matrix equation, i.e.,
E(W̃) = ||Y − X̃W̃||2F (24)
P
where Y = [y1′ . . . yK′
], and || · ||F denotes the Frobenius norm: ||Y||2F = i,j Yi,j
2
. You should
verify that Equations (23) and (24) are equivalent representations of the energy function in Equa-
tion (22). Finally, the solution is again provided by the pseudoinverse:
W̃∗ = X̃+ Y (25)
or, in MATLAB, W̃∗ = X̃\Y.
3 Nonlinear Regression
Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear
models are necessary. In regression, all such models will have the same basic form, i.e.,
y = f (x) (26)
In linear regression, we have f (x) = Wx + b; the parameters W and b must be fit to data.
What nonlinear function do we choose? In principle, f (x) could be anything: it could involve
linear functions, sines and cosines, summations, and so on. However, the form we choose will
make a big difference on the effectiveness of the regression: a more general model will require
more data to fit, and different models are more appropriate for different problems. Ideally, the
form of the model would be matched exactly to the underlying phenomenon. If we’re modeling a
linear process, we’d use a linear regression; if we were modeling a physical process, we could, in
principle, model f (x) by the equations of physics.
In many situations, we do not know much about the underlying nature of the process being
modeled, or else modeling it precisely is too difficult. In these cases, we typically turn to a few
models in machine learning that are widely-used and quite effective for many problems. These
methods include basis function regression (including Radial Basis Functions), Artificial Neural
Networks, and k-Nearest Neighbors.
There is one other important choice to be made, namely, the choice of objective function for
learning, or, equivalently, the underlying noise model. In this section we extend the LS estimators
introduced in the previous chapter to include one or more terms to encourage smoothness in the
estimated models. It is hoped that smoother models will tend to overfit the training data less and
therefore generalize somewhat better.
for the 1D case. The functions bk (x) are called basis functions. Often it will be convenient to
express this model in vector form, for which we define b(x) = [b1 (x), . . . , bM (x)]T and w =
[w1 , . . . , wM ]T where M is the number of basis functions. We can then rewrite the model as
y = f (x) = b(x)T w (28)
Two common choices of basis functions are polynomials and Radial Basis Functions (RBF).
A simple, common basis for polynomials are the monomials, i.e.,
b0 (x) = 1, b1 (x) = x, b2 (x) = x2 , b3 (x) = x3 , ... (29)
2
In the machine learning and statistics literature, these representations are often referred to as linear regression,
since they are linear functions of the “features” bk (x)
1
2
0 0.5
−2
0
−4
−0.5
−6
−1
−8
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x x
Figure 3: The first three basis functions of a polynomial basis, and Radial Basis Functions
Radial Basis Functions, and the resulting regression model are given by
(x−ck )2
bk (x) = e− 2σ2 , (31)
X (x−ck )2
f (x) = wk e− 2σ2 , (32)
where ck is the center (i.e., the location) of the basis function and σ 2 determines the width of the
basis function. Both of these are parameters of the model that must be determined somehow.
In practice there are many other possible choices for basis functions, including sinusoidal func-
tions, and other types of polynomials. Also, basis functions from different families, such as mono-
mials and RBFs, can be combined. We might, for example, form a basis using the first few poly-
nomials and a collection of RBFs. In general we ideally want to choose a family of basis functions
such that we get a good fit to the data with a small basis set so that the number of weights to be
estimated is not too large.
To fit these models, we can again use least-squares regression, by minimizing the sum of
squared residual error between model predictions and the training data outputs:
!2
X X X
E(w) = (yi − f (xi ))2 = yi − wk bk (x) (33)
i i k
To minimize this function with respect to w, we note that this objective function has the same form
as that for linear regression in the previous chapter, except that the inputs are now the bk (x) values.
In particular, E is still quadratic in the weights w, and hence the weights w can be estimated the
same way. That is, we can rewrite the objective function in matrix-vector form to produce
where ||·|| denotes the Euclidean norm, and the elements of the matrix B are given by Bi,j = bj (xi )
(for row i and column j). In Matlab the least-squares estimate can be computed as w∗ = B\y.
Picking the other parameters. The positions of the centers and the widths of the RBF basis
functions cannot be solved directly for in closed form. So we need some other criteria to select
them. If we optimize these parameters for the squared-error, then we will end up with one basis
center at each data point, and with tiny width that exactly fit the data. This is a problem as such a
model will not usually provide good predictions for inputs other than those in the training set.
The following heuristics instead are commonly used to determine these parameters without
overfitting the training data. To pick the basis centers:
1. Place the centers uniformly spaced in the region containing the data. This is quite simple,
but can lead to empty regions with basis functions, and will have an impractical number of
data points in higher-dimensinal input spaces.
2. Place one center at each data point. This is used more often, since it limits the number of
centers needed, although it can also be expensive if the number of data points is large.
3. Cluster the data, and use one center for each cluster. We will cover clustering methods later
in the course.
To pick the width parameter:
1. Manually try different values of the width and pick the best by trial-and-error.
2. Use the average squared distances (or median distances) to neighboring centers, scaled by a
constant, to be the width. This approach also allows you to use different widths for different
basis functions, and it allows the basis functions to be spaced non-uniformly.
In later chapters we will discuss other methods for determining these and other parameters of
models.
1. The problem is insufficiently constrained: for example, if we have ten measurements and ten
model parameters, then we can often obtain a perfect fit to the data.
2. Fitting noise: overfitting can occur when the model is so powerful that it can fit the data and
also the random noise in the data.
There are two important solutions to the overfitting problem: adding prior knowledge and handling
uncertainty. The latter one we will discuss later in the course.
In many cases, there is some sort of prior knowledge we can leverage. A very common as-
sumption is that the underlying function is likely to be smooth, for example, having small deriva-
tives. Smoothness distinguishes the examples in Figure 4. There is also a practical reason to
prefer smoothness, in that assuming smoothness reduces model complexity: it is easier to estimate
smooth models from small datasets. In the extreme, if we make no prior assumptions about the
nature of the fit then it is impossible to learn and generalize at all; smoothness assumptions are one
way of constraining the space of models so that we have any hope of learning from small datasets.
One way to add smoothness is to parameterize the model in a smooth way (e.g., making the
width parameter for RBFs larger; using only low-order polynomial basis functions), but this limits
the expressiveness of the model. In particular, when we have lots and lots of data, we would like
the data to be able to “overrule” the smoothness assumptions. With large widths, it is impossible
to get highly-curved models no matter what the data says.
Instead, we can add regularization: an extra term to the learning objective function that prefers
smooth models. For example, for RBF regression with scalar outputs, and with many other types
of basis functions or multi-dimensional outputs, this can be done with an objective function of the
form:
E(w) = ||y − Bw||2 + λ||w||2 (35)
| {z } | {z }
data term smoothness term
This objective function has two terms. The first term, called the data term, measures the model fit
to the training data. The second term, often called the smoothness term, penalizes non-smoothness
(rapid changes in f (x)). This particular smoothness term (||w||) is called weight decay, because it
tends to make the weights smaller.3 Weight decay implicitly leads to smoothness with RBF basis
functions because the basis functions themselves are smooth, so rapid changes in the slope of f
(i.e., high curvature) can only be created in RBFs by adding and subtracting basis functions with
large weights. (Ideally, we might directly penalize smoothness, e.g., using an objective term that
directly penalizes the integral of the squared curvature of f (x), but this is usually impractical.)
3
Estimation with this objective function is sometimes called Ridge Regression in Statistics.
This regularized least-squares objective function is still quadratic with respect to w and can
be optimized in closed-form. To see this, we can rewrite it as follows:
To minimize E(w), as above, we solve the normal equations ∇E(w) = 0 (i.e., ∂E/∂wi = 0 for
all i). This yields the following regularized LS estimate for w:
This equation describes a process whereby a linear regressor with weights w( 2) is applied to x.
The output of this regressor is then put through the nonlinear Sigmoid function, the outputs of
which act as features to another linear regressor. Thus, note that the inner weights w(2) are distinct
(1)
parameters from the outer weights wj . As usual, it is easiest to interpret this model in the 1D
case, i.e., X (1) (2)
(2)
y = f (x) = wj g wj x + bj + b(1) (42)
j
Figure 5(left) shows plots of g(wx) for different values of w, and Figure 5(right) shows g(x+b)
for different values of b. As can be seen from the figures, the sigmoid function acts more or less
like a step function for large values of w, and more like a linear ramp for small values of w. The
bias b shifts the function left or right. Hence, the neural network is a linear combination of shifted
(smoothed) step functions, linear ramps, and the bias term.
To learn an artificial neural network, we can again write a regularized squared-error objective
function:
E(w, b) = ||y − f (x)||2 + λ||w||2 (43)
1.5 1.5
training data points
original curve
estimated curve
1
1
0.5
0.5
−0.5
−0.5
−1
training data points
original curve
estimated curve
−1 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
(a) (b)
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
(c) (d)
Figure 4: Least-squares curve fitting of an RBF. (a) Point data (blue circles) was taken from a sine
curve, and a curve was fit to the points by a least-squares fit. The horizontal axis is x, the vertical
axis is y, and the red curve is the estimated f (x). In this case, the fit is essentially perfect. The
curve representation is a sum of Gaussian basis functions. (b) Overfitting. Random noise was
added to the data points, and the curve was fit again. The curve exactly fits the data points, which
does not reproduce the original curve (a green, dashed line) very well. (c) Underfitting. Adding
a smoothness term makes the resulting curve too smooth. (In this case, weight decay was used,
along with reducing the number of basis functions). (d) Reducing the strength of the smoothness
term yields a better fit.
0.9
g(x−4)
0.8 g(x)
g(x+4)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
Figure 5: Left: Sigmoids g(wx) = 1/(1+e−wx ) for various values of w, ranging from linear ramps
to smooth steps to nearly hard steps. Right: Sigmoids g(x + b) = 1/(1 + e−x−b ) with different
shifts b.
where w comprises the weights at both levels for all j. Note that we regularize by applying weight
decay to the weights (both inner and outer), but not the biases, since only the weights affect the
smoothness of the resulting function (why?).
Unfortuntely, this objective function cannot be optimized in closed-form, and numerical opti-
mization procedures must be used. We will study one such method, gradient descent, in the next
chapter.
where the set NK (x) contains the indicies of the K training points closest to x. Alternatively, we
might take a weighted average of the K-nearest neighbors to give more influence to training points
close to x than to those further away:
P
i∈N (x) w(xi )yi 2 2
y= P K , w(xi ) = e−||xi −x|| /2σ (45)
i∈NK (x) w(xi )
where σ 2 is an additional parameter to the algorithm. The parameters K and σ control the degree
of smoothing performed by the algorithm. In the extreme case of K = 1, the algorithm produces
a piecewise-constant function.
K-nearest neighbors is simple and easy to implement; it doesn’t require us to muck about at
all with different choices of basis functions or regularizations. However, it doesn’t compress the
data at all: we have to keep around the entire training set in order to use it, which could be very
expensive, and we must search the whole data set to make predictions. (The cost of searching
can be mitigated with spatial data-structures designed for searching, such as k-d-trees and locality-
sensitive hashing. We will not cover these methods here).
4 Quadratics
The objective functions used in linear least-squares and regularized least-squares are multidimen-
sional quadratics. We now analyze multidimensional quadratics further. We will see many more
uses of quadratics further in the course, particularly when dealing with Gaussian distributions.
The general form of a one-dimensional quadratic is given by:
f (x) = w2 x2 + w1 x + w0 (46)
This can also be written in a slightly different way (called standard form):
where a = w2 , b = −w1 /(2w2 ), c = w0 − w12 /4w2 . These two forms are equivalent, and it is
easy to go back and forth between them (e.g., given a, b, c, what are w0 , w1 , w2 ?). In the latter
form, it is easy to visualize the shape of the curve: it is a bowl, with minimum (or maximum) at
b, and the “width” of the bowl is determined by the magnitude of a, the sign of a tells us which
direction the bowl points (a positive means a convex bowl, a negative means a concave bowl), and
c tells us how high or low the bowl goes (at x = b). We will now generalize these intuitions for
higher-dimensional quadratics.
The general form for a 2D quadratic function is:
You should verify for yourself that these different forms are equivalent: by multiplying out all the
elements of f (x), either in the 2D case or, using summations, the general N − D case.
For many manipulations we will want to do later, it is helpful for A to be symmetric, i.e., to
have wi,j = wj,i . In fact, it should be clear that these off-diagonal entries are redundant. So, if we
are a given a quadratic for which A is asymmetric, we can symmetrize it as:
1
f (x) = xT ( (A + AT ))x + bT x + c = xT Ãx + bT x + c (54)
2
and use à = 12 (A + AT ) instead. You should confirm for yourself that this is equivalent to the
original quadratic.
As before, we can convert the quadratic to a form that leads to clearer interpretation:
where µ = − 21 A−1 b, d = c − µT Aµ, assuming that A−1 exists. Note the similarity here to the
1-D case. As before, this function is a bowl-shape in N dimensions, with curvature specified by
the matrix A, and with a single stationary point µ.4 However, fully understanding the shape of
f (x) is a bit more subtle and interesting.
f (x) = xT Ax + bT x + c. (56)
The stationary points occur where all partial derivatives are zero, i.e., ∂f /∂xi = 0 for all i. The
gradient of a function is the vector comprising the partial derivatives of the function, i.e.,
At stationary points it must therefore be true that ∇f = [0, . . . , 0]T . Let us assume that A is
symmetric (if it is not, then we can symmetrize it as above). Equation 56 is a very common form
of cost function (e.g. the log probability of a Gaussian as we will later see), and so the form of its
gradient is important to examine.
Due to the linearity of the differentiation operator, we can look at each of the three terms of
Eq.56 separately. The last (constant) term does not depend on x and so we can ignore it because
its derivative is zero. Let us examine the first term. If we write out the individual terms within the
4
A stationary point means a setting of x where the gradient is zero.
vectors/matrices, we get:
a11 . . . a1N x1
(x1 . . . xN ) ... ... .. ..
. . (58)
aN 1 . . . aN N xN
= (x1 a11 + x2 a21 + . . . + xN aN 1 x1 a12 + x2 a22 + . . . (59)
x1
. . . + x1 a1N + x2 a2N + . . . + xN aN N ) ...
(60)
xN
=x21 a11 + x1 x2 a21 + . . . + x1 xN aN 1 + x1 x2 a12 + x22 a22 + . . . + xN x2 aN 2 + . . . (61)
. . . x1 xN a1N + x2 xN a2N + . . . + x2N aN N (62)
X
= aij xi xj (63)
ij
The ith element of the gradient corresponds to ∂f /∂xi . So in the expression above, for the
terms in the gradient corresponding to each xi , we only need to consider the terms involving xi
(others will have derivative zero), namely
X
x2i aii + xi xj (aij + aji ) (64)
j6=i
We can write a single expression for all of the xi using matrix/vector form:
∂xT Ax
= (A + AT )x. (66)
∂x
You should multiply this out for yourself to see that this corresponds to the individual terms above.
If we assume that A is symmetric, then we have
∂xT Ax
= 2Ax. (67)
∂x
This is also a very helpful rule that you should remember. The next term in the cost function, bT x,
has an even simpler gradient. Note that this is simply a dot product, and the result is a scalar:
bT x = b1 x1 + b2 x2 + . . . + bN xN . (68)
Only one term corresponds to each xi and so ∂f /∂xi = bi . We can again express this in ma-
trix/vector form:
∂ bT x
= b. (69)
∂x
This is another helpful rule that you will encounter again. If we use both of the expressions we
have just derived, and set the gradient of the cost function to zero, we get:
∂f (x)
= 2Ax + b = [0, . . . , 0]T (70)
∂x
The optimum is given by the solution to this system of equations (called normal equations):
1
x = − A−1 b (71)
2
In the case of scalar x, this reduces to x = −b/2a. For linear regression with multi-dimensional
inputs above (see Equation 18): A = XXT and b = −2XyT . As an exercise, convince yourself
that this is true.
Moreover, let us assert the rule “A implies B”, which we will write as A → B. Then, if A is
known to be true, we may deduce logically that B must also be true (if my car is stolen then it
won’t be in the parking spot where I left it). Alternatively, if I find my car where I left it (“B is
false,” written B̄), then I may infer that it was not stolen (Ā) by the contrapositive B̄ → Ā.
Classical logic provides a model of how humans might reason, and a model of how we might
build an “intelligent” computer. Unfortunately, classical logic has a significant shortcoming: it
assumes that all knowledge is absolute. Logic requires that we know some facts about the world
with absolute certainty, and then, we may deduce only those facts which must follow with absolute
certainty.
In the real world, there are almost no facts that we know with absolute certainty — most of
what we know about the world we acquire indirectly, through our five senses, or from dialogue with
other people. One can therefore conclude that most of what we know about the world is uncertain.
(Finding something that we know with certainty has occupied generations of philosophers.)
For example, suppose I discover that my car is not where I remember leaving it (B). Does
this mean that it was stolen? No, there are many other explanations — maybe I have forgotten
where I left it or maybe it was towed. However, the knowledge of B makes A more plausible
— even though I do not know it to be stolen, it becomes more likely a scenario than before. The
actual degree of plausibility depends on other contextual information — did I park it in a safe
neighborhood?, did I park it in a handicapped zone?, etc.
Predicting the weather is another task that requires reasoning with uncertain information.
While we can make some predictions with great confidence (e.g. we can reliably predict that it
will not snow in June, north of the equator), we are often faced with much more difficult questions
(will it rain today?) which we must infer from unreliable sources of information (e.g., the weather
report, clouds in the sky, yesterday’s weather, etc.). In the end, we usually cannot determine for
certain whether it will rain, but we do get a degree of certainty upon which to base decisions and
decide whether or not to carry an umbrella.
Another important example of uncertain reasoning occurs whenever you meet someone new —
at this time, you immediately make hundreds of inferences (mostly unconscious) about who this
person is and what their emotions and goals are. You make these decisions based on the person’s
appearance, the way they are dressed, their facial expressions, their actions, the context in which
you meet, and what you have learned from previous experience with other people. Of course, you
have no conclusive basis for forming opinions (e.g., the panhandler you meet on the street may
be a method actor preparing for a role). However, we need to be able to make judgements about
other people based on incomplete information; otherwise, normal interpersonal interaction would
be impossible (e.g., how do you really know that everyone isn’t out to get you?).
What we need is a way of discussing not just true or false statements, but statements that have
varying levels of certainty. In addition, we would like to be able to use our beliefs to reason about
the world and interpret it. As we gain new information, our beliefs should change to reflect our
greater knowledge. For example, for any two propositions A and B (that may be true or false), if
A → B, then strong belief in A should increase our belief in B. Moreover, strong belief in B may
sometimes increase our belief in A as well.
• The joint probability of two statements A and B — denoted P (A, B) — is the probability
that both statements are true. (i.e., the probability that the statement “A ∧ B” is true).
(Clearly, P (A, B) = P (B, A).)
• All of the above rules can be made conditional on additional information. For example,
given an additional statement C, we can write the Sum Rule as:
X
P (Ai |C) = 1 (75)
i
From these rules, we further derive many more expressions to relate probabilities. For example,
one important operation is called marginalization:
X
P (B) = P (Ai , B) (77)
i
if Ai are mutually-exclusive statements, of which exactly one must be true. In the simplest case
— where the statement A may be true or false — we can derive:
The derivation of this formula is straightforward, using the basic rules of probability theory:
P (A) + P (Ā) = 1, Sum rule (79)
P (A|B) + P (Ā|B) = 1, Conditioning (80)
P (A|B)P (B) + P (Ā|B)P (B) = P (B), Algebra (81)
P (A, B) + P (Ā, B) = P (B), Product rule (82)
Marginalization gives us a useful way to compute the probability of a statement B that is inter-
twined with many other uncertain statements.
Another useful concept is the notion of independence. Two statements are independent if and
only if P (A, B) = P (A)P (B). If A and B are independent, then it follows that P (A|B) = P (A)
(by combining the Product Rule with the defintion of independence). Intuitively, this means that,
whether or not B is true tells you nothing about whether A is true.
In the rest of these notes, I will always use probabilities as statements about variables. For
example, suppose we have a variable x that indicates whether there are one, two, or three people
in a room (i.e., the only possibilities are x = 1, x = 2, x = 3). Then, by the sum rule, we can
derive P (x = 1) + P (x = 2) + P (x = 3) = 1. Probabilities can also describe the range of a real
variable. For example, P (y < 5) is the probability that the variable y is less than 5. (We’ll discuss
continuous random variables and probability densities in more detail in the next chapter.)
To summarize:
Once we have these rules — and a suitable model — we can derive any probability that we
want. With some experience, you should be able to derive any desired probability (e.g., P (A|C))
given a basic model.
these notes, I will use probabilities specifically to refer to values of variables, e.g., P (c = heads)
is the probability that the coin lands heads.
What is the probability that the coin lands heads? This probability should be some real number
θ, 0 ≤ θ ≤ 1. For most coins, we would say θ = .5. What does this number mean? The number θ
is a representation of our belief about the possible values of c. Some examples:
θ =0 we are absolutely certain the coin will land tails
θ = 1/3 we believe that tails is twice as likely as heads
θ = 1/2 we believe heads and tails are equally likely
θ =1 we are absolutely certain the coin will land heads
Formally, we denote the probability of the coin coming up heads as P (c = heads), so P (c =
heads) = θ. In general, we denote the probability of a specific event event as P (event). By the
Sum Rule, we know P (c = heads) + P (c = tails) = 1, and thus P (c = tails) = 1 − θ.
Once we flip the coin and observe the result, then we can be pretty sure that we know the value
of c; there is no practical need to model the uncertainty in this measurement. However, suppose
we do not observe the coin flip, but instead hear about it from a friend, who may be forgetful or
untrustworthy. Let f be a variable indicating how the friend claims the coin landed, i.e. f = heads
means the friend says that the coin came up heads. Suppose the friend says the coin landed heads
— do we believe him, and, if so, with how much certainty? As we shall see, probabilistic reasoning
obtains quantitative values that, qualitatively, matches our common sense very effectively.
Suppose we know something about our friend’s behaviour. We can represent our beliefs with
the following probabilities, for example, P (f = heads|c = heads) represents our belief that the
friend says “heads” when the the coin landed heads. Because the friend can only say one thing, we
can apply the Sum Rule to get:
P (f = heads|c = heads) + P (f = tails|c = heads) = 1 (83)
P (f = heads|c = tails) + P (f = tails|c = tails) = 1 (84)
If our friend always tells the truth, then we know P (f = heads|c = heads) = 1 and P (f =
tails|c = heads) = 0. If our friend usually lies, then, for example, we might have P (f = heads|c =
heads) = .3.
for k = 0, 1, . . . , n, where
n n!
= . (86)
k k! (n − k)!
A multinomial distribution is a natural extension of the binomial distribution to an experiment
with k mutually exclusive
P outcomes, having probabilities pj , for j = 1, . . . , k. Of course, to be
valid probabilities pj = 1. For example, rolling a die can yield one of six values, each with
probability 1/6 (assuming the die is fair). Given n trials, the multinomial distribution specifies the
distribution over the number of each of the possible outcomes. Given n trials, k possible outcomes
with probabilities pj , the distribution over the event that outcome j occurs xj times (and of course
P
xj = n), is the multinomial distribution given by
n!
P (X1 = x1 , X2 = x2 , . . . , Xk = xk ) = px1 px2 . . . pxk k (87)
x1 ! x2 ! . . . xk ! 1 2
is a valid PDF. I will use the convention of upper-case P for discrete probabilities, and lower-case
p for PDFs.
With the PDF we can specify the probability that the random variable x falls within a given
range: Z x1
P (x0 ≤ x ≤ x1 ) = p(x)dx (92)
x0
This can be visualized by plotting the curve p(x). Then, to determine the probability that x falls
within a range, we compute the area under the curve for that range.
The PDF can be thought of as the infinite limit of a discrete distribution, i.e., a discrete dis-
tribution with an infinite number of possible outcomes. Specifically, suppose we create a discrete
distribution with N possible outcomes, each corresponding to a range on the real number line.
Then, suppose we increase N towards infinity, so that each outcome shrinks to a single real num-
ber; a PDF is defined as the limiting case of this discrete distribution.
There is an important subtlety here: a probability density is not a probability per se. For
one thing, there is no requirement that p(x) ≤ 1. Moreover, the probability that x attains any
R 5 specific value out of the infinite set of possible values is always zero, e.g. P (x = 5) =
one
5
p(x)dx = 0 for any PDF p(x). People (myself included) are sometimes sloppy in referring
to p(x) as a probability, but it is not a probability — rather, it is a function that can be used in
computing probabilities.
Joint distributions are defined in a natural way. For two variables x and y, the joint PDF p(x, y)
defines the probability that (x, y) lies in a given domain D:
Z
P ((x, y) ∈ D) = p(x, y)dxdy (93)
(x,y)∈D
Rwe can write p(x|y), which provides a PDF for x for every value of y. (It must be the case that
p(x|y)dx = 1, since p(x|y) is a PDF over values of x.)
In general, for all of the rules for manipulating discrete distributions there are analogous rules
for continuous distributions:
Probability rules for PDFs:
• p(x)
R ∞ ≥ 0, for all x
• −∞ p(x)dx = 1 R
x
• P (x0 ≤ x ≤R x1 ) = x01 p(x)dx
∞
• Sum rule: −∞ p(x)dx = 1
• Product rule: p(x, y) = p(x|y)p(y)
R∞ = p(y|x)p(x).
• Marginalization: p(y) = −∞ p(x, y)dx R∞
• We can also add conditional information, e.g. p(y|z) = −∞ p(x, y|z)dx
• Independence: Variables x and y are independent if: p(x, y) = p(x)p(y).
The variance of a scalar variable x is the expected squared deviation from the mean:
Z
2
Ep(x) [(x − µ) ] = (x − µ)2 p(x)dx (96)
The variance of a distribution tells us how uncertain, or “spread-out” the distribution is. For a very
narrow distribution Ep(x) [(x − µ)2 ] will be small.
The covariance of a vector x is a matrix:
Z
T
Σ = cov(x) = Ep(x) [(x − µ)(x − µ) ] = (x − µ)(x − µ)T p(x)dx (97)
By inspection, we can see that the diagonal entries of the covariance matrix are the variances of
the individual entries of the vector:
Σii = var(xii ) = Ep(x) [(xi − µi )2 ] (98)
between variables xi and xj . If the covariance is a large positive number, then we expect xi to be
larger than µi when xj is larger than µj . If the covariance is zero and we know no other information,
then knowing xi > µi does not tell us whether or not it is likely that xj > µj .
One goal of statistics is to infer properties of distributions. In the simplest
P case, the sample
1
mean of a collection of N data points P x1:N is just their average: x̄ = N i xi . The sample
covariance of a set of data points is: N1 i (xi − x̄)(xi − x̄)T . The covariance of the data points
tells us how “spread-out” the data points are.
Equations 100 and 101 are equivalent. The latter simply says: x is distributed uniformly in the
range x0 and x1 , and it is impossible that x lies outside of that range.
The mean of a uniform distribution U(x0 , x1 ) is (x1 + x0 )/2. The variance is (x1 − x0 )2 /12.
3 3 3
N =1 N =2 N = 10
2 2 2
1 1 1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Figure 6: Histogram plots of the mean of N uniformly distributed numbers for various values of
N . The effect of the Central Limit Theorem is seen: as N increases, the distribution becomes more
Gaussian. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)
The simplest case is a Gaussian PDF over a scalar value x, in which case the PDF is:
2 1 1 2
p(x|µ, σ ) = √ exp − 2 (x − µ) (102)
2πσ 2 2σ
(The notation exp(a) is the same as ea ). The Gaussian has two parameters, the mean µ, and
the variance σ 2 . The mean specifies the center of the distribution, and the variance tells us how
“spread-out” the PDF is.
The PDF for D-dimensional vector x, the elements of which are jointly distributed with a the
Gaussian denity function, is given by
1
p(x|µ, Σ) = p exp −(x − µ)T Σ−1 (x − µ)/2 (103)
(2π)D |Σ|
where µ is the mean vector, and Σ is the D×D covariance matrix, and |A| denotes the determinant
of matrix A. An important special case is when the Gaussian is isotropic (rotationally invariant).
In this case the covariance matrix can be written as Σ = σ 2 I where I is the identity matrix. This is
called a spherical or isotropic covariance matrix. In this case, the PDF reduces to:
2 1 1 2
p(x|µ, σ ) = p exp − 2 ||x − µ|| . (104)
(2π)D σ 2D 2σ
The Gaussian distribution is used frequently enough that it is useful to denote its PDF in a
simple way. We will define a function G to be the Gaussian density function, i.e.,
1
G(x; µ, Σ) ≡ p exp −(x − µ)T Σ−1 (x − µ)/2 (105)
(2π)D |Σ|
When formulating problems and manipulating PDFs this functional notation will be useful. When
we want to specify that a random vector has a Gaussian PDF, it is common to use the notation:
Equations 103 and 106 essentially say the same thing. Equation 106 says that x is Gaussian, and
Equation 103 specifies (evaluates) the density for an input x.
The covariance matrix Σ of a Gaussian must be symmetric and positive definite — this is
equivalent to requiring that |Σ| > 0. Otherwise, the formula does not correspond to a valid PDF,
since Equation 103 is no longer real-valued if |Σ| ≤ 0.
6.3.1 Diagonalization
A useful way to understand a Gaussian is to diagonalize the exponent. The exponent of the Gaus-
sian is quadratic, and so its shape is essentially elliptical. Through diagonalization we find the
major axes of the ellipse, and the variance of the distribution along those axes. Seeing the Gaus-
sian this way often makes it easier to interpret the distribution.
As a reminder, the eigendecomposition of a real-valued symmetric matrix Σ yields a set of
orthonormal vectors vi and scalars λi such that
Σui = λi ui (107)
Equivalently, if we combine the eigenvalues and eigenvectors into matrices U = [u1 , ..., uN ] and
Λ = diag(λ1 , ...λN ), then we have
ΣU = UΛ (108)
Since U is orthonormal:
Σ = UΛUT (109)
The inverse of Σ is straightforward, since U is orthonormal, and hence U−1 = UT :
−1
Σ−1 = UΛUT = UΛ−1 UT (110)
(If any of these steps are not familiar to you, you should refresh your memory of them.)
Now, consider the negative log of the Gaussian (i.e., the exponent); i.e., let
1
f (x) = (x − µ)T Σ−1 (x − µ) . (111)
2
Substituting in the diagonalization gives:
1
f (x) = (x − µ)T UΛ−1 UT (x − µ) (112)
2
1 T
= z z (113)
2
where
−1 −1
z = diag(λ1 2 , ..., λN 2 )UT (x − µ) (114)
P
This new function f (z) = zT z/2 = i zi2 /2 is a quadratic, with new variables zi . Given variables
x, we can convert them to the z representation by applying Eq. 114, and, if all eigenvalues are
x2
u2
u1
y2
y1
µ
1/2
λ2
1/2
λ1
x1
Figure 7: The red curve shows the elliptical surface of constant probability density for a Gaussian
in a two-dimensional space on which the density is exp(−1/2) of its value at x = µ. The major
axes of the ellipse are defined by the eigenvectors ui of the covariance matrix, with corresponding
eigenvalues λi . (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)(Note y1 and
y2 in the figure should read z1 and z2 .)
nonzero, we can convert back by inverting Eq. 114. Hence, we can write our Gaussian in this new
coordinate system as5 :
Y
1 1 2 1 1 2
p exp − ||z|| = √ exp − zi (115)
(2π)N 2 i
2π 2
It is easy to see that for the quadratic form of f (z), its level sets (i.e., the surfaces f (z) = c for
constant c) are hyperspheres. Equivalently, it is clear from 115 that z is a Gaussian random vector
with an isotropic covariance, so the different elements of z are uncorrelated. In other words, the
value of this transformation is that we have decomposed the original N -D quadratic with many
interactions between the variables into a much simpler Gaussian, composed of d independent vari-
ables. This convenient geometrical form can be seen in Figure 7. For example, if we consider an
individual zi variable in isolation (i.e., consider a slice of the function f (z)), that slice will look
like a 1D bowl.
We can also understand the local curvature of f with a slightly different diagonalization.
Specifically, let v = UT (x − µ). Then,
1 1 X vi2
f (u) = vT Λ−1 v = (116)
2 2 i λi
If we plot a cross-section of this function, then we have a 1D bowl shape with variance given by
λi . In other words, the eigenvalues tell us variance of the Gaussian in different dimensions.
5
The normalizing |Σ| disappears due to the nature of change-of-variables in PDFs, which we won’t discuss here.
1 10
xb
xb = 0.7 p(xa |xb = 0.7)
0.5 5
p(xa , xb )
p(xa )
0 0
0 0.5 xa 1 0 0.5 xa 1
Figure 8: Left: The contours of a Gaussian distribution p(xa , xb ) over two variables. Right: The
marginal distribution p(xa ) (blue curve) and the conditional distribution p(xa |xb ) for xb = 0.7 (red
curve). (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)
Then one can show straightforwardly that the marginal PDFs for the components xa and xb are
also Gaussian, i.e.,
xa ∼ N (µa , Σaa ) , xb ∼ N (µb , Σbb ). (118)
With a little more work one can also show that the conditional distributions are Gaussian. For
example, the conditional distribution of xa given xb satisfies
where
µ = Σ Σ−1 −1
1 µ1 + Σ2 µ2 , (121)
Σ = (Σ−1 −1 −1
1 + Σ2 ) . (122)
Note that the linear transformation of a Gaussian random variable is also Gaussian. For exam-
ple, if we apply a transformation such that y = Ax where x ∼ N (x|µ, Σ), we have y ∼
N (y|Aµ, AΣAT ).
7 Estimation
We now consider the problem of determining unknown parameters of the world based on mea-
surements. The general problem is one of inference, which describes the probabilities of these
unknown parameters. Given a model, these probabilities can be derived using Bayes’ Rule. The
simplest use of these probabilities is to perform estimation, in which we attempt to come up with
single “best” estimates of the unknown parameters.
Model: Coin-Flipping
θ ∼ U(0, 1)
(123)
P (c = heads) = θQ
P (c1:N |θ) = i p(ci |θ)
Suppose we wish to learn about a coin by flipping it 1000 times and observing the results
c1:1000 , where the coin landed heads 750 times? What is our belief about θ, given this data? We
now need to solve for p(θ|c1:1000 ), i.e., our belief about θ after seeing the 1000 coin flips. To do
this, we apply the basic rules of probability theory, beginning with the Product Rule:
P (c1:1000 |θ)p(θ)
p(θ|c1:1000 ) = (125)
P (c1:1000 )
6
We would usually expect a coin to be fair, i.e., the prior distribution for θ is peaked near 0.5.
Figure 9: Posterior probability of θ from two different experiments: one with a single coin flip
(landing heads), and 1000 coin flips (750 of which land heads). Note that the latter distribution is
much more peaked.
where Z is a constant (evaluating it requires more advanced math, but it is not necessary for our
purposes). Hence, the final probability distribution is:
which is plotted in Figure 9. This form gives a probability distribution over θ that expresses our
belief about θ after we’ve flipped the coin 1000 times.
Suppose we just take the peak of this distribution; from the graph, it can be seen that the peak
is at θ = .75. This makes sense: if a coin lands heads 75% of the time, then we would probably
estimate that it will land heads 75% of the time of the future. More generally, suppose the coin
lands heads H times out of N flips; we can compute the peak of the distribution as follows:
(Deriving this is a good exercise to do on your own; hint: minimize the negative log of p(θ|c1:N )).
Solving for the desired distribution, gives a seemingly simple but powerful result, known widely
as Bayes’ Rule:
Bayes’ Rule:
p(data|model)p(model)
p(model|data) = p(data)
The different terms in Bayes’ Rule are used so often that they all have names:
likelihood prior
z }| { z }| {
P (data|model) p(model)
p(model|data) = (131)
| {z } p(data)
posterior | {z }
evidence
• The likelihood distribution describes the likelihood of data given model — it reflects our
assumptions about how the data c was generated.
• The prior distribution describes our assumptions about model before observing the data
data.
• The posterior distribution describes our knowledge of model, incorporating both the data
and the prior.
• The evidence is useful in model selection, and will be discussed later. Here, its only role is
to normalize the posterior PDF.
of some unknown variables from observed data. In this chapter, we outline the problem, and
describe some of the main ways to do this, including Maximum A Posteriori (MAP), and Maximum
Likelihood (ML). Estimation is the most common form of learning — given some data from the
world, we wish to “learn” how the world behaves, which we will describe in terms of a set of
unknown variables.
Strictly speaking, parameter estimation is not justified by Bayesian probability theory, and
can lead to a number of problems, such as overfitting and nonsensical results in extreme cases.
Nonetheless, it is widely used in many problems.
Note that we don’t need to be able to evaluate the evidence term p(D) for MAP learning, since
there are no θ terms in it.
Very often, we will assume that we have no prior assumptions about the value of θ, which we
express as a uniform prior: p(θ) is a uniform distribution over some suitably large range. In this
case, the p(θ) term can also be ignored from MAP learning, and we are left with only maximizing
the likelihood. Hence, the Maximum Likelihood (ML) learning principle (i.e., estimator) is
It often turns out that it is more convenient to minimize the negative-log of the objective func-
tion. Because “− ln” is a monotonic decreasing function, we can pose MAP estimation as:
We can see that the objective conveniently breaks into a part corresponding to the likelihood
and a part corresponding to the prior.
One problem with this approach is that all model uncertainty is ignored. We are choosing
to put all our faith in the most probable model. This sometimes has surprising and undesirable
consequences. For example, in the coin tossing example above, if one were to flip a coin just once
and see a head, then the estimator in Eqn. (129) would tell us that the probability of the outcome
being heads is 1. Sometimes a more suitable estimator is the expected value of the posterior
distribution, rather than its maximum. This is called the Bayes’ estimate.
In the coin tossing case above, you can show that the expected value of θ, under the posterior
provides an estimate of the probability that is biased toward 1/2. That is:
Z 1
H +1
p(θ|c1:N ) θ dθ = (138)
0 N +2
You can see that this value is always somewhat biased toward 1/2, but converges to the MAP
estimate as N increases. Interestingly, even when there are is no data whatsoever, in which case
the MAP estimate is undefined, the Bayes’ estimate is simply 1/2.
Solving for µ and Σ by setting ∂L/∂µ = 0 and ∂L/∂Σ = 0 (subject to the constraint that Σ is
symmetric) gives the maximum likelihood estimates7 :
1 X
µ∗ = xi (145)
N i
1 X
Σ∗ = (xi − µ∗ )(xi − µ∗ )T (146)
N i
7
Warning: the calculation for the optimal covariance matrix involves Lagrange multipliers and is not easy.
The ML estimates make intuitive sense; we estimate the Gaussian’s mean to be the sample mean
of the data, and the Gaussian’s covariance to be the sample covariance of the data. Maximum
likelihood estimates usually make sense intuitively. This is very helpful when debugging your
math — you can sometimes find bugs in derivations simply because the ML estimates do not look
right.
y = wT b(x) + n (147)
n ∼ N (0, σ 2 ) . (148)
We add this random variable to the regression equation in (147) to represent the fact that most
models and most measurements involve some degree of error. We’ll refer to this error as noise.
It is straightforward to show from basic probability theory that Equation (147) implies that,
given x and w, y is also Gaussian (i.e., has a Gaussian density), i.e.,
1 T 2 2
p(y | x, w) = G(y; wT b(x), σ 2 ) ≡ √ e−(y−w b(x)) /2σ (149)
2πσ
(G is defined in the previous chapter.) It follows that, for a collection of N independent training
points, (y1:N , x1:N ), the likelihood is given by
N
Y
p(y1:N | w, x1:N ) = G(yi ; wT b(xi ), σ 2 )
i=1
N
!
1 X (yi − wT b(xi ))2
= exp − (150)
(2πσ 2 )N/2 i=1
2σ 2
Furthermore, let us assume the following (weight decay) prior distribution over the unknown
weights w:
w ∼ N (0, αI) . (151)
That is, for w ∈ RM ,
M
Y 1 2 1 T
p(w) = √ e−wk /2α = M/2
e−w w/2α . (152)
k=1
2πα (2πα)
Now, to estimate the model parameters (i.e., w), let’s consider the posterior distribution over
w conditioned on our N training pairs, (xi , yi ). Based on the formulation above, assuming inde-
pendent training samples, it follows that
Note that p(w|x1:N ) = p(w), since we can assume that x alone provides no information about w.
In MAP estimation, we want to find the parameters w that maximize their posterior probability:
Furthermore, we can multiply by a constant, without changing where the optima are, so let us
multiply the whole expression by 2σ 2 . Then, if we define λ = σ 2 /α, we have the exact same
objective function as used in nonlinear regression with regularization. Hence, nonlinear least-
squares with regularization is a form of MAP estimation, and can be optimized the same way.
When the measurements are very reliable, then σ is small and we give the regularizer less influence
on the estimate. But when the data are relatively noisy, so σ is larger, then regularizer has more
influence.
8 Classification
In classification, we are trying to learn a map from an input space to some finite output space. In
the simplest case we simply detect whether or not the input has some property or not. For example,
we might want to determine whether or not an email is spam, or whether an image contains a face.
A task in the health care field is to determine, given a set of observed symptoms, whether or not a
person has a disease. These detection tasks are binary classification problems.
In multi-class classification problems we are interested in determining to which of multiple
categories the input belongs. For example, given a recorded voice signal we might wish to rec-
ognize the identity of a speaker (perhaps from a set of people whose voice properties are given in
advance). Another well studied example is optical character recognition, the recognition of letters
or numbers from images of handwritten or printed characters.
The input x might be a vector of real numbers, or a discrete feature vector. In the case of
binary classification problems the output y might be an element of the set {−1, 1}, while for
a multi-dimensional classification problem with N categories the output might be an integer in
{1, . . . , N }.
The general goal of classification is to learn a decision boundary, often specified as the level
set of a function, e.g., a(x) = 0. The purpose of the decision boundary is to identity the regions
of the input space that correspond to each class. For binary classification the decision boundary is
the surface in the feature space that separates the test inputs into two classes; points x for which
a(x) < 0 are deemed to be in one class, while points for which a(x) > 0 are in the other. The
points on the decision boundary, a(x) = 0, are those inputs for which the two classes are equally
probable.
In this chapter we introduce several basis methods for classification. We focus mainly on
on binary classification problems for which the methods are conceptually straightforward, easy
to implement, and often quite effective. In subsequent chapters we discuss some of the more
sophisticated methods that might be needed for more challenging problems.
That is, if the posterior probability of C1 is larger than the probability of C2 , then we might classify
the input as belonging to class 1. Equivalently, we can compare their ratio to 1:
P (C1 |x)
>1? (162)
P (C2 |x)
If this ratio is greater than 1 (i.e. P (C1 |x) > P (C2 |x)) then we classify x as belonging to class 1,
and class 2 otherwise.
The quantities P (Ci |x) can by computed using Bayes’ Rule as:
p(x|Ci ) P (Ci )
P (Ci |x) = (163)
p(x)
Gaussian Class Conditionals. As a concrete example, consider a generative model in which the
inputs associated with the ith class (for i = 1, 2) are modeled with a Gaussian distribution, i.e.,
Also, let’s assume that the prior class probabilities are equal:
1
P (Ci ) = . (166)
2
The values of µi and Σi can be estimated by maximum likelihood on the individual classes in the
training data.
Given this models, you can show that the log of the posterior ratio (164) is given by
1 1 1 1
a(x) = − (x − µ1 )T Σ−1
1 (x − µ1 ) − ln |Σ1 | + (x − µ2 )T Σ−1
2 (x − µ2 ) + ln |Σ2 | (167)
2 2 2 2
The sign of this function determines the class of x, since the ratio of posterior class probabilities
is greater than 1 when this log is greater than zero. Since a(x) is quadratic in x, the decision
boundary (i.e., the set of points satisfying a(x) = 0) is a conic section (e.g., a parabola, an ellipse,
a line, etc.). Furthermore, in the special case where Σ1 = Σ2 , the decision boundary is linear
(why?).
−20 −20
−15 −15
0
0
−10 −10
0
0
−5 −5
0
0
0 0
0
0
5 5
0
10 10
0
15 15
20 20
25 25
−15 −10 −5 0 5 10 −15 −10 −5 0 5 10
Figure 10: GCC classification boundaries for two cases. Note that the decision boundary is linear
when both classes have the same covariance.
At this point, we can forget about the generative model (e.g., the Gaussian distributions) that we
started with, and use this as our entire model. In other words, rather than learning a distribution
over each class, we learn only the conditional probability of y given x. As a result, we have
fewer parameters to learn since the number of parameters in logistic regression is linear in the
dimension of the input vector, while learning a Gaussian covariance requires a quadratic number
of parameters. With fewer parameters we can learn models more effectively with less data. On the
other hand, we cannot perform other tasks that we could with the generative model (e.g., sampling
from the model; classify data with noisy or missing measurements).
We can learn logistic regression with maximum likelihood. In particular, given data {xi , yi },
we minimize the negative log of:
p({xi , yi }|w, b) ∝ p({yi } | {xi }, w, b)
Y
= p(yi |xi , w, b)
i
Y Y
= P (C1 |xi ) (1 − P (C1 |xi )) (174)
i:yi =C1 i:yi =C2
In the first step above we have assumed that the input features are independent of the weights in
the logistic regressor, i.e., p({xi }) = p({xi }|w, b). So this term can be ignored in the likelihood
since it is constant with respect to the unknowns. In the second step we have assumed that the
input-output pairs are independent, so the joint likelihood is the product of the likelihoods for each
input-output pair.
The decision boundary for logistic regression is linear; in 2D, it is a line. To see this, recall
that the decision boundary is the set of points P (C1 |x) = 1/2. Solving for x gives the points
wT x + b = 0, which is a line in 2D, or a hyperplane in higher dimensions.
Although this objective function cannot be optimized in closed-form, it is convex, which means
that it has a single minimum. Therefore, we can optimize it with gradient descent (or any other
gradient-based search technique), which will be guaranteed to find the global minimum.
If the classes are linearly separable, this approach will lead to very large values of the weights
w, since as the magnitude of w tends to infinity, the function g(a(x)) behaves more and more like
a step function and thus assigns higher likelihood to the data. This can be prevented by placing a
weight-decay prior on w: p(w) = G(w; 0, σ 2 ).
Multiclass classification. Logistic regression can also be applied to multiclass classification, i.e.,
where we wish to classify a data point as belonging to one of K classes. In this case, the probability
of data vector x being in class i is:
T
e−wi x
P (Ci |x) = PK −wk x T
(175)
k=1 e
You should be able to see that this is equivalent to the method described above in the two-class case.
Furthermore,
P it is straightforward to show that this is a sensible choice of probability: 0 ≤ P (Ci |x),
and k P (Ck |x) = 1 (verify these for yourself).
15
10
−5
−10
−15
−20
−25
−30
−20 −15 −10 −5 0 5 10 15 20
where
−1 z ≤ 0
sign(z) = (178)
1 z>0
Alternatively, we might take a weighted average of the K-nearest neighbors:
X 2 2
y = sign w(xi )yi , w(xi ) = e−||xi −x|| /2σ (179)
i∈NK (x)
3
class 1
class 2
−1
−2
−3
−3 −2 −1 0 1 2 3
Figure 12: For two classes and planar inputs, the decision boundary for a 1NN classififier (the
bold black curve) is a subset of the perpendicular bisecting line segments (green) between pairs of
neighbouring points (obtained with a Voronoi tesselation).
1. Generative models, such as the GCC, describe the complete probability of the data p(x, y).
2. Discriminative models, such as LR, ANNs, and KNN, describe the conditional probability
of the output given the input: p(y|x)
The same distinction occurs in regression and classification, e.g., KNN is a discriminative method
that can be used for either classification or regression.
The distinction is clearest when comparing LR with GCC with equal covariances, since they
are both linear classifiers, but the training algorithms are different. This is because they have dif-
ferent goals; LR is optimized for classification performance, where as the GCC is a “complete”
model of the probability of the data that is then pressed into service for classification. As a conse-
quence, GCC may perform poorly with non-Gaussian data. Conversely, LR is not premised on any
particular form of distribution for the two class distributions. On the other hand, LR can only be
class 1
class 2
decision boundary
5
−5
−10
−15
−15 −10 −5 0 5
Figure 13: In this example there are two classes, one with a small isotropic covariance, and one
with an anistropic covariance. One can clearly see that the data are linearly separable (i.e., a line
exists that correctly separates the input training samples). Despite this, LS regression does not
separate the training data well. Rather, the LS regression decision boundary produces 5 incorrectly
classified training points.
used for classification, whereas the GCC can be used for other tasks, e.g., to sample new x data, to
classify noisy inputs or inputs with outliers, and so on.
The distinctions between generative and discriminative models become more significant in
more complex problems. Generative models allow us to put more prior knowledge into how we
build the model, but classification may often involve difficult optimization of p(y|x); discriminative
methods are typically more efficient and generic, but are harder to specialize to particular problems.
for labeled training data {xi , yi }. Given the optimal regression weights, one could then perform
regression on subsequent test inputs and use the sign of the output to determine the output class.
In simple cases this can perform well, but in general it will perform poorly. This is because the
objective function in linear regression measures the distance from the modeled class labels (which
can be any real number) to the true class labels, which may not provide an accurate measure of how
well the model has classified the data. For example, a linear regression model will tend to produce
predicted labels that lie outside the range of the class labels for “extreme” members of a given
class (e.g. 5 when the class label is 1), causing the error to be measured as high even when the
classification (given, say, by the sign of the predicted label) is correct. In such a case the decision
boundary may be shifted towards such an extreme case, potentially reducing the number of correct
classifications made by the model. Figure 13 demonstrates this with a simple example.
The problem arises from the fact that the constraint that y ∈ (−1, 1) is not built-in to the model
(the regression algorithm knows nothing about it), and so wastes considerable representational
power trying to reproduce this effect. It is much better to build this constraint into the model.
With this assumption, rather than estimating one d-dimensional density, we instead estimate d 1-
dimensional densities. This is important because each 1D Gaussian only has two parameters, its
mean and variance, both of which are scalars. So the model has 2d unknowns. In the Gaussian
case, the Naı̈ve Bayes model effectively replaces the general d × d covariance matrix by a diagonal
matrix. There are d entries along the diagonal of the covariance matrix; the ith entry is the variance
of xi |C. This model is not as expressive but it is much easier to estimate.
“business”), or another attribute (e.g., F4 = 1 might mean that the mail headers appear forged).
Similarly, a classifier to distinguish news stories between sports and financial news might be based
on particular words and phrases such as “team,” “baseball,” and “mutual funds.”
To understand the complexity of discrete class conditional models in general (i.e., without using
the Naı̈ve Bayes model), consider the distribution over 3 inputs, for class C = 1, i.e., P (F1:3 | C =
1). (There will be another model for C = 0, but for our little thought experiment here we’ll just
consider the model for C = 1.) Using basic rules of probability, we find that
P (F1:3 | C = 1) = P (F1 | C = 1, F2 , F3 ) P (F2 , F3 | C = 1)
= P (F1 | C = 1, F2 , F3 ) P (F2 | C = 1, F3 ) P (F3 | C = 1) (182)
Now, given C = 1 we know that F3 is either 0 or 1 (ie. it is a coin toss), and to model it we simply
want to know the probability P (F3 = 1 | C = 1). Of course the probability that F3 = 0 is simply
1 − P (F3 = 1 | C = 1). In other words, with one parameter we can model the third factor above,
P (F3 | C = 1).
Now consider the second factor P (F2 | C = 1, F3 ). In this case, because F2 depends on F3 ,
and there are two possible states of F3 , there are two distributions we need to model, namely
P (F2 | C = 1, F3 = 0) and P (F2 | C = 1, F3 = 1). Acordingly, we will need two parameters, one
for P (F2 = 1 | C = 1, F3 = 0) and one for P (F2 = 1 | C = 1, F3 = 1). Using the same logic, to
model P (F1 | C = 1, F2 , F3 ) will require one model parameter for each possible setting of (F2 , F3 ),
and of course there are 22 such settings. For D-dimensional binary inputs, there are O(2D−1 )
parameters that one needs to learn. The number of parameters required grows prohibitively large
as D increases.
The Naı̈ve Bayes model, by comparison, only have D parameters to be learned. The assump-
tion of Naı̈ve Bayes is that the feature vectors are all conditionally independent given the class. The
independence assumption is often very naı̈ve, but yet the algorithm often works well nonetheless.
This means that the likelihood of a feature vector for a particular class j is given by
Y
P (F1:D |C = j) = P (Fi |C = j) (183)
i
where C denotes a class C ∈ {1, 2, ...K}. The probabilities P (Fi |C) are parameters of the model:
P (Fi = 1|C = j) = ai,j (184)
We must also define class priors P (C = j) = bj .
To classify a new feature vector using this model, we choose the class with maximum proba-
bility given the features. By Bayes’ Rule this is:
P (F1:D |C = j)P (C = j)
P (C = j|F1:D ) = (185)
P (F1:D )
Q
( i P (Fi |C = j)) P (C = j)
= PK (186)
P (F 1:D , C = ℓ)
Qℓ=1 Q
i:Fi =1 ai,j i:Fi =0 (1 − ai,j ) bj
= PK Q Q (187)
ℓ=1 i:Fi =1 a i,ℓ i:Fi =0 (1 − a i,ℓ ) b ℓ
If we wish to find the class with maximum posterior probability, we need only compute the numer-
ator. The denominator in (187) is of course the same for all classes j. To compute the denominator
one simply divides the numerators for each class by their sum.
The above computation involves the product of many numbers, some of which might be quite
small. This can lead to underflow. For example, if you take the product a1 a2 ...aN , and all ai << 1,
then the computation may evaluate to zero in floating point, even though the final computation
after normalization should not be zero. If this happens for all classes, then the denominator will be
zero, and you get a divide-by-zero error, even though, mathematically, the denominator cannot be
zero. To avoid these problems, it is safer to perform the computations in the log-domain:
!
X X
αj = ln ai,j + ln(1 − ai,j ) + ln bj (188)
i:Fi =1 i:Fi =0
γ = min αj (189)
j
exp(αj − γ)
P (C = j|F1:D ) = P (190)
ℓ exp(αℓ − γ)
which, as you can see by inspection, is mathematically equivalent to the original form, but will not
evaluate to zero for at least one class.
8.7.2 Learning
For a collection of N training vectors Fk , each with an associated class label Ck , we can learn the
parameters by maximizing the data likelihood (i.e., the probability of the data given the model).
This is equivalent to estimating multinomial distributions (in the case of binary features, binomial
distributions), and reduces to simple counting of features.
Suppose there are Nk examples of each class, and N examples total. Then the prior estimate is
simply:
Nk
bk = (191)
N
Similarly, if class k has Ni,k examples where Fi = 1, then
Ni,k
ai,k = (192)
Nk
With large numbers of features and small datasets, it is likely that some features will never be
seen for a class, giving a class probability of zero for that feature. We might wish to regularize, to
prevent this extreme model from occurring. We can modify the learning rule as follows:
Ni,k + α
ai,k = (193)
Nk + 2α
for some small value α. In the extreme case where there are no examples for which feature i is
seen for class k, the probability ai,k will be set to 1/2, corresponding to no knowledge. As the
number of examples Nk becomes large, the role of α will become smaller and smaller.
In general, given in a multinomial distribution with a large number of classes and a small
training set, we might end up with estimates of prior probability bk being zero for some classes.
This might be undesirable for various reasons, or be inconsistent with our prior beliefs. Again, to
avoid this situation, we can regularize the maximum likelihood estimator with our prior believe
that all classes should have a nonzero probability. In doing so we can estimate the class prior
probabilities as
Nk + β
bk = (194)
N + Kβ
for some small value of β. When there are no observations whatsoever, all classes are given
probability 1/K. When there are observations the estimated probabilities will lie between Nk /N
and 1/K (converging to Nk /N as N → ∞).
Derivation. Here we derive just the per-class probability assuming two classes, ignoring the
feature vectors; this case reduces to estimating a binomial distribution. The full estimation can
easily be derived in the same way.
Suppose we observe N examples of class 0, and M examples of class 1; what is b0 , the proba-
bility of observing class 0? Using maximum likelihood estimation, we maximize:
! !
Y Y Y
P (Ci = k) = P (Ci = 0) P (Ci = 1) (195)
i i:Ci =0 i:Ci =1
N M
= b0 b1 (196)
Furthermore, in order for the class probabilities to be a valid distribution, it is required that b0 +b1 =
1, and that bk ≥ 0. In order to enforce the first constraint, we set b1 = 1 − b0 :
Y
P (Ci = k) = bN 0 (1 − b0 )
M
(197)
i
N
b∗0 = (200)
N +M
which, fortunately, is guaranteed to satisfy the constraint b0 ≥ 0.
9 Gradient Descent
There are many situations in which we wish to minimize an objective function with respect to a
parameter vector:
w∗ = arg min E(w) (201)
w
but no closed-form solution for the minimum exists. In machine learning, this optimization is
normally a data-fitting objective function, but similar problems arise throughout computer science,
numerical analysis, physics, finance, and many other fields.
The solution we will use in this course is called gradient descent. It works for any differen-
tiable energy function. However, it does not come with many guarantees: it is only guaranteed to
find a local minima in the limit of infinite computation time.
Gradient descent is iterative. First, we obtain an initial estimate w1 of the unknown parameter
vector. How we obtain this vector depends on the problem; one approach is to randomly-sample
values for the parameters. Then, from this initial estimate, we note that the direction of steepest
descent from this point is to follow the negative gradient −∇E of the objective function evaluated
at w1 . The gradient is defined as a vector of derivatives with respect to each of the parameters:
dE
dw1
∇E ≡ ..
(202)
.
dE
dwN
The key point is that, if we follow the negative gradient direction in a small enough distance, the
objective function is guaranteed to decrease. (This can be shown by considering a Taylor-series
approximation to the objective function).
It is easiest to visualize this process by considering E(w) as a surface parameterized by w; we
are trying to finding the deepest pit in the surface. We do so by taking small downhill steps in the
negative gradient direction.
The entire process, in its simplest form, can be summarized as follows:
Note that this process depends on three choices: the initialization, the termination conditions,
and the step-size λ. For the termination condition, one can run until a preset number of steps has
elapsed, or monitor convergence, i.e., terminate when
|E(wi+1 ) − E(wi )| < ǫ (203)
There are many, many more advanced methods for numerical optimization. For unconstrained
optimization, I recommend the L-BFGS-B library, which is available for download on the web. It
is written in Fortran, but there are wrappers for various languages out there. This method will be
vastly superior to gradient descent for most problems.
Aside:
The term backpropagation is sometimes used to refer to an efficient algorithm for
computing derivatives for Artificial Neural Networks. Confusingly, this term is also
used to refer to gradient descent (without line search) for ANNs.
10 Cross Validation
Suppose we must choose between two possible ways to fit some data. How do we choose between
them? Simply measuring how well they fit they data would mean that we always try to fit the
data as closely as possible — the best method for fitting the data is simply to memorize it in big
look-up table. However, fitting the data is no guarantee that we will be able to generalize to new
measurements. As another example, consider the use of polynomial regression to model a function
given a set of data points. Higher-order polynomials will always fit the data as well or better than
a low-order polynomial; indeed, an N − 1 degree polynomial will fit N data points exactly (to
within numerical error). So just fitting the data as well as we can usually produces models with
many parameters, and they are not going to generalize to new inputs in almost all cases of interest.
The general solution is to evaluate models by testing them on a new data set (the “test set”),
distinct from the training set. This measures how predictive the model is: Is it useful in new
situations? More generally, we often wish to obtain empirical estimates of performance. This
can be useful for finding errors in implementation, comparing competing models and learning
algorithms, and detecting over or under fitting in a learned model.
10.1 Cross-Validation
The idea of empirical performance evaluation can also be used to determine model parameters that
might otherwise to hard to determine. Examples of such model parameters include the constant K
in the K-Nearest Neighbors approach or the σ parameter in the Radial Basis Function approach.
Hold-out Validation. In the simplest method, we first partition our data randomly into a “training
set” and a “validation set.” Let K be the unknown model parameter. We pick a set of range of
possible values for K (e.g., K = 1, ..., 5). For each possible value of K, we learn a model with
that K on the training set, and compute that model’s error
Pon the validation set. For example, the
2
error on validation set might be just the squared-error, i ||yi − f (xi )|| . We then pick the K
which has the smallest validation set error. The same idea can be applied if we have more model
parameters (e.g., the σ in KNN), however, we must try many possible combinations of K and σ to
find the best.
There is a significant problem with this approach: we use less training data when fitting the
other model parameters, and so we will only get good results if our initial training set is rather
large. If large amounts of data are expensive or impossible to obtain this can be a serious problem.
N -Fold Cross Validation. We can use the data much more efficiently by N -fold cross-validation.
In this approach, we randomly partition the training data into N sets of equal size and run the
learning algorithm N times. Each time, a different one of the N sets is deemed the test set, and
the model is trained on the remaining N − 1 sets. The value of K is scored by averaging the error
across the N test errors. We can then pick the value of K that has the lowest score, and then learn
model parameters for this K.
A good choice for N is N = M − 1, where M is the number of data points. This is called
Leave-one-out cross-validation.
Issues with Cross Validation. Cross validation is a very simple and empirical way of comparing
models. However, there are a number of issues to keep in mind:
• The method can be very time-consuming, since many training runs may be needed. For
models with more than a few parameters, cross validation may be too inefficient to be useful.
• Because a reduced dataset is used for training, there must be sufficient training data so that
all relevant phenomena of the problem exist in both the training data and the test data.
• It is safest to use a random partition, to avoid the possibility that there are unmodeled cor-
relations in the data. For example, if the data was collected over a period of a week, it is
possible that data from the beginning of the week has a different structure than the data later
in the week.
Aside:
Testing machine learning algorithms is very much like testing scientific theories:
scientific theories must be predictive, or, that is, falsifiable. Scientific theories must
also describe plausible models of reality, whereas machine learning methods need
only be useful for making decisions. However, statistical inference and learning first
arose as theories of scientific hypothesis testing, and remain closely related today.
One of the most famous examples is the case of planetary motion. Prior to Newton,
astronomers described the motion of the planets through onerous tabulation of
measurements — essentially, big lookup tables. These tables were not especially
predictive, and needed to updated constantly. Newton’s equations of motion —
which could describe the motion of the planets with only a few simple equations —
were vastly simpler and yet also more effective at predicting motion, and became the
accepted theories of motion.
However, there remained some anomolies. Two astronomers, John Couch Adams
and Urbain Le Verier, thought that these discrepancies might be due to a new,
as-yet-undiscovered planet. Using techniques similar to modern regression, but with
laborious hand-calculation, they independently deduced the position, mass, and
orbit of the new planet. By observing in the predicted directions, astronomers were
indeed able to observe a new planet, which was later named Neptune. This provided
powerful validation for their models.
11 Bayesian Methods
So far, we have considered statistical methods which select a single “best” model given the data.
This approach can have problems, such as over-fitting when there is not enough data to fully con-
strain the model fit. In contrast, in the “pure” Bayesian approach, as much as possible we only com-
pute distributions over unknowns; we never maximize anything. For example, consider a model
parameterized by some weight vector w, and some training data D that comprises input-output
pairs xi , yi , for i = 1...N . The posterior probability distribution over the parameters, conditioned
on the data is, using Bayes’ rule, given by
p(D|w)p(w)
p(w|D) = (205)
p(D)
The reason we want to fit the model in the first place is to allow us to make predictions with future
test data. That is, given some future input xnew , we want to use the model to predict ynew . To
accomplish this task through estimation in previous chapters, we used optimization to find ML or
MAP estimates of w, e.g., by maximizing (205).
In a Bayesian approach, rather than estimation a single best value for w, we computer (or
approximate) the entire posterior distribution p(w|D). Given the entire distribution, we can still
make predictions with the following integral:
Z
p(ynew |D, xnew ) = p(ynew , w|D, xnew )dw
Z
= p(ynew |w, D, xnew ) p(w|D, xnew )dw (206)
The first step in this equality follows from the Sum Rule. The second follows from the Product
Rule. Additionally, the outputs ynew and training data D are independent conditioned on w, so
p(ynew |w, D) = p(ynew |w). That is, given w, we have all available information about making
predictions that we could possibly get from the training data D (according to the model). Finally,
given D, it is safe to assume that xnew , in itself, provides no information about W. With these
assumptions we have the following expression for our predictions:
Z
p(ynew |D, xnew ) = p(ynew |w, xnew ) p(w|D)dw (207)
The integral in (207) is rarely easy to compute, often involving intractable integrals or expo-
nentially large summations. Thus, Bayesian methods often rely on numerical approximations, such
as Monte Carlo sampling; MAP estimation can also be viewed as an approximation. However, in
a few cases, the Bayesian computations can be done exactly, as in the regression case discussed
below.
for a fixed set of basis functions b(x) = [b1 (x), ...bM (x)]T .
To complete the model, we also need to define a “prior” distribution over the weights w (de-
noted p(w)) which expresses what we believe about w, in absence of any training data. One might
be tempted to assign a constant density over all possible weights. There are several problems with
this. First, the result cannot be a valid probability distribution since no choice of the constant
will give the density a finite integral. We could, instead, choose a uniform distribution with finite
bounds, however, this will make the resulting computations more complex.
More importantly, a uniform prior is often inappropriate; we often find that smoother functions
are more likely in practice (at least for functions that we have any hope in learning), and so we
should employ a prior that prefers smooth functions. A choice of prior that does so is a Gaussian
prior:
w ∼ N (0, α−1 I) (209)
which expresses a prior belief that smooth functions are more likely. This prior also has the ad-
ditional benefit that it will lead to tractable integrals later on. Note that this prior depends on a
parameter α; we will see later in this chapter how this “hyperparameter” can be determined auto-
matically as well.
As developed in previous chapters on regression, the data likelihood function that follows from
the above model definition (with the input and output components of the training dataset denoted
x1:N and y1:N ) is
YN
p(y1:N |x1:N , w) = p(yi |xi , w) (210)
i=1
In the negative log-domain, using Equations (208) and (209), the model is given by:
X
− ln p(w|x1:N , y1:N ) = − ln(p(yi |xi , w)) − ln(p(w)) + ln(p(y1:N |x1:N ))
i
1 X α
= 2
(yi − f (xi ))2 + ||w||2 + constants
2σ i 2
As above in the regression notes, it is useful if we collect the training outputs into a single vector,
i.e., y = [y1 , ..., yN ]T , and we collect the all basis functions evaluated at each of the inputs into a
matrix B with elements Bi,j = bj (xi ). In doing so we can simplify the log posterior as follows:
1 α
− ln p(w|x1:N , y1:N ) = 2
||y − Bw||2 + ||w||2 + constants
2σ 2
1 α
= 2
(y − Bw) (y − Bw) + wT w + constants
T
2σ 2
1 T T 1 T 1
= w (B B/σ + αI)w − y Bw/σ 2 − wT BT y/σ 2 + constants
2
2 2 2
1 T −1
= (w − w̄) K (w − w̄) + constants (212)
2
where
−1
K = BT B/σ 2 + αI (213)
w̄ = KBT y/σ 2 (214)
(The last step of the derivation uses the methods of completing the square. It is easiest to verify
the last step by going backwards, that is by multiplying out (w − w̄)T K−1 (w − w̄).)
The derivation above tells us that the posterior distribution over the weight vector is a multi-
dimensional Gaussian with mean w̄ and covariance matrix K, i.e.,
In other words, our belief about w once we have seen the data is specified by a Gaussian density.
We believe that w̄ is the most probable value for w, but we have uncertainty about this estimate, as
determined by the covariance K. The covariance expresses our uncertainty about these parameters.
If the covariance is very small, then we have a lot of confidence in the MAP estimate. The nature
of the posterior distribution is illustrated visually in Figure 14. Note that w̄ is the MAP estimate
for regression, since it maximizes the posterior.
Prediction. For a new data point xnew , the predictive distribution for ynew is given by:
Z
p(ynew |xnew , D) = p(ynew |xnew , D, w)p(w|D)dw
Figure 14: Iterative posterior computation for a linear regression model: y = w0 x + w1 . The top
row shows the prior distribution, and several fair samples from the prior distribution. The second
row shows the likelihood over w after observing a single data point (i.e., an x, y pair), along with
the resulting posterior (the normalized product of the likelihood and the prior), and then several fair
samples from the posterior. The third row shows the liklihood when a new observation is added to
the previous observation, followed by the corresponding posterior and random samples from the
posterior. The final row shows the result of 20 observations.
Copyright c 2011 Aaron Hertzmann and David Fleet 62
CSC 411 / CSC D11 Bayesian Methods
The predictive distribution may be viewed as a function from xnew to a distribution over values of
ynew . An example of this for an RBF model is given in Figure 15.
This is the Bayesian way to do regression. To predict a new value ynew for an input xnew ,
we don’t estimate a single model w. Instead we average over all possible models, weighting the
different models according to their posterior probability.
11.2 Hyperparameters
There are often implicit parameters in our model that we hold fixed, such as the covariance con-
stants in linear regression, or the parameters that govern the prior distribution over the weights.
These are usually called “hyperparameters.” For example, in the RBF model, the hyperparameters
constitute the parameters α, σ 2 , and the parameters of the basis functions (e.g., the width of the
basis functions). Thus far we have assumed that the hyperparameters were “known” (which means
that someone must set them by hand), or estimated by cross-validation (which has a number of pit-
falls, including long computation times, especially for large numbers of hyperparameters). Instead
of either of these approaches, we may apply the Bayesian approach in order to directly estimate
these values as well.
To find a MAP estimate for the α parameter in the above linear regression example we compute:
where
p(y1:N |x1:N , α)p(α)
p(α|x1:N , y1:N ) = (217)
p(y1:N |x1:N )
and
Z
p(y1:N |x1:N , α) = p(y1:N , w|x1:N , α)dw
Z
= p(y1:N |x1:N , w, α)p(w|α)dw
Z Y !
= p(yi |xi , w, α) p(w|α)dw
i
For RBF regression, this objective function can be computed in closed-form. However, depend-
ing on the form of the prior over the hyperparameters, it is often necessary to use some form of
numerical optimization, such as gradient descent.
1 1
t t
0 0
−1 −1
0 x 1 0 x 1
1 1
t t
0 0
−1 −1
0 x 1 0 x 1
Figure 15: Predictive distribution for an RBF model (with 9 basis functions), trained on noisy
sinusoidal data. The green curve is the true underlying sinusoidal function. The blue circles are
data points. The red curve is the mean prediction as a function of the input. The pink region
represents 1 standard deviation. Note how this region shrinks close to where more data points are
observed. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)
it can be expensive, and, more importantly, inaccurate if small amounts of data are available. In
general one intuition is that we want to choose simple models over complex models to avoid over-
fitting,insofar as they provide equivalent fits to the data. Below we consider a Bayesian approach
to model selection which provides just such a bias to simple models.
The goal of model selection is to choose the best model from some set of candidate models
{Mi }Li=1 based on some observed data D. This may be done either with a maximum likelihood
approach (picking the model that assigns the largest likelihood to the data) or a MAP approach
(picking the model with the highest posterior probability). If we take a uniform prior over models
(i.e. p(Mi ) is a constant for all i = 1...L) then these approaches can be seen to be equivalent since:
p(D|Mi )p(Mi )
p(Mi |D) =
p(D)
∝ p(D|Mi )
In practice a uniform prior over models may not be appropriate, but the design of suitable priors
in these cases will depend significantly on one’s knowledge of the application domain. So here we
will assume a uniform prior over models and focus on p(D|Mi ).
In some sense, whenever we estimate a parameter in a model we are doing model selection
where the family of models is indexed by the different values of that parameter. However the term
“model selection” can also mean choosing the best model from some set of parametric models
that are parameterized differently. A good example of this would be choosing the number of basis
functions to use in an RBF regression model. Another simple example is choosing the polynomial
degree for polynomial regression.
The key quantity for Bayesian model selection is p(D|Mi ), often called the marginal data
likelihood. Given two models, M1 and M2 , we will choose the model M1 when p(D|M1 ) >
p(D|M1 ). To specify these quantities in more detail we need to take the model parameters into
account. Different models may have different numbers of parameters (e.g., polynomials of dif-
ferent degrees), or entirely different parameterizations (e.g., RBFs and neural networks). In what
follows, let wi be the vector of parameters for model Mi . In the case of regression, for example,
wi might comprise the regression weights and hyper-parameters like the weight on the regularizer.
The extent to which a model explains (or fits) the data depends on the choice of the right
parameters. Using the sum rule and Bayes’ rule it follows we can write the marginal data likelihood
as Z Z
p(D|Mi ) = p(D, wi |Mi )dwi = p(D|wi , Mi )p(wi |Mi )dwi (218)
This tells us that the ideal model is one that assigns high prior probability p(wi |Mi ) to every weight
vector that also yields a high value of the likelihood p(D|wi , Mi ) (i.e., to parameter vectors that
fit the data well). One can also recognize that the product of the data likelihood and the prior in
the integrand is proportional to the posterior over the parameters that we previously maximized to
find MAP estimates of the model parameters. 8
8
This is the same quantity we compute when optimizing hyper-parameters (which is a type of model selection) and
Typically, a “complex model” that assigns a significant posterior probability mass to complex
data will be able to assign significantly less mass to simpler data than a simpler model would. This
is because the integral of the probability mass must sum to 1 and so a complex model will have
less mass to spend on simpler data. Also, since a complex model will require higher-dimensional
parameterizations, mass must be spread over a higher-dimensional space and hence more thinly.
This phenomenon is visualized in Figure 17.
As an aid to intuition to explain why this marginal data likelihood helps us choose good models,
we consider a simple approximation to the marginal data likelihood p(D|Mi ) (depicted in Figure
16 for a scalar parameter w). First, as is common in many problems of interest, the posterior
distribution over the model parameters p(wi |D, Mi ) ∝ p(D|wi , Mi )p(wi |Mi ) to have a strong
peak at the MAP parameter estimate wiM AP . Accordingly we can approximate the integral in
Equation (218) as the height of the peak, i.e., p(D|wiM AP , Mi )p(wiM AP |Mi ), multiplied by its
width ∆wiposterior .
Z
p(D|wi , Mi )p(wi |Mi )dwi ≈ p(D|wiM AP , Mi ) p(wiM AP |Mi ) ∆wiposterior
We then assume that the prior distribution over parameters p(wi |Mi ) is a relatively broad uniform
with width ∆wiprior , so p(wi ) ≈ ∆w1prior . This yields a further approximation:
i
Z
p(D|wiM AP , Mi )∆wiposterior
p(D|wi , Mi )p(wi |Mi )dwi ≈
∆wiprior
Taking the logarithm, this becomes
∆wiposterior
ln p(D|wiM AP , Mi ) + ln
∆wiprior
Intuitively, this approximation tells us that models with wider prior distributions on the param-
eters will tend to assign less likelihood to the data because the wider prior captures a larger variety
of data (so the density is spread thinner over the data-space). Similarly, models that have a very
narrow peak around their modes are generally less preferable because they assign a lower prob-
ability mass to the surrounding area (and so a slightly perturbed setting of the parameters would
provide a poor fit to the data, suggesting that over-fitting has occurred).
From another perspective, note that in most cases of interest we can assume that ∆wiposterior <
∆wiprior . I.e., the posterior width will be less than the width of the prior. The log ratio is maximal
when the prior and posterior widths are equal. For example, a complex model with many parame-
ters, or a a very broad prior over the parameters will necessarily assign a small probability to any
single value (including those under the posterior peak). A simpler model will assign a higher prior
also corresponds to the denominator “p(D)” in Bayes’ rule for finding the posterior probability of a particular setting
of the parameters wi . Note that above we generally wrote p(D) and not p(D|Mi ) because we were only considering
a single model, and so it was not necessary to condition on it.
∆wposterior
wMAP w
∆wprior
Figure 16: A visualization of the width-based evidence approximation. (Figure from Pattern Recog-
nition and Machine Learning by Chris Bishop.)
probability to the useful parameter values (ie those under the posterior peak). When the model is
too simple, then the likelihood term in the integrand will be particularly high and therefore lowers
the marginal data likelihood. So, as models become more complex the data likelihood increasingly
∆wiposterior
fits the data better. But as the models become more and more complex the log ratio ln ∆w prior
i
acts as a penalty on unnecessarily complex models.
By selecting a model that assigns the highest posterior probability to the data we are automat-
ically balancing model complexity with the ability of the model to capture the data. This can be
seen as the mathematical realization of Occam’s Razor.
Model averaging. To be fully Bayesian, arguably, we shouldn’t select a single “best” model but
should instead combine estimates from all models according to their respective posterior probabil-
ities: X
p(ynew |D, xnew ) = p(ynew |Mi , D, xnew ) p(Mi |D) (219)
i
p(D)
M1
M2
M3
D
D0
Figure 17: The x-axis is data complexity from simplest to most complex, and models Mi are
indexed in order of increasing complexity. Note that in this example M2 is the best model choice
for data D0 since it simultaneously is complex enough to assign mass to D0 but not so complex
that it must spread its mass too thinly. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)
2. Sampling from distributions for which is a simple sampling algorithm is not available.
Recall that expectation of a function φ(x) of a continuous variable x with respect to a distribution
p(x) is defined as: Z
Ep(x) [φ(x)] ≡ p(x)φ(x)dx (220)
Monte Carlo methods approximate this integral by drawing N samples from p(x)
xi ∼ p(x) (221)
Furthermore, the variance of this estimate is inversely proportional to the number of samples:
" #
1 X 1 X 1 1
varp(x1:N ) φ(xi ) = 2 varp(x1:N ) [φ(xi )] = 2 N varp(xi ) [φ(xi )] = varp(x) [φ(x)]
N i N i N N
(224)
Hence, the more samples we get, the better our estimate will be; in the limit, the estimator will
converge to the true value.
Dealing with unnormalized distributions. We often wish to compute the expected value of a
distribution for which evaluating the normalization constant is difficult. For example, the posterior
distribution over parameters w given data D is:
p(D|w)p(w)
p(w|D) = (225)
p(D)
The posterior mean and covariance (w̄ = E[w] and E[(w − w̄)(w − w̄)T ]) can be useful to
understand this posterior, i.e., what we believe the parameter values are “on average,” and how
much uncertainty there is in the parameters. The numerator of p(w|D) is typically easy to compute,
but p(D) entails an integral which is often intractable, and thus must be handled numerically.
Most generally, we can write the problem as computing the expected value with respect to a
distribution p(x) defined as
Z
1 ∗
p(x) ≡ P (x), Z = P ∗ (x)dx (226)
Z
Monte Carlo methods will allow us to handle distributions of this form.
In other words, we can compute the desired expectation by sampling values xi from q(x), and then
computing
p(x) 1 X p(xi )
Eq φ(x) ≈ φ(xi ) (232)
q(x) N i q(xi )
It often happens that p and/or q are known only up to multiplicative constants. That is,
1 ∗
p(x) ≡ P (x) (233)
Zp
1 ∗
q(x) ≡ Q (x) (234)
Zq
where P ∗ and Q∗ are easy to evaluate but the constants Zp and Zq are not.
Then we have:
Z 1 P ∗ (x) ∗
Zp Zq P (x)
Ep(x) [φ(x)] = 1 φ(x)q(x)dx = Eq(x) φ(x) (235)
Zq
Q∗ (x) Zp Q∗ (x)
Zq
and so it remains to approximate Zp
. If we substitute φ(x) = 1, the above formula states that
∗
Zq P (x)
Eq(x) =1 (236)
Zp Q∗ (x)
Zp P (x)
∗
and so Zq
= Eq(x) [ Q ∗ (x) ]. Thus we have:
h i
P ∗ (x)
Eq(x) Q∗ (x)
φ(x)
Ep(x) [φ(x)] = h i (237)
P ∗ (x)
Eq(x) Q∗ (x)
p(x) φ(x)
q(x)
Figure 18: Importance sampling may be used to sample relatively complicated distributions like
this bimodal p(x) by instead sampling simpler distributions like this unimodal q(x). Note that in
this example, sampling from q(x) will produce many samples that will be given a very low weight
since q(x) has a lot of mass where p(x) is near zero (in the center of the plot). On the other hand,
q(x) has ample mass around the two modes of p(x) and so it is a relatively good choice. If q(x) had
very little mass around one of the modes of p(x), the estimate given by importance sampling would
have a very high variance (unless |φ(x)| was small enough there to compensate for the difference).
(Figure from Pattern Recognition and Machine Learning by Chris Bishop.)
Sample u ∼ Uniform[0, 1]
if u ≤ α then
xt+1 ← x′
else
xt+1 ← xt
end if
t←t+1
end loop
Amazingly, it can be shown that, if x1 is a sample from p(x), then every subsequent xt is also
a sample from p(x), if they are considered in isolation. The samples are correlated to each other
via the Markov Chain, but the marginal distribution of any individual sample is p(x).
So far we assumed that x1 is a sample from the target distribution, but, of course, obtaining
this first sample is itself difficult. Instead, we must perform a process called burn-in: we initialize
with any x1 , and then discard the first T samples obtained by the algorithm; if we pick a large
enough value of T , we are guaranteed that the remaining samples are valid samples from the target
distribution. However, there is no exact method for determining a sufficient T , and so heuristics
and/or experimentation must be used.
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3
Figure 19: MCMC applied to a 2D elliptical Gaussian with a proposal distribution consisting
of a circular Gaussian centered on the previous sample. Green lines indicate accepted proposals
while red lines indicate rejected ones. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)
The matrix W can be viewed as a containing a set of C basis vectors W = [w1 , ..., wC ]. If we
also assume Gaussian noise in the measurements, this model is the same as the linear regression
model studied earlier, but now the x’s are unknown in addition to the linear parameters.
13.2 Reconstruction
Suppose we have learned a PCA model, and are given a new ynew value; how do we estimate its
corresponding xnew ? This can be done by minimizing
||ynew − (Wxnew + b)||2 (244)
This is a linear least-squares problem, and can be solved with standard methods (in MATLAB,
implemented by the backslash operator). However W is orthonormal, and thus its transpose is the
pseudoinverse, so the solution is given simply by:
x∗new = WT (ynew − b) (245)
1 X 1 X T
mean(x) ≡ xi = W (yi − b) (246)
N i N i
!
1 T X
= W yi − N b (247)
N i
= 0 (248)
Variance maximization. PCA can also be defined in the following way; in fact, this is the orig-
inal definition of PCA, and the one that is often meant when people discuss PCA. However, this
formulation is exactly equivalent to the one discussed above. In this goal, we wish to find the first
principal component w1 to maximize the variance of the first coordinate of the data:
1 X 2 1 X T
var(x1 ) = x1,i = (w1 (yi − b))2 (249)
N i N i
such that ||w1 ||2 = 1. Then, we wish to choose the second principal component to be a unit
vector and orthogonal to the first component, while maximizing the variance of x2 . The remaining
principle components are also defined in this recursive way, so that each component wi is a unit
vector, orthogonal to all previous basis vectors.
Uncorrelated coefficients. It is straightforward to show that the covariance matrix of the PCA
coefficients is the just the upper left C × C submatrix of Λ (i.e., the diagonal matrix containing the
C leading eigenvalues of K.
1 X
cov(x) ≡ (WT (yi − b)) (WT (yi − b))T (250)
N i
!
1 T X
= W (yi − b)(yi − b)T W (251)
N i
= WT KW (252)
= WT VΛVT W (253)
= Λ̃ (254)
where Λ̃ is the diagonal matrix containing the C leading eigenvalues in Λ. This simple derivation
also shows that the marginal variances of the PCA coefficients are given by the eigenvalues; i.e.,
var(xj ) = λj .
Out of Subspace Error. The total variance in the data is given by the sum of the eigenvalues
of the sample covariance matrix K. The variance captured by the PCA subspace representation is
the sum of the first C eigenvalues. The total amount of variance lost in the representation is given
by the sum of the remaining eigenvalues. In fact, one can show that the least-squares error in the
approximation to the original data provided by the optimal (ML) model parameters, W∗ , {x∗i },
and b∗ , is given by
X D
X
||yi − (W∗ x∗i + b∗ )||2 = λj . (255)
i j=C+1
When learning a PCA model it is common to use the ratio of the total LS error and the total variance
in the training data (i.e., the sum of all eigenvalues). One needs to choose C to be large enough
that this ratio is small (often 0.1 or less).
13.4 Whitening
Whitening is a preprocess that replaces the data with a representation that has zero-mean and unit
covariance, and is often useful as a data preprocessing step. Given measurements {yi }, we replace
them with {zi } given by
− 21 − 21
zi = Λ̃ WT (yi − b) = Λ̃ xi (256)
where Λ̃ is a diagonal matrix of the first C eigenvalues.
Then, the sample mean of the z’s is equal to 0:
− 21 − 21
mean(z) = mean(Λ̃ xi ) = Λ̃ mean(x) = 0 (257)
To derive the sample covariance, we will first compute the covariance of the untruncated values:
1
z̃ ≡ Λ− 2 VT (y − b):
1 X −1 T 1
cov(z̃) ≡ Λ 2 V (yi − b)(yi − b)T VΛ− 2 (258)
N i
!
1 1 X 1
= Λ− 2 V T (yi − b)(yi − b)T VΛ− 2 (259)
N i
1 1
= Λ− 2 VT KVΛ− 2 (260)
1 1
= Λ− 2 VT VΛVT VΛ− 2 (261)
= I (262)
Since z is just the first C elements of z̃, z also has sample covariance I.
13.5 Modeling
PCA is sometimes used to model data likelihood, e.g., we can use it as a form of a “prior”. For
example, suppose we have noisy measurements of some y values and wish to estimate their true
values. If we parameterize the unknown y values by their corresponding x values instead, then
we constrain the estimated values to lie in the low-dimensional subspace of the original data.
However, this approach implies a uniform prior over x values, which may be inadequate, while
being intolerant to deviations from the subspace. A better approach with an inherently probabilistic
model is described below.
Evaluating this integral will give us p(y), however, there is a simpler way to solve for the Gaussian
distribution.
Since we know that y is Gaussian, all we need to do is derive its mean and covariance, which
can be done as follows (using the fact that mathematical expectation is linear):
mean(y) = E[y] = E[Wx + b + n] (266)
= WE[x] + b + E[n] (267)
= b (268)
cov(y) = E[(y − b)(y − b)T ] (269)
= E[(Wx + b + n − b)(Wx + b + n − b)T ] (270)
= E[(Wx + n)(Wx + n)T ] (271)
= E[WxxT WT ] + E[WxnT ] + E[nxT WT ] + E[nnT ] (272)
= WE[xxT ]WT + WE[x]E[nT ] + E[n]E[xT ]WT + σ 2 I (273)
= WWT + σ 2 I (274)
w
y2 y2
p(y|x̂)
b b
ẑ|w|
}
p(y)
p(x)
x̂ x y1 y1
Hence
y ∼ N (b, WWT + σ 2 I) (275)
In other words, learning a PPCA model is equivalent to learning a particular form of a Gaussian
distribution. This is illustrated in Figure 20. The PPCA model is not as general as learning a full
Gaussian model with a D × D covariance matrix; however, it uses fewer numbers to represent the
Gaussian (CD + 1 versus D2 /2 + D/2; why?). Because the representation is more compact, it can
be estimated from smaller datasets, and requires less memory to store the model.
These differences will be significant when D is large; e.g., if D = 100, the full covariance
matrix would require 5050 parameters and thus require hundreds of thousands of data points to
estimate reliably. However, if the effective dimensionality is, say, 2 or 3, then the PPCA represen-
tation will only have a few hundred parameters and many fewer measurements.
Learning. The PPCA model can be learned by Maximum Likelihood, i.e., by minimizing:
N
Y
L(W, b, σ 2 ) = − ln G(yi ; b, WWT + σ 2 I) (276)
i=1
1X N
= (yi − b)T (WWT + σ 2 I)−1 (yi − b) + ln(2π)D |WWT + σ 2 I|(277)
2 i 2
This can be optimized in closed form. The solution is very similar to the conventional PCA
case:
P
1. Let b = N1 i yi
P
2. Let K = N1 i (yi − b)(yi − b)T
4. Assume that the eigenvalues are sorted from largest to smallest (λi ≥ λi+1 ). If this is not the
case, sort them (and their corresponding eigenvectors).
1
PD
5. Let σ 2 = D−C j=C+1 λj . In words, the estimated noise variance is equal to the average
marginal data variance over all directions that are orthogonal to the C principal directions
(i.e., this is the average variance (per dimension) of the data that is lost in the approximation
of the data in the C dimensional subspace).
6. Let Ṽ be the matrix comprising the first C eigenvectors: Ṽ = [V1 , ...VC ], and let Λ̃ be the
diagonal matrix with the C leading eigenvalues: Λ̃ = [λ1 , ...λC ].
1
7. W = Ṽ(Λ̃ − σ 2 I) 2 .
Note that this solution is similar to that in the conventional PCA case with whitening, except that
(a) the noise variance is estimated, and (b) the noise is removed from the variances of the remaining
eigenvalues.
instead of maximizing
Y
p(y1:N , x1:N |W, b, σ 2 ) = p(yi , xi |W, b, σ 2 ) (281)
i
Y
= p(yi |xi , W, b, σ 2 )p(xi ) (282)
i
By integrating out x, we are estimating fewer parameters and thus can get better estimates. Loosely
speaking, doing so might be viewed as being “more Bayesian.” Suppose we did instead try to
W ← 2W (285)
x ← x/2 (286)
By doing this replacement arbitrarily many times, we can get infinitesimal values for x. This
indicates that the objective function is degenerate; using it will yield to very poor results.
Note that, however, this arises using the same model as before, but without marginalizing out
x. This illustrates a general principle: the more parameters you estimate (instead of marginalizing
out), the greater the danger of biased and/or degenerate solutions.
14 Lagrange Multipliers
The Method of Lagrange Multipliers is a powerful technique for constrained optimization. While
it has applications far beyond machine learning (it was originally developed to solve physics equa-
tions), it is used for several key derivations in machine learning.
The problem set-up is as follows: we wish to find extrema (i.e., maxima or minima) of a
differentiable objective function
If we have no constraints on the problem, then the extrema are solutions to the following system
of equations:
∇E = 0 (288)
dE
which is equivalent to writing dx i
= 0 for all i. This equation says that there is no way to infinites-
imally perturb x to get a different value for E; the objective function is locally flat.
Now, however, our goal will be to find extrema subject to a single constraint:
g(x) = 0 (289)
In other words, we want to find the extrema among the set of points x that satisfy g(x) = 0.
It is sometimes possible to reparameterize the problem in order to eliminate the constraints
(i.e., so that the new parameterization includes all possible solutions to g(x) = 0), however, this
can be awkward in some cases, and impossible in others.
Given the constraint g(x) = 0, we are no longer looking for a point where no perturbation in
any direction changes E. Instead, we need to find a point at which perturbations that satisfy the
constraints do not change E. This can be expressed by the following condition:
∇E + λ∇g = 0 (290)
for some arbitrary scalar value λ. The expression ∇E = −λg says that any perturbation to x that
changes E also makes the constraint become violated. Hence, perturbations that do not change g
do not change E either. Hence, our goal is to find a point x that satisfies this condition and also
g(x) = 0
In the Method of Lagrange Multipliers, we define a new objective function, called the La-
grangian:
L(x, λ) = E(x) + λg(x) (291)
Now we will instead find the extrema of L with respect to both x and λ. The key fact is that
extrema of the unconstrained objective L are the extrema of the original constrained prob-
lem. So we have eliminated the nasty constraints by changing the objective function and also
introducing new unknowns.
∇E
∇g
g(x) = 0
Figure 21: The set of solutions to g(x) = 0 visualized as a curve. The gradient ∇g is always normal
to the curve. At an extremal point, ∇E points is parallel to ∇g. (Figure from Pattern Recognition and
Machine Learning by Chris Bishop.)
To see why, let’s look at the extrema of L. The extrema to L occur when
dL
= g(x) = 0 (292)
dλ
dL
= ∇E + λ∇g = 0 (293)
dx
which are exactly the conditions given above. Using the Lagrangian is just a convenient way of
combining these two constraints into one unconstrained optimization.
14.1 Examples
Minimizing on a circle. We begin with a simple geometric example. We have the following
constrained optimization problem:
arg minx,y x + y (294)
subject to x2 + y 2 = 1 (295)
In other words, we want to find the point on a circle that minimizes x + y; the problem is visualized
in Figure 22. Here, E(x, y) = x + y and g(x, y) = x2 + y 2 − 1. The Lagrangian for this problem
is:
L(x, y, λ) = x + y + λ(x2 + y 2 − 1) (296)
Figure 22: Illustration of the maximization on a circle problem. (Image from Wikipedia.)
P (e = k) = pk (300)
For example, coin-flipping is a binomial distribution where N = 2 and e = 1 might indicate that
the coin lands heads.
Suppose we observe N events; the likelihood of the data is:
K
Y Y
P (ei |p) = pN
k
k
(301)
i=1 k
where Nk is the number of times that e = k, i.e., the number of occurrences of the k-th event. To
estimate this distribution, we can minimize the negative log-likelihood:
P
arg min − k Nk ln pk (302)
P
subject to k pk = 1, pk ≥ 0, for all k (303)
The constraints are required in order to ensure that the p’s form a valid
PK−1probability distribution.
One way to optimize this problem is to reparameterize: set pK = 1 − k=1 pk , substitute in, and
then optimize the unconstrained problem in closed-form. While this method does work in this case,
it breaks the natural symmetry of the problem, resulting in some messy calculations. Moreover,
this method often cannot be generalized to other problems.
The Lagrangian for this problem is:
!
X X
L(p, λ) = − Nk ln pk + λ pk − 1 (304)
k k
Here we omit the constraint that pk ≥ 0 and hope that this constraint will be satisfied by the
solution (it will). Setting the gradient to zero gives:
dL Nk
= − + λ = 0 for all k (305)
dpk pk
dL X
= pk − 1 = 0 (306)
dλ k
Maximum variance PCA. In the original formulation of PCA, the goal is to find a low-dimensional
projection of N data points y
x = wT (y − b) (309)
such that the variance of the x′i s is maximized, subject to the constraint that wT w = 1. The
Lagrangian is:
!2
1 X 1 X
L(w, b, λ) = xi − xi + λ(wT w − 1) (310)
N i N i
!2
1 X T 1 X T
= w (yi − b) − w (yi − b) + λ(wT w − 1) (311)
N i N i
!!2
1 X 1 X
= wT (yi − b) − (yi − b) + λ(wT w − 1) (312)
N i N i
1 X T 2
= w (yi − ȳ) + λ(wT w − 1) (313)
N i
1 X T
= w (yi − ȳ)(yi − ȳ)T w + λ(wT w − 1) (314)
N i
!
1 X
= wT (yi − ȳ)(yi − ȳ)T w + λ(wT w − 1) (315)
N i
P
where ȳ = i yi /N . Solving dL/dw = 0 gives:
!
1 X
(yi − ȳ)(yi − ȳ)T w = λw (316)
N i
This is just the eigenvector equation: in other words, w must be an eigenvector of the sample
covariance of the y′ s, and λ must be the corresponding eigenvalue. In order to determine which
one, we can substitute this equality into the Lagrangian to get:
L = wT λw + λ(wT w − 1) (317)
= λ (318)
since wT w = 1. Since our goal is to maximize the variance, we choose the eigenvector w which
has the largest eigenvalue λ.
We have not yet selected b, but it is clear that the value of the objective
P function does not
depend on b, so we might P as well set it to be the mean of the data b = i yi /N , which results in
′
the x s having zero mean: i xi /N = 0.
The vector w is called the first principal component. The Lagrangian is:
X
L(w, x1:N , b, λ) = ||yi − (wxi + b)||2 + λ(||w||2 − 1) (321)
i
There are several sets of unknowns, and we derive their optimal values each in turn.
xi = wT (yi − b) (323)
Basis vector (w). To make things simpler, we will define ỹi = (yi − b) as the mean-subtracted
data points, and the reconstructions are then xi = wT ỹi , and the objective function is:
X
L = ||ỹi − wxi ||2 + λ(wT w − 1) (329)
i
X
= ||ỹi − wwT ỹi ||2 + λ(wT w − 1) (330)
i
X
= (ỹi − wwT ỹi )T (ỹi − wwT ỹi ) + λ(wT w − 1) (331)
i
X
= (ỹiT ỹi − 2ỹiT wwT ỹi + ỹiT wwT wwT ỹi ) + λ(wT w − 1) (332)
i
X X
= ỹiT ỹi − (ỹiT w)2 + λ(wT w − 1) (333)
i i
This is exactly thePeigenvector equation, meaning that extrema for L occur when w is an eigenvec-
tor of the matrix i ỹi ỹiT , and λ is the corresponding eigenvalue. Multiplying both sides by 1/N ,
we see this matrix has the same eigenvectors as the data covariance:
!
1 X λ
(yi − b)(yi − b)T w = w (336)
N i N
Now we must determine which eigenvector to use. We rewrite Equation 333 as:
X X
L = ỹiT ỹi − wT ỹi ỹiT w + λ(wT w − 1) (337)
i i
!
X X
= ỹiT ỹi −w T
ỹi ỹiT w + λ(wT w − 1) (338)
i i
(339)
and substitute in Equation 335:
X
L = ỹiT ỹi − λwT w + λ(wT w − 1) (340)
i
X
= ỹiT ỹi − λ (341)
i
again using wT w = 1. We must pick the eigenvalue λ that gives the smallest value of L. Hence,
we pick the largest eigenvalue, and set w to be the corresponding eigenvector.
where we have introduced K Lagrange multipliers λk . The constraints can be combined into a
single Lagrangian: X
L(x, λ1:K ) = E(x) + λk gk (x) (345)
k
∇E
x1
∇g
x2
g(x) = 0
g(x) > 0
Figure 23: Illustration of the condition for inequality constraints: the solution may lie on the
boundary of the constraint region, or in the interior. (Figure from Pattern Recognition and Machine
Learning by Chris Bishop.)
15 Clustering
Clustering is an unsupervised learning problem in which our goal is to discover “clusters” in the
data. A cluster is a collection of data that are similar in some way.
Clustering is often used for several different problems. For example, a market researcher might
want to identify distinct groups of the population with similar preferences and desires. When
working with documents you might want to find clusters of documents based on the occurrence
frequency of certain words. For example, this might allow one to discover financial documents,
legal documents, or email from friends. Working with image collections you might find clusters
of images which are images of people versus images of buildings. Often when we are given large
amounts of complicated data we want to look for some underlying structure in the data, which
might reflect certain natural kinds within the training data. Clustering can also be used to compress
data, by replacing all of the elements in a cluster with a single representative element.
PK clustering is mutually exclusive. Each data vector i can only be assigned to only cluster:
The
j=1 Li,j = 1. Along the way, we will also be estimating a center cj for each cluster.
The full objective function for K-means clustering is:
X
E(c, L) = Li,j ||yi − cj ||2 (351)
i,j
This objective function penalizes the distance between each data point and the center of the cluster
to which it is assigned. Hence, to minimize this error, we want to bring the cluster centers close to
the data it has been assigned, and we also want to assign the data to nearby centers.
This objective function cannot be optimized in closed-form, and so an iterative method is re-
quired. It includes discrete variables (the labels L), and so gradient-based methods aren’t directly
applicable. Instead, we use a strategy called coordinate descent, in which we alternate between
closed-form optimization of one set of variables holding the other variables fixed. That is, we first
pick initial values, then we alternate between updating the labels for the current centers, and then
updating the centers for the current labels.
Each step of the optimization is guaranteed to lower the objective function until the algorithm
converges (you should be able to show that each step is optimal.) However, there is no guarantee
that the algorithm will find the global optimum and indeed it may easily get trapped in a poor local
minima.
Initialization. The algorithm is sensitive to initialization, and poor initialization can sometimes
lead to very poor results. Here are a few strategies that can be used to initialize the algorithm:
1. Random labeling. Initialize the labeling L randomly, and then run the center-update step to
determine the initial centers. This approach is not recommended because the initial centers
will likely end up just being very close to the mean of the data.
2. Random initial centers. We could try to place initial center locations randomly, e.g., by
random sampling in the bounding box of the data. However, it is very likely that some of the
centers will fall into empty regions of the feature space, and will therefore be assigned no
data. Getting a good initialization this way can be difficult.
3. Random data points as centers. This method works much better: use a random subset of
the data as the initial center locations.
5. Multiple restarts. In multiple restarts, we run K-means multiple times, each time with a
different random initialization (using one of the above methods). We then take the best clus-
tering out of all of the runs, based on the value of the objective function above in Equation
(351).
45
40
35
30
25
20
15
10
−5
−10 0 10 20 30 40
Figure 24: K-means applied to a dataset sampled from three Gaussian distributions. Each data
assigned to each cluster are drawn as circles with distinct colours. The cluster centers are shown
as red stars.
Another key question is how one chooses the number of clusters, i.e., K. A common approach
is to fix K based on some prior knowledge or computational constraints. One can also try different
values of K, adding another term to to the objective function to penalize model complexity.
This algorithm provides a quality guarantee: it gives a clustering that is no worse than twice
the error of the optimal clustering.
K-medoids clustering can also be improved by coordinate descent. The labeling step is the
same as in K-means. However, the cluster updates must be done by brute-force search for each
candidate cluster center update.
P (L = j|θ) = aj (352)
p(y|θ, L = j) = G(y; µj , Kj ) (353)
To sample a single data point from this (generative) model, we first randomly select a Gaussian
component according to their prior probabilities {aj }, and then we randomly sample from the
corresponding Gaussian component. The likelihood of a single data point can be derived by the
product rule and the sum rule as follows:
K
X
p(y|θ) = p(y, L = j|θ) (354)
j=1
K
X
= p(y|L = j, θ) P (L = j|θ) (355)
j=1
K
X 1 1 T −1
= aj p e− 2 (y−µj ) Kj (y−µj ) (356)
j=1
(2π)D |Kj |
where D is the dimension of data vectors. This model can be interpreted as a linear combination
(or blend) of Gaussians: we get a multimodal distribution by adding together unimodal Gaussians.
Interestingly, the MoG model is similar to the Gaussian Class-Conditional model that we used for
classification; the difference is that the class labels will no longer be included in the training set.
In general, the approach of building models by mixtures is quite general and can be used for
many other types of distributions as well, for example, we could build a mixture of Student-t
distributions, or a mixture of a Gaussian and a uniform, and so on.
0.5
45
40
1
35
30
1.5
25
20
2
15
10 2.5
3
0
−5
3.5
−10 −5 0 5 10 15 20 25 30 35 40 20 40 60 80 100 120 140 160 180 200
Figure 25: Mixture of Gaussians model applied to a dataset generated from three Gaussians. The
resulting γ is visualized on the right. The data points are shown as colored circles. The color
is determined by the cluster with the highest posterior assignment probability γij . One standard
deviation ellipses are shown for each Gaussian. Note that the blue points are well isolated and there
is little ambiguity in their assignments. The other two distributions overlap, and one can see how
the orientation and eccentricity of the covariance structure (the ellipses) influence the assignment
probabilities.
15.3.1 Learning
Given a data set y1:N , where each data point is assumed to be drawn independently from the model,
we learn the model parameters, θ, by minimizing the negative log-likelihood of the data:
L(θ) = − ln p(y1:N |θ) (357)
X
= − ln p(yi |θ) (358)
i
P
Note that this is a constrained optimization, since we require aj ≥ 0 and j aj = 1. Furthermore,
Kj must be symmetric, positive-definite matrix to be a covariance matrix. Unfortunately, this
optimization cannot be performed in closed-form.
One approach is to use gradient descent to optimization by gradient descent. There are a few
issues associated with doing so. First, some care is required to avoid numerical issues, as discussed
below. Second, this learning is a constrained optimization, due to constraints on the values of the
a’s. One solution is to project onto the constraints during optimization: at each gradient descent
step (and inside the line search loop), we clamp all negative a values to zero and renormalize the
a’s so that they sum to one. Another option is to reparameterize the problem to be unconstrained.
Specifically, we define new variables βj , and define the a’s as functions of the βs, e.g.,
e βj
aj (β) = PK (359)
βj
j=1 e
This definition ensures that, for any choice of the βs, the as will satisfy the constraints. We sub-
stitute this expression into the model definition and then optimize for the βs instead of the as with
gradient descent. Similarly, we can enforce the constraints on the covariance matrix by reparame-
terization; this is normally done using a upper-triangular matrix U such that K = UT U.
An alternative to gradient descent is the Expectation-Maximization algorithm, or EM. EM
is a quite general algorithm for “hidden variable” problems; in this case, the labels L are “hid-
den” (or “unobserved”). In EM, we define a probabilistic labeling variable γi,j . The variable
γi,j corresponds to the probability that data point i came from cluster j: γi,j is meant to estimate
P (L = j|yi ). In EM, we optimize both θ and γ together. The algorithm alternates between the
“E-step” which updates the γs, and the “M-step” which updates the model parameters θ.
Note that the E-step is the same as classification in the Gaussian Class-Conditional model.
The EM algorithm is a local optimization algorithm, and so the results will depend on initial-
ization. Initialization strategies similar to those used for K-means above can be used.
1. Many computations can be performed directly in the log domain. For example, it may be
large values of β could lead to the above expression being zero for all j, even though the
expression must sum to one. This may arise, for example, when computing the γ updates,
which have the above form. The solution is to make use of the identity:
e−βj e−βj +C
P −β = P −β +C (363)
je je
j j
for any value of C. We can choose C to prevent underflow; a suitable choice is C = minj βj .
3. Underflow can also occur when evaluating
X
ln e−βj (364)
i
The EM algorithmPis a coordinate descent algorithm for optimizing the free energy, subject
to the constraint that j γi,j = 1 and the constraints on a. In other words, EM can be written
compactly as:
pick initial values for γ and θ
loop
E-step:
γ ← arg minγ F(θ, γ)
M-step:
θ ← arg minθ F(θ, γ)
end loop
However, the free energy is different from the negative log-likelihood L(θ) that we initially set
out to minimize. Fortunately, the free energy has the following important properties:
• When the value of γ is optimal, the Free Energy is equal to the negative log-likelihood:
We can use this fact to evaluate the negative log-likelihood simply by running an E-step and
then computing the free energy. In fact, this is often more efficient than directly computing
the negative log-likelihood. The proof is given in the next section.
• The minima of the free energy are also minima of the negative log-likelihood:
This follows from the previous property. Hence, optimizing the free energy is the same as
optimizing the negative log-likelihood.
• The Free Energy is an upper-bound on the negative log-likelihood:
for all values of γ. This observation gives a sanity check for debugging the free energy
computation.
The Free Energy also provides a very helpful tool for debugging: any step of an implementation
that increases the free energy must be incorrect. The term Free Energy arises from its original
definition in statistical physics.
15.3.4 Proofs
This content of this section is not required material for this course and you may skip it. Here we
outline proofs for the key features of the free energy.
EM updates. The steps of the EM algorithm may be derived by solving arg minγ F(θ, γ) and
arg minθ F(θ, γ). In most cases, the derivations generalize familiar ones, e.g., weighted least-
squares. The a and γ parameters are multinomial distributions, and optimization of them requires
Lagrange multipliers or reparameterization. One may ignore the positivity constraint, as it turns
out to be automatically satisfied. The details will be skipped here.
Equality after the E-step. The E-step computes the optimal value for γ:
Hence,
L(θ) = min F(θ, γ) (380)
γ
Bound. An important building block in proving P that F(θ, γ) ≥ L(θ) is Jensen’s Inequality,
which applies since ln is a “concave” function and j bj = 1, bj ≥ 0.
X X
ln bj xj ≥ bj ln xj , or (381)
j j
X X
− ln bj xj ≤ − bj ln xj (382)
j j
We can then derive the bound as follows, dropping the dependence on θ for brevity:
X X
L(θ) = − ln p(yi , L = j) (383)
i j
X X γi,j
= − ln p(yi , L = j) (384)
i j
γi,j
X p(yi , L = j)
≤ − γi,j ln (385)
i,j
γi,j
= F(θ, γ) (386)
15.3.6 Degeneracy
There is a degeneracy in the MoG objective function. Suppose we center one Gaussian at one of
the data points, so that cj = yi . The error for this data point will be zero, and by reducing the
variance of this Gaussian, we can always increase the likelihood of the data. In the limit as this
Gaussian’s variance goes to zero, the data likelihood goes to infinity. Hence, some effort may be
required to avoid this situation. This degeneracy can also be avoided by using a more Bayesian
form of the algorithm, e.g., marginalizing out the cluster centers rather than estimating them.
where θ are the model parameters. This evaluation is somewhat mathematically-involved. A very
coarse approximation to this computation is Bayesian Information Criterion (BIC).
where st is the state of the system at time t. Intuitively, this property says the probability of a state
at time t is competely determined by the system state at the previous time step. More generally,
for any set A of indices less than t and set of indices B greater than t we have:
Another useful identity which also follows directly from the Markov property is:
Discrete Markov Models. A important example of Markov chains are discrete Markov models.
Each state st can take on one of a discrete set of states, and the probability of transitioning from
one state to another is governed by a probability table for the whole sequence of states. More
concretely, st ∈ {1, ..., K} for some finite K and, for all times t, P (st = j|st−1 = i) = Aij
where A is parameter of the model that is a fixed matrix of valid probabilities (so that Aij ≥ 0 and
PK
j=1 Aij = 1). To fully characterize the model, we also require a distribution over states for the
first time-step: P (s1 = i) = ai .
where
T
Y
P (s1:T ) = P (s1 ) P (st |st−1 ) (392)
t=2
and
T
Y
P (y1:T |s1:T ) = p(yt |st ) (393)
t=1
s1 s2 st 1 st s t +1
y1 y2 yt 1 yt yt +1
Figure 27: Illustration of the variables in an HMM, and their conditional dependencies. (Figure
from Pattern Recognition and Machine Learning by Chris Bishop.)
A22
A21
A12
k=2
A32 A23 k=1 A11
k=3
A31
A13
A33
Figure 28: The hidden states of an HMM correspond to a state machine. (Figure from Pattern
Recognition and Machine Learning by Chris Bishop.)
1 1
0.5 0.5
0 0
0 0.5 1 0 0.5 1
Figure 29: Illustration of sampling a sequence of datapoints from an HMM. (Figure from Pattern
Recognition and Machine Learning by Chris Bishop.)
and Σ1:K . As a short-hand, we will denote these parameters by a variable θ = {a, A, µ1:K , Σ1:K }.
Note that, if Aij = aj for all i, then this model is equivalent to a Mixtures-of-Gaussian model
with mixing proportions given by the ai ’s, since the distribution over states at any instant does not
depend on the previous state.
In the remainder of this chapter, we will discuss algorithms for computing with HMMs.
The naive approach is to simply enumerate every possible state sequence and choose the one that
maximizes the above conditional probability. Since there are K T possible state-sequences, this
approach is clearly infeasible for sequences of more than a few steps.
Fortunately, we can take advantage of the Markov property to perform this computation much
more efficiently. The Viterbi algorithm is a dynamic programming approach to finding the most
likely sequence of states s1:T given θ and a sequence of observations y1:T .
We begin by defining the following quantity for each state and each time-step:
(Henceforth, we omit θ from these equations for brevity.) This quantity tells us the likelihood that
the most-likely sequence up to time t ends at state i, given the data up to time t. We will compute
this quantity recursively. The base case is simply:
Once we have computed δ for all time-steps and all states, we can determine the final state of
the most-likely sequence as:
since p(y1:T ) does not depend on the state sequence. We can then backtrack through δ to determine
the states of each previous time-step, by finding which state j was used to compute each maximum
in the recursive step above. These states would normally be stored during the recursive process so
that they can be looked-up later.
k=2
k=3
A33 A33 A33
n 2 n 1 n n+1
Figure 30: Illustration of the αt (i) values computed during the Forward Recursion. (Figure from
Pattern Recognition and Machine Learning by Chris Bishop.)
Note that this is identical to the Viterbi algorithm, except that maximization over j has been re-
placed by summation.
In the Backward Recursion we compute:
β(st, 1 ) β(st +1 ,1 )
(st 1,1 ) (st, 1 )
A11
A11 k=1
k=1
A12 p(yt |st +1 ,1 )
A21 p(yt |st, 1 )
β(st +1 ,2 )
(st 1,2 )
k=2
k=2 A13
A31
p(yt |st +1 ,2 )
β(st +1 ,3 )
(st 1,3 )
k=3
k=3
t t +1 p(yt |st +1 ,3 )
t 1 t
Figure 31: Illustration of the steps of the Forward Recursion and the Backward Recursion (Figure
from Pattern Recognition and Machine Learning by Chris Bishop.)
The probability that the hidden sequence had state i at time t is:
The normalizing constant — which is also the likelihood of the entire sequence, p(y1:T ) — can be
computed by the following formula:
X
p(y1:T ) = p(st = i, y1:T ) (413)
i
X
= αt (i)βt (i) (414)
i
The result of this summation will be the same regardless of which time-step t we choose to do the
summation over.
The probability that the hidden sequence transitioned from state i at time t to state j at time
t + 1 is:
Note that the denominator gives an expression for p(y1:T ), which can be computed for any value
of t.
As before, even evaluating this objective function requires K T steps, and methods like gradient
descent will be impractical. Instead, we can use the EM algorithm. Note that, since an HMM
is a generalization of a Mixture-of-Gaussians, EM for HMMs will be a generalization of EM for
MoGs. The EM algorithm applied to HMMs is also known as the Baum-Welch Algorithm.
The algorithm alternates between the following two steps:
ai = γ1 (i) (419)
P
γt (i)yt
µi = Pt (420)
γt (i)
P t T
t γt (i)(yP t − µi )(yt − µi )
Σi = (421)
t γt (i)
P
t ξt (i, j)
Aij = P P (422)
k t ξt (i, k)
towards 0. Thus the limit of machine precision will quickly be reached and the computed values
will underflow (evaluate to zero).
The solution is to compute a normalized terms in the Forward-Backward recursions:
αi (t)
α̂t (i) = QT (423)
m=1 cm
T
!
Y
β̂t (i) = cm βi (t) (424)
m=t+1
Specifically, we use ct = p(yt |y1:t−1 ). It can then be seen that, if we use α̂ and β̂ in the M-step
instead of α and β, the ct terms will cancel out (you can see this by substituting the formulas for
γ and ξ into the M-step). We then must choose ct to keep the scaling of α̂ and β̂ within machine
precision.
In the base case for the forward recursion, we set:
K
X
c1 = p(y1 |s1 = i)ai (425)
i=1
p(y1 |s1 = i)ai
α̂1 (i) = (426)
c1
(This may be implemented by first computing the numerator of α̂, and then summing it to get c1 ).
The recursion for computing α̂ is
X K
X
ct = p(yt |st = i) Aji α̂t−1 (j) (427)
i j=1
PK
j=1 Aji α̂t−1 (j)
α̂t (i) = p(yt |st = i) (428)
ct
In the backward step, the base case is:
β̂T (i) = 1 (429)
and the recursive case is
PK
j=1 Aij p(yt+1 |st+1 = j)β̂t+1 (j)
β̂t (i) = (430)
ct+1
using the same ct values computed in the forward recursion.
The γ and ξ variables can then be computed as
γt (i) = α̂t (i)β̂t (i) (431)
α̂t (i)p(yt+1 |st+1 = j)Aij β̂t+1 (j)
ξt (i, j) = (432)
ct+1
It can be shown that ct = p(yt |y1:t−1 ). Hence, once the recursion is complete, we can compute
the data likelihood as Y
p(y1:T ) = ct (433)
t
X T −1
XX T
XX
F (θ, ξ) = − γ1 (i) ln ai − ξt (i, j) ln Aij − γt (i) ln p(yt |st = i)
i i,j t=1 i t=1
T −1
XX T −2
XX
+ ξt (i, j) ln ξt (i, j) − γt (i) ln γt (i) (435)
i,j t=1 i t=2
Warning! Since we weren’t able to find any formula for the free energy, we derived it from
scratch (see below). In our tests, it didn’t precisely match the negative log-likelihood. So there
might be a mistake here, although the free energy did decrease as expected.
Derivation. This material is very advanced and not required for the course. It is mainly here
because we couldn’t find it elsewhere.
As a short-hand, we define s = s1:T to be a variable representing an entire state sequence. The
likelihood of a data sequence is:
X
p(y1:T ) = p(y1:T , s) (437)
s
marginals of ξ:
X
γt (i) = q(st = i) = q(s) (438)
s\{i}
X
ξt (i, j) = q(st = i, st+1 = j) = q(s) (439)
s\{i,j}
While these sequences may often be similar, they can be different as well. For example, it is
possible that the most likely states for two consecutive time-steps do not have a valid transition
between them, i.e., if s∗t = i and s∗t+1 = j, it is possible (though unlikely) that Aij = 0. This
illustrates that these two ways to create sequences of states answer two different questions: what
sequence is jointly most likely? And, for each time-step, what is the most likely state just for that
time-step?
where dist(x, w, b) is the Euclidean distance from the feature point φ(x) to the hyperplane defined
by w and b. With this objective function we are maximizing the distance from the decision bound-
ary wT φ(x) + b = 0 to the nearest point i. The constraints force us to find a decision boundary
that classifies all training data correctly. That is, for the classifier a training point correctly yi and
wT φ(xi ) + b should have the same sign, in which case their product must be positive.
y y= 1
f== 11 f =1
y
f== 00 yf =
y = 00
= −11
f= yf == 1
−1
margin
margin
Figure 32: Left: the margin for a decision boundary is the distance to the nearest data point. Right:
In SVMs, we find the boundary with maximum margin. (Figure from Pattern Recognition and Machine
Learning by Chris Bishop.)
It can be shown that the distance from a point φ(xi ) to a hyperplane wT φ(x) + b = 0 is given
T T φ(x )+b)
by |w φ(x i )+b|
||w||
, or, since yi tells us the sign of f (xi ), yi (w ||w|| i
. This can be seen intuitively by
T
writing the hyperplane in the form f (x) = w (φ(xi ) − p), where p is a point on the hyperplane
such that wT p = b. The vector from φ(xi ) to the hyperplane projected onto w/||w|| gives a vector
from the hyperplane to the the point; the length of this vector is the desired distance.
Substituting this expression for the distance function into the above objective function, we get:
yi (wT φ(xi )+b)
maxw,b mini ||w||
(458)
T
such that, for all i, yi (w φ(xi ) + b) ≥ 0 (459)
Note that, because of the normalization by ||w|| in (458), the scale of w is arbitrary in this objective
function. That is, if we were to multiply w and b by some real scalar α, the factors of α in the
numerator and denominator will cancel one another. Now, suppose that we choose the scale so
that the nearest point to the hyperplane, xi , satisfies yi (wT φ(xi ) + b) = 1. With this assumption
the mini in Eqn (458) becomes redundant and can be removed. Thus we can rewrite the objective
function and the constraint as
1
maxw,b ||w||
(460)
T
such that, for all i, yi (w φ(xi ) + b) ≥ 1 (461)
Finally, as a last step, since maximizing 1/||w|| is the same as minimizing ||w||2 /2, we can
re-express the optimization problem as
minw,b 21 ||w||2 (462)
such that, for all i, yi (wT φ(xi ) + b) ≥ 1 (463)
This objective function is a quadratic program, or QP, because the objective function and the
constraints are both quadratic in the unknowns. A QP has a single global minima, which can be
found efficiently with current optimization packages.
In order to understand this optimization problem, we can see that the constraints will be “active”
for only a few datapoints. That is, only a few datapoints will be close to the margin, thereby
constraining the solution. These points are called the support vectors. Small movements of
the other data points have no effect on the decision boundary. Indeed, the decision boundary is
determined only by the support vectors. Of course, moving points to within the margin of the
decision boundary will change which points are support vectors, and thus change the decision
boundary. This is in constrast to the probabilistic methods we have seen earlier in the course, in
which the positions of all data points affect the location of the decision boundary.
As discussed above, we aim to both maximize the margin and minimize violation of the mar-
gin constraints. This objective function is still a QP, and so can be optimized with a QP library.
However, it does have a much larger number of optimization variables, namely, one ξ must now
be optimized for each datapoint. In practice, SVMs are normally optimized with special-purpose
optimization procedures designed specifically for SVMs.
fy =
= −11
y
f== 00
ξ>
> 11 fy =
= 11
ξ << 11
ξ ==00
ξ=
= 0
Figure 33: The slack variables ξi ≥ 1 for misclassified points, and 0 < ξi < 1 for points close to
the decision boundary. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)++
(Note that yf (x) > 0 is the same as requiring that y and f (x) have the same sign.) This loss
function says that we pay a penalty of 1 when we misclassify a new input, and a penalty of zero if
we classify it correctly.
Ideally, we would choose the classifier to minimize the loss over the new test data that we are
given; of course, we don’t know the true labels, and instead we optimize the following surrogate
objective function over the training data:
X
E(w) = L(xi , yi ) + λR(w) (467)
i
9
A loss function specifies a measure of the quality of a solution to an optimization problem. It is the penalty
function that tell us how badly we want to penalize errors in a models ability to fit the data. In probabilistic methods
it is typically the negative log likelihood or the negative log posterior.
EE(z)
(z)
zz
−22 −11 0
0 11 22
Figure 34: Loss functions, E(z), for learning, for z = y f (x). Black: 0-1 loss. Red: LR loss.
Green: Quadratic loss ((z − 1)2 ). Blue: Hinge loss. (Figure from Pattern Recognition and Machine
Learning by Chris Bishop.)
where R(w) is a regularizer meant to prevent overfitting (and thus improve performance on future
data). The basic assumption is that loss on the training set should correspond to loss on the test
set. If we can get the classifier to have small loss on the training data, while also being smooth,
then the loss we pay on new data ought to not be too big either. This optimization framework is
equivalent to MAP estimation as discussed previously10 ; however, here we are not at all concerned
with probabilities. We only care about whether the classifier gets the right answers or not.
Unfortunately, optimizing a classifier for the 0-1 loss is very difficult: it is not differentiable
everywhere, and, where it is differentiable, the gradient is zero everywhere. There are a set of
algorithms called Perceptron Learning which attempt to do this; of these, the Voted Perceptron
algorithm is considered one of the best. However, these methods are somewhat complex to analyze
and we will not discuss them further. Instead, we will use other loss functions that approximate
0-1 loss.
We can see that maximum likelihood logistic regression is equivalent to optimization with the
following loss function:
LLR = ln 1 + e−yf (x) (468)
which is the negative log-likelihood of a single data vector. This function is a poor approximation
to the 0-1 loss, and, if all we care about is getting the labels right (and not the class probabilities),
then we ought to search for a better approximation.
SVMs minimize the slack variables, which, from the constraints, can be seen to give the hinge
loss:
1 − yf (x) 1 − yf (x) > 0
Lhinge = (469)
0 otherwise
10
However, not all loss functions can be viewed as the negative log of a valid likelihood function, although all
negative-log likelihoods can be viewed as loss functions for learning.
This loss function is zero for points that are classified correctly (with distance to the decision
boundary at least 1); hence, it is insensitive to correctly-classified points far from the boundary. It
increases linearly for misclassified points, not nearly as quickly as the LR loss.
The minus sign with the secon term is used because we are minimizing with respect to the first
term, but maximizing the second.
dL
Setting the derivative of dw = 0 and dL
db
= 0 gives the following constraints on the solution:
X
w = ai yi φ(xi ) (471)
i
X
y i ai = 0 (472)
i
Using (471) we can substitute for w in 470. Then simplifying the result, and making use of the
next constraint (471), one can derive what is often called the dual Lagrangian:
X 1 XX
L(a1:N ) = ai − ai aj yi yj φ(xi )T φ(xj ) (473)
2 i j
While this objective function is actually more expensive to evaluate than the primal Lagrangian
(i.e., 470), it does lead to the following modified form
X 1 XX
L(a1:N ) = ai − ai aj yi yj k(xi , xj ) (474)
2 i j
where k(xi , xj ) = φ(xi )T φ(xj ) is called a kernel function. For example, if we used the basic
linear features, i.e., φ(x) = x, then k(xi , xj ) = xTi xj .
The advantage of the kernel function representation is that it frees us from thinking about the
features directly; the classifier can be specified solely in terms of the kernel. Any kernel that
satisfies a specific technical condition11 is a valid kernel. For example, one of the most commonly-
used kernels is the “RBF kernel”:
2
k(x, z) = e−γ||x−z|| (475)
which corresponds to a vector of features φ(x) with infinite dimensionality! (Specifically, each
element of φ is a Gaussian basis function with vanishing variance).
Note that, just as most constraints in the Eq. (463) are not “active”, the same will be true here.
That is, only some constraints will be active (ie the support vectors), and for all other constraints,
ai = 0. Hence, once the model is learned, most of the training data can be discarded; only the
support vectors and their a values matter.
The one final thing we need to do is estimate the bias b. We now know the values for ai for
all support vectors (i.e., for data constraints that are considered active), and hence we know w.
Accordingly, for all support vectors we know, by assumption above, that
f (xi ) = wT φ(xi ) + b = 1 . (476)
From this one can easily solve for b.
Applying the SVM to new data. For the kernel representation to be useful, we need to be able
to classify new data without needing to evaluate the weights. This can be done as follows:
f (xnew ) = wT φ(xnew ) + b (477)
!T
X
= ai yi φ(xi ) φ(xnew ) + b (478)
i
X
= ai yi k(xi , xnew ) + b (479)
i
Generalizing the kernel representation to non-separable datasets (i.e., with slack variables) is
straightforward, but will not be covered in this course.
Figure 35: Nonlinear classification boundary learned using kernel SVM (with an RBF kernel).
The circled points are the support vectors; curves are isocontours of the decision function (e.g., the
decision boundary f (x) = 0, etc.) (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)
17.6 Software
Like many methods in machine learning there is freely available software on the web. For SVM
classification and regression there is well-known software developed by Thorsten Joachims, called
SVMlight, (URL: http://svmlight.joachims.org/ ).
18 AdaBoost
Boosting is a general strategy for learning classifiers by combining simpler ones. The idea of
boosting is to take a “weak classifier” — that is, any classifier that will do at least slightly better
than chance — and use it to build a much better classifier, thereby boosting the performance of the
weak classification algorithm. This boosting is done by averaging the outputs of a collection of
weak classifiers. The most popular boosting algorithm is AdaBoost, so-called because it is “adap-
tive.”12 AdaBoost is extremely simple to use and implement (far simpler than SVMs), and often
gives very effective results. There is tremendous flexibility in the choice of weak classifier as well.
Boosting is a specific example of a general class of learning algorithms called ensemble methods,
which attempt to build better learning algorithms by combining multiple simpler algorithms.
Suppose we are given training data {(xi , yi )}N i=1 , where xi ∈ R
K
and yi ∈ {−1, 1}. And
suppose we are given a (potentially large) number of weak classifiers, denoted fm (x) ∈ {−1, 1},
and a 0-1 loss function I, defined as
0 iffm (xi ) = yi
I(fm (x), y) = (480)
1 iffm (xi ) 6= yi
(1)
for i from 1 to N , wi = 1
for m = 1 to M do
Fit weak classifier m to minimize the objective function:
PN (m)
wi I(fm (xi )6=yi )
ǫm = i=1
P (m)
i wi
where I(fm (xi ) 6= yi ) = 1 if fm (xi ) 6= yi and 0 otherwise
αm = ln 1−ǫ ǫm
m
for all i do
(m+1) (m)
wi = wi eαm I(fm (xi )6=yi )
end for
end for
After learning, the final classifier is based on a linear combination of the weak classifiers:
M
!
X
g(x) = sign αm fm (x) (481)
m=1
Essentially, AdaBoost is a greedy algorithm that builds up a ”strong classifier”, i.e., g(x), incre-
mentally, by optimizing the weights for, and adding, one weak classifier at a time.
12
AdaBoost was called adaptive because, unlike previous boosting algorithms, it does not need to know error bounds
on the weak classifiers, nor does it need to know the number of classifiers in advance.
2 2 2
0 0 0
0 1 2 0 1 2 0 1 2
2 2 2
0 0 0
0 1 2 0 1 2 0 1 2
Figure 36: Illustration of the steps of AdaBoost. The decision boundary is shown in green for each
step, and the decision stump for each step shown as a dashed line. The results are shown after 1, 2,
3, 6, 10, and 150 steps of AdaBoost. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)
0.5
0 0.4
0
−0.5 0.2
−0.5
−1 −1 0
−2 −1 0 1 −1 0 1 0 20 40 60
200 −2
200
−4
100 100
−6
−8
100 200 300 400 500 100 200 300 400 500
Figure 37: 50 steps of AdaBoost used to learn a classifier with decision stumps.
where the value in the parentheses is 1 if the k-th element of the vector x is greater than c, and
-1 otherwise. The scalar s is either -1 or 1 which allows one the classifier to respond with class 1
when xk ≤ c. Accordingly, there are three parameters to a decision stump:
• c∈R
• s ∈ {−1, 1}
Because the number of possible parameter settings is relatively small, a decision stump is often
trained by brute force: discretize the real numbers from the smallest to the largest value in the
training set, enumerate all possible classifiers, and pick the one with the lowest training error. One
can be more clever in the discretization: between each pair of data points, only one classifier must
be tested (since any stump in this range will give the same value). More sophisticated methods, for
example, based on binning the data, or building CDFs of the data, may also be possible.
Loss function view. Here we discuss the loss function interpretation of AdaBoost. As was shown
(decades after AdaBoost was first invented), AdaBoost
P can be viewed as greedy optimization of a
particular loss function. We define f (x) = 21 m αm fm (x), and rewrite the classifier as g(x) =
sign(f (x)) (the factor of 1/2 has no effect on the classifier output). AdaBoost can then be viewed
as optimizing the exponential loss:
which must be optimized with respect to the weights α and the parameters of the weak classifiers.
The optimization process is greedy and sequential: we add one weak classifier at a time, choosing it
and its α to be optimal with respect to E, and then never change it again. Note that the exponential
loss is an upper-bound on the 0-1 loss:
Lexp (x, y) ≥ L0−1 (x, y) (485)
Hence, if exponential loss of zero is achieved, then the 0-1 loss is zero as well, and all training
points are correctly classified.
Consider the weak classifier fm to be added at step m. The entire objective function can be
written to separate out the contribution of this classifier:
X 1 Pm−1 1
E = e− 2 yi j=1 αj fj (x)− 2 yi αm fm (x) (486)
i
X 1 Pm−1 1
= e − 2 yi j=1 αj fj (x)
e− 2 yi αm fm (x) (487)
i
Since we are P holding constant the first m − 1 terms, we can replace them with a single constant
(m) − 21 yi m−1
j=1 αj fj (x) . Note that these are the same weights computed by the recursion used by
wi = e
(m) (m−1) − 1 yi αj fm−1 (x)
AdaBoost: wi ∝ wi e 2 . (There is a proportionality constant that can be ignored).
Hence, we have
X (m) 1
E = wi e− 2 ym αm fm (x) (488)
i
We can split this into two summations, one for data correctly classified by fm , and one for those
misclassified:
X (m) αm X (m) αm
E = wi e− 2 + wi e 2 (489)
i:fm (xi )=yi i:fm (xi )6=yi
E(z)
z
2 1 0 1 2
Figure 38: Loss functions for learning: Black: 0-1 loss. Blue: Hinge Loss. Red: Logistic re-
gression. Green: Exponential loss. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)
Problems with the loss function view. The exponential loss is not a very good loss function to
use in general. For example, if we directly optimize the exponential loss over all variables in the
classifier (e.g., with gradient descent), we will often get terrible performance. So the loss-function
interpretation of AdaBoost does not tell the whole story.
Margin view. One might expect that, when AdaBoost reaches zero training set error, adding any
new weak classifiers would cause overfitting. In practice, the opposite often occurs: continuing to
add weak classifiers actually improves test set performance in many situations. One explanation
comes from looking at the margins: adding classifiers tends to increase the margin size. The formal
details of this will not be discussed here.
extremely simple and smooth. During the learning process, we add more and more complexity to
the model to improve the fit to the data. At some point, adding additional complexity to the model
overfits: we are no longer modeling the decision boundary we wish to fit, but are fitting the noise
in the data instead. We use the test set to determine when overfitting begins, and stop learning at
that point.
Early stopping can be used for most iterative learning algorithms. For example, suppose we use
gradient descent to learn a regression algorithm. If we begin with weights w = 0, we are beginning
with a very smooth curve. Each step of gradient descent will make the curve less smooth, as the
entries of w get larger and larger; stopping early can prevent w from getting too large (and thus
too non-smooth).
Early stopping is very simple and very general; however, it is heuristic, as the final result one
gets will depend on the particulars in the optimization algorithm being used, and not just on the
objective function. However, AdaBoost’s procedure is suboptimal anyway (once a weak classifier
is added, it is never updated).
An even more aggressive form of early stopping is to simply stop learning at a fixed number of
iterations, or by some other criteria unrelated to test set error (e.g., when the result “looks good.”)
In fact, pracitioners often using early stopping to regularize unintentionally, simply because they
halt the optimizer before it has converged, e.g., because the convergence threshold is set too high,
or because they are too impatient to wait.