Mathematical Foundations of Machine Learning: (NMAG 469, FALL TERM 2018-2019)
Mathematical Foundations of Machine Learning: (NMAG 469, FALL TERM 2018-2019)
LEARNING
(NMAG 469, FALL TERM 2018-2019)
∗
HÔNG VÂN LÊ
Contents
1. Learning, machine learning and artificial intelligence 3
1.1. Learning, inductive learning and machine learning 3
1.2. A brief history of machine learning 5
1.3. Current tasks and types of machine learning 6
1.4. Basic questions in mathematical foundations of machine
learning 9
1.5. Conclusion 10
2. Statistical models and frameworks for supervised learning 11
2.1. Discriminative model of supervised learning 11
2.2. Generative model of supervised learning 15
2.3. Empirical Risk Minimization and overfitting 17
2.4. Conclusion 19
3. Statistical models and frameworks for unsupervised learning and
reinforcement learning 19
3.1. Statistical models and frameworks for density estimation 19
3.2. Statistical models and frameworks for clustering 22
3.3. Statistical models and frameworks for dimension reduction and
manifold learning 23
3.4. Statistical model and framework for reinforcement learning 25
3.5. Conclusion 25
4. Fisher metric and maximum likelihood estimator 25
4.1. The space of all probability measures and total variation norm 25
4.2. Fisher metric on a statistical model 28
4.3. MSE and Cramér-Rao inequality 30
4.4. Efficient estimators and MLE 32
4.5. Consistency of MLE 32
4.6. Conclusion 33
5. Consistency of a learning algorithm 33
5.1. Consistent learning algorithm and its sample complexity 33
5.2. Uniformly consistent learning and VC-dimension 37
It is not knowledge, but the act of learning ... which grants the greatest
enjoyment.
Carl Friedrich Gauss
1.3. Current tasks and types of machine learning. Now I shall de-
scribe what current machine learning can perform and how they do it.
1.3.1. Main tasks of current machine learning. Let us give a short descrip-
tion of current applications of machine learning.
Classification task assigns a category to each item. In mathematical lan-
guage, a category in an element in a countable set. For example, document
classification may assign items with categories such as politics, email spam,
sports, or weather while image classification may assign items with cate-
gories such as landscape, portrait, or animal. The number of categories in
3The term “regression” was coined by Francis Galton in the nineteenth century to
describe a biological phenomenon. The phenomenon was that the heights of descendants
of tall ancestors tend to regress down towards a normal average (a phenomenon also
known as regression toward the mean of population). For Galton, regression had only this
biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a
more general statistical context: movement toward the mean of a statistical population.
Galton’s method of investigation is non-standard at that time: first he collected the data,
then he guessed the relationship model of the events.
4 He founded the world’s first university statistics department at University College
London in 1911, the Biometrical Society and Biometrika, the first journal of mathematical
statistics and biometry.
5Fisher introduced the main models of statistical inference in the unified framework of
parametric statistics. He described different problems of estimating functions from given
data (the problems of discriminant analysis, regression analysis, and density estimation)
as the problems of parameter estimation of specific (parametric) models and suggested the
maximum likelihood method for estimating the unknown parameters in all these models.
8 HÔNG VÂN LÊ ∗
1.3.2. Main types of machine learning. The type of a machine learning task
is defined by the type of interaction between the learner and the environment.
More precisely we consider types of training data, i.e., the data available to
the learner before making decision and prediction, the outcomes and the test
data that are used to evaluate and apply the learning algorithm.
MACHINE LEARNING 9
when we have more data. For example, in classification problem, the learn-
ing machine has to predict the category of a new item from a specific set,
after seeing a lot of labeled data consisting of items and their categories.
The classification task is a typical task in supervised learning where we can
explain how and why a learning machine works and how and why machine
learns successfully. Mathematical foundations of machine learning aim to
answer these questions in mathematical language.
Question 1.3. What is the mathematical model of learning?
To answer Question 1.3 we need to specify our definition of learning in a
mathematical language which can be used to build instructions for machines.
Question 1.4. How to quantify the difficulty/complexity of a learning prob-
lem?
We quantify the difficulty of a problem in terms of its time complexity,
which is the minimum time needed for performing computer program to
solve a problem, and in term of its resource complexity which measure the
capacity of data storage and energy resource needed to solve the problem.
If the complexity of a problem is very large then we cannot not learn it. So
Question 1.4 contains the sub-question “ why can we learn a problem?”
Question 1.5. How to choose a learning algorithm?
Clearly we want to have a best learning algorithm, once we know a model
of a machine learning which specifies the set of possible predictors (decisions)
and the associated error/reward function.
By Definition 1.1, a learning process is successful, if its prediction/estimation
improves with the increase of data. Thus the notion of success of learn-
ing process requires a mathematical treatment of asymptotic rate of er-
ror/reward in the presence of complexity of the problem.
Question 1.6. Is there a mathematical theory underlying intelligence?
I shall discuss this speculative question in the last lecture.
6classically, elements of X are called random variables, where the word “variable”
means “unknown”. When X is an input space (resp. an output space) its elements are
also called independent (resp. dependent) variables. Since nowadays the word variable
has a different meaning, like [Ghahramani2013, p. 4], I would avoid “random variable” in
this situation
MACHINE LEARNING 13
such that hS (x) predicts the label of (unseen) instance x with the less error.
• The error function, also called a risk function, measures the discrepancy
between a hypothesis h ∈ H and an ideal predictor. The error function is
a central notion in learning theory. This function should be defined as
the averaged discrepancy of h(x) and y, where (x, y) runs over X × Y.
The averaging is calculated using the probability measure µ := µX ×Y that
governs the distribution of labeled pair (x, y(x)). Thus a risk function R
must depend on µ, so we denote it by Rµ . It is accepted that the risk
function Rµ is defined as follows.
Z
L
(2.2) Rµ (f ) := L(x, y, f ) dµ
X ×Y
L2 (R, µ) := {f ∈ RR | i2 (f ) ∈ L2 (X × R, µ)}.
Now we let F := L2 (X , µ). Let Y denote the function on R such that
Y (y) = y. Assume that Y ∈ L2 (R, µ) and define the quadratic loss function
L:X ×Y ×F →R
(2.12) L(x, y, h) := |y − h(x)|2 ,
The expected risk RµL is called the L2 -risk, also known as mean squared error
(MSE). Show that the regression function r(x) := Eµ i2 (Y )|X = x belongs
to F and minimizes the L2 (µ)-risk.
Definition 2.8. A model of supervised learning with the aim to estimate
the conditional distribution P (y ∈ B|x), in particular, a conditional density
function p(y|x), or joint distribution of (x, y) is called a generative model of
supervised learning.
MACHINE LEARNING 17
n
1X
(2.15) R̂SL (h) := L(xi , yi , h) ∈ R.
n
i=1
If L is fixed, then we also omit the superscript L.
The empirical risk is a function of two variables: the “empirical data” S
and the predictor h. Given S a learner can compute R̂S (h) for any function
h : X → Y. A minimizer of the empirical risk should have also “approxi-
mately” minimize the expected risk. This is the empirical risk minimization
principle, abbreviated as ERM.
Remark 2.10. We note that
L(d) 1 L(dn )
(2.16) R̂S (h) = R (h)
n µS
where µS is the Dirac measure on (X × Y)n associated to S, see (2.6). If h
is fixed, by the weak law of large numbers, the RHS of (2.16) converges in
probability to the expected risk RµL (h), so we could hope to find a condition
18 HÔNG VÂN LÊ ∗
where pnθ (Sn ) is the density of the probability measure µnθ on X n . It follows
that the minimizer θ of the empirical risk R̂SLn is the maximizer of the log-
likelihood function log[pnθ (Sn )]. According to ERM principle, the minimizer
θ of R̂SLn should provide an “approximation” of the density pu of the unknown
probability measure µu .
Remark 3.1. (1) For X = R the ERM principle for the expected log-
likelihood function holds. Namely one can show that the minimum of the
risk functional in (3.2), if exists is attained at a function p∗u which may differ
from pu only on a set of zero measure, see [Vapnik1998, p.30] for a proof.
(2) Note that minimizing the expected log-likelihood function Rµ (θ) is
the same as minimizing the following modified risk function [Vapnik2000,
p.32]
Z Z
∗ pθ (x)
(3.4) Rµ (θ) := Rµ (θ) + log pu (x)pu (x)dµ0 = − log pu (x)dµ0 .
X X pu (x)
The expression on the RHS of (3.4) is the Kullback-Leibler divergence KL(pθ µ0 , µu )
that is used in statistics for measuring the divergence between pθ µ0 and µu .
8For many important statistical models in machine learning the condition 1-1 map does
not hold and we refer to [AJLS2017] for a general treatment.
MACHINE LEARNING 21
Note that the RHS measures the accuracy of p̂PSnR (x0 ) probably w.r.t. Sn ∈
Rn . This is an important concept of accuracy in the presence of uncertainty.
It has been proved that under certain condition on the kernel function K
and the infinite dimensional statistical model P of densities the M SE(fˆP R , x0 )
converges to zero uniformly on R as h goes to zero [Tsybakov2009, Theorem
1.1, p. 9].
Remark 3.2. In this Subsection we discuss two popular models of machine
learning for density estimation using ERM principle, which works under cer-
tain conditions. We postpone important Bayesian model of machine learning
and stochastic approximation method for finding minimizer of the expected
risk function using i.i.d. data to later parts of our course.
3.2. Statistical models and frameworks for clustering. Clustering is
the process of grouping similar objects x ∈ X together. There are two
possible types of grouping: partitional clustering, where we partition the
objects into disjoint sets; and hierarchical clustering, where we create a
nested tree of partitions. To formalize the notion of similarity we introduce
a quasi-distance function on X . That is, a function d : X × X → R+ that is
symmetric, satisfies d(x, x) = 0 for all x ∈ X .
A popular approach to clustering starts by defining a cost function over
a parameterized set of possible clusterings and the goal of the clustering
algorithm is to find a partitioning (clustering) of minimal cost. Under this
paradigm, the clustering task is turned into an optimization problem. The
function to be minimized is called the objective function, which is a function
G from pairs of an input (X , d), and a proposed clustering solution C =
(C1 , · · · , Ck ), to positive real numbers. Given G, the goal of a clustering
algorithm is defined as finding, for a given input (X , d), a clustering C
MACHINE LEARNING 23
so that G((X, d), C) is minimized. In order to reach that goal, one has
to apply some appropriate search algorithm. As it turns out, most of the
resulting optimization problems are NP-hard, and some are even NP-hard
to approximate.
Example 3.3. The k-means objective function is one of the most popular
clustering objectives. In k-means the data is partitioned into disjoint sets
C1 , · · · , Ck where each Ci is represented by a centroid µi := µi (Ci ). It
is assumed that the input set X is embedded in some larger metric space
(X 0 , d) and µi ⊂ X 0 . We define µi as follows
X
µi (Ci ) := arg min0 d(x, µ).
µ∈X
x∈Ci
W (Sn ) := {W (xi )} ⊂ Rm .
Rd
Assume that ξ1 , · · · , ξd ∈ are eigenvectors of C(Sm ) with eigenvalues
λ1 ≥ · · · ≥ λd ≥ 0. Show that any (W, U ) ∈ F with W (xj ) = 0 for all
j ≥ m + 1 is a solution of (3.11).
Thus a PCA problem can be solved using linear algebra method.
• Manifold learning and autoencoder. In real life data are not concentrated
on a linear subspace of Rd but around a submanifold M ⊂ Rd . The current
challenge in ML community is that to reduce representation of data in Rd
using all the data in Rd but only use only data concentrated around M . For
that purpose we use autoencoder, which is a non-linear analogue of PCA.
In an auto-encoder we learn a pair of functions: an encoder function
ψ : Rd → Rk , and a decoder function ϕ : Rk → Rd . The goal of the learning
process is to P find a pair of functions (ψ, ϕ) such that the reconstruction
error
n
X
RSn (ψ, ϕ) := ||xi − ϕ(ψ(xi ))||2
i=1
is small. We therefore must restrict ψ and ϕ in some way. In PCA, we
constrain k < d and further restrict ψ and ϕ to be linear functions.
MACHINE LEARNING 25
for an arbitrary measurable space (X , Σ). This geometry induces the Fisher
metric on any statistical model P ⊂ P(X ) satisfying a mild condition.
Let us fix some notations. Recall that a signed finite measure µ on X is
a function µ : Σ → R which satisfies all axioms of a measure except that µ
needs not take non-negative value. Now we set
M(X ) := {µ : µ a finite measure on X },
S(X ) := {µ : µ a signed finite measure on X }.
It is known that S(X ) is a Banach space whose norm is given by the total
variation of a signed measure, defined as
n
X
kµkT V := sup |µ(Ai )|
i=1
Z
(4.11) = −ε pθ (x)∂v log pθ (x)dµ0
X
Z
2
(4.12) −ε pθ (x)(∂v )2 log pθ (x)dµ0 + O(ε3 ).
X
Using the mean value ϕσ̂ , we define the variance of σ̂ w.r.t. ϕ as the
derivation of ϕ ◦ σ̂ from its mean value ϕσ̂ . We set for all l ∈ (Rn )∗
(4.16) Vξϕ [σ̂](l) := Eξ [(ϕl ◦ σ̂ − ϕlσ̂ ) · (ϕl ◦ σ̂ − ϕlσ̂ )].
The RHS of (4.16) is well-defined, since σ̂ ∈ L2ϕ (P, X ). It is a quadratic
form on (Rn )∗ and will be denoted by Vξϕ [σ̂].
Exercise 4.11. ([JLS2017]) Prove the following formula
(4.17) M SEξϕ [σ̂](l, k) = Vξϕ [σ̂](l, k) + hbϕ ϕ
σ̂ (ξ), li · hbσ̂ (ξ), ki
for all ξ ∈ P and all l, k ∈ (Rn )∗ .
By Proposition 3.3 in [JLS2017], since P is 2-integrable, the function
ϕlσ̂ := hϕσ̂ (ξ), li is differentiable, i.e., there is differential dϕlσ̂ ∈ Tξ∗ P for any
ξ ∈ P such that ∂v ϕlσ̂ (ξ) = dϕlσ̂ (v) for all v ∈ Tξ P . Here for a function f
on P we define ∂v f (ξ) := f˙(c(t)) where c(t) ⊂ P is a curve with c(0) = ξ
and ċ(0) = v. The differentiability of ϕlσ̂ is proved by differentiation under
integral
Z Z
l l
(4.18) ∂v ϕσ̂ = ∂v (ϕ ◦ σ̂ dξ) = (ϕl ◦ σ̂(x) − Eξ (ϕl ◦ σ̂)) · log v dξ,
X X
see [AJLS2017, JLS2017] for more detail.
MACHINE LEARNING 31
Recall that the Fisher metric on any 2-integrable statistical model is non-
degenerate. 9 We now regard kdϕlσ̂ k2g−1 (ξ) as a quadratic form on (Rn )∗ .
9In literature, e.g., [Amari2016, AJLS2017, Borovkov1998], one considers the Fisher
metric on a parametrized statistical model, i.e., the metric obtained by pull-back the
Fisher metric on P via the parameterization map p : Θ → P . This “parameterized”
Fisher metric may be degenerate.
32 HÔNG VÂN LÊ ∗
||dϕlσ̂ ||2g−1 (ξ) . Hence the Cramér-Rao inequality in Theorem 4.12 becomes
the well-known Cramér-Rao inequality for an unbiased estimator
(4.21) Vξ := Vξ [σ̂] ≥ g−1 (ξ).
4.4. Efficient estimators and MLE.
Definition 4.15. Assume that P ⊂ P(X ) is a 2-integrable statistical model
and ϕ : P → Rn is a feature map. An estimator σ̂ ∈ L2ϕ (P, X ) is called effi-
cient, if the Cramér-Rao inequality for σ̂ becomes an equality, i.e., Vξϕ [σ̂] =
||dϕσ̂ ||2g−1 (ξ) for any ξ ∈ P .
Then we define the MSE and variance of ϕk -estimator ϕk ◦ σ̂k , which can be
estimated using the Cramér-Rao inequality. It turns out that the efficient
unbiased estimator w.r.t. MSE is MLE. The notion of a ϕ-estimator allows
to define the notion of a consistent sequence of ϕ-estimators that formalizes
the notion of asymptotically accurate estimators.
we have
µm {S ∈ Z m : |RµL (Aerm (S)) − Rµ,H | ≤ 2ε} ≥
µm {S ∈ Z m : RµL (Aerm (S)) ≤ RµL (hε ) + ε} ≥
ε ε
µm {S ∈ Z m : RµL (Aerm (S)) ≤ R̂SL (hε ) + & |RSL (hε ) − RµL (hε )| < } ≥
2 2
δ
µn {S ∈ Z m : RµL (Aerm (S)) ≤ R̂SL (hε ) + ε} − ≥
2
n n L L ε δ
µ {S ∈ Z : |Rµ (Aerm (S)) − R̂S (A(S))| ≤ } − ≥ 1 − δ
2 2
13we refer to [AJLS2017, p. 293] and [Bogachev2007, p. 188] for definition of µ∞
36 HÔNG VÂN LÊ ∗
since R̂SL (A(S)) < R̂SL (hε ). This completes the proof of Lemma 5.5.
Theorem 5.6. (cf. [SSBD2014, Corollary 4.6, p.57]) Let (Z, H, L, P(Z))
be a unified learning model. If H is finite and L(Z × H) ⊂ [0, c] 63 ∞ then
the ERM algorithm is uniformly consistent.
Proof. By Lemma 5.5, it suffices to find for each (ε, δ) ∈ (0, 1) × (0, 1) a
number mH (ε, δ) such that for all m ≥ mH (ε, δ) and for all µ ∈ P(Z) we
have
\
µm {S ∈ Z m : |R̂SL (h) − RµL (h)| ≤ ε} ≥ 1 − δ.
(5.6)
h∈H
In order to prove (5.6) it suffices to establish the following inequality
X
µm {S ∈ Z m | |R̂SL (h) − RµL (h)| > ε} < δ.
(5.7)
h∈H
Since #H < ∞, it suffices to find mH (ε, δ) such that when m ≥ mH (ε, δ)
each summand in RHS of (5.7) is small enough. For this purpose we shall
apply the well-known Hoeffding inequality, which specifies the rate of con-
vergence in the week law of large numbers, see Subsection 11.2.
To apply Hoeffding’s inequality to the proof of Theorem 5.6 we observe
that for each h ∈ H
{θih (z) := L(h, z) ∈ [0, c]}
are i.i.d. R-valued random variables on Z. Furthermore we have for any
h ∈ H and S = (z1 , · · · , zm )
m
L 1 X h
R̂S (h) = θi (zi ),
m
i=1
RµL (h)
= θ̄h .
Hence the Hoeffding inequality implies
(5.8) µm {S ∈ Z m : |R̂SL (h) − RµL (h)| > ε} ≤ 2 exp(−2mε2 c−2 ).
Now pluging
log(2#(H)/δ)
m ≥ mH (ε, δ) :=
2ε2 c−2
in (5.8) we obtain (5.7). This completes the proof of Theorem 5.6.
Definition 5.7. The function mH : (0, 1)2 → R defined by the requirement
that mH (ε, δ) is the least number for which (5.6) holds is called the sample
complexity of a (unified) learning model (Z, H, L, P ).
Remark 5.8. (1) We have proved that the sample complexity of the algo-
rithm Aerm is upper bounded by the sample complexity mH of (Z, H, L, P ).
(2) The definition of the sample complexity mH in our lecture is different
from the definition of the sample complexity mH in [?, Definition 3.1, p. 43],
which is equivalent to the notion of the sample complexity mA of a learning
algorithm in our lecture.
MACHINE LEARNING 37
Theorem 5.9. Let X be an infinite domain set, Y = {0, 1} and L(0−1) the
0-1 loss function. Then there is no uniformly consistent learning algorithm
on a unified learning model (X × Y, H, L(0−1) , PX (X × Y)).
C[k] C[k]
µf := (Γf )∗ µX ∈ PX (X × Y).
n
[
XS := P ri (S).
i+1
Note that S is distributed by µnf means that S = {(x1 , f (x1 )), · · · , (xn , f (xn )),
so S is essentially distributed by the uniform probability measure (µX n
X ) . Let
us compute and estimate the double integral in the LHS of (5.10) using (2.9)
38 HÔNG VÂN LÊ ∗
Exercise 5.18. Show that ΓF (n) = n + 1 for the set F of all threshold
functions.
It follows from Lemma 5.19 that for any (ε, δ) ∈ (0, 1)2 there exists m
such that
q
4 + log ΓH (2m)
√ <ε
δ 2m
and therefore by (5.14) for any (ε, δ) ∈ (0, 1)2 the value mH (ε, δ) is finite.
This completes the proof of Theorem 5.15.
5.4. Conclusions. In this lecture we define the notion of the (uniform) con-
sistency of a learning algorithm A on a unified learning model (Z, H, L, P )
and characterize this notion via the sample complexity of A. We relate
the consistency of the ERM algorithm with the uniform convergence of the
law of large numbers over the parameter space H and use it to prove the
uniform consistency of ERM in the binary classification problem (X , H ⊂
{0, 1}X , L(0−1) , P(X × {0, 1}). We show that the finiteness of the VC-
dimension of H is a necessary and sufficient condition for the existence of
a uniform consistent learning algorithm on a binary classification problem
(X × {0, 1}, H ⊂ {0, 1}X , L(0−1) , PX (X × {0, 1}).
MACHINE LEARNING 41
In the second step, we reduce the problem of estimating upper bound for
the sample complexity mH to the problem of estimating upper bound for
the sample complexities mDj , where {Dj | j ∈ [1, l]} is a cover of H, and
using the covering number. Namely we have the following easy inequality
ρm {S ∈ Z m | sup |M SEρ (f ) − M SES (f )| ≥ ε} ≤
f ∈H
l
X
(6.3) ρm {S ∈ Z m | sup |M SEρ (f ) − M SES (f )| ≥ ε}.
j=1 f ∈Dj
Combining the last relation with (6.3), we derive the following desired upper
estimate for the sample complexity mH .
Proposition 6.4. Assume that for all f ∈ H we have |f (x)−y| ≤ M ρ-a.e..
Then for all ε > 0 we have
mε 2
m m ε (− 2 1 2
ρ {z ∈ Z : sup |M SEρ (f )−M SEz (f )| ≤ ε} ≥ 1−N (H, )2e 4(2σ + 3 M ε)
f ∈H 8M
where σ 2 = supf ∈H Vρ (fY2 ).
This completes the proof of Theorem 6.1.
Exercise 6.5. Let L2 denote the instantaneous quadratic loss function in
(2.12). Derive from Theorem 6.1 an upper bound for the sample complexity
mH (ε, δ) of the learning model (X , H ⊂ Cn (X ), L2 , PB (X ×Rn )), where H is
compact and PB (X × Rn ) is the space of Borel measures on the topological
space X × Rn .
44 HÔNG VÂN LÊ ∗
where {σi ∈ Z2 | i ∈ [1, n]} and µZZ22 is the counting measure on Z2 , see (5.9).
If S is distributed according to a probability measure µn on Z n , then the
Rademacher complexity of G w.r.t. µ are given by
Rn,µ (G) := Eµn [RS (G)].
The Rademacher n-complexity Rn (Z, H, L, µ) (resp. the Rademacher em-
pirical n-complexity RS (Z, H, L)) is defined to be the complexity Rn,µ (GHL)
L L
(resp. the empirical complexity R̂S (GH )), where GH is the family associated
to the model (Z, H, L, µ) by (6.4).
Example 6.8. Let us consider a learning model (X ×Z2 , H ⊂ (Z2 )X , L(0−1) , µ).
For a sample S = {(x1 , y1 ), · · · , (xm , ym )} ∈ (X × Z2 )m denote by P r(S)
the sample (x1 , · · · , xm ) ∈ X m . We shall show that
(0−1) 1
(6.5) R̂S (GH ) = R̂P r(S) (H).
2
Using the identity
1
L(0−1) (x, y, h) = 1 − δyh(x) = (1 − yi h(xi ))
2
we compute
m
(0−1) 1 X
R̂S (GH ) = E(µZ2 )m [sup σi δyh(x
i
i)
]
Z2 h∈H m
i=1
m
1 X 1 − yi h(xi )
= E(µZ2 )m [sup σi ]
Z2 h∈H m 2
i=1
MACHINE LEARNING 45
E Z σi =0 m
(µ 2 )m 1 1 X
Z2
= E(µZ2 )m [sup −σi yi h(xi )]
2 Z2 h∈H m i=1
m
1 1 X 1
= E(µZ2 )m [sup σi h(xi )] = R̂P r(S) (H)
2 Z2 h∈H m 2
i=1
which is required to prove.
We have the following relation between the empirical Rademacher com-
plexity and the Rademacher complexity, using the McDiarmid concentration
inequality, see (11.4) and [MRT2012, (3.14), p.36]
r
n n L L ln(2/δ)
(6.6) µ {S ∈ Z |Rn,µ (GH ) ≤ RS (GH ) + } ≥ 1 − δ/2.
2m
Theorem 6.9. (see e.g. [SSBD2014, Theorems 26.3, 26.5, p. 377- 378])
Assume that (Z, H, L, µ) is a learning model with |L(z, h)| < c for all z ∈ Z
and all h ∈ H. Then for any δ > 0 and any h ∈ H we have
r
n n L L L 2 ln(2/δ)
(6.7) µ {S ∈ Z | Rµ (h) − RS (h) ≤ Rn,µ (GH ) + c } ≥ 1 − δ,
n
r
n n 2 ln(4/δ)
(6.8) µ {S ∈ Z | RµL (h) − RSL (h) ≤ L
RS (GH ) + 4c } ≥ 1 − δ.
n
(6.9) r
n n 2 ln(8/δ)
µ {S ∈ Z |RµL (Aerm (S)) − RµL (h) ≤ L
2RrS (GH ) + 5c } ≥ 1 − δ.
δ
It follows from (6.10), using the Markov inequality, the following bound
for the sample complexity mAerm in terms of Rademacher complexity
2RL L
n,µ (GH )
(6.11) µn {S ∈ Z n | RµL (Aerm ) − Rµ,H
L
≤ } ≥ 1 − δ.
δ
Remark 6.10. (1) The first two assertions of Theorem 6.9 give an upper
bound of a “half” of the sample complexity mH of a unified learning model
(Z, H, L, µ) by the (empirical) Rademacher complexity Rn (G) of the associ-
ated family G. The last assertion of Theorem 6.9 is derived from the second
assertion and the Hoeffding inequality.
(2) For the binary classification problem (X ×{0, 1}, H ⊂ {0, 1}X , L(0−1) , P(X ×
{0, 1}) there exists a close relationship between the Rademacher complexity
and the growth function ΓH (m), see [MRT2012, Lemma 3.1, Theorem 3.2,
p. 37] for detailed discussion.
46 HÔNG VÂN LÊ ∗
L
Recall that Rµ,H := inf h∈H RµL (h) quantify the optimal performance of a
learner in H. Then we decompose the difference between the expected risk
of a predictor h ∈ H and the Bayes risk as follows:
(6.12) RµL (h) − Rb,µ
L
= (RµL (h) − Rµ,H
L L
) + (Rµ,H L
− Rb,µ ).
The first term in the RHS of (6.12) is called the estimation error of h, cf.
(6.1), and the second term is called the approximation error. If h = Aerm (S)
is a minimizer of the empirical risk R̂SL , then the estimation error of Aerm (S)
is also called the sample error [CS2001, p. 9].
The approximation error quantifies how well the hypothesis class H is
suited for the problem under consideration. The estimation error measures
how well the hypothesis h performs relative to best hypotheses in H. Typi-
cally, the approximation error will decrease when enlarging H but the sample
error will increase as demonstrated in No-Free-Lunch Theorem 5.9, because
P should be enlarged as H will be enlarged.
Example 6.11. (cf. [Vapnik2000, p. 19], [CS2001, p. 4, 5]) Let us
compute the error decomposition of a discriminative model (X × R, H ⊂
RX , L2 , PB (X × R) for regression problem. Let π : X × R → X denote
the natural projection. Then the measure ρ, the push-forward measure
π∗ (ρ) ∈ P(X ) and the conditional probability measure ρ(y|x) on each fiber
π −1 (x) = R are related as follows
Z Z Z
ϕ(x, y)dρ = ( ϕ(x, y)dρ(y|x))dπ∗ (ρ)
X ×R X R
for ϕ ∈ L1 (ρ),
similar to the Fubini theorem, see Subsection 10.2.5.
Let us compute the Bayes risk Rb,µ L for L = L . The maximal subspace
2
HL,µ where the expected loss RµL is well defined is the space L2 (X , π∗ (ρ)).
We claim that the regression function of ρ, see Exercise 2.7,
Z
rρ (x) := Eρ (i2 (Y )|X = x) = ydρ(y|x)
R
MACHINE LEARNING 47
minimizes the M SEπ∗ (ρ) defined on the space L2 (X , π∗ (ρ)). Indeed, for any
f ∈ L2 (X , π∗ (ρ)) we have
Z
(6.13) M SEπ∗ (ρ) (f ) = (f (x) − rρ (x))2 dπ∗ (ρ) + M SEπ∗ (ρ) (rρ ).
X
The equation (6.13) implies that the Bayes risk Rb,π L is M SEπ∗ (ρ) (rρ ). It
∗ (ρ)
follows that the approximation error of a hypothesis class H is equal
Z
M SEπ∗ (ρ) (fmin ) = (fmin − rρ (x))2 dπ∗ (ρ) + M SEπ∗ (ρ) (rρ ),
X
(0−1)
function R̂S . Thus, given S, we need to find a strategy for selecting one
of these ERM’s, or equivalently for selecting a separating hyperplane H(w,b) ,
+
since the associated half-space H(w,b) is defined by H(w,b) and any training
value (xi , yi ). The standard approach in the SVM framework is to choose
H(w,b) that maximizes the distance to the closest points xi ∈ [P r(S)]. This
approach is called the hard SVM rule. To formulate the hard SVM rule we
need a formula for the distance of a point to a hyperplane H(w,b) .
Lemma 7.4 (Distance to a hyperplane). Let V be a real Hilbert space and
H(w,b) := {z ∈ V | hz, wi + b = 0}. The distance of a point x ∈ V to H(w,b)
is given by
|hx, wi + b|
(7.2) ρ(x, H(w,b) ) := inf ||x − z|| = .
z∈H(w,b) ||w||
Proof. Since H(w,b) = H(w,b)/λ for all λ > 0, it suffices to prove (7.2) for the
case ||w|| = 1 and hence we can assume that w = e1 . Now formula (7.2)
follows immediately, noting that H(e1 ,b) = H(e1 ,0) − be1 .
Let H(w,b) separate S = {(x1 , y1 ), · · · , (xm , ym )} correctly. Then we have
yi = sign(hxi , wi + b),
=⇒ |hxi , wi + b| = yi (hx, wi + b).
Hence, by Lemma 7.4, the distance between H(w,b) and S is
mini yi (hxi , wi + b)
(7.3) ρ(S, H(w,b) ) := min ρ(xi , H(w,b) ) = .
i ||w||
The distance ρ(S, H(w,b) ) is also called the margin of a hyperplane H(w,b)
w.r.t. S. The hyperplanes, that are parallel to the separating hyperplane
and passing through the closest points on the negative or positive sides are
called marginal.
Denote by HS the subset of π(Hlin ) = HA (V ) that consists of hyperplanes
separating S. Set
(7.4) A∗hs (S) := arg max ρ(S, H(w,b)0 ).
H(w,b)0 ∈HS
Proof. If H(w,b) separates S then ρ(S, H(w,b) ) = mini yi (hw, xi i + b). Since
the constraint ||w|| ≤ 1 does not effect on H(w,b) , which is invariant under a
positive rescaling, (7.3) implies that
(7.6) max min yi (hw, xi i + b) ≥ max ρ(S, H(w,b)0 ).
(w,b):||w||≤1 i H(w,b)0 ∈HS
produces a solution (w, b) := (w0 /||w0 ||, b0 /||w0 ||) of the optimization prob-
lem (7.5).
Proof. Let (w0 , b0 ) be a solution of (7.7). We shall show that (w0 /||w0 ||, b0 /||w0 ||)
is a solution of (7.5). It suffices to show that the margin of the hyperplane
H(w0 ,b0 ) is greater than or equal to the margin of the hyperplane associated
to a (and hence any) solution of (7.5).
Let (w∗ , b∗ ) be a solution of Equation (7.5). Set
γ ∗ := min yi (hw∗ , xi i + b∗ )
i
which is the margin of the hyperplane H(w∗ ,b∗ ) by (7.3). Therefore for all i
we have
yi (hw∗ , xi i + b∗ ) ≥ γ ∗
or equivalently
w∗ b∗
yi∗ (h , xi i + ) ≥ 1.
γ∗ γ∗
∗ ∗
Hence the pair ( wγ ∗ , γb ∗ ) satisfies the condition of the quadratic optimization
problem in (7.7). It follows that
w∗ 1
||w0 || ≤ || || = ∗ .
γ∗ γ
MACHINE LEARNING 51
7.2. Soft SVM. Now we consider the case when the sample set S is not sep-
arable. There are at least two possibilities to overcome this difficulty. The
first one is to find a nonlinear embedding of patterns into a high-dimensional
space. To realize this approach we use a kernel trick that embeds the pat-
terns in an infinite dimensional Hilbert space space, which we shall learn in
the next lecture. The second way is to seek a predictor sign f(w,b) such that
−1
H(w,b) = f(w,b) (0) still has maximal margin in some sense. More precisely,
we shall relax the hard SVM rule (7.7) by replacing the constraint
(7.8) yi (hw, xi i + b) ≥ 1
by the relaxed constraint
(7.9) yi (hw, xi i) + b ≥ 1 − ξi
where ξi ≥ 0 are called the slack variables. The slack variables are commonly
used in optimization to define relaxed versions of some constraints. In our
case a slack variable ξi measures the distance by which vector xi violates
the original inequality in the LHS of (7.8).
The relaxed hard SVM rule is called the soft SVM rule.
Definition 7.10. The soft SVM algorithm Ass : (Rd × Z2 )m → Hlin with
slack variables {ξ ∈ Rm
≥0 } is defined as follows
The loss function for the soft SVM learning machine is the hinge loss
function Lhinge : Hlin × (V × {±1}) → R defined as follows
(7.12) Lhinge (h(w,b) , (x, y)) := max{0, 1 − y(hw, xi + b)}.
Hence the empirical hinge risk function is defined as follows for S =
{(x1 , y1 ) · · · , (xm , ym )}
m
1 X
RShinge (h(w,b) ) = max{0, 1 − yi (hw, xi i + b}.
m
i=1
Lemma 7.11. The Equation (7.10) with constraint (7.11) for Ass is equiv-
alent to the following regularized risk minimization problem, which does not
depend on the slack variables ξ:
Proof. Let us fix (w0 , b0 ) and minimize the RHS of (7.10) under the con-
straint (7.11). It is straightforward to see that ξi = Lhinge (w, b), (xi , yi ) .
Using this and comparing (7.10) with (7.13), we complete the proof of
Lemma 7.11.
Remark 7.13. The hinge loss function Lhinge enjoys several good properties
that justify the preference of Lhinger as a loss function over the zero-one loss
function L(0−1) , see [SSBD2014, Subsection 12.3, p. 167] for discussion.
7.3. Sample complexities of SVM.
Exercise 7.14. Prove that V C dim Hlin = dim V + 1.
From the Fundamental Theorem of binary classification 5.15 and Exercise
(7.14) we obtain immediately the following
Proposition 7.15. The binary classification problem (V ×Z2 , Hlin , L(0−1) , P(V ×
Z2 )) ha a uniformly consistent learning algorithm if and only if V is finite
dimensional.
In what follow we shall show the uniform consistent of hard SVM and
soft SVM replacing the statistical model P(Rd × Z2 ) by a smaller class.
Definition 7.16. ([SSBD2014, Definition 15.3]) Let µ be a distribution on
V × Z2 . We say that µ is separable with a (γ, ρ)-margin if there exists
(w∗ , b∗ ) ∈ V × R such that kwk = 1 and such that
µ{(x, y) ∈ V × Z2 | y(hw∗ , xi + b∗ ) ≥ γ and ||x|| ≤ ρ} = 1.
Similarly, we say that µ is separable with a (γ, ρ)-margin using a homoge-
neous half-space if the preceding holds with a half-space defined by a vector
(w∗ , 0).
Definition 7.16 means that the set of labeled pairs (x, y) ∈ V × Z2 that
satisfy the condition
y(hw∗ , xi + b∗ ) ≥ γ and ||x|| ≤ ρ
has a full µ-measure, where µ is a separable measure on with (γ, ρ)-margin.
Remark 7.17. (1) We regard an affine function f(w,b) : V → R as a linear
function fw0 : V 0 → R where V 0 = he1 i⊗R ⊕ V , i.e., we incorporate the
bias term b of f(w,b) in (7.1) into the term w as an extra coordinate. More
precisely we set
w0 := be1 + v and x0 := e1 + x.
Then
f(w,b) (x) = fw0 (x0 ).
Note that the natural projection of the zero set fw−1 0
0 (0) ⊂ V to V is the zero
set H(w,b) of f(w,b) .
(2) By Remark 7.17 (1), we can always assume that a separable measure
with (γ, ρ)-margin is a one that uses a homogeneous half-space by enlarging
the instance space V .
Denote by P(γ,ρ) (V ×Z2 ) the subset of P(V ×Z2 ) that consists of separable
measures with a (γ, ρ)-margin using a homogeneous half-space. Using the
Rademacher complexity, see [SSBD2014, Theorem 26.13, p. 384], we have
54 HÔNG VÂN LÊ ∗
Exercise 7.20. Using the Markov inequality, derive from Theorem 7.19 an
upper bound for the sample complexity of the soft SVM.
Theorem 7.19 and Exercise 7.20 imply that we can control the sample
complexity of a soft SVM algorithm as a function of the norm of the under-
lying Hilbert space V , independently of the dimension of V . This becomes
highly significant when we learn classifiers h : V → Z2 via embeddings into
high dimensional feature spaces.
8.1. Kernel trick. It is known that a solution of a hard SVM can be ex-
pressed as a linear combination of support vectors (Exercise 7.9). If the
number of support vectors is less than the dimension of the instance space,
then this property simplifies the search for a solution of the hard SVM.
Below we shall show that this property is a consequence of the Represen-
ter Theorem concerning solutions of a special optimization problem. The
optimization problem we are interested in is of the following form:
(8.2) w0 = arg min f hw, ψ(x1 )i, · · · , hw, ψ(xm )i + R(kwk)
w∈W
0 if yi (ai ) ≥ 1 for all i
f (a1 , · · · , am ) :=
∞ otherwise
we obtain Equation (7.7) of hard SVM for homogeneous vectors (w, 0), re-
placing ai by hw, xi i. The general case of non-homogeneous solutions (w, b)
is reduced to the homogeneous case by Remark 7.17.
(2) Plugging in Equation (8.2)
R(a) := λa2 ,
m
1 X
f (a1 , · · · , am ) := max{0, 1 − yi ai }
m
i=1
we obtain Equation (7.13) of soft SVM for a homogeneous solution Ass (S),
identifying Ass (S) with its parameter (w, 0), S with {(x1 , y1 ) · · · (xm , ym )}
and replacing ai with hw, xi i.
Theorem 8.2 (Representer Theorem). Let ψ : X → W be a feature mapping
from an instance space X to a Hilbert space W and w0 a solution of (8.2).
Then the projection of w0 to the subspace hψ(x1 ), · · · , ψ(xm )i⊗R in W is
also a solution of (8.2).
Proof. Assume that w0 is a solution of (8.2). Then we can write
m
X
w0 = αi ψ(xi ) + u
i=1
v
m
X m
X u m
uX
(8.5) arg minm f αj Gj1 , · · · , αj Gjm +R t αi αj Gji .
α∈R
j=1 j=1 i,j=1
Pm
Recall that the solution w0 = i=1 αj ψ(xj ) of the hard (resp. soft) SVM
optimization problem, where (α1 , · · · , αm ) is a solution of (8.5), produces a
“nonlinear” classifier ŵ0 : X → Z2 associated to as follows
ŵ0 (x) := sign w0 (x)
where
m
X m
X
(8.6) w0 (x) := hw0 , ψ(x)i = αj hψ(xj ), ψ(x)i = αj K(xj , x).
i=1 j=1
To compute (8.6) we need to know only the kernel function K and not the
mapping ψ, nor the inner product h, i on the Hilbert space W .
This motivates the following question.
Problem 8.3. Find a sufficient and necessary condition for a kernel func-
tion, also called a kernel, K : X × X → R such that K can be written as
K(x, x0 ) = hψ(x), ψ(x0 )i for a feature mapping ψ : X → W , where W is a
real Hilbert space.
Definition 8.4. If K satisfies the condition in Problem 8.3 we shall say
that K is generated by a (feature) mapping ψ. The target Hilbert space is
also called a feature space.
8.2. PSD kernels and reproducing kernel Hilbert spaces.
Denote by
N (f )
X
X
W := {f ∈ R | f = ai Kxi , ai ∈ R and N (f ) < ∞}.
i=1
The PSD property of K implies that the inner product is positive semi-
definite, i.e.
X X
h αi Kxi , αi Kxi i ≥ 0.
i j
Since the inner product is positive semi-definite, the Cauchy-Schwarz in-
equality implies for f ∈ W and x ∈ X
(8.7) hf, Kx i2 ≤ hf, f ihKx , Kx i.
Since for all x, y we have Ky (x) = K(y, x) = hKy , Kx i , it follows that for
all f ∈ W we have
(8.8) f (x) = hf, Kx i.
Using (8.8), we obtain from (8.7) for all x ∈ X
|f (x)|2 ≤ hf, f iK(x, x).
This proves that the inner product on W is positive definite and hence W
is a pre-Hilbert space. Let H be the completion of W . The map x 7→ Kx
is the desired mapping from X to H. This completes the proof of Theorem
8.6.
Example 8.7. (1) (Polynomial kernels). Assume that P is a polynomial in
one variable with non-negative coefficients. Then the polynomial kernel of
the form K : Rd × Rd → R, (x, y) 7→ P (hx, y)i is a PSD kernel. This follows
from the observations that if Ki : X × X → R, i = 1, 2, are PDS kernes then
(K1 +K2 )(x, y) := K1 (x, y)+K2 (x, y) is a PDS kernel, and (K1 ·K2 )(x, y) :=
K1 (x, y) · K2 (x, y) is a PDS kernel. In particular, K(x, y) := (1 + hx, yi)2 is
a PSD kernel.
(2)(Exponential kernel). For any γ > 0 the kernel K(x, y) := exp γ ·
hx, yi is a PDS kernel, since it is the limit of a polynomials in hx, yi with
non-negative coefficients.
Exercise 8.8. (1) Show that the Gaussian kernel K(x, y) := exp − γ2 ||x −
y||2 ) is a PDS kernel.
(2) Let X = B(0, 1) - the open ball of radius 1 centered at the origin
0 ∈ Rd . Show that K(x, y) := (1 − hx, yi)−p is a PDS kernel for any p > 0.
MACHINE LEARNING 59
Hence H is a RKHS.
To show the uniqueness of a RKHS H such that K is the reproducing
kernel of H we assume that there exists another RKHS H0 such that for all
x, y ∈ X there exist kx , ky ∈ H0 with the following properties
K(x, y) = hkx , ky i and f (x) = hf, kx i for all f ∈ H.
15In other words, the vector structure on H is induced from the vector structure on R
via the evaluation map.
60 HÔNG VÂN LÊ ∗
8.4. Conclusion. In this section we learn the kernel trick, which simplifies
the algorithm of solving hard SVM and soft SVM optimization problem,
using embedding of patterns into a Hilbert space. The kernel trick is based
on the theory of RKHS. The main difficulty of the kernel method is that
we still have no general method of selecting a suitable kernel for a concrete
problem. Another open problem is to improve the upper bound for sample
complexity of SVM algorithm, i.e., to find new conditions on µ ∈ P such that
the sample complexity of Ahk , Ask which is computed w.r.t. µ is bounded.
9. Neural networks
In the last lecture we examined kernel based SVMs which are generaliza-
tions of linear classifiers, which are also called perceptrons.
Today we shall examine other generalizations of linear classifiers which
are artificial neural networks, shortened as neural networks (otherwise, non-
artificial neural networks are (called) biological neural networks). The idea
behind neural networks is that many neurons can be joined together by
communication links to carry out complex computations. Neural networks
achieve outstanding performance on many important problems in computer
vision, speech recognition, and natural language processing.
Note that under “a neural network” one may think of a computing de-
vice, a learning model, or a hypothesis class of a learning model, a class of
(sequences of) multivariable functions.
In today lecture we shall investigate several types of neural networks,
their expressive power, i.e., the class of functions that can be realized as
elements in a hypothesis class of a neural network. In the next lecture we
shall discuss the current learning algorithm -stochastic gradient descend -
on neural networks.
62 HÔNG VÂN LÊ ∗
• The i-th input nodes give the output xi . If the input space is Rn then
we have n + 1 input-nodes, one of them is the “constant” neuron, whose
output is 1.
• There is a neuron in the hidden layer that has no incoming edges. This
neuron will output the constant σ(0).
• A feedforward neural network (FNN) has underlying acyclic directed
graph. Each FNN (E, V, w, σ) represents a multivariate multivariable func-
tion hV,E,σ,w : Rn → Rm which is obtained by composing the computing
instruction of each neuron on directed paths from input neurons to output
neurons. For each architecture (V, E, σ) of a FNN we denote by
HV,E,σ = {hV,E,σ,w : w ∈ RE }
the underlying hypothesis class of functions from the input space to the
output space of the network.
• A recurrent network (RNN) has underlying directed graph with a cycle.
By unrolling cycles in a RNN in time r, a RNN defines a map r : N+ :→
{F N N } such that [r(n)] ⊂ [r(n + 1)], where [r(n)] is the underlying graph
of r(n), see [GBC2016, §10.2, p. 368] and [Graves2012, §3.2, p. 22]. Thus a
RNN can be regarded as a sequence of multivariate multivariable functions
which serves as in a discriminative model for supervised sequence labelling.
Then {gi | i ∈ [1, k]} are linear classifiers and therefore can be implemented
by the neurons in V1 . Now set
k
X
f (x) := sign( gi (x) + k − 1)
i=1
which is also a linear classifier. This completes the proof of Proposition
9.4
In general case we have the following Universal Approximation Theorem,
see e.g. [Haykin2008, p. 167].
Theorem 9.5. Let ϕ be a nonconstant, bounded and monotone increas-
ing continuous function. For any m ∈ N, ε > 0 and any function F ∈
C0 ([0, 1]m ) there exists an integer m1 ∈ N and constants ai , bj , wij where
i ∈ [1, m1 ], j ∈ [1, m] such that
m1
X Xm
f (x1 , · · · , xm ) := αi ϕ( wij xj + bi )
i=1 j=1
It follows from Proposition 9.6 that the sample complexity of the ERM
algorithm for (V ×Z2 , HV,E,sign , L(0−1) , P(V ×Z2 )) is finite. But the running
time for ERM algorithm in a neural network HV,E,sign is non-polynomial and
therefore it is impractical to use it [SSBD2014, Theorem 20.7, p. 276]. The
solution is to use the stochastic gradient descend, which we shall learn in
the next lecture.
9.3.2. Neural networks for regression problem. In neural networks with hy-
pothesis class HV,E,σ for regression problems one often chooses the activa-
tion function σ to be the sigmoid function σ(a) := (1 + e−a )−1 , and the loss
function L to be L2 , i.e., L(x, y, hw ) := 21 ||hw (x) − y||2 for hw ∈ HV,E,σ and
x ∈ X = Rn , y ∈ Y = Rm .
9.3.3. Neural networks for generative models in supervised learning. In gen-
erative models of supervised learning we need to estimate the conditional
distribution p(t|x). In many regression problems p is chosen as follows
[Bishop2006, (5.12), p. 232]
β −β
(9.3) p(t|x) = N (t|y(x, w), β −1 ) = √ exp (t − y(x, w)),
2π 2
where β is unknown parameter and y(x, w) is the expected value of t. Thus
the learning model is of the form (X , H, L, P ) where H := {y(t, x, w)} pa-
rameterized by a parameter w, β, and a statistical model P is a subset of
MACHINE LEARNING 67
Since µB (A|x)
is a B-measurable function on X for each A, by [Bogachev2007,
Theorem 2.12.3, p. 144] on B-measurable functions, we have for any E ∈ Σ0
Z
−1
(10.8) µ(A ∩ π (E)) = µy (A) dπ∗ (µ)(y)
E
Then the density f (x|y) is called the conditional density of ξ subject to the
condition that η = y.
Theorem 10.10. ([Borovkov1998, Theorem 2, §20, p. 108]) If the joint
distribution of ξ and η in X × Y has density function f (x, y) w.r.t. the
product of measures µ ∈ P(X) and λ ∈ P(Y) then the function
Z
f (x, y)
f (x|y) := where q(y) = f (x, y)dµ(x)
q(y)
is the conditional density of ξ given η = y and the function q(y) is the
density of η w.r.t. the measure λ.
10.2.5. Conditional measures and transitional measures. Regular conditional
measures in Definition 10.7 are examples of transitions measures for which
we shall have a generalized version of Fubini theorem (Theorem 10.12)
Definition 10.11. ([Bogachev2007, Definition 10.7.1, p. 384]) Let (X1 , B1 )
and (X , B2 ) be a pair of measurable spaces. A transition measure for this
pair is a function P (·|·) : X1 × B2 → R with the following properties:
(i) for every fixed x ∈ X1 the function B 7→ P (x|B) is a measure on B2 ;
(ii) for every fixed B ∈ B2 the function x 7→ P (x|B) is measurable w.r.t.
B1 .
In the case where transition measures are probabilities in the second ar-
gument, they are called transition probabilities.
Theorem 10.12. Let P (·|·) be a transition probability for spaces (X1 , B1 )
and (X2 , B2 ) and let ν be a probability measure on B1 . Then there exists a
unique probability measure µ on (X1 × X2 , B1 ⊗ B2 ) with
Z
(10.12) µ(B1 × B2 ) = P (x|B2 )dν(x) for all B1 ∈ B1 , B2 ∈ B2 .
B1
17
Borovkov did not use the terminology “regular conditional probability”, instead he
use the terminology “conditional distribution: and “conditional density ”.
72 HÔNG VÂN LÊ ∗