Kernel Methodsfor Machine Learningwith Mathand Pytho
Kernel Methodsfor Machine Learningwith Mathand Pytho
and Python
Joe Suzuki
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
Among machine learning methods, kernels have always been a particular weakness
of mine. I tried to read “Introduction to Kernel Methods” by Kenji Fukumizu (in
Japanese) but failed many times. I invited Prof. Fukumizu to give an intensive lecture
at Osaka University and listened to the course for a week with the students, but I
could not understand the book’s essence. When I first started writing this book, my
goal was to rid my sense of weakness. However, now that this book is completed, I
can tell readers how they can overcome their own kernel weaknesses.
Most people, even machine learning researchers, do not understand kernels and
use them. If you open this page, I believe you have a positive feeling that you want
to overcome your weakness.
The shortest path I would most recommend for achieving this is to learn mathe-
matics by starting from the basics. Kernels work according to the mathematics behind
them. It is essential to think through this concept until you understand it. The mathe-
matics needed to understand kernels are called functional analysis (Chap. 2). Even if
you know linear algebra or differential and integral calculus, you may be confused.
Vectors are finite dimensional, but a set of functions is infinite dimensional and can
be treated as linear algebra. If the concept of completeness is new to you, I hope you
will take the time to learn about it. However, if you get through this second chapter,
I think you will understand everything about kernels.
This book is the third volume (of six) in the 100 Exercises for Building Logic set.
Since this is a book, there must be a reason for publishing it (the so-called cause)
when existing books on kernels can be found. The following are some of the features
of this book.
1. The mathematical propositions of kernels are proven, and the correct conclu-
sions are stated so that the reader can reach the essence of kernels.
v
vi Preface
The author wishes to thank Mr. Bing Yuan Zhang, Mr. Tian Le Yang, Mr. Ryosuke
Shimmura, Mr. Tomohiro Kamei, Ms. Rieko Tasaka, Mr. Keito Odajima, Mr. Daiki
Fujii, Mr. Hongming Huang, and all graduate students at Osaka University, for
pointing out logical errors in mathematical expressions and programs. Furthermore, I
would like to take this opportunity to thank Dr. Hidetoshi Matsui (Shiga University),
Dr. Michio Yamamoto (Okayama University), and Dr. Yoshikazu Terada (Osaka
University) for their advice on functional data analysis in seminars and workshops.
This English book is based mainly on the Japanese book published by Kyoritsu
Shuppan Co., Ltd. in 2021. The author would like to thank Kyoritsu Shuppan Co.,
Ltd., particularly its editorial members Mr. Tetsuya Ishii and Ms. Saki Otani. The
author also appreciates Ms. Mio Sugino, Springer, preparing the publication and
providing advice on the manuscript.
ix
Contents
xi
xii Contents
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Chapter 1
Positive Definite Kernels
In data analysis and various information processing tasks, we use kernels to evaluate
the similarities between pairs of objects. In this book, we deal with mathematically
defined kernels called positive definite kernels. Let the elements x, y of a set E
correspond to the elements (functions) (x), (y) of a linear space H called the
reproducing kernel Hilbert space. The kernel k(x, y) corresponds to the inner product
(x), (y) H in the linear space H . Additionally, by choosing a nonlinear map ,
this kernel can be applied to various problems. The set E may be a string, a tree, or a
graph, even if it is not a real-numbered vector, as long as the kernel satisfies positive
definiteness. After defining probability and Lebesgue integrals in the second half,
we will learn about kernels by using characteristic functions (Bochner’s theorem).
√ definite matrix A ∈ R , we have that z Az ≥ 0
n×n
Corollary 1 For a nonnegative
for any z ∈ C , where i = −1 is the imaginary unit, and we write the conjugate
n
x − i y of z = x + i y ∈ C with x, y ∈ R as z.
Proof: Since there exists a B ∈ Rn×n such that A = B B for a nonnegative definite
matrix A ∈ Rn×n , we have that
z Az = (Bz) Bz = |Bz|2 ≥ 0
for any z = [z 1 , . . . , z n ] ∈ Cn .
Example 1
n=3
B = np.random.normal(size=n∗∗2).reshape(3, 3)
A = np.dot(B.T, B)
values, vectors = np.linalg.eig(A)
print("values:\n", values, "\n\nvectors:\n", vectors, "\n")
values:
[0.09337468 7.75678625 4.43554113]
vectors:
[[ 0.49860775 0.84350568 0.199721 ]
[ 0.39606374 -0.42663779 0.81308899]
[-0.77105371 0.32631023 0.54680692]]
1.1 Positive Definiteness of a Matrix 3
S = []
for i in range(10):
z = np.random.normal(size = n)
y = np.squeeze(z.T.dot(A.dot(z)))
S.append(y)
if (i+1) % 5 == 0:
print("S[%d:%d]:"%((i−4),i), S[i−4:i])
1.2 Kernels
for λ > 0, and we construct the following function (the Nadaraya-Watson estimator)
from observations (x1 , y1 ), . . . , (x N , y N ) ∈ E × R:
N
k(x, xi )yi
fˆ(x) = i=1
N
.
j=1 k(x, x j )
For a given input x∗ ∈ E that is different from the N pairs of inputs, we return
the weighted sum of y1 , . . . , y N ,
k(x∗ , x1 ) k(x∗ , x N )
N , . . . , N ,
j=1 k(x ∗ , x j ) j=1 k(x ∗ , x j )
as the output fˆ(x∗ ). Because we assume that a larger k(x, y) yields a more similar
x, y ∈ E, the more similar x∗ and xi are, the larger the weight of yi .
Given an input x∗ ∈ E for i = 1, . . . , N , we weight yi such that xi − λ ≤ x∗ ≤
xi + λ is proportional to k(xi , x∗ ). If we make the λ value smaller, we predict y∗ by
using only the (xi , yi ) for which xi and x∗ are close. We display the output obtained
when we execute the following code in Fig. 1.1.
4 1 Positive Definite Kernels
3
λ = 0.05
λ = 0.35
λ = 0.5
2
1
y
0
-1
-2
-3 -2 -1 0 1 2 3
x
Fig. 1.1 We use the Epanechnikov kernel and Nadaraya-Watson estimator to draw the curves for
λ = 0.05, 0.35, 0.5. Finally, we obtain the optimal λ value and present it in the same graph
n = 250
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation
xx = np.arange(−3, 3, 0.1)
yy = [[] for _ in range(3)]
lam = [0.05, 0.35, 0.50]
color = ["g", "b", "r"]
for i in range(3):
for zz in xx:
yy[i].append(f(zz, lam[i]))
plt.plot(xx, yy[i], c = color[i], label = lam[i])
The kernels that we consider in this book satisfy the positive definiteness criterion
defined below. Suppose k : E × E → R is symmetric, i.e., k(x, y) = k(y, x), x, y ∈
E. For x1 , . . . , xn ∈ E (n ≥ 1), we say that the matrix
⎡ ⎤
k(x1 , x1 ) · · · k(x1 , xn )
⎢ .. .. .. ⎥
⎦∈R
n×n
⎣ . . . (1.1)
k(xn , x1 ) · · · k(xn , xn )
is the Gram matrix w.r.t. a k of order n. We say that k is a positive definite kernel2 if
the Gram matrix of order n is nonnegative definite for any n ≥ 1 and x1 , . . . , xn ∈ E.
Example 3 The kernel in Example 2 does not satisfy positive definiteness. In
fact, when λ = 2, n = 3, and x1 = −1, x2 = 0, x3 = 1, the matrix consisting of
K λ (xi , yi ) can be written as
⎡ ⎤ ⎡ ⎤
k(x1 , x1 ) k(x1 , x2 ) k(x1 , x3 ) 3/4 9/16 0
⎣ k(x2 , x1 ) k(x2 , x2 ) k(x2 , x3 ) ⎦ = ⎣ 9/16 3/4 9/16 ⎦
k(x3 , x1 ) k(x3 , x2 ) k(x3 , x3 ) 0 9/16 3/4
and the determinant is computed as 33 /26 − 35 /210 − 35 /210 = −33 /29 . In general,
the determinant of a matrix is the product of its eigenvalues, and we find that at least
one of the three eigenvalues is negative.
∞
Example 4 For random variables {X i }i=1 that are not necessarily independent, if
k(X i , X j ) is the covariance between X i , X j , the Gram matrix of any order is the
covariance matrix among a finite number of X j , which means that k is positive
definite. We discuss Gaussian processes based on this fact in Chap. 6.
By assuming positive definiteness, the theory of kernels will be developed in this
book. Hereafter, when we state kernels, we are referring to positive definite kernels.
Let H be a linear space (vector space) equipped with an inner product ·, · H .
Then, we often construct a positive definite kernel with
2 Although it seems appropriate to say “a nonnegative definite kernel”, the custom of saying “a
positive definite kernel” has been established.
6 1 Positive Definite Kernels
n
n
n
n
n
z K z = z i z j (xi ), (x j ) H = z i (xi ), z j (x j ) H = z j (x j )2H ≥ 0 ,
i=1 j=1 i=1 j=1 j=1
1/2
where we write a H := a, a H for a ∈ H .
x Ax ≥ 0, x Bx ≥ 0=
⇒x (a A + bB)x ≥ 0
n
n
B∞ = z j z h k∞ (x j , x h ) = −
j=1 h=1
n x1 ,
for · · · , xn ∈ E, z 1 , . . . , z n ∈ R, and > 0. Then, the difference between Bi :=
n
j=1 h=1 z j z h ki (x j , x h ) ≥ 0 and B∞ becomes arbitrarily close to zero as i → ∞.
However, the difference is at least > 0, which is a contradiction and means that
B∞ ≥ 0. If a kernel takes only a (nonnegative) constant value a, since all the values
in (1.1) are a ≥ 0, we have
⎡ ⎤ ⎡√ √ ⎤ ⎡ √ √ ⎤
a ··· a a/n · · · a/n a/n · · · a/n
⎢ .. . . .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. . ⎥
⎣. . .⎦=⎣ . . . ⎦ ⎣ . . .. ⎦ .
√ √ √ √
a ··· a a/n · · · a/n a/n · · · a/n
x Ax ≥ 0 , x ∈ Rn =
⇒x D ADx ≥ 0 , x ∈ Rn ,
k(x, y)
√ (1.3)
k(x, x)k(y, y)
obtained by substituting f (x) = {k(x, x)}−1/2 for k(x, x) > 0 (x ∈ E) in the last
item of Proposition 4 is positive definite. Furthermore„ the value obtained by substi-
tuting n = 2, x1 = x, and x2 = y into (1.1) is nonnegative, and the absolute value of
(1.3) does not exceed one. We say that (1.3) is the positive definite kernel obtained
by normalizing k(x, y).
β2 2 βm m
km (x, y) := 1 + βx y + (x y) + · · · + (x y) (1.4)
2 m!
(m ≥ 1) is a polynomial of the products of positive definite kernels, and the coef-
ficients are nonnegative. From the first two items of Proposition 4, this kernel is a
positive definite kernel. Additionally, because (1.4) is a Taylor expansion up to the
order m, from the third item of Proposition 4,
1
k(x, y) := exp{− x − y2 } , σ > 0 (1.5)
2σ 2
Thus, from the fifth item of Proposition 4 and the fact that exp(βx y) with β = σ −2
is positive definite, we see that (7) is positive definite.
Example 8 (Polynomial Kernel) The kernel
if we normalize it.
The converse is true for Proposition 2, which will be proven in Chap. 3: for
any nonnegative definite kernel k, there exists a feature map : E → H such that
k(x, y) = (x), (y) H .
Example 10 (Polynomial Kernel) Let m, d ≥ 1. The feature map of the kernel
km,d (x, y) = (x y + 1)m with x, y ∈ Rd is
m!
m,d (x1 , · · · , xd ) = ( x m 1 · · · xdm d )m 0 ,m 1 ,...,m d ≥0 ,
m 0 !m 1 ! · · · m d ! 1
d m!
( z i )m = z m 1 · · · z dm d
i=0 m 0 +m 1 +···+m d =m
m 0 !m 1 ! · · · m d ! 1
√ √ √
2,2 (x1 , x2 ) = [1, x12 , x22 , 2x1 , 2x2 , 2x1 x2 ]
because
= (1 + x1 y1 + x2 y2 )2 = (1 + x y)2 = k(x, y) .
Example 11 (Infinite-Dimensional
√ Polynomial Kernel) Let 0 < r ≤ ∞, d ≥ 1, and
E := {x ∈ Rd |x2 < r }. Let f : (−r, r ) → R be C ∞ . We assume that the func-
tion can be Taylor-expanded by
∞
f (x) = an x n , x ∈ (−r, r ) .
n=0
We often obtain the optimal value for each of the kernel parameters via cross
validation (CV)4 . If the parameters take continuous values, we select a finite number
of candidates and obtain the evaluation value for each parameter as follows. Divide
the N samples into K groups and conduct estimation with the samples belonging to
group K − 1 group. Perform testing with the samples belonging to the one remaining
group and calculate the corresponding score. Repeat the procedure K times (changing
4 Joe Suzuki, “Statistical Learning with Math and Python”, Chap. 4, Springer.
10 1 Positive Definite Kernels
σ 2 = 0.01
3
σ 2 = 0.001
σ 2 = σbest
2
2
1
y
0
-1
-2
-3 -2 -1 0 1 2 3
x
Fig. 1.2 Smoothing by predicting the values of x outside the N sample points via the Nadaraya-
Watson estimator. We choose the best parameter for the Gaussian kernel via cross validation
Fig. 1.3 A rotation employed for cross validation. Each group consists of N /k samples; we divide
N N 2N N
the samples into k groups based on their sample IDs. 1 ∼ , +1∼ , . . . , (k − 2) + 1 ∼
k k k k
N N
(k − 1) , (k − 1) + 1 ∼ N
k k
the test group) and find the sum of the obtained scores. In that way, we evaluate
the performance of the kernel based on one parameter. Execute this process for all
parameter candidates and use the parameters with the best evaluation values.
We obtain the optimal value of the parameter σ 2 via CV. We execute this procedure,
setting σ 2 = 0.01, 0.001.
n = 100
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation
for i in range(2):
for zz in xx:
yy[i].append(F(zz, sigma2[i]))
plt.plot(xx, yy[i], c = color[i], label = sigma2[i])
plt.legend(loc = "upper left", frameon = True, prop={’size’:20})
plt.title("Nadaraya−Watson Estimator", fontsize = 20)
xx = np.arange(−3, 3, 0.1)
yy = [[] for _ in range(3)]
sigma2 = [0.001, 0.01, sigma2_best]
labels = [0.001, 0.01, "sigma2_best"]
color = ["g", "b", "r"]
for i in range(3):
for zz in xx:
yy[i].append(F(zz, sigma2[i]))
plt.plot(xx, yy[i], c = color[i], label = labels[i])
plt.legend(loc = "upper left", frameon = True, prop={’size’: 20})
plt.title("Nadaraya−Watson Estimator", fontsize = 20)
12 1 Positive Definite Kernels
1.4 Probability
Each set is an event when the sets are closed by set operations (union, intersection,
and complement).
Example 13 We consider a set consisting of the subsets of E = {1, 2, 3, 4, 5, 6}
(dice eyes) that are closed by set operations:
{E, {}, {1, 3}, {5}, {2, 4, 6}, {1, 3, 5}, {2, 4, 5, 6}, {1, 2, 3, 4, 6}} .
If any of these eight elements undergo the union, intersection, or complement oper-
ations, the result remains one of these eight elements. In that sense, we can say
that these eight elements are closed by the set operations. The subsets {1, 3} and
{2, 4, 5, 6} are events, but {2, 4} is not. On the other hand, for the entire set E, if we
include {1}, {2}, {3}, {4}, {5}, {6} as events, 26 events should be considered. Even if
the entire set E is identical, whether it is an event differs depending on the set F of
events.
In the following, we start our discussion after defining the entire set E and the
set F of subsets (events) of E closed by the set operations. Any open interval (a, b)
with a, b ∈ R is a subset of the whole real number system R. Applying set operations
(union, set product, and set complement) to multiple open intervals does not form an
open interval, but the result remains a subset of R. We call any subset of R obtained
from an open set by set operations a Borel set of R, and we denote such a subset as
B. A set obtained by further applying set operations to Borel sets remains a Borel
set.
Example 14 For a, b ∈ R, the following are Borel sets: {a} = ∩∞n=1 (a − 1/n, a +
1/n), [a, b) = {a} ∪ (a, b), (a, b] = √{b} ∪ (a, b), [a, b] = {a} ∪ (a, b], R =
∪∞ ∞
n=0 (−2 , 2 ), Z = ∪n=0 {−n, n}, and [ 2, 3) ∪ Z.
n n
As described above, we assume that we have defined the entire set E and the
set F of events. At this time, the μ : F → [0, 1] that satisfies the following three
conditions is called a probability.
1. μ(A) ≥ 0, A ∈ F, ∞
∞
2. Ai ∩ A j = {}=
⇒μ(∪i=1 Ai ) = i=1 μ(Ai ), and
3. μ(E) = 1.
We say that μ is a measure if μ satisfies the first two conditions, and we say that
this measure is finite if μ(E) takes a finite value. We say that (E, F, μ) is either a
probability space or a measure space, depending on whether μ(E) = 1 or not.
For probability and measure spaces, if {e ∈ E|X (e) ∈ B} is an event for any
Borel set B, which means that {e ∈ E|X (e) ∈ B} ∈ F, we say that the function
X : E → R is measurable in X . In particular, if we have a probability space, X is a
random variable. Whether X is measurable depends on (E, F) rather than (E, F, μ).
1.4 Probability 13
Then, if F = {{1, 3, 5}, {2, 4, 6}, {}, E}, then X is a random variable. In fact, since
X is measurable,
{e ∈ E|X (e) ∈ {1}} = {1, 3, 5}
for the Borel set B = {1}, [−2, 3), [0, 1). Even if we choose the Borel set B, the
set {e ∈ E|X (e) ∈ B} is one of {1, 3, 5}, {2, 4, 6}, {}, E. On the other hand, if F =
{{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable.
In the following, assuming
that the function f : E → R is measurable, we define
the Lebesgue integral f dμ. We first assume that f is nonnegative. For a sequence
E
of exclusive subsets {Bk } of F, we define
inf f (e) μ(Bk ) . (1.7)
e∈Bk
k
sup inf f (e) μ(Bk ),
{Bk } e∈Bk
k
takes a finite value, we say that the supremum is the Lebesgue integral of the mea-
surable function f for (E, B, μ), and we write f dμ. When the function f is not
E
necessarily nonnegative, we divide E into E + := {e ∈ E| f (e) ≤ 0}} and E − := {e ∈
E| f (e)
≥ 0}},and we define the above quantity for each of f + := f, f − :=− f . If
both f + dμ, f − dμ take finite values, we say that f dμ := f + dμ − f − dμ
is the Lebesgue integral of f for (E, B, μ).
If X is a random variable, the associated Borel sets are the events for the probability
μ(·). We say that the probability of event X ≤ x for x ∈ R
FX (x) := μ([−∞, x)) = dμ ,
(−∞,x]
14 1 Positive Definite Kernels
n
n
z i z j φ(xi − x j ) ≥ 0 , z = [z 1 , . . . , z n ] ∈ Rn (1.8)
i=1 j=1
√ n ≥ 1, x1 , . . . , xn ∈ E.
for an arbitrary
Let i = −1 be the imaginary unit. We define the characteristic function of a
random variable X by ϕ : Rd → C:
1.5 Bochner’s Theorem 15
ϕ(t) := E[exp(it X )] = exp(it x)dμ(x) , t ∈ Rd ,
E
the kernel k in Proposition 5 by a constant. Note that we only consider a kernel k(·, ·)
whose range is real in this book, although the range of the characteristic function is
generally Cn .
d
In the following, we denote t2 by j=1 t j for t = [t1 , . . . , td ] ∈ R .
2 d
As discussed in Chap. 4, the space E of the covariates is projected via the feature
map : E → H . The method of evaluating similarity via the inner product (kernel)
in another linear space ( RKHS ) has been widely used in machine learning and data
science. If the similarities between the elements of the set E are accurately repre-
sented, then this approach yields improved regression and classification processing
performance. As this is a kernel configuration method, we provide the notions of
convolutional and marginalized kernels and illustrate them by introducing string,
tree, and graph kernels.
First, we define positive definite kernels k1 , . . . , kd for the sets E 1 , . . . , E d . Sup-
pose that we define a set E and a map R : E 1 × · · · × E d → E. Then, we define the
kernel E × E (x, y) → k(x, y) ∈ R by
d
k(x, y) = ki (xi , yi ) , (1.10)
R −1 (x) R −1 (y) i=1
where R −1 (x) is the sum over (x1 , . . . , xd ) ∈ E 1 × · · · E d such that R(x1 , . . . , xd ) =
x. A kernel in the form of (1.10) is called a convolutional kernel [13]. Since each
ki (xi , yi ) is positive definite, k(x, y) is also positive definite (according to the first
two items of Proposition 4).
1.6 Kernels for Strings, Trees, and Graphs 17
’ababbcaaac’
’ccbcbcaaacaa’
string_kernel (x,y,2)
58
18 1 Positive Definite Kernels
k(x, y) = cu (x)cu (y) = 1 · I (x2 = y2 ) · 1 .
u R(x1 ,x2 ,x3 )=x R(y1 ,y2 ,y3 )=y
Thus, we observe that the string kernel can be expressed by (1.10), where I (A) takes
values of one and zero depending on whether condition A is satisfied.
Example 22 (Tree Kernel) Suppose that we assign a label to each vertex of trees
x, y. We wish to evaluate the similarity between x, y based on how many subtrees
are shared. We denote by ct (x), ct (y) the numbers of occurrences of subtree t in x, y,
respectively. Then, the kernel
k(x, y) := ct (x)ct (y) (1.11)
t
where c(u, v) = t I (u, t)I (v, t) is the number of common subtrees in x and y
such that the vertices u ∈ Vx and v ∈ Vy are their roots. We assume that a label l(v)
is assigned to each v ∈ V and determine whether they coincide.
1. For the descendants u 1 , . . . , u m and v1 , . . . , vn of u and v, if any of the following
hold, then we define c(u, v) := 0:
(a) l(u) = l(v),
(b) m = n,
(c) there exists i = 1, . . . , m such that l(u i ) = l(vi ),
2. otherwise, we define
1.6 Kernels for Strings, Trees, and Graphs 19
m
c(u, v) := {1 + c(u i , vi )}.
i=1
For example, suppose that we assign one of the labels A, T, G, C to each vertex
in Fig. 1.4. We may write this in a Python function as follows, where we assume that
we assign no identical labels to the vertices at the same level of the tree. Note that
the function calls itself (it is a recursive function). For example, the function requires
the value C(4, 2) when it obtains C(1, 1).
def C( i , j ) :
S, T = s [ i ] , t [ j ]
# Return zero when verteces i and j of the trees s and t do not coincides
i f S[0] != T[0]:
return 0
# Return zero when either verteces i or j of the trees s and t does not have a descendant
i f S[1] is None:
return 0
i f T[1] is None:
return 0
i f len(S[1]) != len(T[1]) :
return 0
U = []
for x in S[1]:
U.append( s [x][0])
U1 = sorted(U)
V = []
for y in T[1]:
V.append( t [y][0])
V1 = sorted(V)
m = len(U)
# Return zero when the labels of the descendants do not coincide
for h in range(m) :
i f U1[h] != V1[h] :
return 0
U2 = np. array (S[1]) [np. argsort (U) ]
V2 = np. array (T[1]) [np. argsort (V) ]
W= 1
for h in range(m) :
W = W ∗ (1 + C(U2[h] , V2[h]) )
return W
def k(s , t ) :
m, n = len( s ) , len( t )
kernel = 0
for i in range(m) :
for j in range(n) :
i f C( i , j ) > 0:
kernel = kernel + C( i , j )
return kernel
1 G 1 G
2 T 4 A A 2 5 T
A
3 C 5 6 T 3 4 6 7
C C
C T
C 8 9 T
Fig. 1.4 A tree kernel evaluates the similarity in terms of which the labels A, G, C, T are assigned
to the vertices of the trees
t[6] = ["A", [7, 8]]; t[7] = ["C", None]; t[8] = ["T", None]
for i in range(6):
for j in range(9):
if C(i, j) > 0:
print(i, j, C(i, j))
0 0 2
3 1 1
3 6 1
k(s , t )
for x, x ∈ E X (Tsuda et al. [32]). We claim that the marginalized kernel is positive
definite. In fact, k X Y being positive definite implies the existence of the feature map
: E X Y (x, y) → (x, y) such that
Thus, there exists another feature map E X x → y∈E Y P(y|x)((x, y)) such that
k(x, x ) := P(y|x)P(y |x )((x, y)), ((x , y ))
y∈E Y y ∈E Y
= P(y|x)((x, y)), P(y |x )((x , y )) .
y∈E Y y ∈E Y
We may define (1.12) for the conditional density function f of Y given X as follows:
k(x, x ) := kY |X ((x, y), (x , y )) f (y|x) f (y |x )dydy
y∈E Y y ∈E Y
for x, x ∈ E X .
def k(s , p) :
return prob(s , p) / len(node)
def prob(s , p) :
i f len(node[ s [0]]) == 0:
return 0
i f len( s ) == 1:
return p
m = len( s )
S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p)
return S
22 1 Positive Definite Kernels
C 2 4 A
b
0.0016460905349794243
Because five vertices exist, we multiply by 1/5, choose one of the next two
transitions, and so on.
1 2 1 2 2 1 2 1 22
· · · ( · 1) · · · ( · 1) · = .
5 3 2 3 3 2 3 3 5 × 35
Appendix
Many books have proofs because Fubini’s theorem, Lebesgue’s dominant conver-
gence theorem, and Levy’s convergence theorem are general theorems. We have
abbreviated these statements and proofs. The proof of Proposition 5 was provided
by Ito [15].
Appendix 23
Proof of Proposition 3
D is a diagonal matrix whose components are the eigenvalues λi ≥ 0 of the non-
negative definite matrix A, and U is an orthogonal matrix whose column vectors
n u i that
are unit eigenvectors are orthogonal to each other. Then, we can write
A = U DU = i=1 λi u i u i . Similarly, if μi , vi , i = 1, . . . , n arethe eigenvalues
and eigenvectors of matrix B, respectively, then we can write B = i=1 n
μi vi vi . At
this moment, we have
(u i u i ) ◦ (v j v j )=(u i,k u i,l ·v j,k v j,l )k,l =(u i,k v j,k · u i,l v j,l )k,l = (u i ◦ v j )(u i ◦ v j ) .
n
n
n
n
A◦B = λi μ j (u i u i ) ◦ (v j u j ) = λi μ j (u i ◦ v j )(u i ◦ v j )
i=1 j=1 i=1 j=1
is nonnegative definite.
Proof of Proposition 5
We only show the case in which φ(0) = η(E) = 1 because the extension is straight-
forward. Suppose that (1.9) holds. Then, we have
n
n
n
n
n
z j z k φ(x j − xk ) = z j ei x j t z k e−i xk t dη(t) = | z j ei x j t |2 dη(t) ≥ 0,
j=1 k=1 E j=1 k=1 E j=1
and (1.8) follows. Conversely, suppose that (1.8) holds. Since the matrix consisting
of φ(xi − x j ) for the (i, j)th element is nonnegative definite and symmetric, we
have that φ(x) = φ(−x), x ∈ R. If we substitute n = 2, x1 = u, and x2 = 0, then
we obtain
1 φ(u) z1
[z 1 , z 2 ] ≥0
φ(u) 1 z2
where we use the dominant convergence theorem for the last equality. In general, for
a g : R → R that is monotonically increasing and bounded from above, we have
y
1
lim g(x)d x = lim g(x) .
y→∞ y 0 x→∞
∞
Thus, we have f n (x)d x = 1.
−∞
Finally, we show that φn → φ (n → ∞):
a ∞
1
φ(t)e−t /n e−ita dt
2
φn (z) := lim e i za
a→∞ −a −∞ 2π
∞
1 2 sin a(t − z)
φ(t)e−t /n
2
= lim dt
a→∞ 2π −∞ t −z
∞
1 b 1 2 sin a(t − z)
φ(t)e−t /n
2
= lim da dt
b→∞ b 0 2π −∞ t −z
∞
1 2(1 − cos b(t − z))
φ(t)e−t /n
2
= lim dt
b→∞ 2π −∞ b(t − z)2
∞
1 s 2(1 − cos s)
φ(z + )e−(z+s/b) /n ds = φ(z)e−z /n → φ(z).
2 2
= lim 2
b→∞ 2π −∞ b s
Appendix 25
and bi
2 sin ai xi 2(1 − cos ti bi )
dai = ,
0 ti ti2 bi
Exercises 1∼15
1. Show that the following three conditions are equivalent for a symmetric matrix
A ∈ Rn×n .
(a) There exists a square matrix B such that A = B B.
(b) x Ax ≥ 0 for an arbitrary x ∈ Rn .
(c) All the eigenvalues of A are nonnegative.
In addition, using Python, generate a square matrix B ∈ Rn×n with real elements
by generating random numbers to obtain A = B B. Then, randomly generate
five more x ∈ Rn (n = 5) to examine whether x Ax is nonnegative for each
value.
2. Consider the Epanechnikov kernel defined by k : E × E → R
|x − y|
k(x, y) = D
λ
3
(1 − t 2 ), |t| ≤ 1
D(t) = 4
0, Otherwise
for λ > 0. Suppose that we write a kernel for λ > 0 and (x, y) ∈ E × E in
Python as shown below:
Specify the function D using Python. Moreover, define the function f that
makes a prediction at z ∈ E based on the Nadaraya-Watson estimator by uti-
lizing the function k such that z, λ are the inputs of f and k, respectively,
and (x1 , y1 ), . . . , (x N , y N ) are global. Then, execute the following to examine
whether the functions D, f work properly.
26 1 Positive Definite Kernels
n = 250
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4
xx = np.arange(−3, 3, 0.1)
yy = [[] for _ in range(3)]
lam = [0.05, 0.35, 0.50]
color = ["g", "b", "r"]
for i in range(3):
for zz in xx:
yy[i].append(f(zz, lam[i]))
plt.plot(xx, yy[i], c = color[i], label = lam[i])
Replace the Epanechnikov kernel with the Gaussian kernel, the exponential type,
and the polynomial kernel and execute them.
3. Show that the determinant of A ∈ R3×3 coincides with the product of the three
eigenvalues. In addition, show that if the determinant is negative, at least one of
the eigenvalues is negative.
4. Show that the Hadamard product of nonnegative definite matrices of the same
size is nonnegative definite. Show also that the kernel obtained by multiplying
positive definite kernels is positive definite.
5. Show that a square matrix whose elements consist of the same nonnegative value
is nonnegative definite. Show further that a kernel that outputs a nonnegative
constant is positive definite.
6. Find the feature map 3,2 (x1 , x2 ) of the polynomial kernel k3,2 (x, y) = (x y +
1)3 for x, y ∈ R2 to derive
7. Use Proposition 4 to show that the Gaussian and polynomial kernels and expo-
nential types are positive definite. Show also that the kernel obtained by nor-
malizing a positive definite kernel is positive definite. What kernel do we obtain
when we normalize the exponential type and the Gaussian kernel?
8. The following procedure chooses the optimal parameter σ 2 of the Gaussian
kernel via 10-fold CV when applying the Nadaraya-Watson estimator to the
samples. Change the 10-fold CV procedure to the N -fold (leave-one-out) CV
process to find the optimal σ 2 , and draw the curve by executing the procedure
below:
def K(x, y, sigma2):
return np.exp(−np.linalg.norm(x − y)∗∗2/2/sigma2)
n = 100
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4
Exercises 1∼15 27
m = int(n / 10)
sigma2_seq = np.arange(0.001, 0.01, 0.001)
SS_min = np.inf
for sigma2 in sigma2_seq:
SS = 0
for k in range(10):
test = range(k∗m,(k+1)∗m)
train = [x for x in range(n) if x not in test]
for j in test:
u, v = 0, 0
for i in train:
kk = K(x[i], x[j], sigma2)
u = u + kk ∗ y[i]
v = v + kk
if not(v==0):
z=u/v
SS = SS + (y[j] − z)∗∗2
if SS < SS_min:
SS_min = SS
sigma2_best = sigma2
print("Best sigma2 = ", sigma2_best)
and F = {{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable (not mea-
surable).
10. Derive the characteristic function of the Gaussian distribution f (x) =
1 (x − μ)2
√ exp{− } with a mean of μ and a variance of σ 2 and find the
2π 2σ 2
condition for the characteristic function
α to be a real function. Do the same for
the Laplace distribution f (x) = exp{−α|x|} with a parameter α > 0.
2
11. Obtain the kernel value between the left tree and itself in Fig. 1.4. Construct and
execute a program to find this value.
12. Randomly generate binary sequences x, y of length 10 to obtain the string kernel
value k(x, y).
13. Show that the string, tree, and marginalized kernels are positive definite. Show
also that the string and graph kernels are convolutional and marginalized kernels,
respectively.
14. How can we compute the path probabilities below when we consider a random
walk in the directed graph of Fig. 1.5 if the stopping probability is p = 1/3?
28 1 Positive Definite Kernels
(a) 3 → 1 → 4 → 3 → 5,
(b) 1 → 2 → 4 → 1 → 2,
(c) 3 → 5 → 3 → 5.
15. What inconvenience occurs when we execute the procedure below to compute
a graph kernel? Illustrate this inconvenience with an example.
def k(s , p) :
return prob(s , p) / len(node)
def prob(s , p) :
i f len(node[ s [0]]) == 0:
return 0
i f len( s ) == 1:
return p
m = len( s )
S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p)
return S
Chapter 2
Hilbert Spaces
When considering machine learning and data science issues, in many cases, the
calculus and linear algebra courses taken during the first year of university provide
sufficient background information. However, we require knowledge of metric spaces
and their completeness, as well as linear algebras with nonfinite dimensions, for
kernels. If your major is not mathematics, we might have few opportunities to study
these topics, and it may be challenging to learn them in a short period. This chapter
aims to learn Hilbert spaces, the projection theorem, linear operators, and (some of)
the compact operators necessary for understanding kernels. Unlike finite-dimensional
linear spaces, ordinary Hilbert spaces require scrutiny of their completeness.
1 We call M a metric space rather than (M, d) when we do not stress d or when d is apparent.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 29
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_2
30 2 Hilbert Spaces
Example 24 The set M = [0, 1] is a closed set because the neighborhood U (y, ) of
y∈/ M has no intersection with M if we make the radius > 0 smaller, which means
that M contains all the convergence points of M. On the other hand, M = (0, 1) is
an open set because M contains the neighborhood U (y, ) of y ∈ M if we make the
radius > 0 smaller. If we add {0}, {1} to the interval (0, 1), (0, 1], [0, 1), we obtain
the closed set [0, 1].
We say that the minimum closed set in M that contains E is the closure of E, and
we write this as E. If E is not a closed set, then E does not contain all the convergence
points. Thus, the closure is the set of convergence points of E. Moreover, we say that
E is dense in M if E = M, which is equivalent to the following conditions. “y ∈ E
exists such that d(x, y) < for an arbitrary > 0 and x ∈ M”, and “each point in M
is a convergence point of E”. Furthermore, we say that M is separable if it contains
a dense subset that consists of countable points.
Example 25 For the distance d(x, y) := |x − y| with x, y ∈ R and the metric space
(R, d), each irrational number a ∈ R\Q is a convergence point of Q. In fact, for an
arbitrary > 0, the interval (a − , a + ) contains a rational number b ∈ Q. Thus, Q
does not contain the convergence point a ∈ / Q and is not a closed set in R. Moreover,
the closure of Q is R (Q is dense in R). Furthermore, since Q is a countable set, we
find that R is separable.
Let (M, d) be a metric space. We say that a sequence {xn } in2 M converges to
x ∈ M if d(xn , x) → 0 as n → ∞ for x ∈ M, and we write this as xn → x. On
the other hand, we say that a sequence {xn } in M is Cauchy if d(xm , xn ) → 0 as
m, n → ∞, i.e., if supm,n≥N d(xm , xn ) → 0 as N → ∞.
If {xn } converges to some x ∈ M, then it is a Cauchy sequence. However, the
converse does not hold. We say that a metric space (M, d) is complete if each
Cauchy sequence {xn } in M converges to an element in M. We say that (M, d) is
bounded if there exists a C > 0 such that d(x, y) < C for an arbitrary x, y ∈ M, and
the minimum and maximum values are the upper and lower limits if M is bounded
from above and below, respectively.
min{x1 , . . . , x N −1 , x N − } ≤ xn ≤ max{x1 , . . . , x N −1 , x N + } .
Proposition 6 R is complete.
1 3
(0, 1) ∪i=1
n
( xi , xi ) ,
2 2
which means that M is not compact.
Proposition 7 (Heine-Borel) For R p , any bounded closed set M is compact.
Proof: Suppose that we have set a neighborhood U (P) for each P ∈ M and that
M ⊆ ∪i=1m
U (Pi ) cannot be realized by any m and P1 , . . . , Pm . If we divide the
closed set (rectangular) that contains M ⊆ R p into two components for each dimen-
sion, then at least one of the 2 p rectangles cannot be covered by a finite number of
neighborhoods. If we repeat this procedure, then the volume of the rectangle that a
finite number of neighborhoods cannot cover becomes sufficiently small for the cen-
ter to converge to a P ∗ ∈ M; furthermore, we can cover the rectangle with U (P ∗ ),
which is a contradiction.
Let (M1 , d1 ), (M2 , d2 ) be metric spaces. We say that the map f : M1 → M2 is
continuous at x ∈ M1 if for any > 0, there exists δ(x, ) such that for y ∈ M1 ,
In particular, if there exists δ(x, ) that does not depend on x ∈ M1 in (2.1), we say
that f is uniformly continuous.
Example 29 The function f (x) = 1/x defined on the interval (0, 1] is contin-
uous but is not uniformly continuous. In fact, if we make x approach y after
1 1
fixing y, we can make d2 ( f (x), f (y)) = | − | as small as possible, which
x y
means that f is continuous in (0, 1]. However, when we make x approach y to
make d2 ( f (x), f (y)) smaller than a constant, we observe that for each > 0, the
smaller y is, the smaller d1 (x, y) = |x − y| should be. Thus, no such δ() for which
d1 (x, y) < δ()=⇒d2 ( f (x), f (y)) < exists if δ() does not depend on x, y ∈ M
(Fig. 2.1).
32 2 Hilbert Spaces
10
make the | f (x) − f (y)|
value smaller than a
f (x) = 1/x
8
constant, we need to make
the |x − y| value smaller
6
when x, y are close to 0 (red
lines) than that when x, y are
4
far away from 0 (blue lines).
Thus, δ > 0 depends on the
2
locations of x, y
0.1 0.2 0.3 0.4 0.5 0.6
x
1
d1 (x, z i ) < (z i ) < (z i )
2
d1 (y, z i ) ≤ d1 (x, y) + d1 (x, z i ) < (z i ) .
Combining these inequalities, from the assumption that f is continuous and (2.2),
we have That
Example 30 We can prove that “a definite integral exists for a continuous function
defined over a closed interval [a, b]” by virtue of Proposition 8. If we divide a < b
2.1 Metric Spaces and Their Completeness 33
We say that a set V is a linear space3 if it satisfies the following conditions: for
x, y ∈ V and α ∈ R,
1. x + y ∈ V and
2. αx ∈ V.
Example 31 If we define the sum of x = [x1 , . . . , xd ], y = [y1 , . . . , yd ] ∈ Rd and
the multiplication by a constant α ∈ R as x + y = [x1 + y1 , . . . , xd + yd ] and αx =
[αx1 , . . . , αxd ], respectively, then the d-dimensional Euclidean space Rd forms a
linear space.
Example
1 32 (L 2 Space) The set L 2 [0, 1] of functions f : [0, 1] → R for which
0 { f (x)} d x takes a finite value is a linear space because
2
1 1 1
{ f (x) + g(x)} d x ≤ 2 2
f (x) d x + 2
2
g(x)2 d x < ∞
0 0 0
1 1
{α f (x)}2 d x = α2 f (x)2 d x < ∞
0 0
1 1
for α ∈ R when 0 { f (x)}2 d x < ∞ and 0 g(x)2 d x < ∞.
Let V be a linear space. We say that any bivariate function ·, · : V × V → R
that obeys the four conditions below is an inner product:
1. x, x ≥ 0;
2. αx + β y, z = αx, z + βy, z;
3. x, y = y, x; and
4. x, x = 0 ⇐⇒ x = 0
for x, y, z ∈ V and α, β ∈ R.
d
x, y = xi yi
i=1
d
x, x = 0 ⇐⇒ xi2 = 0 ⇐⇒ x = 0 .
i=1
For a given linear space, we need to choose its inner product. We say that a linear
space equipped with an inner product is an inner product space.
Example 34 (Inner Product of the L 2 Space) Let L 2 [0, 1] be the linear space in
Example 32. The bivariate function
1
f, g = f (x)g(x)d x
0
f, g ∈ L 2 [0, 1] is not an inner product because the last condition fails. If f (1/2) = 1
and f (x) = 0 for x
= 1/2, then we have
1
f, f = f (x)2 d x = 0 .
0
1
4 We define the equivalent relation ∼ such that f − g ∈ {h| 0 {h(x)}2 d x = 0} ⇐⇒ f ∼ g and
construct the inner product for the quotient space L 2 / ∼.
2.2 Linear Spaces and Inner Product Spaces 35
satisfies the four conditions of a norm, and we call this norm a norm induced by an
inner product.
In Examples 32 and 34, we introduced L 2 over E = [0, 1] using the Riemann
integral. However, in general, we define L 2 (E, F, μ) according to the set of f :
E → R for which
f 2 dμ (2.4)
E
Example 35 (Uniform Norm) The set of continuous functions over [a, b] forms a
linear space. The uniform norm defined by
f = 0 ⇐⇒ f (x) = 0 , x ∈ [a, b] .
where the equality holds if and only if z = 0, which occurs exactly when one of x, y
is a constant multiplication of the other. Moreover, we can examine the triangle norm
inequality via (2.5):
x + y2 =x2 + 2|x, y|+y2 ≤ x2 + 2x y + y2 = (x + y)2 .
Because we did not use the last condition of inner products in deriving the Cauchy-
Schwarz inequality, we may apply this inequality to any bivariate function that sat-
isfies the first three conditions.
Using Cauchy-Schwarz’s inequality, we can prove the continuity of an inner prod-
uct.
5 We often abbreviate (E, F , μ) or specify the interval as [a, b] (as in L 2 [a, b]).
36 2 Hilbert Spaces
n=1 n=2
0.8
0.8
y
y
0.4
0.4
0.0
0.0
-0.5 0 0.5 1.0 1.5 -0.5 0 0.5 1.0 1.5
x x
n=3 n=4
0.8
0.8
y
y
0.4
0.4
0.0
0.0
-0.5 0 0.5 1.0 1.5 -0,5 0 0.5 1.0 1.5
x x
Fig. 2.2 We illustrate Example 37 when f n → f . The function is continuous for finite values of n
but is not continuous as n → ∞
|xn , yn − x, y| ≤ |xn , yn − y| + |xn − x, y| ≤ xn · yn − y + xn − x · y → 0.
We say that a vector space in which a norm is defined and the distance is complete
is a Banach space. Hereafter, we denote by C(E) the set of continuous functions
defined over E.
2+n
1 1
1
1 1
f n − f 2 = f n (t) − f (t)2 dt = [n(t − ) − 1]2 dt = →0.
0 1
2
2 3n
as N → ∞. Then, the real sequence { f n (x)} is a Cauchy sequence for each x ∈ [a, b]
and converges to a real value (Proposition 6). If we define the function f (x) with
x ∈ E by limn→∞ f n (x), supn≥N f n (x) and inf n≥N f n (x) converge to f (x) from
above and from below, respectively. From (2.6), we see that
uniformly converges to 0 for an arbitrary x ∈ [a, b], which implies that C[a, b] is
complete.
Because any inner product does not induce the uniform norm, C[a, b] is a Banach
space but is not a Hilbert space.
For the proof, we use the Stone-Weierstrass theorem (Proposition 12) [30, 31, 34,
35]. The term algebra is used to denote the linear space A that defines associative “·”
and commutative “+” properties and satisfies
x · (y + z) = x · y + x · z
(y + z) · x = y · x + z · x
α(x · y) = (αx) · y
for x, y, z ∈ A and α ∈ R, where the first two properties are identical if · is commu-
tative. The general theory may be complex, but we only suppose that + and · are the
standard addition and multiplication operations and that A is either a polynomial or
continuous function.
Example 38 (Polynomial Ring) Let +,· be the standard commutative addition and
multiplication operations. The polynomial ring R[x, y, z] is a set of polynomials with
indeterminates (variables) x, y, z and is an algebra with R commutative coefficients.
R[x, y, z] is a linear space, and if two elements belong to R[x, y, z], multiplication
by R and addition among the elements belong to R[x, y, z]. Moreover, the three laws
follow for the elements in R[x, y, z].
Proposition 12 (Stone-Weierstrass [30, 31, 34, 35]) Let E and A be a compact set
and an algebra, respectively. Under the following conditions, A is dense in C(E).
1. No x ∈ E exists such that f (x) = 0 for all f ∈ A.
2. For each pair x, y ∈ E with x
= y, f ∈ A exists such that f (x)
= f (y).
For the proof, see the appendix at the end of this chapter. We say that a function in
the form
m
h(Bk )I (Bk ) (2.7)
k=1
The outline of the proof is as follows; see the appendix for details. It is sufficient
to show that “{ f n } is a Cauchy sequence in L 2 =
⇒ there exists an f ∈ L 2 such that
f n − f → 0”. We define the f to which the Cauchy sequence { fn } in L 2 converges
and derive f n − f → 0 and f ∈ L 2 .
1. Let { f n } be an arbitrary Cauchy
∞sequence in L .
2
i−1
vi := xi − xi , e j e j , ei = vi /vi , i = 1, 2, . . .
j=1
m
f m (x) = a0 + (an cos nx + bn sin nx) (2.8)
n=1
which are the {1, cos x, sin x, cos 2x, sin 2x, · · · } divided by their norms, form-
ing an orthonormal basis of the Hilbert space L 2 [−π, π] that consists of the
functions
π expressed by the Fourier series as in (2.8), where π we regard f, g :=
−π f (x)g(x)d
π
x as the inner product of f, g ∈ H , and we use −π cos mx sin nxd x =
π
0 and −π cos2 mxd x = −π sin2 nxd x = π for m, n > 0.
Proposition 17 For a Hilbert space H , being separable is equivalent to having an
orthonormal basis.
2.3 Hilbert Spaces 41
Let V and M be a linear space equipped with an inner product and its subspace,
respectively. We define the orthogonal complement of M as
x 1 ⊥ x 2 , x 1 ∈ M1 , x 2 ∈ M2 , (2.9)
we write it as M1 ⊕ M2 := {x1 + x2 : x1 ∈ M1 , x2 ∈ M2 }.
x − y, z = 0 , z ∈ M (2.10)
and is unique.
is a Cauchy sequence.
42 2 Hilbert Spaces
H = M ⊕ M⊥ . (2.11)
M ⊥ = { f ∈ H | f, k(xi , ·) H = 0, i = 1, . . . , N } .
⇒x, y = 0 , y ∈ M ⊥ =
x ∈ M= ⇒x ∈ (M ⊥ )⊥ .
For the third item, from the first two properties, taking the closure on both sides
of M ⊆ (M ⊥ )⊥ yields M ⊆ (M ⊥ )⊥ . From Proposition 19, an arbitrary x ∈ (M ⊥ )⊥
⊥
can be written as y ∈ M ∩ (M ⊥ )⊥ = M and z ∈ M ∩ (M ⊥ )⊥ . However, we have
⊥
M ∩ (M ⊥ )⊥ ⊆ M ⊥ ∩ (M ⊥ )⊥ = {0}, which means that z = 0, and we obtain the
third item.
Im(T ) := {T x : x ∈ X 1 } ⊆ X 2
and
Ker(T ) := {x ∈ X 1 : T x = 0} ⊆ X 1 ,
and we call the dimensionality of Im(T ) the rank of T . We say that the linear operator
T : X 1 → X 2 is bounded if for each x ∈ X 1 , there exists a constant C > 0 such that
T x2 ≤ Cx1 .
Proof: If T is uniformly continuous, then there exists a δ > 0 such that x1 ≤
δx
δ=⇒ T x2 ≤ 1. Since ≤ δ, we have
x
δx x1 x1
T x2 = T ( )2 ≤
x1 δ δ
for any x
= 0. On the other hand, if T is bounded, there exists a constant C that does
not depend on x ∈ X 1 such that
T x2 ≤ T x1 .
be finite. We define the integral operator by the linear operator T in L 2 [0, 1] such
that 1
(T f )(·) = K (·, x) f (x)d x (2.13)
0
for f ∈ L 2 [0, 1]. Note that (2.13) belongs to L 2 [0, 1] and that T is bounded: From
1 1 1
|(T f )(x)|2 ≤ K 2 (x, y)dy f 2 (y)dy = f 22 K 2 (x, y)dy ,
0 0 0
we have
1 1 1
T f 22 = |(T f )(x)|2 d x ≤ f 22 K 2 (x, y)d xd y .
0 0 0
We call such a K an integral operator kernel and distinguish between the positive
definite kernels we deal with in this book.
T f = f, eT , f ∈ H (2.14)
and T = eT .
2.5 Linear Operators 45
We assume that Tx is bounded for each x ∈ E. Then, from Proposition 22, there
exists a k x ∈ H such that
f (x) = Tx ( f ) = f, k x
x ∈ E, and Tx = k x .
We call the T ∗ in Proposition 23 the adjoint operator of T . In particular, if T ∗ = T ,
we call such an operator T self-adjoint.
we see that the adjoint T ∗ is the transpose matrix of T and that T can be written as
a symmetric matrix if and only if T is self-adjoint.
Example 45 For the integral operator of L 2 [0, 1] in Example 42, from Fubini’s
theorem, we have that
1 1 1
T f, g = K (x, y) f (x)g(y)d xd y = f, K (y, ·)g(y)dy ,
0 0 0
1
and y → (T ∗ g)(y) = 0 K (x, y)g(x)d x is an adjoint operator. If the integral oper-
ator kernel K is symmetric, the operator T is self-adjoint.
46 2 Hilbert Spaces
Let (M, d) and E be a metric space and a subset of M, respectively. If any infinite
sequence in E contains a subsequence that converges to an element in E, then we
say that E is sequentially compact. If {xn } has a subsequence that converges to x,
then x is a convergence point of {xn }.
Example 46 Let E := R and d(x, y) := |x − y| for x, y ∈ R. Then, E is not
sequentially compact. In fact, the sequence xn = n has no convergence points. For
E = (0, 1], the sequence xn = 1/n converges to 0 ∈ / (0, 1] as n → ∞, and the con-
vergence point of any subsequence is only 0. Therefore, E = (0, 1] is not sequentially
compact.
Proposition 24 Let (M, d) and E be a metric space and a subset of M, respectively.
Then, E is sequentially compact if and only if E is compact.
Proof: Many books on geometry deal with the proof of equivalence. See such books
for the details of this proof.
In this section, we explain compactness by using the terminology of sequential
compactness.
Let X 1 , X 2 be linear spaces equipped with norms, and let T ∈ B(X 1 , X 2 ). We
say that T is compact if {T xn } contains a convergence subsequence for any bounded
sequence {xn } in X 1 .
Example 47 The orthonormal basis {e j } in a Hilbert space H is √ bounded because
e j = 1. However, for an identity map, we have that ei − e j = 2 for any i
= j.
Thus, the sequence e1 , e2 , . . . does not have any convergence points in H . Hence,
the identity operator for any infinite-dimensional Hilbert space is not compact.
Proposition 25 For any bounded linear operator T , the following hold.
1. If the rank is finite, then the operator T is compact.
2. If a sequence of finite-rank operators {Tn } exists such that Tn − T → 0 as
n → ∞, then T is compact8 .
Proof: See the appendix at the end of this chapter.
Let H and T ∈ B(H ) be a Hilbert space and its bounded linear operator, respec-
tively. If λ ∈ R and 0
= e ∈ H exist such that
T e = λe , (2.15)
1. e j is linearly independent.
2. If T is self-adjoint, then {e j } are orthogonal.
Example 49 For any C > 0, the absolute values of a finite number of eigenvalues
λi for a compact operator T exceed C. Suppose that the absolute values of an infinite
number of eigenvalues λ1 , λ2 , . . . exceed C. Let M0 := {0}, Mi := span{e1 , . . . , ei },
e j ∈ Ker(T − λ j I ), j = 1, 2, . . ., i = 1, 2, . . .. Since the {e1 , . . . , ei } are linearly
⊥
independent, each Mi ∩ Mi−1 is one dimensional for i = 1, 2, . . .. Thus, if we
⊥
define the orthonormal sequence xi ∈ Ker(T − λi I ) ∩ Mi−1 , i = 1, 2, . . . via Gram-
Schmidt, then we have
T xi − T xk 2 = T xi 2 + T xk 2 ≥ 2C 2
for each x ∈ H .
Proof: We utilize the following steps, where the second item is equivalent to
(Ker(T ))⊥ = Im(T ) because T = T ∗ .
1. Show that H = Ker(T ) ⊕ (Ker(T ))⊥ .
2. Show that (Ker(T ))⊥ = Im(T ∗ ).
48 2 Hilbert Spaces
∞
for arbitrary H x = i=1 x, ei ei ; this condition is equivalent to λ1 ≥ 0,
λ2 ≥ 0, . . ..
Proposition 28 If T is nonnegative definite, we have
T e, e
λk = max (2.17)
e∈span{e1 ,...,ek−1 }⊥ e2
which expresses the maximum value over the Hilbert space H when k = 1.
Proof: The claim follows from (2.16) and λ j ≥ 0:
∞
max T e, e = max λ j e, e j 2 = λk .
e∈{e1 ,...,ek−1 }⊥ e=1 e=1
j=k
Let H1 , H2 be Hilbert spaces, {ei } an orthonormal basis of H1 , and T ∈ B(H1 , H2 ).
If
∞
T ei 2
i=1
takes a finite value, we say that T is a Hilbert-Schmidt (HS) operator, and we write
the set of HS operators in B(H1 , H2 ) as B H S (H1 , H2 ).
We define the inner product of T1 , T2 ∈ B H S (H1 , H2 ) and the HS norm of T ∈
B H S (H1 , H2 ) by T1 , T2 H S := ∞ j=1 T1 e j , T2 e j 2 and
∞
1/2
1/2
T H S := T, T H S = T ei 22 ,
i=1
respectively.
Proposition 29 The HS norm value of T ∈ B(H1 , H2 ) does not depend on the choice
of orthonormal basis {ei }.
Proof: Let {e1,i }, {e2, j } be arbitrary orthonormal bases of Hilbert spaces H1 , H2 ,
and let T1 , T2 ∈ B(H1 , H2 ). Then, for Tk e1,i = ∞ ∗
j=1 Tk e1,i , e2, j 2 e2, j , Tk e2, j =
∞ ∗
i=1 Tk e2, j , e1,i 1 e1,i , and k = 1, 2, we have
2.6 Compact Operators 49
∞
∞
∞
T1 e1,i , T2 e1,i 2 = T1 e1,i , e2, j 2 T2 e1,i , e2, j 2
i=1 i=1 j=1
∞
∞ ∞
= e1,i , T1∗ e2, j 1 e1,i , T2∗ e2, j 1 = T1∗ e2, j , T2∗ e2, j 1 ,
i=1 j=1 i=1
which means that both sides do not depend on the choices of {e1,i }, {e2, j }. In particu-
lar, if T1 = T2 = T , we see that T 2H S does not depend on the choices of {e1,i }, {e2, j }.
Proposition 30 An HS operator is compact.
Proof: Let T ∈ B(H1 , H2 ) be an HS operator, x ∈ H1 , and
n
Tn x := T x, e2i 2 e2i ,
i=1
∞
∞
∞
(T − Tn )x22 = T x, e2,i 22 = x, T ∗ e2,i 21 ≤ T ∗ e2,i 2 .
i=n+1 i=n+1 i=n+1
n
n
m
n
T 2H S = T e X,i 2 = T ∗ eY, j 2 = Ti,2j ,
i=1 j=1 i=1 j=1
where e X,i ∈ Rm is a column vector such that the ith element is one and the other
elements are zeros, and eY, j ∈ Rn is a column vector such that the jth element is one
and the other elements are zeros.
Let T ∈ B(H ) be nonnegative definite and {ei } be an orthonormal basis of H . If
∞
T T R := T e j , e j
j=1
50 2 Hilbert Spaces
is finite, we say that T T R is the trace norm of T and that T is a trace class. Similar to
an HS norm value, a trace norm value does not depend on the choice of orthonormal
basis {e j }.
If we substitute x = e j into (2.16) in Proposition 27, then we have T x = λe j and
obtain that
∞ ∞
T T R := T e j , e j = λj .
j=1 j=1
∞
∞ ∞
T 2H S = T ei,1 , e j,2 2 = λ2j ,
i=1 j=1 j=1
we have
1/2
∞
T H S ≤ λ1 λi = λ1 T T R .
i=1
Proof of Proposition 13
We show that a simple function approximates an arbitrary f ∈ L 22 and that a contin-
uous function approximates an arbitrary simple function. Hereafter, we denote the
L 2 norm by · .
Since f ∈ L 2 is measurable, if f is nonnegative, the sequence { f n } of simple
functions defined by
(k − 1)2−n , (k − 1)2−n ≤ f (ω) < k2−n , 1 ≤ k ≤ n2n
f n (ω) =
n, n ≤ f (ω) ≤ ∞
f n − f 2 → 0 .
We can show a similar derivation for a general f that is not necessarily nonnegative,
as derived in Chap. 1.
Appendix: Proofs of Propositions 51
On the other hand, let A be a closed subset of [a, b], and let K A be the indi-
cator function (K A (e) = 1 if e ∈ A; K A (e) = 0 otherwise). If we define h(x) :=
1
inf y∈A {|x − y|} and gnA (x) := , then gnA is continuous, gnA (x) ≤ 1 for
1 + nh(x)
x ∈ [a, b], gnA (x) = 1 for x ∈ A, and lim gnA (x) = 0 for x ∈ B := [a, b]\A. Thus,
n→∞
we have
1/2 1/2
lim gnA − K A = lim gnA (x)2 d x = lim g A (x)2 d x =0,
n→∞ n→∞ B B n→∞ n
where the second equality follows from the dominant convergence theorem. More-
over, if A, A are disjoint, then αgnA + α gnA with α, α > 0 approximates αK A +
α K A . In fact, we have
αgnA + α gnA − (αK A + α K A ) ≤ αgnA − K A + α gnA − K A .
Proof of Proposition 14
Suppose that { f n } is a Cauchy sequence in L 2 , which means that
t−1
| f nr (x) − f n t (x)| ≤ | f n k+1 (x) − f n k (x)| .
k=r
f − f n 2 = | f n − f |2 dμ = lim inf | f n − f n k |2 dμ ≤ lim inf | f n − f n k |2 dμ <
E E k→∞ k→∞ E
Proof of Proposition 15
The first item holds because
n
n
0 ≤ x − x, ei ei 2 = x2 − x, ei 2
i=1 i=1
n
for all n. For the second item, letting n > m, sn := k=1 x, ek ek , we have
n
n
n
sn − sm = 2
x, ek ek , x, ek ek = |x, ek |2 ,
k=m+1 k=m+1 k=m+1
which diminishes as n, m → ∞ according to the first item. For the third item, we
have
n
n n
sn − sm 2 = αk ek , αk ek = αk2 = Sn − Sm
k=m+1 k=m+1 k=m+1
n n
for sn := i=1 αi ei , Sn := i=1 αi2 , and n > m. Thus, the third item follows from
the equivalence: {sn } is Cauchy ⇐⇒ {Sn } is Cauchy.
n
The last item holds because y, ei = lim α j e j , ei = αi for y = ∞j=1 α j e j ,
n→∞
j=1
which follows from the continuity of inner products (Proposition 9).
Proof of Proposition 16
For 1.=⇒6., since
∞{ei } is an orthonormal basis of H , we may write an arbitrary
x ∈ H as x = i=1 αi ei , αi ∈ R. From the fourth item of Proposition15, we have
∞
αi = x, ei and obtain 6. 6.= ⇒5. is obtained by substituting x = i=1 x, ei ei ,
∞
y = i=1 y, ei ei into x, y. 5.= ⇒4. is obtained by substituting x = y in 5. 4.=
⇒3.
is due to
n n
x − x, ek ek = x −
2 2
|x, ek |2 → 0
k=1 k=1
n
z − y, e j = z, e j − lim z, ei , e j = z.e j − z.e j = 0 .
n→∞
i=1
Proof of Proposition 19
Let M be a closed subset of H . We show that for each x ∈ H , there exists a unique
y ∈ M that minimizes x − y and that we have
x − y, z − y ≤ 0 (2.20)
for z ∈ M. To this end, we first show that any sequence {yn } in M for which
yn + ym 2
yn − ym 2 = 2x − yn 2 + 2x − ym 2 − 4x −
2
≤ 2x − yn + 2x − ym − 4 inf x − y → 0 .
2 2 2
y∈M
Hence, {yn } is Cauchy. Then, suppose that more than one lower limit y exists, and
let u
= v be such a y. For example, for {yn }, let y2m−1 → u, and let y2m → v satisfy
(2.21). However, this limit is not Cauchy and contradicts the discussion shown thus
far. Hence, the y that achieves the limit in (2.21) is unique. In the following, we
assume that y gives the lower limit.
Moreover, note that
for arbitrary 0 < a < 1 and z ∈ M, and if x − y, z − y > 0, the inequality flips
for small a > 0. Thus, we have x − y, z − y ≤ 0.
Finally, if we substitute z = 0, 2y into (2.20), we have x − y, y = 0. Therefore,
(2.20) implies that x − y, z ≤ 0 for z ∈ M. We obtain the proposition by replacing
z with −z.
54 2 Hilbert Spaces
Proof of Proposition 22
If the operator T maps to zero for any element, then eT = 0 satisfies the desired
condition. Thus, we assume that T outputs a nonzero value for at least one input.
From the first item of Proposition 20, Ker(T )⊥ is a closed subset of H and contains
a y such that T y = 1. Thus, for an arbitrary x ∈ H , we have
T (x − (T x)y) = T x − T x T y = 0
Proof of Proposition 25
For the first item, note that if {xn } is bounded, so is {T xn }. Moreover, if the image
of T is of finite dimensionality, then {T xn } is also compact (Proposition 7)9 . For the
second item, we use the so-called diagonal argument. In the following, we denote the
norms of H1 , H2 by · 1 , · 2 . Let {xk } be a bounded sequence in X 1 . From the
compactness of T1 , there exists {x1,k } ⊆ {x0,k } := {xk } such that {T1 x1,k } converges to
a y1 ∈ H2 as k → ∞. Then, there exists {x2,k } ⊆ {x1,k } such that {T2 x2,k } converges
to a y2 ∈ H2 as k → ∞. If we repeat this process, the sequence {yn } in H2 converges.
In fact, for each n, there exists a large kn such that
1
Tn xn,k − yn 2 < , k ≥ kn .
n
If we make {kn } monotone, then for m < n, we obtain
Thus, as m, n → ∞, we have
1 1
ym − yn ≤ + + Tm − T · xn,kn 1 + Tn − T · xn,kn 1 → 0
m n
. Since H2 is complete, there exists a y ∈ H2 such that {yn } converges. Since
9This statement is called Bolzano-Weierstrass’s theorem for sequential compactness rather than
Heine-Borel’s theorem. The two theorems coincide for metric spaces.
Appendix: Proofs of Propositions 55
Proof of Proposition 26
By induction, we show that
n
c j e j = 0=
⇒c1 = c2 = · · · = cn = 0 . (2.22)
j=1
k+1
k+1
k
0 = λk+1 cjej − λjcjej = (λk+1 − λ j )c j e j .
j=1 j=1 j=1
From λk+1
=λ j , if we assume that cj := (λk+1 − λ j )c j
=0, then from kj=1 cj e j = 0
and the assumption of induction, we have c1 = · · · = ck = 0, which means that c1 =
· · · = ck = 0 and ck+1 ek+1 = − kj=1 c j e j = 0. Thus ck+1 = 0. Moreover, under
the condition that T is self-adjoint, from ei , e j = ei , λ−1 −1
j T e j = λ j T ei , e j =
−1
λ j λi ei , e j and λi
= λ j for i
= j, we have ei , e j = 0. Thus, the {e j } are orthog-
onal.
Proof of Proposition 27
We first show that
Ker(T )⊥ = Im(T ∗ ) . (2.23)
T x1 2 = x1 , T ∗ T x1 1 = 0 ,
which means that x1 ∈ Ker(T ) and establishes inverse inclusion. Thus, we have
shown that Ker(T ) = (Im(T ∗ ))⊥ . Furthermore, if we apply the third item of Propo-
sition 20, we obtain
(Ker(T ))⊥ = Im(T ∗ ) .
56 2 Hilbert Spaces
Note that since Ker(T ) is an orthogonal complement of subset Im(T ∗ ) of H , the first
item of Proposition 20 and (2.11) can be applied. Since T ∈ B(H ) is self-adjoint
(T ∗ = T ), we can write (2.23) further as
H = Ker(T ) ⊕ Im(T ) .
n
n
cjej = T ( λ−1
j cjej)
j=1 j=1
and span{e j | j ≥ 1} ⊆ Im(T ). Even if we perform closure on both sides, the inclusion
relation does not change. Thus, we have span{e j | j ≥ 1} ⊆ Im(T ). Furthermore, we
decompose (2.11)
Im(T ) = span{e j | j ≥ 1} ⊕ N ,
⊥
where N = span{e j | j ≥ 1} ∩ Im(T ). Note that T y ∈ span{e j | j ≥ 1} for y ∈ span
{e j | j ≥ 1}, and
T x, y = x, T y = 0
In fact,
1 1
|T x, y| = | T (x + y), x + y − T (x − y), x − y|
4 4
1 1
≤ |T (x + y), x + y| + | T (x − y), x − y|
4 4
1 1
≤ (w(T )(x + y + x − y2 ) = w(T )(x2 + y2 ),
2
4 2
and if we take the upper limit under x = y = 1, we obtain
Tx
T = sup T x, ≤ sup T x, y ≤ w(T ) .
x=1 T x x=y=1
Appendix: Proofs of Propositions 57
and (2.25).
In addition, we know that either ±T is an eigenvalue of T . In fact, from (2.25),
there exists a sequence {xn } in H with xn = 1 such that T xn , xn → T or
T xn , xn → −T (the upper and lower limits are convergence points). For the
former case, we have
Exercises 16∼30
16. Choose the closed sets among the sets below. For the nonclosed sets, find their
closures.
(a) ∪∞n=1 [n − n , n + n ];
1 1
for z ∈ M.
58 2 Hilbert Spaces
is orthonormal.
24. Derive Proposition 19 according to the following steps in the appendix. What
are the derivations of (a) through (e)?
(a) Show that a sequence {yn } in M for which
26. Show that the integral operator (2.13) is a bounded linear operator and that it is
self-adjoint when K is symmetric.
27. Let (M, d) be a metric space with M := R and a Euclidean distance d. Show that
each of the following E ⊆ M is not sequentially compact. Furthermore, show
that they are not compact without using the equivalence between compactness
and sequential compactness.
(a) E = [0, 1)and
(b) E = Q.
28. Proposition 27 is derived according to the following steps in the appendix. What
are the derivations of (a) through (c)?
(a) Show that H1 = Ker(T ) ⊕ Im(T ).
(b) Show that span{e j | j ≥ 1} ⊆ Im(T ).
(c) Show that span{e j | j ≥ 1} ⊇ Im(T ).
Why do we need to show (2.25)?
29. Show that the HS and trace norms satisfy the triangle inequality.
30. Show that if T ∈ B(H ) is a trace class, then it is also an HS class, and show that
if T ∈ B(H ) is a trace class, it is also compact.
Chapter 3
Reproducing Kernel Hilbert Space
Thus far, we have learned that a feature map : E x → k(x, ·) is obtained by the
positive definite kernel k : E × E → R. In this chapter, we generate a linear space
H0 based on its image k(x, ·)(x ∈ E) and construct a Hilbert space H by completing
this linear space, where H is called reithe reproducing kernel Hilbert space (RKHS),
which satisfies the reproducing prsoperty of the kernel k (k is the reproducing kernel
of H ). In this chapter, we first understand that there is a one-to-one correspondence
between the kernel k and the RKHS H and that H0 is dense in H (via the Moore-
Aronszajn theorem). Furthermore, we introduce the RKHS represented by the sum
of RKHSs and apply it to Sobolev spaces. We prove Mercer’s theorem regarding
integral operators in the second half of this chapter and compute their eigenvalues
and eigenfunctions. This chapter is the core of the theory contained in this book, and
the later chapters correspond to its applications.
3.1 RKHSs
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 61
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_3
62 3 Reproducing Kernel Hilbert Space
p
k(x, y) := ei (x)ei (y) (3.3)
i=1
p
e j (·), k(x, ·) H = e j , ei H ei (x) = e j (x)
i=1
p
for each 1 ≤ j ≤ p. Thus, for any f (·) = i=1 f i ei (·) ∈ H , f i ∈ R, we have
f (·), k(x, ·) H = f (x) (reproducing property). Therefore, H is an RKHS, and (3.3)
is a reproducing kernel.
Proposition 32 The reproducing kernel k of the RKHS H is unique, symmetric
k(x, y) = k(y, x), and nonnegative definite.
Proof: If k1 , k2 are RKHSs of H , then by the reproducing property, we have that
In other words,
f, k1 (x, ·) − k2 (x, ·) H = 0
k(x, y) = k(x, ·), k(y, ·) H = k(y, ·), k(x, ·) H = k(y, x) .
n
n
n
n n
n
z i z j k(xi , x j ) = z i z j k(xi , ·), k(x j , ·) H = z i k(xi , ·), z j k(x j , ·) H ≥ 0
i=1 j=1 i=1 j=1 i=1 j=1
.
Proposition 33 A Hilbert space H is an RKHS if and only if Tx ( f ) = f (x) ( f ∈ H )
is bounded at each x ∈ E for the linear functional Tx : H f → f (x) ∈ R.
Proof: If H has a reproducing kernel k, then at each x ∈ E, we have
Thus, we have
3.1 RKHSs 63
|Tx ( f )| = | f (·), k(x, ·) H | ≤
f
·
k(x, ·)
=
f
k(x, x) .
Example 52 (Linear Kernel) Let ·, · E be the inner product of E := Rd . Then, the
linear space
H := {x, · E |x ∈ E}
n
H := { αi k(xi , ·)|α1 , . . . , αn ∈ R}
i=1
f (·), g(·) H = a K b
for f (·), g(·) ∈ H , where f (·) = nj=1 a j k(x j , ·) ∈ H , a = [a1 , . . . , an ] ∈ Rn and
g(·) = nj=1 b j k(x j , ·) ∈ H , b = [b1 , . . . , bn ] ∈ Rn via the Gram matrix
⎡ ⎤
k(x1 , x1 ) · · · k(x1 , xn )
⎢ .. .. .. ⎥
K := ⎣. . . . ⎦ .
k(xn , x1 ) · · · k(xn , xn )
n
f (·), k(xi , ·) H = [a1 , . . . , an ]K ei = a j k(x j , xi ) = f (xi )
j=1
are isomorphic as an inner product space. Note that H has a reproducing kernel
E × E → R with
k(x, y) = e−i(x−y)t dη(t) .
E
In fact, we have k(x, y) = E e−i xt eiyt dη(t). Thus, if we set G(t) = e−i xt , we obtain
f (·), k(x, ·) H = F(t)G(t)dη(t) = F(t)ei xt dη(t) = f (x)
E E
3.1 RKHSs 65
for f (y) = F(t)eiyt dη(t) and k(x, y) = G(t)eiyt dη(t). For different kernels
E E
k(x, y), such as the Gaussian and Laplacian kernels, the measure η(t) will be differ-
ent, and the corresponding RKHS H will be different.
1
Example 56 Let E := [0, 1]. Using the real-valued function F with 0 F(u)2 du <
1
∞, we consider the set H of functions f : E → R, f (t) = 0 F(u)(t − u)0+ du,
where we denote (z)0+ = 1 and (z)0+ = 0 when z ≥ 0 and when z < 0, respectively.
1
The linear space H is complete for the norm
f
2 = 0 F(u)2 du (Proposition 14)
1 1
if the inner product is f, g H = 0 F(u)G(u)du for f (t) = 0 F(u)(t − u)0+ du
1
and g(t) = 0 G(u)(t − u)0+ du. This Hilbert space H is the RKHS for k(x, y) =
min{x, y}. In fact, for each z ∈ E, we see that
1 1 1
f (z), k(x, z) H = F(u)(z − u)0+ du, (x − u)0+ (z − u)0+ du H = F(u)(x − u)0+ du = f (x).
0 0 0
Thus far, we have obtained the RKHS corresponding to each positive definite
kernel, but a necessary condition exists for a Hilbert space H to be an RKHS. If that
condition is not satisfied, we can claim that it is not an RKHS.
Proposition 35 Let H be an RKHS consisting of functions on E. If limn→∞ | f n −
f
H = 0 f, f 1 , f 2 , . . . ∈ H , then for each x ∈ E, limn→∞ | f n (x) − f (x)| = 0 holds.
Proof: In fact, we have that for each x ∈ E,
| f n (x) − f (x)| ≤
f n − f
k(x, x) .
Example 57 illustrates that L 2 [0, 1] is too large, and as we will see in the next section,
the Sobolev space restricted to L 2 [0, 1] is an RKHS.
We first show that if k1 , k2 are reproducing kernels, the sum k1 + k2 is also a repro-
ducing kernel. To this end, we show the following.
Proposition 36 If H1 , H2 are Hilbert spaces, so is the direct product F := H1 × H2
under the inner product
66 3 Reproducing Kernel Hilbert Space
( f 1 , f 2 ), (g1 , g2 ) F := f 1 , g1 H1 + f 2 , g2 H2 (3.4)
for f 1 , g1 ∈ H1 , f 2 , g2 ∈ H2 .
Proof: From
( f 1 , f 2 )
2F =
f 1
2H1 +
f 2
H2 , we have
f 1,n − f 1,m
H1 ,
f 2,n − f 2,m
H2
≤
f 1,n − f 1,m
2H1 +
f 2,n − f 2,n
2H2 =
( f 1,n , f 2,n ) − ( f 1,m , f 2,m )
F .
Thus, we have
{( f 1,n , f 2,n )} is Cauchy
=
⇒ { f 1,n }, { f 2,n } is Cauchy
=
⇒ f 1 ∈ H1 , f 2 ∈ H2 exists such that f 1,n → f 1 , f 2,n → f 2
=
⇒
( f 1,n , f 2,n ) − ( f 1 , f 2 )
F =
( f 1,n − f 1 , f 2,n − f 2 )
F
=
f 1,n − f 1
2 +
f 2,n − f 2
2 → 0 ,
f, g H := v −1 ( f ), v −1 (g) F (3.5)
for f, g ∈ H forms an inner product. Note that N ⊥ is a closed subspace of the Hilbert
space F.
Proposition 37 If the direct sum H of Hilbert spaces H1 , H2 has the inner product
(3.5), then H is complete (a Hilbert space).
Proof: Since F is a Hilbert space (Proposition 36) and N ⊥ is its closed subset, N ⊥
is complete. Thus, we have
f n − f m
H → 0=
⇒
v −1 ( f n − f m )
F → 0
=
⇒ g ∈ F exists such that
v −1 ( f n ) − g
F → 0
=
⇒
f n − v(g)
H → 0, v(g) ∈ H .
Proposition 38 (Aronszajn [1]) Let k1 , k2 be the reproducing kernels of RKHSs
H1 , H2 , respectively. Then, k = k1 + k2 is the reproducing kernel of the Hilbert
space
H := H1 ⊕ H2 := { f 1 + f 2 | f 1 ∈ H1 , f 2 ∈ H2 }
f
2H = min {
f 1
2H1 +
f 2
2H2 } (3.6)
f = f 1 + f 2 , f 1 ∈H1 , f 2 ∈H2
for f ∈ H .
The proof proceeds as follows.
1. Let f ∈ H and N ⊥ ( f 1 , f 2 ) := v −1 ( f ). We define k(x, ·):=k1 (x, ·)+k2 (x, ·)
and (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)), and we show that
2. Using the above, we present the reproducing property f, k(x, ·) H = f (x) of
k.
3. We show that the norm of H is (3.6).
For details, see the Appendix at the end of this chapter.
In the following, we construct the Sobolev space as an example of an RKHS and
obtain its kernel.
Let W1 [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable
almost everywhere and f ∈ L 2 [0, 1]. Then, we can write each f ∈ W1 [0, 1] as
x
f (x) = f (0) + f (y)dy . (3.7)
0
Similarly, let Wq [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable
q − 1 times and q times almost everywhere and f (q) ∈ L 2 [0, 1]. If we define
xi
φi (x) := , i = 0, 1, . . .
i!
and
q−1
(x − y)+
G q (x, y) := ,
(q − 1)!
q−1 1
f (x) = f (i) (0)φi (x) + G q (x, y) f (q) (y)dy. (3.8)
i=0 0
and obtain (3.7) by repeatedly applying this integral to the right-hand side of (3.8).
For the transformation, we use
q−1
1 1
(x − y)+
G q (x, y)h(y)dy = h(y)dy
0 0 (q − 1)!
1
q−1
q −1
x
= xi (−y)q−1−i h(y)dy .
(q − 1)! i=0
i 0
q−1 1
αi φi (x) + G q (x, y)h(y)dy (3.9)
i=0 0
H0 := span{φ0 , . . . , φq−1 },
q−1
f, g H0 = f (i) (0)g (i) (0)
i=0
for f, g ∈ H0 . We find that the inner product ·, · H0 satisfies the requirement of inner
products and that {φ0 , . . . , φq−1 } is an orthonormal basis. Since the inner product
space H0 is of finite dimensionality, it is apparently a Hilbert space. We define another
inner product space H1 as
1
H1 := { G q (x, y)h(y)dy|h ∈ L 2 [0, 1]} .
0
f m − f n
H1 → 0 ⇐⇒
f m(q) − f n(q)
L 2 [0,1] → 0,
f n − f
H1 → 0 ⇐⇒
f n(q) − f (q)
L 2 [0,1] → 0 .
q−1
f (x) = ⇒h = f (q) = 0
αi φi (x) ∈ H1 =
i=0
and
1
f (x) = ⇒α0 = f (0) = 0, . . . , αq−1 = f (q−1) (0) = 0 ,
G q (x, y)h(y)dy ∈ H0 =
0
q−1
k0 (x, y) := φi (x)φi (y)
i=0
and 1
k1 (x, y) := G q (x, z)G q (y, z)dz ,
0
for x, y ∈ E.
Let (E, F, μ) be a measure space. We assume that the integral operator kernel
K : E × E → Risa measurable function and is not necessarily nonnegative definite.
E×E K (x, y)dμ(x)dμ(y) takes finite values. Then, we define
2
Suppose that
the integral operator TK by
(TK f )(·) := K (x, ·) f (x)dμ(x) (3.10)
E
TK f
2 = {(TK f )(x)}2 dμ(x) ≤ {K (x, y)}2 dμ(x)dμ(y) { f (z)}2 dμ(z)
E E×E E
=
f
2
{K (x, y)} dμ(x)dμ(y) ,
2
E×E
3.3 Mercer’s Theorem 71
Proof: By Proposition 12, for an arbitrary > 0, there exist n() ≥ 1 and an R-
n()
coefficient bivariate polynomial K n() (x, y) := i=1 gi (x)y i whose order of y is at
most n() such that
sup |K (x, y) − K n() (x, y)| < ,
x,y∈E
as
TK n f : H f → [ f (x)g0 (x)dμ(x), . . . , f (x)gn (x)dμ(x)] ∈ Rn+1 .
E E
Since the rank of TK n is finite, from the first item of Proposition 25, TK n is a compact
operator. Moreover, since
(TK n − TK ) f
2 = ( [K n (x, y) − K (x, y)] f (y)dμ(y))2 dμ(x) ≤ 2
f
2 μ2 (E) ,
E E
using {λ j } and {e j } that satisfy Proposition 27. Moreover, Lemma 1 implies the
following:
Lemma 2
e j (y) = λ−1
j K (x, y)e j (x)dμ(x)
E
i.e.,
x 1
ye(y)dy + x e(y)dy = λe(x) .
0 x
i.e., 1
e(y)dy = λe (x) . (3.12)
x
If we further differentiate both sides by x, then we obtain e(x) = −λe (x) and
√ √
e(y) = α sin(y/ λ) + β cos(y/ λ) .
4
λj = , (3.13)
{(2 j − 1)π }2
if we regard the finite measure μ in (3.10) of the integral operator kernel as a Gaus-
sian distribution with a mean of 0 and a variance of σ̂ 2 ; then, the eigenvalue and
eigenfunction are
2a j
λj = B
A
and √
e j (x) = exp(−(c − a)x 2 )H j ( 2cx) ,
dj
H j (x) := (−1) j exp(x 2 ) exp(−x 2 ) ,
dx j
√
a −1 := 4σ̂ 2 , b−1 := 2σ 2 , c := a 2 + 2ab, A := a + b + c, and B := b/A. The
proof is not difficult but rather monotonous and long. See the Appendix at the end
of this chapter for details. Note that for a Gaussian kernel with a parameter σ 2 , if the
measure is also a Gaussian distribution with a mean of 0 and a variance of σ̂ 2 , we
σ̂ 2 b
can compute the eigenvalues from β := 2 = :
σ 2a
2a j 2a b
B = √ ( √ )j
A a + b + a + 2ab a + b + a 2 + 2ab
2
β
= [1/2 + β + 1/4 + β]−1/2 ( √ )j ,
1/2 + β + 1/4 + β
The Hermite polynomials are H1 (x) = 2x, H2 (x) = −2 + 4x 2 , and H3 (x) = 12x −
8x 3 (H0 (1) = 1, H j (x) = 2x H j−1 (x) − H j−1 (x)), and the other quantities are
√
5 1
c = a 2 + 2ab = (4σ̂ 2 )−1 1 + 4σ̂ 2 /σ 2 = , a = (4σ̂ 2 )−1 = .
4 4
We show the eigenfunction φ j for j = 1, 2, 3 in Fig. 3.1. The code is as follows.
3-start
# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . u s e ( "seaborn−ticks" )
74 3 Reproducing Kernel Hilbert Space
def Hermite ( j ) :
i f j == 0 :
return [ 1 ]
a = [ 0] ∗ ( j + 2)
b = [0] ∗ ( j + 2)
a [0] = 1
f o r i i n range ( 1 , j + 1 ) :
b [ 0 ] = −a [ 1 ]
f o r k i n range ( i + 1 ) :
b [ k ] = 2 ∗ a [ k − 1] − ( k + 1) ∗ a [ k + 1]
f o r h i n range ( j + 2 ) :
a[h] = b[h]
return b [ : ( j +1) ]
[0, 2]
H e r m i t e ( 2 ) # 2 nd o r d e r H e r m i t e P o l y n o m i a l
[-2, 0, 4]
[0, -12, 0, 8]
d e f H( j , x ) :
coef = Hermite ( j )
S = 0
f o r i i n range ( j + 1 ) :
S = S + np . a r r a y ( c o e f [ i ] ) ∗ ( x ∗∗ i )
return S
c c = np . s q r t ( 5 ) / 4
a = 1/4
def phi ( j , x ) :
r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x )
p l t . y l i m ( −2 , 8 )
p l t . y l a b e l ( "phi" )
p l t . t i t l e ( "CharacteristicfunctionofGaussKernel" )
In this section, we prove Mercer’s theorem for integral operators and illustrate
some examples. Hereafter, we assume that K and TK are nonnegative definite.
By absolute convergence, we mean that the sum of the absolute values converges,
and by uniform convergence, we mean that the upper bound of the error that does
not depend on x, y ∈ E converges to zero.
Proof: Note that K n (x, y) := K (x, y) − nj=1 λ j e j (x)e j (y) is continuous and that
the integral operator TK n is nonnegative definite. In fact, for each f ∈ L 2 (E,
F, μ), we have
n ∞
TK n f, f = TK f, f − λ j f, e j 2 = λ j f, e j 2 ≥ 0 .
j=1 j=n+1
j =2
eigenfunctions are even and
4
j =3
odd functions, respectively
2
φj
-6 -4 -2 0
-2 -1 0 1 2
x
76 3 Reproducing Kernel Hilbert Space
Thus, from Proposition 40, K n is nonnegative definite, and K n (x, x) ≥ 0. Thus, for
all x ∈ E, we have
∞
λ j e2j (x) ≤ K (x, x) . (3.15)
j=1
Example 60 (The Kernel Expressed by the Difference Between Two Variables) Let
E = [−1, 1]. An integral operator for which K :E × E → R can be expressed by
K (x, z) = φ(x − z) (φ : E → R) is TK f (x) = E φ(x − y) f (y)dy, which can be
expressed by (φ ∗ f )(x) using convolution: (g ∗ h)(u) = E g(u − v)h(v). Here-
after, we assume that the cycle of φ is two, i.e., φ(x) = φ(x + 2Z). In this case,
e j (x) = cos(π j x) is the eigenfunction of TK . In fact, since φ is an even function and
is cyclic, we have
3.3 Mercer’s Theorem 77
1−x
TK e j (x) = φ(x − y) cos(π jy)dy = φ(−u) cos(π j (x + u))du = φ(u) cos(π j (x + u))du
E −1−x E
and
TK e j (x) = { φ(u) cos(π ju)du} cos(π j x) − { φ(u) sin(π ju)du} sin(π j x)
E E
= λ j cos(π j x)
from the Addition theorem cos(π j (x + u)) = cos(π j x) cos(π ju) − sin(π j x)
sin(π ju), where λ j = E φ(u) cos(π ju)du. Similarly, sin(π j x) is an eigenfunction,
and λ j is the corresponding eigenvalue. Thus, from Mercer’s theorem, we have
∞
∞
K (x, y) = λ j {cos(π j x) cos(π jy) + sin(π j x) sin(π jy)} = λ j cos{π j (x − y)} .
j=0 j=0
TK φ j = λφ j
and
φ j φk dμ = δ j.k .
E
1
m
K (x j , y)φi (x j ) = λi φi (y) , y ∈ E (3.18)
m j=1
i = 1, 2, . . .. Since we have
1
m
φ j (xi )φk (xi ) = δ j,k
m i=1
where K m ∈ Rm×m is the Gram matrix and is the diagonal matrix with the elements
√ λi(m)
λ(m)
1 = mλ 1 , . . . , λ (m)
m = mλ m . If we substitute φ i (x j ) = mU j,i , λi = into
m
(3.18), we obtain
√ m
m
φi (·) = (m) K (x j , ·)U j,i . (3.19)
λi j=1
0 20 40 60 80 100
# Eigenvalues
Fig. 3.2 The eigenvalues obtained in Example 62. We compare the cases involving m = 1000
samples and the first m = 300 samples. The largest eigenvalues for both cases coincide
# Kernel D e f i n i t i o n
sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# G e n e r a t e S a m p l e s and D e f i n e t h e Gram M a t r i x
m = 300
x = np . random . r a n d n (m) − 2 ∗ np . random . r a n d n (m) ∗∗2 + 3 ∗ np . random . r a n d n (m)
∗∗3
# E i g e n v a l u e s and E i g e n v e c t o r s
K = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
v a l u e s , v e c t o r s = np . l i n a l g . e i g (K)
lam = v a l u e s / m
a l p h a = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
a l p h a [ : , i ] = v e c t o r s [ i , : ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16)
# D i s p l a y Graph
def F ( y , i ) :
S = 0
f o r j i n range (m) :
S = S + alpha [ j , i ] ∗ k ( x [ j ] , y )
return S
i = 1 ## Execute i t changing i
d e f G( y ) :
return F ( y , i )
w = np . l i n s p a c e ( − 2 , 2 , 1 0 0 )
p l t . p l o t (w, G(w) )
p l t . t i t l e ( "EigenValuesandtheirEigenFunctions" )
Finally, we present the RKHS obtained from Mercer’s theorem (Proposition 41).
In Example 57, we pointed out that the condition was too loose for the L 2 -space to
be an RKHS. The following proposition suggests the restrictions that we should add.
80 3 Reproducing Kernel Hilbert Space
Eigenfunction
1.5
1.0
0.5
-2 -1 0 1 2 -2 -1 0 1 2
x x
m = 1000 m = 1000
Eigenfunction
-2 -1 0 1 2 -2 -1 0 1 2
x x
Fig. 3.3 The eigenfunctions obtained in Example 62. We show a comparison between the functions
of the m = 1000 samples and the first m = 300 samples. The eigenfunctions coincide for the first
largest three eigenvalues, but they are far from each other for the fourth eigenvalue. However, the
fourth eigenvalues coincide
∞
f (x)e j (x)dη(x) g(x)e j (x)dη(x)
f, g H := E E
(3.20)
j=1
λj
Proof: From the definition of the inner product (3.20), we can write ei , e j H =
1
δ
λi i, j
. Thus, we have
3.3 Mercer’s Theorem 81
∞ ∞
β 2j
{ β j e j (x)}2 dβ(x) < ∞ ⇐⇒ <∞,
E j=1 j=1
λj
∞
and H is a Hilbert space. From Mercer’s theorem, we can write k(x, ·) = j=1 λjej
(x)e j (·), so we have
∞
∞
{λ j e j (x)}2
= λ j e j (x)e j (x) = k(x, x) < ∞
j=1
λj j=1
and k(x, ·) ∈ H . Finally, since E k(·, y)e j (y)dη(y) = λ j e j (·), we have
∞
1
f, k(·, x) H = f (y)e j (y)dη(y) k(x, y)e j (y)dη(y)
λ
j=1 j E E
∞
= { f (y)e j (y)dη(y)}e j (x) = f (x),
j=1 E
Appendix
Proof of Proposition 34
Let k : E × E → R be the positive definite kernel of a Hilbert space H . We show
that for the linear space H0 spanned by k(x, ·), x ∈ E, the bivariate function
m
n
f, g H0 = ai b j k(xi , y j )
i=1 j=1
m
n
f (·) = ai k(xi , ·) and g(·) = b j k(y j , ·) ∈ H0 . (3.21)
i=1 j=1
m
m
f, g H0 = ai g(xi ) = b j f (x j )
i=1 j=1
82 3 Reproducing Kernel Hilbert Space
m
n
f
2 = ai a j k(xi , x j ) ≥ 0 .
i=1 j=1
Moreover, from
| f (x)| = | f (·), k(x, ·) H0 | ≤
f
H0 k(x, x),
we have
f
H0 = 0= ⇒ f = 0. In the following, we construct the linear space H
obtained by completing H0 .
Let { f n } be a Cauchy sequence in H0 . For an arbitrary x ∈ E and m, n ≥ 1, we
have
| f m (x) − f n (x)| ≤
f m − f n
H0 k(x, x),
and { f n (x)} is Cauchy. Since this sequence is a real sequence, it has a convergence
point for each x ∈ E. In the following, let H be the set of f : E → R such that the
{ f n (x)} for which { f n } is Cauchy in H0 converges to f (x) for each x ∈ E. In general,
H0 is a subset of H . In the following, we define an inner product in H and prove that
H is an RKHS with a reproducing kernel k.
Lemma 4 Suppose that { f n } is a Cauchy sequence in H0 . If the sequence { f n (x)}
converges to 0 for each x ∈ E, then we have
lim
f n
H0 = 0 .
n→∞
Proof of Lemma 4: Since a Cauchy sequence is bounded (Example 26), there exists
a B > 0 such that
f n
< B, n = 1, 2, . . .. Moreover, since the above sequence
is a Cauchy sequence, for an arbitrary > 0,p there exists an N such that n >
N= ⇒
f n − f N
< /B. Thus, for f N (x) = i=1 αi k(xi , x) ∈ H0 , αi ∈ R, xi ∈ E,
and i = 1, 2, . . ., we have that when n > N
p
f n
2H0 = f n − f N , f n H0 + f N , f n H0 ≤
f n − f N
H0
f n
H0 + αi | f n (xi )| .
i=1
Each of the first and second terms is at most since we have f n (xi ) → 0 as n → ∞
for each i = 1, . . . , p. Hence, we have Lemma 4.
For Cauchy sequences { f n }, {gn } in H0 , we define f, g ∈ H such that { f n (x)},
{gn (x)} converge to f (x), g(x), respectively, for each x ∈ E. Then, { f n , gn H0 } is
Cauchy:
| f n , gn H0 − f m , gm H0 | = | f n , gn − gm H0 + f n − f m , gm H0 |
≤
f n
H0
gn − gm
H0 +
f n − f m
H0
gm
H0 .
Appendix 83
Since { f n , gn H0 } is real and Cauchy, it converges (Proposition 6). The inner product
obtained by convergence depends only on f (x), g(x) (x ∈ E).
Let { f n }, {gn } be other Cauchy sequences in H0 that converge to f, g for each
x ∈ E. Then, { f n − f n }, {gn − gn } are Cauchy sequences that converge to 0 for each
x ∈ E, and from Lemma 4, we have
f n − f n
H0 ,
gn − gn
H0 → 0 as n → ∞,
which means that
f, g H := lim f n , gn H0 .
n→∞
To show that this expression satisfies the definition of an inner product, we assume
that
f
H = f, f H = 0. Then, for each x ∈ E, as n → ∞, from
| f n (x)| = | f n (·), k(x, ·)| ≤ k(x, x)
f n
H0 → 0,
n → ∞, and H0 is dense in H .
We show that H is complete. Let { f n } be a Cauchy sequence in H . From denseness,
there exists a sequence { f n } in H0 such that
f n − f n
H → 0 (3.23)
f n − f n
H ,
f m − f m
H ,
f n − f m
H < /3 and
f n − f m
H0 =
f n − f m
H ≤
f n − f n
H +
f n − f m
H +
f m − f m
H ≤
f − f n
H → 0. Combining this with (3.23), we obtain
f − f n
H ≤
f − f n
H +
f n − f n
H → 0
as n → ∞. Hence, H is complete.
84 3 Reproducing Kernel Hilbert Space
Next, we show that k is the corresponding reproducing kernel of the Hilbert space
H . Property (3.1) holds immediately because k(x, ·) ∈ H0 ⊆ H , x ∈ E. For another
property (3.2), since f ∈ H is a limit of the Cauchy sequence { f n } in H0 at x ∈ E,
we have
Finally, we show that such an H uniquely exists. Suppose that G exists and shares
the same properties possessed by H . Since H is a closure of H0 , G should contain
H as a subspace. Since H is closed, from (2.11), we write G = H ⊕ H ⊥ . However,
since k(x, ·) ∈ H , x ∈ E and f (·), k(x, ·)G = 0 for f ∈ H ⊥ , we have f (x) = 0,
x ∈ E, which means that H ⊥ = {0}.
Proof of Proposition 38
From our assumption, we have k(x, ·) = k1 (x, ·) + k2 (x, ·) ∈ H for each x ∈ E.
We define N ⊥ (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)) for each x ∈ E, where h 1 (x, ·),
h 2 (x, ·) are elements in H1 , H2 for x ∈ E, but h 1 , h 2 are not necessarily reproducing
kernels k1 , k2 of H1 , H2 , respectively. Since k(x, ·) = k1 (x, ·) + k2 (x, ·), we have
0 = 0, f H = z, ( f 1 , f 2 ) F
( f 1 , f 2 )
2F =
v −1 ( f )
2F +
(g1 , g2 )
2F .
f
2H =
v −1 ( f )
2F ≤
( f 1 , f 2 )
2F =
f 1
2H1 +
f 2
2H2 ,
Proof of Example 59
We use the equality [10]
∞ √ αy
exp(−(x − y)2 )H j (αx)d x = π(1 − α 2 ) j/2 H j ( ).
−∞ (1 − α 2 )1/2
Suppose that E p(y)dy = 1. If we have
k(x, y)φ j (y) p(y)dy = λφ j (x),
E
then
k̃(x, y)φ̃ j (y)dy = λφ̃ j (x)
E
for k̃(x, y) := p(x)1/2 k(x, y) p(y)1/2 , φ̃ j (x) := p(x)1/2 φ j (x). Thus, it is sufficient
to show that we obtain the right-hand side by substituting
2a
p(x) := exp(−2ax 2 )
π
2a
k̃(x, y) := exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 )
π
2a 1/4 √
φ̃ j (x) := ( ) exp(−cx 2 )H j ( 2cx)
π
into the left-hand side for E = (−∞, ∞). The left-hand side becomes
∞ √
2a
( )3/4 exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 ) exp(−cy 2 )H j ( 2cy)dy
−∞ π
2a 3/4 ∞ b b2 √
= ( ) exp{−(a + b + c)(y − x)2 + [ − (a + b)]x 2 }H j ( 2cy)dy
π −∞ a + b + c a + b + c
∞ √
2a 3/4 b 2c dz
= ( ) exp(−cx 2 ) exp{−(z − √ x)2 }H j ( √ z) √
π −∞ a+b+c a+b+c a+b+c
2a 1/4 2a √ 2c √
= ( ) exp(−cx 2 ) π (1 − ) j/2 H j ( 2cx)
π π(a + b + c) a+b+c
2a b 2a √ 2a j
= ( ) j ( )1/4 exp(−cx 2 )H j ( 2cx) = B φ̃ j (x),
a+b+c a+b+c π A
√
√ 2c
where we define z := y a + b + c, α := √ and use
a+b+c
86 3 Reproducing Kernel Hilbert Space
2c a+b−c (a + b)2 − c2 b
(1 − α ) 2 1/2
= 1− = = = .
a+b+c a+b+c (a + b + c)2 a+b+c
Proof of Proposition 40
Since K is uniformly continuous, if d is the distance E × E, there exists a δn such
that
⇒|K (x1 , y1 ) − K (x2 , y2 )| < n −1
d((x1 , y1 ), (x2 , y2 )) < δn =
1
max |K (x, y) − K n (x, y)| < .
(x,y)∈E×E n
and
m
m
TK n f, f = K (vi , v j ) f (x)dμ(x) f (y)dμ(y)
i=1 j=1 E n,i E n, j
m
m
max z i z j K (xi .y j ) < 0
x h ,yh ∈E h ,h=1,...,m
i=1 j=1
and μ(E 1 ), . . . , μ(E m ) > 0. However, from the mean value theorem, we have
m
m
TK f, f := z i z j {μ(E i )μ(E j )}−1 k(x, y)dμ(x)dμ(y) < 0 .
i=1 j=1 Ei Ej
m −1
for f = i=1 z i {μ(E i )} I Ei , which contradicts the fact that TK is positive definite.
Appendix 87
Proof of Lemma 3
We assume that f n (x) monotonically increases as n grows for each x ∈ E. Let > 0
be arbitrary. For each x ∈ E, let n(x) be the minimum n such that | f n (x) − f (x)| <
. From continuity, for each x ∈ E, we set U (x) so that
Then, we have
f (y) − f n(x) (y) ≤ f (x) + − f n(x) (y) ≤ f n(x) (x) + 2 − f n(x) (y) ≤ | f n(x) (x) − f n(x) (y)| + 2 < 3 .
Exercises 31∼45
31. Proposition 34 can be derived according to the following steps. Which part of
the proof in the appendix does each step correspond to?
(a) Define the inner product ·, · H0 of H0 := span{k(x, ·) : x ∈ E}.
(b) For any Cauchy sequence { f n } in H0 and each x ∈ E, the real sequence
{ f n (x)} is Cauchy, so it converges to a f (x) := lim f n (x) (Proposition 6).
n→∞
Let H be such a set of f s.
(c) Define the inner product ·, · H of the linear space H .
(d) Show that H0 is dense in H .
(e) Show that any Cauchy sequence { f n } in H converges to some element of H
as n → ∞ (completeness of H ).
(f) Show that k is a reproducing kernel of H .
(g) Show that such an H is unique.
1
32. In Examples 55 and 56, the inner product is f, g H = 0 F(u)G(u)du, and the
RKHS is
H = {E x → F(t)J (x, t)dη(t) ∈ R|F ∈ L 2 (E, η)} .
E
What are the J (x, t) in Examples 55 and 56? Also, how is the kernel k(x, y)
represented in general by using J (x, t)?
33. Proposition 38 can be derived according to the following steps. Which part of
the proof in the appendix does each step correspond to?
88 3 Reproducing Kernel Hilbert Space
(b) Using (a), prove the reproducing property of k: f, k(x, ·) H = f (x).
(c) Show that the norm of H is (3.6)
34. Show that each f ∈ Wq [0, 1] can be the Taylor series expanded by
q−1 1
f (x) = f (i) (0)φi (x) + G q (x, y) f (q) (y)dy
i=0 0
using
xi
φi (x) := , i = 0, 1, . . .
i!
and
q−1
(x − y)+
G q (x, y) := .
(q − 1)!
q−1
H0 = { αi φi (x)|α0 , . . . , αq−1 ∈ R}
i=0
1
H1 = { G q (x, y)h(y)dy|h ∈ L 2 [0, 1]}
0
(You need to show the inclusion relation on both sides of the set). In addition,
show that H0 ∩ H1 = {0}.
36. We consider the integral operator Tk of k(x, y) = min{x, y}, in L 2 [0, 1], where
x, y ∈ E = [0, 1]. Substitute
4
λj =
{(2 j − 1)π }2
√ (2 j − 1)π
e j (x) = 2 sin x
2
38. In Example 59, the following program obtains eigenvalues and eigenfunctions
under the assumption that σ 2 = σ̂ 2 = 1. We can change the program to set the
values of σ 2 , σ̂ 2 in ## and add σ 2 , σ̂ 2 as an argument to the function phi in ###
and run it to output a graph.
d e f H( j , x ) :
i f j == 0 :
return 1
e l i f j == 1 :
return 2 ∗ x
e l i f j == 2 :
r e t u r n −2 + 4 ∗ x ∗∗2
else :
r e t u r n 4 ∗ x − 8 ∗ x ∗∗3
c c = np . s q r t ( 5 ) / 4
a = 1/4 ##
def phi ( j , x ) : # ##
r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x )
40. In Example 58, suppose that the period of φ is 2π instead of 2. What are the
eigenvalues and eigenfunctions of Tk ? Additionally, derive the kernel k.
41. What eigenequations should be solved in Example 61 when m = 3, d = 1?
42. Define and execute the following part of the program in Example 62 as a function.
The input for this includes data x, a kernel k, and the i of the ith eigenvalue. The
output is a function F.
K = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
v a l u e s , v e c t o r s = np . l i n a l g . e i g (K)
90 3 Reproducing Kernel Hilbert Space
lam = v a l u e s / m
a l p h a = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
a l p h a [ : , i ] = v e c t o r s [ : , i ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16)
def F ( y , i ) :
S = 0
f o r j i n range (m) :
S = S + alpha [ j , i ] ∗ k ( x [ j ] , y )
return S
43. In Example 62, for the Gaussian kernel, random numbers are generated accord-
ing to the normal distribution, and we obtain the corresponding eigenvalues and
eigenfunctions. When the number of samples is large, theoretically, the eigenval-
ues are reduced exponentially (Example 59). What happens with the polynomial
kernel k(x, y) = (1 + xy)2 when m = 2 and d = 1? Output the eigenvalues and
eigenfunctions as the Gaussian kernel.
44. If we construct (3.19) using the solution of K m U = U , show that the result is
a solution of (3.18) and that it is orthogonal with
a magnitude of 1.
45. In Proposition 42, β j should originally satisfy ∞ β
j=1 j
2
< ∞. However, this is
not stated in the assertion of Proposition 42. Why is this the case?
Chapter 4
Kernel Computations
In Chap. 1, we learned that the kernel k(x, y) ∈ R represents the similarity between
two elements x, y in a set E. Chapter 3 described the relationships between a kernel
k, its feature map E x → k(x, ·) ∈ H , and its reproducing kernel Hilbert space
H . In this chapter, we consider k(x, ·) to be a function of E → R for each x ∈ E,
and we perform data processing for N actual data pairs (x1 , y1 ), . . . , (x N , y N ) of
covariates and responses. The xi , i = 1, . . . , N (row vectors) are p-dimensional and
given by the matrix X ∈ R N × p . The responses yi (i = 1, . . . , N ) may be real or
binary. This chapter discusses kernel ridge regression, principal component analysis,
support vector machines (SVMs), and splines, and we find the f ∈ H that minimizes
the objective function under
N various constraints. It is known that we can write the
optimal f in the form i=1 αi k(xi , ·) (representation theorem), and the problem
reduces to finding the optimal α1 , . . . , α N .
In the second half, we address the problem of computational complexity. The
computation of a kernel takes more than O(N 3 ), and real-time calculation is hard
when N is greater than 1000. In particular, we consider how to reduce the rank of the
Gram matrix K . Specifically, we learn actual procedures for random Fourier features,
Nyström approximation, and incomplete Cholesky decomposition.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 91
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_4
92 4 Kernel Computations
1
N
x̄ j = xi, j and that the matrix X X is nonsingular, we can obtain the solution
N i=1
as β̂ = (X X )−1 X y from X = (xi, j ) and y = (yi ). In the following, we prepare
a kernel k : E × E → R and consider the problem of finding the f ∈ H that mini-
mizes
N
L := (yi − f (xi ))2 .
i=1
M := span({k(xi , ·)}i=1
N
)
and
M ⊥ = { f ∈ H | f, k(xi , ·)
H = 0, i = 1, . . . , N } .
N
N
N
N
(yi − f (xi ))2 = (yi − f 1 (xi ))2 = (yi − αk(x j , xi ))2 (4.1)
i=1 i=1 i=1 j=1
N
N
L= {yi − α j k(x j , xi )}2 = y − K α2 , (4.2)
i=1 j=1
n
fˆ(x) = α̂i k(xi , x).
i=1
4.1 Kernel Ridge Regression 93
# We i n s t a l l s k f d a module b e f o r e h a n d
pip i n s t a l l cvxopt
# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a v e b e e n e x e c u t e d .
import numpy a s np
import p a n d a s a s pd
from s k l e a r n . d e c o m p o s i t i o n import PCA
import c v x o p t
from c v x o p t import s o l v e r s
from c v x o p t import m a t r i x
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . u s e ( "seaborn−ticks" )
from numpy . random import r a n d n # G a u s s i a n random numbers
from s c i p y . s t a t s import norm
def alpha ( k , x , y ) :
n = len ( x )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
r e t u r n np . l i n a l g . i n v (K + 10 e −5 ∗ np . i d e n t i t y ( n ) ) . d o t ( y )
# Add 10^( − 5) I t o K f o r making i t i n v e r t i b l e
Example 63 Utilizing the function alpha, we execute kernel regression via poly-
nomial and Gaussian kernels for n = 50 data (λ = 0.1). We present the output in
Fig. 4.1.
d e f k_p ( x , y) : # Kernel D e f i n i t i o n
return ( np . d o t ( x . T , y ) + 1 ) ∗∗3
d e f k_g ( x , y) : # Kernel D e f i n i t i o n
return np . exp ( − ( x − y ) ∗∗2 / 2 )
lam = 0 . 1
n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) # Data
Generation
a l p h a _ p = a l p h a ( k_p , x , y )
a l p h a _ g = a l p h a ( k_g , x , y )
z = np . s o r t ( x ) ; u = [ ] ; v = [ ]
f o r j i n range ( n ) :
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ p [ i ] ∗ k_p ( x [ i ] , z [ j ] )
u . append ( S )
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ g [ i ] ∗ k_g ( x [ i ] , z [ j ] )
v . append ( S )
Kernel Regression
5
Polynomial Kernel
Gaussian Kernel
4
3
2
y
1
0
-1
Fig. 4.1 We execute kernel regression by using polynomial and Gaussian kernels
We cannot obtain the solution of a linear regression problem when the rank of X
is smaller than p, i.e., N < p. Thus, we often minimize
N
(yi − xi β)2 + λβ22
i=1
for cases in which λ > 0. We call such a modification of linear regression a ridge. The
β to be minimized is given by (X X + λI )−1 X y. In fact, we derive the formula
by differentiating
y − Xβ2 + λβ β
−X (y − Xβ) + λβ = 0 .
N
L
:= (yi − f (xi ))2 + λ f 2H . (4.3)
i=1
f 2H = f 1 2H + f 2 2H + 2 f 1 , f 2
H = f 1 2H + f 2 2H ≥ f 1 2H . (4.4)
N
L
≥ (yi − f 1 (xi ))2 + λ f 1 2H .
i=1
N
N
N
N
f 1 2H = αi k(xi , ·), α j k(x j , ·)
H = αi α j k(xi , ·), k(x j , ·)
H = α K α
i=1 j=1 i=1 j=1
y − K α2 + λα K α . (4.5)
−K (y − K α) + λK α = 0 .
If K is nonsingular, we have
α̂ = (K + λI )−1 y . (4.6)
Finally, if we use the fˆ ∈ H that minimizes the (4.3) obtained thus far, we can
predict the value of y given a new x ∈ R p via
n
fˆ(x) = α̂i k(xi , x) .
i=1
def alpha ( k , x , y ) :
n = len ( x )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
r e t u r n np . l i n a l g . i n v (K + lam ∗ np . i d e n t i t y ( n ) ) . d o t ( y )
96 4 Kernel Computations
5
polynomial and Gaussian
kernels
4
3
2
y
1
0
-1
Example 64 Using the function alpha, we execute kernel ridge regression for poly-
nomial and Gaussian kernels and n = 50 data(λ = 0.1). We show the outputs in
Fig. 4.2.
d e f k_p ( x , y) : # Kernel D e f i n i t i o n
return ( np . d o t ( x . T , y ) + 1 ) ∗∗3
d e f k_g ( x , y) : # Kernel D e f i n i t i o n
return np . exp ( − ( x − y ) ∗∗2 / 2 )
lam = 0 . 1
n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) #
Data G e n e r a t i o n
a l p h a _ p = a l p h a ( k_p , x , y )
a l p h a _ g = a l p h a ( k_g , x , y )
z = np . s o r t ( x ) ; u = [ ] ; v = [ ]
f o r j i n range ( n ) :
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ p [ i ] ∗ k_p ( x [ i ] , z [ j ] )
u . append ( S )
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ g [ i ] ∗ k_g ( x [ i ] , z [ j ] )
v . append ( S )
p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
p l t . p l o t ( z , u , c = "r" , l a b e l = "PolynomialKernel" )
p l t . p l o t ( z , v , c = "b" , l a b e l = "GaussKernel" )
p l t . x l i m ( −1 , 1 )
p l t . y l i m ( −1 , 5 )
p l t . x l a b e l ( "x" )
p l t . y l a b e l ( "y" )
p l t . t i t l e ( "KernelRidge" )
p l t . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )
4.2 Kernel Principle Component Analysis 97
We review the procedure of principal component analysis (PCA) when we do not use
any kernel. We centralize each of the columns in the matrix X and vector y. We first
compute the v1 := v ∈ R p that maximizes v X X v under v v = 1. Similarly, for
i = 2, . . . , p, we repeatedly compute vi with the v v = 1 that maximizes v X X v
and is orthogonal to v1 , · · · , vi−1 ∈ R p . In the actual cases, we do not use all of
the v1 , · · · , v p but compress R p to the v1 , · · · , vm (1 ≤ m ≤ p) with the largest
eigenvalues. We compute the v ∈ R p that maximizes
v X X v − μ(v v − 1) (4.7)
xvm
for each row vector x ∈ R p using the obtained v1 , . . . , vm ∈ R p . We call such a value
the score of x, which is the vector obtained by projecting x onto the m elements.
We may apply a problem that is similar to PCA for an RKHS H via the feature
map : E xi → k(xi , ·) ∈ H rather than the PCA in R p . To this end, we consider
the problem of finding the f ∈ H that maximizes
N
f (xi )2 − μ( f 2H − 1) (4.8)
j=1
N
N
N
N
f (xi )2 = f 1 (·)+ f 2 (·), k(xi , ·)
2H = f 1 (·), k(xi , ·)
2H = f 1 (xi )2
i=1 i=1 i=1 i=1
N N
N
N
N
= { α j k(x j , xi )} =
2
αr αs k(xr , xi )k(xs , xi ) = α K 2 α
i=1 j=1 i=1 r =1 s=1
98 4 Kernel Computations
f 1 + f 2 2H = f 1 2H + f 2 2H ≥ f 1 2H
N
N
N
= α j k(x j , ·)2H = αr αs k(xr , xs ) = α K α .
j=1 r =1 s=1
α K K α − μ(α K α − 1) .
β Kβ − μ(β β − 1) .
1 u1 uN
α = K −1/2 β = √ β = √ , . . . , √ .
λ λ1 λN
If we centralize the Gram matrix K = (k(xi , x j )), then the (i, j)th element of the
modified Gram matrix is
1 1
N N
k(xi , ·) − k(x h , ·), k(x j , ·) − k(x h , ·)
H
N h=1 N h=1
1 1
N N
= k(xi , x j ) − k(xi , x h ) − k(x j , xl )
N h=1 N l=1
1
N N
+ k(x h , xl ) . (4.9)
N 2 h=1 l=1
N
αi k(xi , x) ∈ Rm
i=1
is the score of x ∈ R p .
Compared to ordinary PCA, kernel PCA requires a computational time of O(N 3 ).
Therefore, when N is large compared to p, the computational complexity may be
enormous. In the Python, we can write the procedure as follows:
4.2 Kernel Principle Component Analysis 99
def k e r n e l _ p c a _ t r a i n ( x , k ) :
n = x . shape [ 0 ]
K = np . z e r o s ( ( n , n ) )
S = [0] ∗ n ; T = [0] ∗ n
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i , : ] , x [ j , : ] )
f o r i i n range ( n ) :
S [ i ] = np . sum (K[ i , : ] )
f o r j i n range ( n ) :
T [ j ] = np . sum (K [ : , j ] )
U = np . sum (K)
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = K[ i , j ] − S [ i ] / n − T [ j ] / n + U / n ∗∗2
v a l , v e c = np . l i n a l g . e i g (K)
idx = v a l . a r g s o r t ( ) [ : : − 1 ] # d e c r e a s i n g order as R
val = val [ idx ]
vec = vec [ : , idx ]
a l p h a = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
a l p h a [ : , i ] = vec [ : , i ] / v a l [ i ] ∗ ∗ 0 . 5
return alpha
d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) :
n = x . shape [ 0 ]
p c a = np . z e r o s (m)
f o r i i n range ( n ) :
p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z )
return pca
In kernel PCA, when we use the linear kernel, the scores are consistent with those
of PCA without any kernel. For simplicity, we assume that X is normalized. If we do
not use the kernel, then by the singular value decomposition of X = U V (U ∈
R N × p , ∈ R p× p , V ∈ R p× p ), the multiplication of N 1−1 X X = N 1−1 V 2 V and
V is N 1−1 X X V = N 1−1 V 2 . Thus, each column of V is a principal component
vector, and the scores of x1 , . . . , x N ∈ R p (row vector) are the first m columns of
X V = U V · V = U .
On the other hand, for the linear kernel, we may write the Gram matrix as
K = X X = U 2 U and have K U = X X U = U 2 . That is, each column of
U is β1 , . . . , β N , and the columns α1 , . . . , α N of K −1/2 U are the principal compo-
nent vectors. Therefore, the scores of x1 , . . . , x N ∈ R p (row vectors) are the first m
columns of
K · K −1/2 U = U 2 U · (U 2 U )−1/2 · U = U .
1 1 1
N N N N
xi x j − xi x h − x j xl + xl x h = (xi − x̄)(x j − x̄)
N h=1 N l=1 N h=1 l=1
100 4 Kernel Computations
for the linear kernel, which is consistent with that of the ordinary PCA approach.
Therefore, the obtained score is the same.
21 24 45
44 39
32 40 48
28
7 35 436 13
-40
2
10
38 47 3 4 41 34
49 23 25 22 9
3637 1 19
16 12 17
Second
Second
14 26
-50
8 10
27
0
31 42 50
46 20 18 29
15
29 18 15
50 20 46
42 10 31 27
-60
8
-10
19 17 12
26
1 3736 14
16
23 49
9 22 25
34 41 4 3 47 38
-70
2
-20
13 35 7
28 643
48 40 32
39 44
45 24 21
-80
5
-30
33 30 11
-100 -50 0 50 100 150 -350 -300 -250 -200 -150 -100 -50
First First
15 33
32
0.4
29
3
44
0.2
20
21
0.2
5 50
38
Second
Second
7 2916 42 4
42
41
40 11 43
47
0.0
43
39
44
113
33
38
37
36
35
28
34
30
24
45
27
32
31
23
25
22
26
50
49 23
48
4746 13 3 48
18
17
19
16 30 281 7 39
0.0
21
20
11
89
14
10
12 49
9 35
19
612
2817
10
36
5
22
-0.2
24 14
26
4 1815 31 4645
27 37 41
-0.2
25
-0.4
34
6
2 40
-0.4 -0.2 0.0 0.2 0.4 -0.2 0.0 0.2 0.4 0.6
First First
Fig. 4.3 For the US Arrests data, we ran the ordinary PCA and kernel PCA methods (linear;
Gaussian with σ 2 = 0.08, 0.01), and we display the scores here. In the figure, 1 to 50 are the IDs
given to the states, and California’s ID is 5 (written in red). The results of the kernel PCA approach
differ greatly depending on what kernel we choose. Additionally, since kernel PCA is unsupervised,
it is not possible to use CV to select the optimal parameters. The scores of ordinary PCA and PCA
with the linear kernel should be identical. Although the directions of both axes are opposite, which
is common in PCA, we can conclude that they match
4.3 Kernel SVM 101
California and the other 49 states became clear (Fig. 4.3d). We used the following
code for the execution of the compared approaches:
# def k (x , y ) :
# r e t u r n np . d o t ( x . T , y )
sigma2 = 0.01
def k ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 )
X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/master/
USArrests.csv’ )
x = X. v a l u e s [ : , : − 1]
n = x . shape [ 0 ] ; p = x . shape [ 1 ]
alpha = kernel_pca_train (x , k )
z = np . z e r o s ( ( n , 2 ) )
f o r i i n range ( n ) :
z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] )
N
i ≤ γ
i=1
and
yi (β0 + xi β) ≥ M(1 − i ) , i = 1, . . . , N .
1 N
β2 + C i (4.10)
2 i=1
102 4 Kernel Computations
N
1
N N
αi − αi α j yi y j xi x j (4.11)
i=1
2 i=1 j=1
N
under i=1 αi yi = 0, where xi is the ith row vector of X (the dual problem)1 . The
constant C > 0 is a parameter that represents the flexibility of the boundary surface.
The higher the value is, the more samples are used to determine the boundary (samples
with αi = 0, i.e., support vectors). Although we sacrifice the fit of the data, we reduce
the boundary variation caused by sample data to prevent overtraining. Then, from
the support vectors, we can calculate the slope of the boundary with the following
formula:
N
β= αi yi xi ∈ R p .
i=1
Then, suppose that we replace the boundary surface with a curved surface by replac-
ing the inner product xi x j with a general nonlinear kernel k(xi , x j ). Then, we can
obtain complicated boundary surfaces rather than planes. However, the theoretical
basis for replacing the product with a kernel is not clear.
Therefore, in the following, we derive the same results by formulating the opti-
mization using k : E × E → R. As in to the previous application of the representa-
tion theorem, we find the f ∈ H that minimizes
1 N N N
f 2H + C i − αi [yi { f (xi ) + β0 } − (1 − i )] − μi i . (4.12)
2 i=1 i=1 i=1
i ≥ 0
αi [yi { f (xi ) + β0 } − (1 − i )] = 0
μi i = 0
1 We see this derivation in several references, such as Joe Suzuki, “Statistical Learning with Math
and Python” (Springer); C. M. Bishop, “Pattern Recognition and Machine Learning” (Springer);
Hastie, Tibshirani, and Fridman, “Elements of Statistical Learning” (Springer); and other primary
machine learning books.
4.3 Kernel SVM 103
γ j k(xi , x j ) − α j y j k(xi , x j ) = 0 (4.13)
j j
αi yi = 0
i
C − αi − μi = 0 (4.14)
μi ≥ 0 , 0 ≤ αi ≤ C.
f 1 (β ∗ ), . . . , f m (β ∗ ) ≤ 0 (4.15)
α1 f 1 (β ∗ ) = · · · = αm f m (β ∗ ) = 0 (4.16)
m
∇ f 0 (β ∗ ) + αi ∇ f i (β ∗ ) = 0 . (4.17)
i=1
N
1
N N
αi − αi α j yi y j k(xi .x j ) . (4.18)
i=1
2 i=1 j=1
Comparing (4.11) and (4.18), we observe that the dual problem replaces xi x j with
k(xi , x j ) for the formulation without any kernel.
In fact, if we set f (·) = β, ·
H , β ∈ R p , k(x, y) = x y (x, y ∈ R p ), then we
obtain the dual problem for a linear kernel (4.11).
Example 66 By using the following function svm_2, we can compare how the
bounds differ between a linear kernel (the standard inner product) and a nonlin-
ear kernel (a polynomial kernel), as shown in Fig. 4.4. cvxopt is a Python module
for solving quadratic programming problems. The function cvxopt calculates α.
def K_linear ( x , y ) :
r e t u r n x . T@y
d e f K_poly ( x , y ) :
r e t u r n ( 1 + x . T@y) ∗∗2
2 For the proof, see Chap. 9 of Joe Suzuki “Statistical Learning with R/Python” (Springer).
104 4 Kernel Computations
d e f svm_2 (X, y , C , K) :
eps =0.0001
n=X . s h a p e [ 0 ]
P=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
P [ i , j ] =K(X[ i , : ] , X[ j , : ] ) ∗y [ i ] ∗ y [ j ]
# S p e c i f y i t v i a t h e m a t r i x f u n c t i o n i n t h e package m a t r i x
P= m a t r i x ( P+np . e y e ( n ) ∗ e p s )
A= m a t r i x ( − y . T . a s t y p e ( np . f l o a t ) )
b= m a t r i x ( np . a r r a y ( [ 0 ] ) . a s t y p e ( np . f l o a t ) )
h= m a t r i x ( np . a r r a y ( [ C] ∗ n + [ 0 ] ∗ n ) . r e s h a p e ( − 1 , 1 ) . a s t y p e ( np . f l o a t ) )
G= m a t r i x ( np . c o n c a t e n a t e ( [ np . d i a g ( np . o n e s ( n ) ) , np . d i a g ( − np . o n e s ( n ) ) ] ) )
q= m a t r i x ( np . a r r a y ( [ − 1 ] ∗ n ) . a s t y p e ( np . f l o a t ) )
r e s = c v x o p t . s o l v e r s . qp ( P , q , A=A, b=b , G=G, h=h )
a l p h a =np . a r r a y ( r e s [ ’x’ ] ) # x i s t h e a l p h a i n t h e t e x t
b e t a = ( ( a l p h a ∗y ) .T@X) . r e s h a p e ( 2 , 1 )
index = ( eps < alpha [ : , 0 ] ) & ( alpha [ : , 0] < C − eps )
b e t a _ 0 =np . mean ( y [ i n d e x ] −X[ i n d e x , : ] @beta )
r e t u r n {’alpha’ : a l p h a , ’beta’ : b e t a , ’beta_0’ : b e t a _ 0 }
d e f p l o t _ k e r n e l (K, l i n e ) : # S p e c i f y t h e l i n e s v i a t h e l i n e a r g u m e n t
r e s =svm_2 (X, y , 1 ,K)
a l p h a = r e s [ ’alpha’ ] [ : , 0 ]
b e t a _ 0 = r e s [ ’beta_0’ ]
def f ( u , v ) :
S= b e t a _ 0
f o r i i n range (X . s h a p e [ 0 ] ) :
S=S+ a l p h a [ i ] ∗ y [ i ] ∗K(X[ i , : ] , [ u , v ] )
return S [ 0 ]
# ww i s t h e h e i g h t o f f ( x , y ) . We can draw t h e c o n t o u r .
uu=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; vv=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; ww= [ ]
f o r v i n vv :
w= [ ]
f o r u i n uu :
w. append ( f ( u , v ) )
ww. a p p e n d (w)
p l t . c o n t o u r ( uu , vv , ww, l e v e l s =0 , l i n e s t y l e s = l i n e )
0
-2
-3
-3 -2 -1 0 1 2 3
X[,1]
4.4 Spline Curves 105
a = 3 ; b=−1
n =200
X= r a n d n ( n , 2 )
y=np . s i g n ( a ∗X [ : , 0 ] + b∗X[ : , 1 ] ∗ ∗ 2 + 0 . 3 ∗ r a n d n ( n ) )
y=y . r e s h a p e ( − 1 , 1 )
f o r i i n range ( n ) :
i f y [ i ]==1:
p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="red" )
else :
p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="blue" )
p l o t _ k e r n e l ( K_poly , l i n e ="dashed" )
p l o t _ k e r n e l ( K _ l i n e a r , l i n e ="solid" )
J
g(x) = β1 + β2 x + β3 x 2 + β4 x 3 + β j+4 (x − ξ j )3+ (4.19)
j=1
⎧
⎪ 2
⎨ g0 (x) = β1 + β2 x + β3 x + β4 x ,
3 x < ξ1
= g j (x) = g j−1 (x) + β j+4 (x − ξ j )3 , ξ j ≤ x < ξ j+1
⎩ g (x) = β + β x + β x 2 + β x 3 + J β
⎪ 3
J 1 2 3 4 j=1 j+4 (x − ξ j ) , x ≥ ξ J
and
g
(ξ J ) = g
(ξ J ) = 0 . (4.22)
J
g
(ξ J ) = 6β4 + 6 β j+4 = 0
j=1
J
g
(ξ J ) = 2β3 + 6β4 ξ J + 6 β j+4 (ξ J − ξ j ) = 0
j=1
J
J
⇐⇒ β j+4 = β j+4 ξ j = 0 .
j=1 j=1
N 1
{yi − f (xi )}2 + λ { f
(x)}2 d x (4.23)
i=1 0
[β1 , . . . , β N ] = (X X + λG)−1 X y .
def h ( j , x , knots ) :
K = len ( knots )
i f j == 0 :
return 1
e l i f j == 1 :
return x
else :
r e t u r n d ( j − 1 , x , k n o t s )−d (K− 2 , x , k n o t s )
# G g i v e s values i n t e g r a t i n g the f u n c t i o n s t h a t are d i f f e r e n t i a t e d twice
d e f G( x ) : # The x v a l u e s a r e o r d e r e d i n a s c e n d i n g
n = len ( x )
g = np . z e r o s ( ( n , n ) )
f o r i i n range ( 2 , n − 1) :
f o r j i n range ( i , n ) :
g [ i , j ] = 1 2 ∗ ( x [ n −1]− x [ n − 2]) ∗ ( x [ n −2]− x [ j − 2]) \
∗ ( x [ n −2]− x [ i − 2]) / ( x [ n −1]− x [ i − 2]) / \
( x [ n −1]− x [ j − 2]) +(12∗ x [ n − 2]+6∗ x [ j − 2] − 18∗x [ i − 2]) \
∗ ( x [ n −2]− x [ j − 2]) ∗ ∗ 2 / ( x [ n −1]− x [ i − 2]) / ( x [ n −1]− x [ j − 2])
g[ j , i ] = g[ i , j ]
return g
3 See Chap. 7 of this series (“Statistical Learning with R/Python” (Springer)) for the proof.
108 4 Kernel Computations
g(x)
the spline does not follow the
observed data, but it
becomes smoother
-5 0 5
x
# MAIN
n = 100
x = np . random . u n i f o r m ( − 5 , 5 , n )
y = x + np . s i n ( x ) ∗2 + np . random . r a n d n ( n ) # Data G e n e r a t i o n
i n d e x = np . a r g s o r t ( x )
x = x [ index ] ; y = y [ index ]
X = np . z e r o s ( ( n , n ) )
X[ : , 0] = 1
f o r j i n range ( 1 , n ) :
f o r i i n range ( n ) :
X[ i , j ] = h ( j , x [ i ] , x ) # Generation of Matrix X
GG = G( x ) # Generation of Matrix G
lam_set = [ 1 , 30 , 80]
c o l _ s e t = [ "red" , "blue" , "green" ]
plt . figure ()
plt . y l i m ( −8 , 8 )
plt . x l a b e l ( "x" )
plt . y l a b e l ( "g(x)" )
f o r i i n range ( 3 ) :
lam = l a m _ s e t [ i ]
gamma = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X) +lam ∗GG) ,X . T ) , y )
def g ( u ) :
S = gamma [ 0 ]
f o r j i n range ( 1 , n ) :
S = S + gamma [ j ] ∗ h ( j , u , x )
return S
u _ s e q = np . a r a n g e ( − 8 , 8 , 0 . 0 2 )
v_seq = [ ]
for u in u_seq :
v_seq . append ( g ( u ) )
p l t . p l o t ( u_seq , v_seq , c = c o l _ s e t [ i ] , l a b e l = "$\lambda=%d$"%l a m _ s e t [
i ])
p l t . legend ( )
p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
p l t . t i t l e ( "smoothspline(n=100)" )
T e x t ( 0 . 5 , 1 . 0 , ’smoothspline(n=100)’ )
N 1
{yi − f (xi )}2 + λ { f (q) (x)}2 d x . (4.24)
i=1 0
Pi f, g
H = f i , g0 + g1
H = f i , gi
H = f 0 + f 1 , gi
H = f, Pi g
H
N
{yi − f (xi )}2 + λ f, P1 f
H (4.25)
i=1
f (xi ) = g + h, k(xi , ·)
H = g(xi )
P1 f H1 ≥ P1 g H1
(the representation theorem). Thus, we may restrict the range of f to M for searching
the optimum to find α1 , . . . , α N , β1 , . . . , βq in
q−1
N
g(·) = βi φi (·) + αi k(xi , ·) . (4.26)
i=0 i=1
N
n
N
N
N 1
(q) (q)
{yi − β j g j (xi )} + λ 2
βi β j gi (x)g j (x)d x.
i=1 i=1 j=1 i=1 j=1 0
1 (q) (q)
Let X = (g j (xi )) ∈ R N ×N , G = ( 0 gi (x)g j (x)d x) ∈ R N ×N , and y = [y1 , . . . ,
y N ] . The optimal solution β = [β1 , . . . , β N ] is given by
β = (X X + λG)−1 X y .
Proof: The claim is due to Bochner’s theorem (Proposition 5). See the appendix at
the end of this chapter for details. √
Based on Proposition 45, we generate 2 cos(ω x + b) m ≥ 1 times, i.e.,
(wi , bi ), i = 1, . . . , m, and construct the function
√
z i (x) = 2 cos(ωi x + bi ) i = 1, . . . , m.
1
m
k̂(x, y) := z i (x)z i (y)
m i=1
approaches k(x, y). Utilizing this fact, when m is small compared to N , the method
to reduce the complexity of kernel computation is called random Fourier features
(RFF).
We claim that the RFF possesses the following property:
2n 2 2
P(|X − E[X ]| ≥ ) ≤ 2 exp(− n ), (4.30)
i=1 (bi − ai )
2
n
e−s E[exp{s(Sn − E[Sn ])}] = e−s E[es(X i −E[X i ]) ] .
i=1
s2
n
P(Sn − E[Sn ] ≥ ) ≤ min exp{−s + (bi − ai )2 } ,
s>0 8 i=1
n
in which the minimum value is attained when s := 4 / i=1 (bi − ai )2 , and we have
n
P(Sn − E[Sn ] ≥ ) ≤ exp{−2 2 / (bi − ai )2 } .
i=1
n
P(Sn − E[Sn ] ≤ − ) ≤ exp{−2 2 / (bi − ai )2 } .
i=1
Hence, we have
0.6
0.4
0.2
0.0
-0.2
-0.4
Fig. 4.6 In the RFF approximation, we generated k̂(x, y) 1000 times by changing m. We observe
that they all have zero centers, and the larger m is, the smaller the estimation error is
Since E[k̂(x, y)] = k(x, y) and −2 ≤ z i (x)z i (y) ≤ 2, using Proposition 46, we
obtain (4.29)4 .
Example 68 From Example 19, since the probability of a Gaussian kernel has a
mean of 0 and a covariance matrix σ −2 I ∈ Rd×d , we generate the d-dimensional
random numbers and √ uniform random numbers independently and construct the m
functions z i (x) = 2 cos(ωi x + bi ), i = 1, . . . , m. We draw a boxplot of k̂(x, y) −
k(x, y) by generating (x, y) 1000 times with d = 1 and m = 20, 100, 400 in Fig. 4.6.
We observe that k̂(x, y) − k(x, y) has a mean of 0 (k̂(x, y) is an unbiased estimator),
and the larger m is, the smaller the variance is. The program is written as follows:
s i g m a =10
s i g m a 2 = s i g m a ∗∗2
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
def z ( x ) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b )
def zz ( x , y ) :
r e t u r n np . sum ( z ( x ) ∗ z ( y ) )
u=np . z e r o s ( ( 1 0 0 0 , 3 ) )
m_seq = [ 2 0 , 1 0 0 , 4 0 0 ]
f o r i i n range ( 1 0 0 0 ) :
x= r a n d n ( 1 )
y= r a n d n ( 1 )
f o r j i n range ( 3 ) :
m=m_seq [ j ]
w= r a n d n (m) / s i g m a
b=np . random . r a n d (m) ∗2∗ np . p i
u [ i , j ] = z z ( x , y )−k ( x , y )
4The original paper by Rahimi and Recht (2007) and subsequent work proved more rigorous upper
and lower bounds than these [2].
114 4 Kernel Computations
N
The solution α = [α1 , . . . , α N ] with f (·) = i=1 αi k(xi , ·) for kernel ridge
regression with the Gram matrix K is given by (4.6) (Sect. 4.1). If we obtain the
fˆ that approximates f and
Napproximates the Gram matrix K via RFF as K̂ = Z Z ,
ˆ
then we obtain f (·) = i=1 α̂i k̂(xi , ·) by using α̂ ∈ R for ( K̂ + λI N )α̂ = y for
N
U (Is + V U ) = (Ir + U V )U .
And we have
Z (Z Z + λI N )−1 = (Z Z + λIm )−1 Z .
Let x ∈ E be a value other than the x1 , . . . , x N used for estimation, and let z(x) :=
[z 1 (x), . . . , z m (x)] (row vector). Then, for
we have
N
N
fˆ(x) = αi k̂(x, xi ) = z(x) z (xi )α̂i = z(x)Z α̂ = z(x)Z ( K̂ + λI N )−1 y
i=1 i=1
Then, for the new x ∈ E, we can find its value from fˆ(x) = z(x)β̂. The computa-
tional complexity of (4.33) is O(m 2 N ) for the multiplication of Z Z , O(m 3 ) for
finding the inverse of Z Z + λIm ∈ Rm×m , O(N m) for the multiplication of Z y,
and O(m 2 ) for multiplying (Z Z + λIm )−1 and Z y. Thus, overall, the process
requires only O(N 2 m) complexity at most. On the other hand, the process takes
O(N 3 ) time when using the kernel without approximation. If m = N /10, the com-
putational time becomes 1/100. Obtaining fˆ(x) from a new x ∈ E also takes only
O(m) time.
Example 69 We applied RFF to kernel ridge regression. For N = 200 data, we used
m = 20 for the approximation. We plotted the curve for λ = 10−6 , 10−4 (Fig. 4.7).
The program is as follows:
4.5 Random Fourier Features 115
8
10
W. Approx. W. Approx.
6
8
4
6
y
y
2
4
2
0
0
-2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0
x x
Fig. 4.7 We applied RFF to kernel ridge regression. On the left and right are λ = 10−6 and
λ = 10−4 , respectively
s i g m a =10
s i g m a 2 = s i g m a ∗∗2
# Function z
m=20
w= r a n d n (m) / s i g m a
b=np . random . r a n d (m) ∗2∗ np . p i
d e f z ( u ,m) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗u+b )
# Gaussian Kernel
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
# Data G e n e r a t i o n
n =200
x= r a n d n ( n ) / 2
y =1+5∗ np . s i n ( x / 1 0 ) +5∗ x ∗∗2+ r a n d n ( n )
x_min=np . min ( x ) ; x_max=np . max ( x ) ; y_min=np . min ( y ) ; y_max=np . max ( y )
lam = 0 . 0 0 1
# lam =0.9
# Low Rank A p p r o x i m a t e d F u n c t i o n
d e f a l p h a _ r f f ( x , y ,m) :
n= l e n ( x )
Z=np . z e r o s ( ( n ,m) )
f o r i i n range ( n ) :
Z [ i , : ] = z ( x [ i ] ,m)
b e t a =np . d o t ( np . l i n a l g . i n v ( np . d o t ( Z . T , Z ) +lam ∗ np . e y e (m) ) , np . d o t ( Z . T , y ) )
return ( beta )
# Usual Function
def alpha ( k , x , y ) :
n= l e n ( x )
K=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
a l p h a =np . d o t ( np . l i n a l g . i n v (K+lam ∗ np . e y e ( n ) ) , y )
return alpha
# N u m e r i c a l Comparison
alpha_hat=alpha (k , x , y )
b e t a _ h a t = a l p h a _ r f f ( x , y ,m)
r =np . s o r t ( x )
116 4 Kernel Computations
u=np . z e r o s ( n )
v=np . z e r o s ( n )
f o r j i n range ( n ) :
S=0
f o r i i n range ( n ) :
S=S+ a l p h a _ h a t [ i ] ∗ k ( x [ i ] , r [ j ] )
u [ j ]=S
v [ j ] = np . sum ( b e t a _ h a t ∗ z ( r [ j ] ,m) )
1
(R R + λI N )−1 = {I N − R(R R + λIm )−1 R } , (4.34)
λ
with r = m, s = N ,A = λI N , U = R, C = Ir , and V = R .
Computing the left side of (4.34) requires an inverse matrix operation of size N ,
while computing the right side involves the product of N × m and m × m matrices
and an inverse matrix operation of size m. The computations on the left- and right-
hand sides require O(N 3 ) and O(N 2 m) complexity, respectively. In the following
part of this section, we show that with some approximation, the decomposition of
K = R R is completed in O(N m 2 ) time, i.e., the calculation of the ridge regression
is performed in O(N m 2 ). In other words, if N /m = 10, the computational time is
only 1/100.
x1 , . . . , xm
λi(N ) := N λi
m
KN = λi(N ) vi vi ,
i=1
s i g m a 2 =1
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
n =300
x= r a n d n ( n ) / 2
y=3 − 2∗ x ∗∗2 + 3∗ x ∗∗3 + 2∗ r a n d n ( n )
lam =10∗∗( − 5)
m=10
K=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
# Low Rank A p p r o x i m a t e d F u n c t i o n
118 4 Kernel Computations
λ = 10−5 , m = 10 λ = 10−5 , m = 20
5
5
4
4
3
3
2
2
y
y
1
1
-1 0
-1 0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
x x
λ = 10−3 , m = 10 λ = 10−3 , m = 20
5
5
4
4
3
3
2
2
y
y
1
1
-1 0
-1 0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
x x
Fig. 4.8 We approximated data with N = 300 and ranks m = 10, 20. The upper and lower subfig-
ures display the results obtained when running λ = 10−5 and λ = 10−3 , respectively. The red and
blue lines are the results obtained without approximation and with approximation, respectively. The
accuracy is almost the same as that in the case without approximation when m = 20. The larger the
value of λ is, the smaller the approximation error becomes
N
2. For each i = 1, . . . , r , the first i columns of R are set so that B j,i = h=1 R j,h Ri,h
for j = 1, . . . , N . In other words, the setup is complete through the ith column
of B. ⎡ ⎤
R1,1 0 · · · · · · · · · 0
⎢ .. . . . . ⎥
⎢ . . . ··· ··· 0⎥
⎢ ⎥
⎢ . ⎥
⎢ Ri,1 .. Ri,i 0 · · · 0 ⎥
R=⎢ ⎢ ⎥.
⎥
⎢ Ri+1,1 ... Ri+1,i 0 · · · 0 ⎥
⎢ ⎥
⎢ . . . . . . ⎥
⎣ . . .
. .
. .
. . .
. .⎦
R N ,1 · · · R N ,i 0 ··· 0
In this case, we swap the two subscripts in B by multiplying a matrix Q from the
front and rear of B.
3. The final result is that R R = B = P A P with P = Q 1 · · · Q N . Therefore, A =
P R R P , and we have that P R(P R) is the Cholesky decomposition.
120 4 Kernel Computations
Here, to replace the (i, j) rows and (i, j) columns of the symmetric matrix B, let
Q be the matrix obtained by replacing the (i, j), ( j, i) and (i, i), ( j, j) components
of the unit matrix with 1 and 0, respectively, and multiplying B by the symmetric
matrix Q from the front and rear of B. For example,
⎡⎤⎡ ⎤⎡ ⎤ ⎡ ⎤
100 b11 b12 b13 100 b11 b13 b12
Q B Q = ⎣ 0 0 1 ⎦ ⎣ b21 b22 b23 ⎦ ⎣ 0 0 1 ⎦ = ⎣ b31 b22 b32 ⎦ .
010 b31 b32 b33 010 b21 b23 b33
(a) Swap the ith and kth rows and ith and kth columns of B.
(b) Let Q i,k := 1, Q k,i := 1, Q i,i := 0, Q k,k := 0.
(c) Swap Ri,1 , · · · , Ri,i−1 and Rk,1 , · · · , Rk,i−1 .
(d) Ri,i = Bk,k − i−1 2
h=1 Rk,h .
of the ith column join, but we divide them by Ri,i . Compared to the case where other
values are selected as Ri,i in step 1, the absolute value of R j,i after dividing by
Rii becomes smaller, and the B j, j − ih=1 R 2j,h in the next step becomes larger for
2
each j. If Rr,r takes a negative value, then regardless of the selection order, there
is no solution to the Cholesky decomposition, contradicting Proposition 47 (the
uniqueness of the solution is also guaranteed). Even in the case of an incomplete
Cholesky decomposition, we use the first r columns when running r = N .
We show the code for executing the incomplete Cholesky decomposition below:
d e f im_ch (A,m) :
n=A . s h a p e [ 1 ]
R=np . z e r o s ( ( n , n ) )
P=np . e y e ( n )
f o r i i n range (m) :
max_R=− np . i n f
f o r j i n range ( i , n ) :
RR=A[ j , j ]
f o r h i n range ( i ) :
RR=RR−R[ j , h ] ∗ ∗ 2
i f RR>max_R :
k= j
4.7 Incomplete Cholesky Decomposition 121
max_R=RR
R[ i , i ] = np . s q r t ( max_R )
i f k != i :
f o r j i n range ( i ) :
w=R [ i , j ] ; R [ i , j ] =R [ k , j ] ; R [ k , j ] =w
f o r j i n range ( n ) :
w=A[ j , k ] ; A[ j , k ] =A[ j , i ] ; A[ j , i ] =w
f o r j i n range ( n ) :
w=A[ k , j ] ; A[ k , j ] =A[ i , j ] ; A[ i , j ] =w
Q=np . e y e ( n ) ; Q[ i , i ] = 0 ; Q[ k , k ] = 0 ; Q[ i , k ] = 1 ; Q[ k , i ] = 1
P=np . d o t ( P , Q)
i f i <n :
f o r j i n range ( i +1 , n ) :
S=A[ j , i ]
f o r h i n range ( i ) :
S=S−R[ i , h ] ∗R[ j , h ]
R[ j , i ] = S / R [ i , i ]
r e t u r n np . d o t ( P , R)
# Data G e n e r a t i o n : Make m a t r i x A n o n n e g a t i v e d e f i n i t e
n=5
D=np . m a t r i x ( [ [ np . random . r a n d i n t ( − n , n ) f o r i i n range ( n ) ] f o r j i n range ( n )
])
A=np . d o t (D, D . T )
A
L= im_ch (A, 5 )
np . d o t ( L , L . T )
# # A c a n n o t be r e c o v e r e d
B=np . d o t ( L , L . T )
B
# The f i r s t t h r e e e i g e n v a l u e s o f B a r e c l o s e t o t h o s e o f A .
np . l i n a l g . e i g (B )
# The r a n k o f B i s t h r e e .
np . l i n a l g . m a t r i x _ r a n k ( B)
3
Appendix 123
Appendix
Proof of Proposition 44
Since r is a natural spline function whose highest order is 2q − 1 and it satisfies
we have
1 1 1
r (q) (x)s (q) (x)d x = r (q) (x)s (q−1) (x) 0 − r (q+1) (x)s (q−1) (x)d x
0 0
1 1
=− r (q+1) (x)s (q−1) (x)d x = · · · = (−1)q−1 r (2q−1) (x)s
(x)d x
0 0
N −1
= (−1)q−1 r (2q−1) (x +j ) s(x j+1 ) − s(x j ) = 0 , (4.36)
j=1
where the third equality is due to (4.36). On the other hand, from g, r ∈ Wq [0, 1]
and s ∈ Wq [0, 1], we have
q−1 (i) q−1
s (0) 1
(x − u)+ (q)
s(x) = x + i
s (u)du.
i=0
i! 0 (q − 1)!
1
Therefore, when the equality of (4.37) holds, i.e., 0 {s (q) (x)}2 d x = 0, we have
s (q) (x) = 0 almost everywhere. Hence,
q−1 (i)
s (0)
s(x) = xi ,
i=0
i!
124 4 Kernel Computations
which means that s(xi ) = 0 for i = 1, 2, . . . , N . Thus, if N exceeds the order of the
polynomial q − 1, then we require s(x) = 0 for x ∈ [0, 1].
Proof of Proposition 45
From the additive theorem, we have
Since the expectation of the second term w.r.t. b when fixing ω is zero, we have
√ √
Eω,b [ 2 cos(ω x + b) · 2 cos(ω y + b)] = Eω cos(w (x − y)) .
If we apply Euler’s formula eiθ = cos θ + i sin θ to Proposition 5, then k(x, y) takes
a real value. Thus, we have E[sin(ω (x − y))] = 0, and k(x, y) can be written as
Proof of Lemma 7
Let > 0. Since e x is convex w.r.t. x, if we take the expectation on the both sides
of
X − a b b − X a
e X ≤ e + e
b−a b−a
−a
for s = (b − a) and θ = . Therefore, it is sufficient for the exponent f (s) :=
b−a
−θ s + log(1 − θ + θ es ) to be at most s 2 /8. Since
θ es
f
(s) = −θ +
1 − θ + θ es
(1 − θ ) · θ es 1
f
(s) = = φ(1 − φ) ≤
(1 − θ + θ e )s 2 4
θ es
for φ = . Hence, a μ ∈ R exists such that
1 − θ + θ es
Appendix 125
1
s2
f (s) = f (0) + f
(0)(s − 0) + f (μ)(s − 0)2 ≤ ,
2 8
which implies (4.31).
Exercises 46∼64
def k e r n e l _ p c a _ t r a i n ( x , k ) :
n = x . shape [ 0 ]
K = np . z e r o s ( ( n , n ) )
S = [0] ∗ n ; T = [0] ∗ n
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i , : ] , x [ j , : ] )
f o r i i n range ( n ) :
S [ i ] = np . sum (K[ i , : ] )
f o r j i n range ( n ) :
T [ j ] = np . sum (K [ : , j ] )
U = np . sum (K)
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = K[ i , j ] − S [ i ] / n − T [ j ] / n + U / n ∗∗2
v a l , v e c = np . l i n a l g . e i g (K)
idx = v a l . a r g s o r t ( ) [ : : − 1 ] # d e c r e a s i n g order as R
val = val [ idx ]
vec = vec [ : , idx ]
a l p h a = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
a l p h a [ : , i ] = vec [ : , i ] / v a l [ i ] ∗ ∗ 0 . 5
return alpha
d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) :
n = x . shape [ 0 ]
126 4 Kernel Computations
p c a = np . z e r o s (m)
f o r i i n range ( n ) :
p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z )
return pca
Check whether the constructed function works with the following program:
sigma2 = 0.01
def k ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 )
X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/
master/USArrests.csv’ )
x = X. v a l u e s [ : , : − 1]
n = x . shape [ 0 ] ; p = x . shape [ 1 ]
alpha = kernel_pca_train (x , k )
z = np . z e r o s ( ( n , 2 ) )
f o r i i n range ( n ) :
z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] )
49. Show that the ordinary PCA and kernel PCA with a linear kernel output the same
score.
50. Derive the KKT condition for the kernel SVM (4.12).
51. In Example 66, instead of linear and polynomial kernels, use a Gaussian kernel
with different values of σ 2 (three different types), and draw the boundary curve
in the same graph.
52. From (4.21) and (4.22), derive Jj=1 β j+4 = 0 and Jj=1 β j+4 ξ j = 0.
53. Prove Proposition 44 according to the following steps:
1
(a) Show that r (q) (x)s (q) (x)d x = 0.
1
0 1
(b) Show that 0 {g (q) (x)}2 d x ≥ 0 {r (q) (x)(x)}2 d x.
q−1 (i)
s (0) i
(c) When the equality in (b) holds, show that s(x) = x.
i=0
i!
(d) Show that the function s decreases when the equality in (b) holds and N
exceeds the degree q − 1 of the polynomial.
54. In RFF, instead of finding the kernel k(x, y), we find its unbiased estimator
k̂(x, y). Show that the average of k̂(x, y) is k(x, y). Moreover, construct a func-
tion that outputs k̂(x, y) from (x, y) ∈ E for m = 100 by using the constants
and functions in the program below. Furthermore, compare the result with the
value output by the Gaussian kernel and confirm that it is correct.
Exercises 46∼64 127
s i g m a =10
s i g m a 2 = s i g m a ∗∗2
def z ( x ) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b )
def zz ( x , y ) :
r e t u r n np . sum ( z ( x ) ∗ z ( y ) )
for U ∈ Rr ×s , V ∈ Rs×r , r, s ≥ 1.
59. Evaluate the number of computations required to obtain (4.33) for the RFF. In
addition, evaluate the computational complexity of finding fˆ(x) for the new
x ∈ E.
60. To find the coefficient estimates (K + λI )−1 y in kernel ridge regression, we
wish to decompose the low-rank matrix K = R R with R ∈ R N ×m . If we can
decompose K = R R , evaluate the computations on the left- and right-hand
sides, where we assume that finding the inverse of the matrix A ∈ Rn×n takes
O(n 3 ).
61. We wish to find the coefficient α̂ of the kernel ridge regression by using the
Nyström approximation. If we use the left-hand side of (4.34) instead of the
right-hand side, what changes would be necessary in the following code?
s i g m a = 1 0 ; s i g m a 2 = s i g m a ^2
z= f u n c t i o n ( x ) s q r t ( 2 /m) ∗ c o s (w∗x+b )
z z = f u n c t i o n ( x , y ) sum ( z ( x ) ∗ z ( y ) )
a l p h a .m= f u n c t i o n ( k , x , y ,m) {
n= l e n g t h ( x ) ; K= m a t r i x ( 0 , n , n ) ; f o r ( i i n 1 : n ) f o r ( j i n 1 : n )K[ i , j ] = k ( x [ i ] ,
x[ j ])
A= s v d (K [ 1 : m, 1 : m] )
u= a r r a y ( dim=c ( n ,m) ) ;
f o r ( i i n 1 :m) f o r ( j i n 1 : n ) u [ j , i ] = s q r t (m/ n ) ∗sum (K[ j , 1 : m] ∗ A$u [ 1 : m, i ] ) /
A$d [ i ]
mu=A$d∗n /m
R= s q r t ( mu [ 1 ] ) ∗u [ , 1 ] ; f o r ( i i n 2 :m) R= c b i n d ( R , s q r t ( mu [ i ] ) ∗u [ , i ] )
a l p h a = ( d i a g ( n )−R%∗%s o l v e ( t (R)%∗%R+lambda∗ d i a g (m) )%∗%t (R) )%∗%y / lambda
return ( as . v e c t o r ( alpha ) )
}
i
B ji = R jh R jh
h=1
T( f )
satisfies ≤ E[ k(X, X )] < ∞. From Proposition 22, there exists an m X ∈ H
f
H
such that
E[ f (X )] = f (·), m X (·) H
for any f ∈ H . We call such an m X the expectation of k(X, ·), and we write m X (·) =
E[k(X, ·)]. Then, we have
which means that we can change the order of the inner product and expectation
operations. Let E X , E Y be sets. We define the tensor product H0 of RKHSs H X and
HY consisting of kernels k X : E X → R and kY : E Y → R, respectively, by the set
m
of functions E X × E Y → R, f (x, y) = i=1 f X,i (x) f Y,i (y), f X,i ∈ H X , f Y,i ∈ HY
for (x, y) ∈ E X × E Y , and we define the inner product and norm by
m
n
f, g H0 = f X,i , g X, j HX f Y,i , gY, j HY
i=1 j=1
and
f
2H0 = f, f H0 , respectively for f = mj=1 f X, j f Y, j , f X,i ∈ H X , f Y,i ∈ HY
and g = nj=1 g X, j gY, j , g X, j ∈ H X , gY, j ∈ HY . In fact, we have
m
n
f, g H0 = αi,r γ j,t k X (xr , xt ) βi,s δ j,u kY (ys , yu )
i=1 j=1 r t s u
m
n
= αi,r βi,s g(xr , ys ) = γ j,t δ j,u f (xt , yu )
i=1 r s j=1 t u
for f X,i (·) = r αi,r k X (xr , ·), f Y,i (·) = s βi,s kY (ys , ·), g X, j (·) = t γ j,t k X (xt , ·),
and gY, j (·) = u δ j,u kY (yu , ·), which means that the functions do not depend on the
expressions of f, g.
If we complete
∞ H 0 , we can construct a linear space H consisting of the func-
∞ ∞ ∞
tions f = i=1 a e e
j=1 i, j X.i Y, such that
f
2
:= a 2
< ∞, and
∞ ∞ ∞i, j
j i=1 j=1
the inner product is f, g H = i=1 j=1 ai, j bi, j , where g = j=1 bi, j e X,i eY, j
∞ ∞ 2
( i=1 j=1 bi, j < ∞) and {e X,i }, {eY, j } are orthonormal bases of H X , HY , respec-
tively. Then, H0 is a dense subspace in H , and H is a Hilbert space. We say that
H0 is the direct product of H X , HY and write H X ⊗ HY . H is the set of functions f
such that f (x) := limn→∞ f n (x) for any Cauchy sequence { f n } in H0 and x ∈ E.
The claim follows from a similar discussion as that in Steps 1-5 of Proposition 34.
E X Y [
k X (X, ·)kY (Y, ·)
HX ⊗HY ] = E X Y [
k X (X, ·)
HX
kY (Y, ·)
HY ]
= E X Y [ k X (X, X )kY (Y, Y )] ≤ E X [k X (X, X )]EY [kY (Y, Y )] .
E X Y [ f (X, Y )] = f, m X Y ,
and we write
m X Y := E X Y [k X (X, ·)kY (Y, ·)] ,
which means that we can change the order of the inner product and expectation
operations:
m XY − m X mY
f g, m X Y − m X m Y HX ⊗HY = Y X f, g HY = f, X Y g HX . (5.1)
Proof: The operators Y X , X Y are conjugates of each other, and from Proposition
22, if one exists, so does the other. We prove the existence of X Y . The linear
functional
Tg : H X f → f g, m X Y − m X m Y HX ⊗HY ∈ R
f g, m X Y − m X m Y HX ⊗HY ≤
f
HX
g
HY
m X Y − m X m Y
HX ⊗HY ,
f g, m X Y − m X m Y HX ⊗HY = f, X Y g HX .
X Y g
HX =
h g
HX =
Tg
≤
g
HY
m X Y − m X m Y
HX ⊗HY .
We call X Y , Y X the mutual covariance operators.
Let H and k be an RKHS and its reproducing kernel respectively, and let P be
the set of distributions that X follows. Then, we can define the map
P μ → k(x, ·)dμ(x) ∈ H ,
Gretton et al. (2008), [11] proposed a statistical testing approach for testing whether
two distributions share given independent sequences x1 , . . . , xm ∈ R and y1 , . . . ,
yn ∈ R. We write the two distributions as P, Q and regard P = Q as the null hypothe-
sis. Let H and k be an RKHS and its reproducing kernel respectively;
we define m P :=
E P [k(X, ·)] = E k(x, ·)d P(x), m Q := E Q [k(X, ·)] = E k(x, ·)d Q(x) ∈ H . We
5.2 The MMD and Two-Sample Problem 133
F := { f ∈ H |
f
H ≤ 1} ,
= sup {m P − m Q , f } =
m P − m Q
2H .
2
f ∈F
MMD = 0 ⇐⇒ m P = m Q ⇐⇒ P = Q (5.2)
and
MMD2
= m P , m P + m Q , m Q − 2m P , m Q
= E X [k(X, ·)], E X [k(X , ·)] + EY [k(Y, ·)], EY [k(Y , ·)] − 2E X [k(X, ·)], EY [k(Y, ·)]
= E X X [k(X, X )] + EY Y [k(Y, Y )] − 2E X Y [k(X, Y )] ,
where X and X (Y and Y ) are independent and follow the same distribution. How-
ever, we do not know m X , m Y from the two-sample data. Thus, we execute the test
using their estimates:
1 1 2
m m n n m n
2
MMD B := k(x i , x j ) + k(yi , y j ) − k(xi , y j )
m 2 i=1 j=1 n 2 i=1 j=1 mn i=1 j=1
(5.3)
1 m
1 n
2
m n
k(xi , x j ) + k(yi , y j ) − k(xi , y j ) .
m(m − 1) i=1 j=i n(n − 1) i=1 j=i mn i=1 j=1
(5.4)
Then, the estimate (5.4) is unbiased while (5.3) is biased:
1
m
1
m
1
E[ k(X i , X j )] = EXi [ E X j [k(X i , X j )]] = E X X [k(X, X )].
m(m − 1) m m−1
i=1 j=i i=1 j=i
134 5 The MMD and HSIC
50
Density
Density
30
0 10
0
-0.01 0.00 0.01 0.02 0.03 0.00 0.02 0.04 0.06 0.08
2 2
MMDU MMDU
Fig. 5.1 Permutation test for the two-sample problem. The distributions of X, Y are the same (left)
and different (right). The blue and red dotted lines show the statistics and the borders of the rejection
region, respectively.
However, similar to the HSIC in the next section, we do not know the distribution
of the MMD estimate under P = Q. We consider executing one of the following
processes.
1. Construct a histogram of the MMD estimate values randomly by changing the
values of x1 , . . . , xm and y1 , . . . , yn (permutation test).
2. Compute an asymptotic distribution from the distribution of U statistics.
For the former, for example, we may construct the following procedure.
Example 71 We perform a permutation test on two sets of 100 samples that fol-
low the standard Gaussian distribution (Fig. 5.1 Left). For the unbiased estimator of
2
M M D 2 , we use M M DU in (5.6) instead of (5.4) for a later comparison. We also
double the standard deviation of one set of samples and perform the permutation
2
test again (Fig. 5.1 Right). The reason why M MDU also takes negative values is that
when the true value of the M M D is close to zero, the value can also be negative
since it is an unbiased estimator.
# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
from s c i p y . s t a t s import kde
import i t e r t o o l s
import math
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . use ( " seaborn −t i c k s " )
sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# Data G e n e r a t i o n
n = 100
xx = np . random . r a n d n ( n )
yy = np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e e q u a l
5.2 The MMD and Two-Sample Problem 135
# y y = 2 ∗ np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e n o t e q u a l
x = xx ; y = yy
# Distribution of the null hypothesis
T = []
f o r h i n range ( 1 0 0 ) :
i n d e x 1 = np . random . c h o i c e ( n , s i z e = i n t ( n / 2 ) , r e p l a c e = F a l s e )
i n d e x 2 = [ x f o r x i n range ( n ) i f x n o t i n i n d e x 1 ]
x = l i s t ( xx [ i n d e x 2 ] ) + l i s t ( yy [ i n d e x 1 ] )
y = l i s t ( xx [ i n d e x 1 ] ) + l i s t ( yy [ i n d e x 2 ] )
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
i f i != j :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
T . append ( S / n / ( n − 1) )
v = np . q u a n t i l e ( T , 0 . 9 5 )
# Statistics
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
i f i != j :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
u = S / n / ( n − 1)
# D i s p l a y o f t h e graph
x = np . l i n s p a c e ( min ( min ( T ) , u , v ) , max ( max ( T ) , u , v ) , 2 0 0 )
d e n s i t y = kde . g a u s s i a n _ k d e ( T )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = u , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = v , c = "b" )
For the latter approach, we construct the following quantities. For m ≥ 1 sym-
metric variables and h : E m → R, we call the quantity
1
U N := h(xi1 , . . . , xim ) (5.5)
N 1≤i ,...,i ≤N
1 m
m
N
the U-statistic w.r.t. h of order m, where i1 ,...,im ranges over (i 1 , . . . , i m ) ∈
m
{1, . . . , N } ’s. We use this quantity for estimating the expectation E[h(X 1 , . . . , X m )]
m
1
E[
h(X i1 , . . . , X im )]
N i <...<i
1 m
m
1
= Eh(X i1 , . . . , X im ) = Eh(X 1 , . . . , X m ) .
N i <...<i
1 m
m
1
N N
VN := · · · h(xi1 , . . . , xim )
N m i =1 i =1
1 m
2 1
MMDU = h(z i , z j )
n(n − 1) i= j
for
h(z i , z j ) := k(xi , x j ) + k(yi , y j ) − k(xi , y j ) − k(x j , yi ) (5.6)
with z i = (xi , yi ).
We define
which is obtained by taking the expectation of the U statistics (5.5) over Z c+1 , . . . ,
Z m for 1 ≤ c ≤ m. Moreover, we define
h̃ c (z 1 , . . . , z c ) := h c (z 1 , . . . , z c ) − θ
Proposition 51 (Serfling [27]) Suppose that the U statistics are Eh 2 < ∞ and that
h 1 (z 1 ) is zero (degenerated). Let λ1 , λ2 , . . . be the eigenvalues of the conjugate
integral operation
L 2 f (·) → ĥ 2 (·, y) f (y)dη(y)
5.2 The MMD and Two-Sample Problem 137
as m → ∞, where χ12 , χ22 , . . . are random variables that are independent of each
other and follow a χ 2 distribution with one degree of freedom.
For the proof, see Sect. 5.5.2 of Serfling [27] (page 193-199).
Note that h̃ 2 (z 1 , z 2 ) = h(z 1 , z 2 ) is given by (5.6), which is symmetric but not
nonnegative definite. Therefore, Mercer’s theorem cannot be applied. However. an
integral operator is generally compact (Proposition 39), and if the kernel of an integral
operator is symmetric, then the integral operator is self-adjoint (e.g., 45). Therefore,
from Proposition 27, eigenvalues and eigenfunctions exist. However, since they are
not nonnegative definite, some eigenvalues may not be nonnegative.
∞ ∞
In the following, we write {λi }i=1 and {φi (·)}i=1 as the eigenvalues and eigen-
functions, respectively, of the integral operator
Th̃ : L 2 [E, μ] f → h̃ 2 (·, y) f (y)dη(y) ∈ L 2 [E, η]
E
2
Utilizing Proposition 51, we find that N MMDU converges to the random variable
∞
λ j (χ 2j − 1)
j=1
E K (·, x) f (x)dη(x) is called a kernel of the integral operator even if it is not positive defi-
nite.
138 5 The MMD and HSIC
50
Density
Density
30
0 10
0
0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06 0.08 0.10
2 2
M M DU M M DU
Fig. 5.2 Test performed using the U-statistic for the two-sample problem. The same (left) and
different (right) distributions of X, Y are employed. The blue line is the statistic, and the red dotted
line is the boundary of the rejection region. We can see that the distribution obtained according to
the null hypothesis has almost the same shape as that in Fig. 5.1
sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# Data G e n e r a t i o n
n = 100
x = np . random . r a n d n ( n )
y = np . random . r a n d n ( n ) # The D i s t r i b u t i o n s a r e e q u a l
# y = 2 ∗ np . random . r a n d n ( n ) # The D i s t r i b u t i o n s a r e n o t e q u a l
# D i s t r i b u t i o n under n u l l h y p o t h e s i s
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] ) + k ( y [ i ] , y [ j ] ) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
lam , v e c = np . l i n a l g . e i g (K)
lam = lam / n
r = 20
z = []
f o r h i n range ( 1 0 0 0 0 ) :
z . a p p e n d ( np . l o n g d o u b l e ( 1 / n ∗ ( np . sum ( lam [ 0 : r ]
∗ ( np . random . c h i s q u a r e ( d f = 1 , s i z e = r ) − 1) ) ) ) )
v = np . q u a n t i l e ( z , 0 . 9 5 )
# Statistics
S = 0
f o r i i n range ( n − 1 ) :
f o r j i n range ( i + 1 , n ) :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
u = np . l o n g d o u b l e ( S / n / ( n − 1 ) )
x = np . l i n s p a c e ( min ( min ( z ) , u , v ) , max ( max ( z ) , u , v ) , 2 0 0 )
# D i s p l a y o f t h e graph
d e n s i t y = kde . g a u s s i a n _ k d e ( z )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )
5.3 The HSIC and Independence Test 139
N N
is close to zero for x̄ := (1/N ) i=1 xi and ȳ := (1/N ) i=1 yi , then we may say
that the variables are independent.
1 1
exp{− (x 2 − 2ρ X Y xy + y 2 )} ,
2 1− ρ X2 Y 2(1 − ρ X2 Y )
To this end, Gretton et al. [12] thought that to test the independence of random
variables X, Y , if we map E X → k X (X, ·) ∈ H X and E Y → kY (Y, ·) ∈ HY
for kernels k X , kY , and perform the test of independence based on the covari-
ance between k X (X, ·) and kY (Y, ·), such an inconvenience would not occur. They
devised a statistical test for E[k X (X, ·)kY (Y, ·)] = E[k X (X, ·)]E[kY (Y, ·)] rather than
E[X Y ] = E[X ]E[Y ]. We define
H S I C(X, Y ) :=
m X m Y − m X Y
2HX ⊗HY ∈ R,
H S I C(X, Y ) = 0 ⇐⇒ X ⊥⊥ Y.
X ⊥⊥ Y ⇐⇒ PX Y = PX PY ⇐⇒ m X Y = m X m Y ⇐⇒ H S I C(X, Y ) = 0 ,
X Y = Y X = 0 ⇐⇒ X ⊥⊥ Y .
If we abbreviate
·
X ⊗Y and ·, · X ⊗Y as
·
and ·, · , respectively, then we
have
m X Y
2 = E X Y [k X (X, ·)kY (Y, ·)], E X Y [k X (X , ·)kY (Y , ·)]
= E X Y E X Y [k X (X, ·)kY (Y, ·), k X (X , ·)kY (Y , ·)
= E X Y X Y [k X (X, X )kY (Y, Y )] ,
5.3 The HSIC and Independence Test 141
and
m X m Y
2 = E X [k X (X, ·)]EY [kY (Y, ·)], E X [k X (X , ·)]EY [kY (Y , ·)]
= E X E X [k X (X, X )]EY EY [kY (Y, Y )] ,
where X, X (Y, Y ) are independent and follow the same distribution. Hence, we
can write H S I C(X, Y ) as
H S I C(X, Y ) :=
m X Y − m X m Y
2
= E X X Y Y [k X (X, X )kY (Y, Y )] − 2E X Y {E X [k X (X, X )]EY [kY (Y, Y )]}
+E X X [k X (X, X )]EY Y [kY (Y, Y )] . (5.8)
When applying the HSIC, we often construct the following estimator, replacing
the mean by the relative frequency.
1 2
H S I C := 2 k X (xi , x j )kY (yi , y j ) − 3 k X (xi , x j ) kY (yi , yh )
N i j N i j h
1
+ 4
k X (xi , x j ) kY (yh , yr ) (5.9)
N i j h r
1
k X (x, y) = kY (x, y) = exp(− ||x − y||2 )
2σ 2
(Gaussian kernel) as follows.
d e f k_x ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗ ∗ 2 / 2 )
k_y = k_x
k_z = k_x
n = 100
for a in [0 , 0.1 , 0.2 , 0.4 , 0.6 , 0 . 8 ] : # a i s the c o r r e l a t i o n
x = np . random . r a n d n ( n )
z = np . random . r a n d n ( n )
y = a ∗ x + np . s q r t ( 1 − a ∗ ∗ 2 ) ∗ z
p r i n t ( HSIC_1 ( x , y , k_x , k_y ) )
5.3 The HSIC and Independence Test 143
0.0006847868161461435
0.004413058917908441
0.004693757443490376
0.01389332860758824
0.010176397492526468
0.0364733529032461
The smaller the value of H S I C is, the more likely independence is, but for
random variables X, Y, U, V , the condition H S I C(X, Y ) < HS I C(U, V ) does not
mean that X, Y is closer to independence than U, V . However, in practice, the HSIC
is often used as the criterion to measure the certainty of independence.
Example 77 (LiNGAM [16, 28]) We wish to know the cause-and-effect relation
among the random variables X, Y, Z from their independent N realizations x, y, z.
For example, we assume that X, Y are generated based on either Model 1 (in which
X = e1 and Y = a X + e2 for a constant a ∈ R and zero-mean independent variables
e1 , e2 ) or Model 2 (in which Y = e1 and X = a Y + e2 for a constant a ∈ R and zero-
mean independent variables e1 , e2 ). We choose the model with a higher probability
144 5 The MMD and HSIC
between e1 ⊥⊥ e2 and e1 ⊥⊥ e2 . Then, we can apply the function HSIC_1, where e2 , e2
are calculated from y − ax, x − a y. For example, using the function
def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) ) / l e n ( x )
def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v
we can estimate a and a via f(y,x) and f(x,y), respectively. When we have three
variables X, Y, Z , we first determine the upstream variable. To this end, using the
function HSIC_2, we compare three independence cases: between x and its residue
(f(y,x), f(z,x)), between y and its residue (f(z,y), f(x,y)), and between z and
its residue (f(x,z), f(y,z)). For example, if we choose the first pair, then X is the
upstream variable.
Then, we choose the midstream variable among the unselected two variables.
For example, if X is selected in the first round, then we compare two independence
sets f(y_x,z_xy) and f(z_x,y_zx). If we use the notation of the program, these are
y_x=f(y,x) and z_xy=f(z_x,y_x).
# Data G e n e r a t i o n
n = 30
x = np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
y = 2 ∗ x + np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
z = x + y +np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
x = x − np . mean ( x )
y = y − np . mean ( y )
z = z − np . mean ( z )
i f v1 < v2 :
i f v1 < v3 :
top = 1
else :
top = 3
else :
i f v2 < v3 :
top = 2
else :
top = 3
# E s t i m a t e t h e DownStream
x_yz = f ( x_y , z_y )
y_zx = f ( y_z , x_z )
z_xy = f ( z_x , y_x )
i f top == 1 :
v1 = HSIC_1 ( y_x , z_xy , k_y , k_z )
v2 = HSIC_1 ( z_x , y_zx , k_z , k_y )
if v1 < v2 :
middle = 2
bottom = 3
else :
5.3 The HSIC and Independence Test 145
middle = 3
bottom = 2
i f t o p == 2 :
v1 = HSIC_1 ( z_y , x_yz , k_y , k_z )
v2 = HSIC_1 ( x_y , z_xy , k_z , k_y )
i f v1 < v2 :
middle = 3
bottom = 1
else :
middle = 1
bottom = 3
i f top == 3 :
v1 = HSIC_1 ( z_y , x_yz , k_z , k_x )
v2 = HSIC_1 ( x_y , z_xy , k_x , k_z )
if v1 < v2 :
middle = 1
bottom = 2
else :
middle = 2
bottom = 1
# Display the Result
print ( " top = " , top )
print ( " middle = " , middle )
print ( " bottom = " , bottom )
top = 1
middle = 3
bottom = 2
v = np . q u a n t i l e (w, 0 . 9 5 )
x = np . l i n s p a c e ( min ( min (w) , u , v ) , max ( max (w) , u , v ) , 2 0 0 )
## Graphical Output
d e n s i t y = kde . g a u s s i a n _ k d e (w)
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )
Now, let us use the unbiased estimate of the HSIC, H S I C U , to find the theoretical
asymptotic distribution according to the null hypothesis. Noting that
1
N N N N
H SIC = 4 h(z i , z j , z q , zr )
N i=1 j=1 q=1 r =1
i,
j,h,r
1
h(z i , z j , z q , zr )= {k X (xt , xu )kY (yt , yu )+k X (xt , xu )kY (yv , yw )−2k X (xt , xu )kY (yt , yv )} ,
4!
(t,u,v,w)
i, j,q,r
where (t,u,v,w) denotes the sum such that (i, j, q, r ) ranges over (t, u, v, w) =
(i, j, h, r ), i.e., the sum over the permutations of (i, j, h, r ). If we modify this esti-
mate to make it an unbiased estimator, we obtain
1
H S I CU = h(z i , z j , z q , zr ),
N i< j<q<r
4
where i, j,q,r is the sum that ranges over 1 ≤ i, j, q, r ≤ N without any overlap.
For example, we can construct the program as follows. Since the program con-
sumes memory, the number of samples should be limited to 100 or less. Additionally,
Density
0 100
0 100
Fig. 5.3 The distribution follows the null hypothesis when using the unbiased estimator H S I CU
of the HSIC. The blue line is the statistic, and the red dotted line is the boundary with the rejection
region
5.3 The HSIC and Independence Test 147
since the estimator is different from H S I C, it produces different values for the same
data. The values of H S I C U are smaller than those of H S I C.
d e f h ( i , j , q , r , x , y , k_x , k_y ) :
M = l i s t ( i t e r t o o l s . combinations ( [ i , j , q , r ] , 4) )
m = l e n (M)
S = 0
f o r j i n range (m) :
t = M[ j ] [ 0 ]
u = M[ j ] [ 1 ]
v = M[ j ] [ 2 ]
w = M[ j ] [ 3 ]
S = S + k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ t ] , y [ u ] ) \
+ k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ v ] , y [w ] ) \
− 2 ∗ k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ t ] , y [ v ] )
return S / m
The function h 1 (·) is the zero function. For h 2 (·, ·), we use the following formula.
and
1
h 2 (z, z ) = k̃ X (x, x )k̃Y (y, y )
6
z = (x, y), z = (x , y ).
Proof: The derivation is due to simple transformations. See the original paper for the
proof.
Mercer’s theorem is not applicable since the kernel h 2 of the integral operator is
not nonnegative definite. However, since the kernel is symmetric and its integral oper-
ator is self-adjoint, eigenvalues {λi } and eigenfunctions {φi } exist (Proposition 27).
Therefore, as in the case involving the two-sample problem, the null distribution
can be calculated by using Proposition 51. Moreover, the mean of h 2 is zero, i.e.,
h̃ 2 = h 2 .
148 5 The MMD and HSIC
Density
0 200
0
-0.001 0.000 0.001 0.002 0.003 0.0000.0010.0020.0030.0040.005
HSIC U HSIC U
Fig. 5.4 The distribution follows the null hypothesis when using the unbiased estimator of the
HSIC, i.e., H S I C U . The blue line is the statistic, and the red dotted line is the boundary with the
rejection region. The distribution of the null hypothesis is different from that of the estimator H SIC
used in the permutation test. In particular, when X, Y are independent, the unbiased estimator can
take a negative value because the true value of the HSIC is zero
Example 79 Calculate the eigenvalues of the Gram matrix of the positive definite
kernel h 2 and divide them by N to obtain the desired eigenvalues (Sect. 3.3). Then,
find the distribution that follows the null hypothesis and calculate the rejection region.
We construct the following program and execute it. We input a random number that
follows a Gaussian distribution with N = 100 samples. In Fig. 5.4, the left panel
shows a correlation coefficient of 0, and the right panel shows a correlation coefficient
of 0.2.
sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / s i g m a ∗ ∗ 2 )
k_x = k ; k_y = k
# # Data G e n e r a t i o n
n = 1 0 0 ; x = np . random . r a n d n ( n )
a = 0 # Independent
# a =0.2 ## C o r r e l a t i o n 0 . 2
y = a ∗x + np . s q r t ( 1 − a ∗ ∗ 2 ) ∗ np . random . r a n d n ( n )
# y=rnorm ( n ) ∗2 ## The d i s t r i b u t i o n s a r e n o t e q u a l
## N u l l H y p o t h e s i s
K_x = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_x [ i , j ] = k_x ( x [ i ] , x [ j ] )
K_y = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_y [ i , j ] = k_y ( y [ i ] , y [ j ] )
F = np . z e r o s ( n )
f o r i i n range ( n ) :
F [ i ] = np . sum ( K_x [ i , : ] ) / n
G = np . z e r o s ( ( n ) )
f o r i i n range ( n ) :
G[ i ] = np . sum ( K_y [ i , : ] ) / n
H = np . sum ( F ) / n
I = np . sum (G) / n
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = ( K_x [ i , j ] − F [ i ] − F [ j ] + H) \
5.3 The HSIC and Independence Test 149
∗ ( K_y [ i , j ] − G[ i ] − G[ j ] + I ) / 6
r = 20
lam , v e c = np . l i n a l g . e i g (K)
lam = lam / n
p r i n t ( lam )
z = []
f o r s i n range ( 1 0 0 0 0 ) :
z . a p p e n d ( 1 / n ∗ ( np . sum ( lam [ 0 : r ] ∗ ( np . random . c h i s q u a r e ( d f = 1 , s i z e = r ) − 1) )
))
v = np . q u a n t i l e ( z , 0 . 9 5 )
## S t a t i s t i c s
u = HSIC_U ( x , y , k_x , k_y )
## Graphical Output
x = np . l i n s p a c e ( min ( min ( z ) , u , v ) , max ( max ( z ) , u , v ) , 2 0 0 )
d e n s i t y = kde . g a u s s i a n _ k d e ( z )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )
∞
∞
∞
Y X
2H S =
Y X e X,i
2HY = eY, j , Y X e X,i 2HY
i=1 i=1 j=1
∞
∞
= e X,i ⊗ eY, j , m X Y − m X m Y 2HX ⊗HY =
m X Y − m X m Y
2HX ⊗HY .
i=1 j=1
Similarly,
X Y
2H S has the same value.
Let H and k be an RKHS and its reproducing kernel, respectively. Let P be the set
of distributions that a random variable X follows. Then, we can define the map
P μ → k(x, ·)dμ(x) ∈ H ,
the radius of the open set U (x, ) is sufficiently small, then E(η), U (x, ) has no
intersection.
Here, if E(η) = E, then (5.11) means that μ = 0, i.e., μ1 = μ2 . On the other
hand, if E(η) E, then a μ = 0 exists such that (5.11) holds.
Proposition 54 k(x, y) = φ(x − y) is characteristic if and only if the support of
the finite measure η of k(x − y) = E ei(x−y)w dη(w) coincides with E.
For the proof of necessity, see the Appendix at the end of this chapter.
Example 80 The Gaussian kernel in Example 19 and the Laplace kernel in Example
20 are zero-mean Gaussian- and Laplace-distributed, respectively, and the support is
the entire interval. Therefore, they are characteristic kernels. On the other hand, the
kernel
k(x, y) = φ(x − y)
k1 , k2 is characteristic as well.
Let E be a compact set. Suppose that the RKHS H induced by the kernel k : E ×
E → H is a dense (under the uniform norm) subset of the set C(E) of continuous
functions E → R. Then, we say that the kernel k is universal.
To show that the kernel k is universal, we only need to see if the correspond-
ing RKHS satisfies the two Stone-Weierstrass conditions (Proposition 12). Proposi-
tion 56 gives a sufficient condition for the kernel k to be universal (see Chap. 2 for
152 5 The MMD and HSIC
if we let
γ f (·) − i αi (·)
∞ ≤
γ
∞ , then we have
f (·) − αi γ −1 (·)
∞ ≤
γ
−1
∞
γ f (·) − αi (·)
∞ ≤
i i
The necessary and sufficient condition for Proposition 54 assume that the kernel
is a function of the difference between two variables. The following is a sufficient
condition, but it refers to kernels in general.
Proposition 57 A universal kernel on a compact set is characteristic.
Proof: See the Appendix at the end of the chapter.
sup E P [ f (X )] − E Q [ f (X )],
f ∈F
where F is a class of functions. This chapter also deals with the case in which
F := { f ∈ H |
f
H ≤ 1}.
Proposition 58 Suppose that a kmax exists such that 0 ≤ k(x, y) ≤ kmax for each
x, y ∈ E. Then, for any > 0, we have
2 4kmax 2 N
P |M M D B − M M D2| > + ≤ 2 exp − ,
N 4kmax
2
where the estimator M M D B of M M D 2 is given by (5.3), and we assume that the
number of samples for x, y is equal to N and that P = Q.
For the proof of Proposition 58, we use an inequality that slightly generalizes
Proposition 46.
Proposition 59 (McDiarmid) Let f : E m → R imply that a ci < ∞ (i = 1, · · · , m)
exists satisfying
and
2 2
P (| f (x1 , . . . , xm ) − E X 1 ···X m f (X 1 , . . . , X m )| > ) < 2 exp − m 2
.
i=1 ci
(5.13)
Proof: Hereafter, we denote f (X 1 , · · · , X N ) and E[ f (X 1 , · · · , X N )] by f and E[ f ],
respectively. If we define
154 5 The MMD and HSIC
V1 := E X 2 ···X N [ f |X 1 ] − E X 1 ···X N [ f ]
..
.
Vi := E X i+1 ···X N [ f |X 1 , · · · , X i ] − E X i ···X N [ f |X 1 , · · · , X i−1 ]
..
.
VN := f − E X N [ f |X 1 , · · · , X N −1 ]
N
f − E X 1 ···X N [ f ] = Vi . (5.14)
i=1
From
we have
E X i [Vi |X 1 , · · · , X i−1 ] = 0 . (5.15)
N
f − E[ f ] > ⇐⇒ exp{t Vi } > et for arbitrary t > 0 .
i=1
N
−t
P( f − E[ f ] ≥ ) ≤ inf e E[exp{t Vi }] . (5.16)
t>0
i=1
N N −1
E[exp{t Vi }] = E X 1 ···X N −1 [exp{t Vi }E X N [exp{t VN }|X 1 , · · · , X N −1 ]]
i=1 i=1
N −1
≤ E X 1 ···X N −1 [exp{t Vi }] exp{t 2 c2N /8}
i=1
t 2
N
= exp{ ci2 } .
8 i=1
5.5 Introduction to Empirical Processes 155
t2 2
N
P( f − E[ f ] ≥ ) ≤ inf exp{−t + c }.
t>0 8 i=1 i
N 2
The right-hand side is minimized when t = 4/ i=1 ci , and we obtain (5.12).
Replacing f with − f , we obtain the other inequality. From both inequalities, we
have (5.13).
In the following, we denote by
F := { f ∈ H |
f
H ≤ 1}
the unit ball in the universal (see Sect. 5.4 for the definition of universality) RKHS
H w.r.t. a compact E and assume that the kernel of H is less than or equal to kmax .
Hereafter, let X 1 , . . . .X m be independent random variables that follow probability
P, and let σ1 , . . . , σm be independent random variables, each of which takes a value
of ±1 equiprobably. Then, we say that the quantity
1
m
R N (F) := Eσ sup | σi f (xi )| (5.17)
f ∈F m i=1
Proof: From
f
H ≤ 1 and k(x, x) ≤ kmax , we have
1 1
N N
R N (F) = Eσ [sup | σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |]
f ∈F N i=1 f ∈F N i=1
1
N
= Eσ [sup | f, σi k(xi , ·) H |]
f ∈F N i=1
156 5 The MMD and HSIC
1 N
1
N
≤ Eσ [sup
f
H σi k(xi , ·), σi k(xi , ·) H ]
f ∈F N i=1 N i=1
1 N N
N N
≤ Eσ [ σ i σ j k(x i , x j )] ≤ Eσ [ 1 σi σ j k(xi , x j )]
N 2 i=1 j=1 N 2 i=1 j=1
1 N N
kmax
= δi, j k(xi , x j ) ≤ ,
2
N i=1 j=1 N
where we use
E[σi σ j ] = σi2 δi, j = δi, j
in the derivation. We obtain the other inequality by taking the expectation w.r.t. the
probability P.
Propositions 59 and 60 are inequalities used for mathematical analysis in machine
learning as well as for the proof of Proposition 58.
Proof of Proposition 58: If we define
f (x1 , . . . , x N , y1 , . . . , y N )
1 1 1 1
:=
k(x1 , ·) + . . . + k(x N , ·) − k(y1 , ·) − . . . − k(y N , ·) ,
N N N N
then from the triangular inequality, we obtain
1 1
N N
2
|M M D 2 − M M D B | = | sup {E P ( f ) − E Q ( f )} − sup { f (xi ) − f (y j )}|
f ∈F f ∈F N N
i=1 j=1
1
N
1
N
≤ sup |E P ( f ) − E Q ( f ) − { f (xi ) − f (y j )}| .
f ∈F N N
i=1 j=1
1 1
N N
E X,Y sup |E P ( f ) − E Q ( f ) − { f (X i ) − f (Yi )}|
f ∈F N N
i=1 i=1
1 1 1 1
N N N N
= E X,Y sup |E X { f (X i ) − f (X i )} − EY { f (Y j )) − f (Y j )}|
f ∈F N N N N
i=1 i=1 j=1 j=1
5.5 Introduction to Empirical Processes 157
1 1 1 1
N N N N
≤ E X,Y,X ,Y sup | f (X i ) − f (X i ) − f (Yi ) + f (Yi )|
f ∈F N N N N
i=1 i=1 i=1 i=1
1
N
1
N
= E X,Y,X ,Y ,σ,σ sup | σi { f (X i ) − f (X i )} + σi { f (Yi ) − f (Yi )}|
f ∈F N N
i=1 i=1
1 1
N n
≤ E X,X ,σ sup | σi { f (X i ) − f (X i )}| + EY,Y ,σ sup | σ j { f (Y j ) − f (Y j )}|
f ∈F N i=1 f ∈F N j=1
kmax
≤ 2[R(F , P) + R(F , Q)] ≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ] = 4 , (5.19)
N
where the first inequality is due to Jensen’s inequality, the second stems from the tri-
angular inequality, the third is derived from the definition of Rademacher complexity,
and the fourth due is obtained from the inequality of Rademacher complexity (Propo-
√ 2
sition 60). From (5.18) and (5.19), for ci = N2 kmax and f = M M D 2 − M MD ,
we have
kmax
E X 1 ...X N f ≤ 4 .
N
Appendix
The essential part of the proof of Proposition 54 was given by Fukumizu [7] but has
been rewritten as a concise derivation to make it easier for beginners to understand.
Proof of Proposition 48
The fact that E x → k(x, ·) ∈ H is measurable means that E[k(X, ·)] can be
treated as a random variable. However, the events in E × E are the direct prod-
ucts of the events generated by each E (the elements of F × F). Therefore, if the
function E × E (x, y) → k(x, y) ∈ R is measurable, then the function E y →
k(x, y) ∈ R is measurable for each x ∈ E (even if y ∈ E is fixed, (x, y) → k(x, y)
is still measurable). In the following, we show that any function belonging to H is
measurable. First, we note that H0 = span{k(x, ·)|x ∈ E} is dense in H . Addition-
ally, we note that for the sequence { f n } in H0 ,
f − f n
H → 0 (n → ∞) means that
| f (x) − f n (x)| → 0 for each x ∈ E (Proposition 35). The following lemma implies
that f is measurable.
for any f ∈ H and δ > 0 (this is an extension to the case where H = R). Moreover,
we have
f − k(x, ·)
H < δ ⇐⇒ k(x, x) − 2 f (x) < δ 2 −
f
2H .
Proof of Lemma 8
It is sufficient to show that f −1 (B) ∈ F for any open set B. We fix B ⊆ R arbitrarily
and let Fm := {y ∈ B|U (y, 1/m) ⊆ B}, where U (y, r ) := {x ∈ R | d(x, y) < r }.
From the definition, we have the following two equations.
Proof of Proposition 49
∞ ∞
The evaluation is finite for arbitrary g = i=1 j=1 e X,i eY, j ∈ H X ⊗ HY and
(x, y) ∈ E. In fact, we have
⎛ ⎞1/2 ⎛ ⎞1/2
∞
∞ ∞
∞
∞
|g(x, y)| ≤ |ai, j | · |e X,i (x)| · |eY, j (y)|≤ |e X,i (x)| · ⎝ 2 (y)⎠
eY, j
⎝ 2 (y)⎠
ai, j ,
i=1 j=1 i=1 j=1 j=1
∞ (5.20)
where
we apply Cauchy-Schwarz’s inequality (2.5) to j=1 . If we set k Y (y, ·) =
j h j (y)eY,
j (·), then from eY,i (·), k Y (y, ·) = eY,i (y), we have h i (y) = eY,i (y) and
kY (y, ·) = ∞ j=1 eY, j (y)eY, j (·). Thus, we obtain
∞
j (y) = k Y (y, y)
2
eY, (5.21)
j=1
Appendix 159
and
⎛ ⎞1/2 ⎛ ⎞1/2 ⎛ ⎞1/2
∞
∞
∞ ∞
∞
|e X,i (x)| · ⎝ ai,2 j ⎠ ≤⎝ e X,i (x)⎠ ⎝
2 ai, j ⎠
2 = k X (x, x)
g
,
i=1 j=1 i=1 i=1 j=1
∞ (5.22)
where we apply Cauchy-Schwarz’s√ inequality √ (2.5) to i=1 . Note that (5.20), (5.21),
and (5.22) imply that |g(x, y)| ≤ k X (x, x) kY (y, y)
g
. Thus, H X ⊗ HY is an
RKHS.
From k X (x, ·) ∈ H X , kY (y, ·) ∈ HY , we have that k(x, ·, y, ) := k X (x, ·)
kY (y, ·) ∈ H X ⊗ HY for k(x, x , y, y ) := k X (x, x )kY (y, y ). From
∞
∞ ∞
∞
g(x, y) = ai, j e X,i (x)eY, j (y) = ai, j e X,i (·), k X (x, ·) H X eY,i (), kY (y, ) HY
i=1 j=1 i=1 j=1
∞
∞ ∞
∞
= ai, j e X,i (·)eY, j (), k(x, ·, y, ) H = ai, j e X,i (·)eY, j (·), k(x, ·, y, ) H
i=1 j=1 i=1 j=1
= g(·, ), k(x, ·, y, ) ,
Since g is not zero, ν is not the zero measure. Thus, using the total variation
n
|ν|(B) := sup |ν(Bi )| , B ∈ F ,
∪Bi =B i=1
160 5 The MMD and HSIC
fn f1
0
F U
where sup is the supremum when dividing F into Bi ∈ F, we define the constant
c := |ν|(E) and the finite measures μ1 := 1c |ν| and μ2 := 1c {|ν| − ν}. From ν(E) =
0, we observe that μ1 and μ2 are both probabilities and that μ1 = μ2 . Additionally,
we have
c(dμ1 − dμ2 ) = dν = 2 cos(w0 x)dμ .
From Fubini’s theorem, we can write the difference between the expectations w.r.t.
probabilities μ1 , μ2 as
1
φ(x − y)dμ1 (y) − φ(x − y)dμ2 (y) = φ(x − y)2 cos(w0 y)dμ(y)
c
E
E
E
1 i(x−y) w 1 i xw
= 2 cos(w0 y) e dηdμ(y) = e h(w)dη(w) .
c c E
However, since the supports of h and η do not intersect, the value is zero, which
contradicts the assumption that φ(x − y) is a characteristic kernel.
Proof of Proposition 57
For any bounded continuous f , if E f d P = E f d Q holds, this implies that P = Q
(Fig. 5.5). In fact, let U be an open subset of E, and let V be its complement.
Furthermore, let d(x, V ) := inf y∈V d(x, y) and f n (x) := min(1, nd(x, V )). Then,
f n is a bounded continuous function on E, and f n (x) ≤ I (x ∈ U ) and f n (x) →
I (x ∈ U ) as n → ∞ foreach x ∈ R; Thus, by the monotonic convergence theorem,
E n f d P → P(U ) and f
E n d Q → Q(U ) hold. By our assumption, E fn d P =
E nf d Q and P(U ) = Q(U ), i.e., P(V ) = Q(V ) holds 2
In other words, every event
is guaranteed to be a closure event. Let E be a compact set. For each element g ∈ H in
the RKHS H of the universal kernel, the same argument follows since supx∈E | f (x) −
g(x)| can be arbitrarily small for any f ∈ C(E). That is, if gd P = gd Q holds
for any g ∈ H , then P = Q, so the universal kernel is characteristic.
2 If E is compact, then for any A ∈ F , P(A) = {P(V )|V is a closed set, V ⊆ A, V ∈ V } (Theorem
7.1.3, Dudley [6]).
Exercises 65∼83 161
Exercises 65∼83
65. Proposition 49 can be derived according to the following steps. Which part of
the proof in the Appendix does each step correspond to?
√ √
(a) Show that |g(x, x, y, y)| ≤ k X (x, x) kY (y, y)
g
for g ∈ H X ⊗ HY and
x ∈ E X , y ∈ E Y (from Proposition 33, this implies that H is some RKHS).
(b) Show that k(x, ·, y, ) := k X (x, ·)kY (y, ) ∈ H when x ∈ E X , y ∈ E Y are
fixed.
(c) Show that f (x, y) = f (·, ), k(x, ·, y, ) H .
66. How can we define the average of the elements of H X ⊕Y , m X Y =E X Y [k X (·)kY (·)]?
Define the average in the same way that we defined m X using Riesz’s lemma
(Proposition 22).
67. Show that Y X ∈ B(H X , HY ) exists such that
f g, m X Y − m X m Y HX ⊗HY = Y X f, g HY
for each f ∈ H X , g ∈ HY .
68. The MMD is generally defined as sup f ∈F {E P [ f (X )] − E Q [ f (X )]} for some set
F of functions. Assuming that F := { f ∈ H |
f
H ≤ 1}, show that the MMD is
m P − m Q
H . Furthermore, show that we can transform the MMD as follows.
where X and X (Y and Y ) are independent random variables that follow the
same distribution.
69. Show that the squared MMD estimator (5.4) is unbiased.
70. In the two-sample problem solved by a permutation test in Example 71, for the
case when the numbers of samples are m, n (can be different) instead of the same
n and m, n are both even numbers, modify the entire program in Example 71 to
examine whether it works correctly (m = n in Example 71).
71. For the function h in (5.6), show that h 1 is a function that always takes a value
of zero and that h̃ 2 and h coincide as functions.
72. Show that the fact that random variables X, Y that follow Gaussian distributions
are independent is equivalent to the condition that their correlation coefficient
is zero. Additionally, give an example of two variables whose correlation coef-
ficient is zero but that are not independent.
73. Prove the following equation.
m X Y − m X m Y
2 = E X X Y Y [k X (X, X )kY (Y, Y )]
−2E X Y {E X [k X (X, X )]EY [kY (Y, Y )]} + E X X [k X (X, X )]EY Y [kY (Y, Y )].
1 2
H S I C := 2 k X (xi , x j )kY (yi , y j ) − 3 k X (xi , x j ) kY (yi , yh )
N N
i j i j h
1
+ 4 k X (xi , x j ) kY (yh , yr )
N r
i j h
def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) ) / l e n ( x )
def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v
# # E s t i m a t e UpStream ##
def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) / l e n ( x ) )
def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v
i f v1 < v2 :
i f v1 < v3 :
top = 1
else :
top = 3
else :
i f v2 < v3 :
top = 2
else :
top = 3
# # E s t i m a t e MidStream ##
x_yz = f ( x_y , z_y )
y_zx = f ( y_z , x_z )
z_xy = f ( z_x , y_x )
i f top == 1 :
v1 = ## Blank ( 1 ) ##
v2 = ## Blank ( 2 ) ##
if v1 < v2 :
middle = 2
bottom = 3
else :
middle = 3
bottom = 2
i f t o p == 2 :
v1 = # # B l a n k ( 3 ) ##
v2 = # # B l a n k ( 4 ) ##
i f v1 < v2 :
middle = 3
bottom = 1
else :
middle = 1
bottom = 3
i f top == 3 :
v1 = # # B l a n k ( 5 ) ##
v2 = # # B l a n k ( 6 ) ##
if v1 < v2 :
middle = 1
bottom = 2
else :
middle = 2
bottom = 1
# # O u t p u t t h e R e s u l t s ##
print ( " top = " , top )
print ( " middle = " , middle )
print ( " bottom = " , bottom )
# Data G e n e r a t i o n
f x = np . random . r a n d n ( n )
y = np . random . r a n d n ( n )
164 5 The MMD and HSIC
d e n s i t y = kde . g a u s s i a n _ k d e (w)
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )
78. In the MMD (Sect. 5.2) and HSIC (Sect. 5.3), we cannot apply Mercer’s theorem
because the kernel of the integral operator is not nonnegative definite. However,
in both cases, the integral operator possesses eigenvalues and eigenfunctions.
Why?
79. Show that k(x, y) = φ(x − y), φ(t) = e−|t| is a characteristic kernel.
80. In the proof of Proposition 54 (necessity, Appendix), we used the fact that
(d+1)/2
g(w) := ( −
w
2 )+ is nonnegative definite [8]. Verify that this fact is
correct for d = 1 by proving the following equality.
1 1 − cos(x)
g(w)e−iwx dw = .
2π − π x2
81. Why is the exponential type a universal kernel? Why is the characteristic kernel
based on a triangular distribution not a universal kernel?
82. Explain why the three equations and four inequalities hold in the following
derivation of the upper bound on the Rademacher complexity.
1 1
N N
R N (F) = Eσ [sup | σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |]
f ∈F N i=1 f ∈F N i=1
1
N
= Eσ [sup | f, σi k(xi , ·) H |]
f ∈F N i=1
1 N
1
N
≤ Eσ [sup
f
H σi k(xi , ·), σi k(xi , ·) H ]
f ∈F N i=1 N i=1
1 N N
≤ Eσ [ 2 σi σ j k(xi , x j )]
N i=1 j=1
1 N N
kmax
≤ Eσ [ 2 k(xi , x j )] ≤ .
N i=1 j=1 N
Exercises 65∼83 165
83. Explain why the one equality and four inequalities hold for the derivation of the
2
upper bound of |M M D 2 − M M D B | below.
1 1 1 1
N N N N
E X,Y sup |E X { f (xi ) − f (xi )} − EY { f (y j )) − f (y j )}|
f ∈F N N N N
i=1 i=1 j=1 j=1
1
N
1
N
1
N
1
N
≤ E X,Y,X ,Y sup | f (xi ) − f (xi ) − f (yi ) + f (yi )|
f ∈F N N N N
i=1 i=1 i=1 i=1
1 1
N N
= E X,Y,X ,Y ,σ,σ sup | σi { f (xi ) − f (xi )} + σi { f (yi ) − f (yi )}|
f ∈F N i=1
N
i=1
1 1
N n
≤ E X,X ,σ sup | σi { f (xi ) − f (xi )}| + EY,Y ,σ sup | σ j { f (y j ) − f (y j )}|
f ∈F N f ∈F N
i=1 j=1
≤ 2[R(F , P) + R(F , Q)]
≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ].
Chapter 6
Gaussian Processes and Functional Data
Analyses
6.1 Regression
Let E and (, F , μ) be a set and a probability space. If the correspondence between
ω → f (ω, x) ∈ R is measurable for each x ∈ E, i.e., if f (ω, x) is a random
variable at each x ∈ E, then we say that f : × E → R is a stochastic process.
Moreover, if the random variables f (ω, x1 ), . . . , f (ω, x N ) follow an N -variable
Gaussian distribution for any N ≥ 1 and any finite number of elements x1 , . . . .x N ∈
E, then we call f a Gaussian process. We define the covariance between xi , x j ∈ E
by
{ f (ω, xi ) − m(xi )}{ f (ω, x j ) − m(x j )}dμ(ω) ,
where m(x) := f (ω, x)dμ(ω) is the expectation of f (ω, x) for x ∈ E. Then, no
matter what N and x1 , . . . , x N we choose, their covariance matrices are nonnegative
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 167
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_6
168 6 Gaussian Processes and Functional Data Analyses
definite. Thus, we can write the covariance matrix by using a positive definite kernel
k : E × E → R. Therefore, the Gaussian process can be uniquely expressed in terms
of a pair (m, k) containing the mean m(x) of each x ∈ E and the covariance k(x, x )
of each (x, x ) ∈ E × E.
In general, a random variable is a map of → R, and we should make ω explicit,
i.e., f (ω, x), but for simplicity, for the time being, we make ω implicit, i.e., f (x),
even if it is a random variable.
Example 83 Let m X ∈ R N and k X X ∈ R N ×N be the mean and covariance matrix,
respectively, of the Gaussian process (m, k) at x1 , . . . , x N ∈ E := R. In general, for
a mean μ and a covariance matrix ∈ R N ×N , is nonnegative definite, and there
exists a lower triangular matrix R ∈ R N ×N with = R R (Cholesky decomposi-
tion). Therefore, to generate random numbers that follow N (m X , k X X ) from N inde-
pendent random numbers u 1 , . . . , u N that follow the standard Gaussian distribution,
we can calculate f X := R X u + m X ∈ R N for k X X := R X R X with u = [u 1 , . . . , u N ].
In fact, the expectation and the covariance matrix of f X are m X and
E[( f X − m X )( f X − m X ) ] = E[R X uu R
X ] = R X E[uu ]R X = R X R X = k X X ,
# I n s t a l l t h e module s k f d a v i a
p i p i n s t a l l s c i k i t −f d a
# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
from s k l e a r n . d e c o m p o s i t i o n import PCA
import s k f d a
# D e f i n i t i o n o f (m, k )
d e f m( x ) :
return 0
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / 2 )
# D e f i n i t i o n o f gp_sample
d e f g p _ s a m p l e ( x , m, k ) :
n = len ( x )
m_x = m( x )
k_xx = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i ] , x [ j ] )
R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x
u = np . random . r a n d n ( n )
r e t u r n R . d o t ( u ) + m_x
6.1 Regression 169
k_xx:
[[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02
3.35462628e-04]
[6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01
1.11089965e-02]
[1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01
1.35335283e-01]
[1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00
6.06530660e-01]
[3.35462628e-04 1.11089965e-02 1.35335283e-01 6.06530660e-01
1.00000000e+00]]
Example 84 For E = R2 , we can similarly obtain random numbers that follow the
N -variate multivariate Gaussian distribution.
# D e f i n i t i o n o f (m, k )
d e f m( x ) :
return 0
def k ( x , y ) :
r e t u r n np . exp ( − np . sum ( ( x−y ) ∗ ∗ 2 ) / 2 )
# D e f i n i t i o n o f Function gp_sample
d e f g p _ s a m p l e ( x , m, k ) :
n = x . shape [ 0 ]
m_x = m( x )
k_xx = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i ] , x [ j ] )
R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x
u = np . random . r a n d n ( n )
r e t u r n R . d o t ( u ) + m_x
k_xx:
[[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02
3.35462628e-04]
[6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01
1.11089965e-02]
[1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01
1.35335283e-01]
[1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00
6.06530660e-01]
[3.35462628e-04 1.11089965e-02 1.35335283e-01 6.06530660e-01
1.00000000e+00]]
yi = f (xi ) + i (6.1)
N
1 (yi − f (xi ))2
[√ exp{− }]
i=1 2π σ 2 2σ 2
when the function f is known (fixed). In the following, we assume that the func-
tion f randomly varies, and we regard the Gaussian process (m, k) as its prior
distribution. That is, we consider the model f X ∼ N (m X , k X X ) with yi | f (xi ) ∼
N ( f (xi ), σ 2 ) as f X = ( f (x1 ), . . . , f (x N )). Then, we calculate the posterior distri-
bution of f (z 1 ), . . . , f (z n ) corresponding to z 1 , . . . , z n ∈ E, which is different from
x1 , . . . , x N . The variations in y1 , . . . , y N is due to the variations in f and i . Thus,
the covariance matrix is
In the following, we show that the posterior probability of the function f (·) given the
value of Y is still a Gaussian process. To this end, we use the following proposition.
μ := m Z + k Z X (k X X + σ 2 I )−1 (Y − m X ) ∈ Rn
and
:= k Z Z − k Z X (k X X + σ 2 I )−1 k X Z ∈ Rn×n .
L L = kX X + σ 2 I ,
which can be completed in O(N 3 /3) time. Then, let the solutions of Lγ = k X x , Lβ =
y − m(x), and L α = β be γ ∈ R N , β ∈ R N , and α ∈ R N , respectively. Since L is
6.1 Regression 173
a lower triangular matrix, these calculations take at most O(N 2 ) time. Additionally,
we have
and
k x X (k X X + σ 2 I )−1 k X x = (Lγ ) (L L )−1 Lγ = γ γ .
m (x) = m(x) + k x X α
and
k (x, x) = k(x, x) − γ γ .
We can write the calculations of m(x), k(x, x) in forms that are completed in
O(N 3 ) and O(N 3 /3) time in Python as follows.
d e f gp_1 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 ) C o m p u t a t i o n
mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , ( y−mu ( x ) ) )
s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h )
r e t u r n {"mm" :mm, "ss" : s s }
d e f gp_2 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 / 3 )
Computation
a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) ) # O( n ^ 2 )
Computation
mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) )
gamma = np . l i n a l g . s o l v e ( L . T , h ) # O( n ^ 2 )
Computation
s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 )
r e t u r n {"mm" :mm, "ss" : s s }
Example 85 For comparison purposes, we executed the functions gp_1 and gp_2.
We can see the difference achieved by Cholesky decomposition, which reduced the
computational complexity (Fig. 6.1).
sigma_2 = 0 . 2
n = 100
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( x / 2 ) + np . random . r a n d n ( n )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
# # Measure E x e c u t i o n Time
import t i m e
s t a r t 1 = time . time ( )
gp_1 ( 0 )
end1 = t i m e . t i m e ( )
p r i n t ( "time1=" , end1 − s t a r t 1 )
s t a r t 2 = time . time ( )
gp_2 ( 0 )
end2 = t i m e . t i m e ( )
p r i n t ( "time2=" , end2 − s t a r t 2 )
# The 3 s i g m a w i d t h a r o u n d t h e a v e r a g e
u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 )
v _ s e q = [ ] ; w_seq = [ ]
for u in u_seq :
r e s = gp_1 ( u )
v _ s e q . a p p e n d ( r e s [ "mm" ] )
w_seq . a p p e n d ( r e s [ "ss" ] )
plt . figure ()
plt . x l i m ( −3 , 3 )
plt . y l i m ( −3 , 3 )
plt . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
plt . p l o t ( u_seq , v _ s e q )
plt . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ 3 f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b" )
plt . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ ( − 3) f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b
")
p l t . show ( )
n = 100
plt . figure ()
p l t . x l i m ( −3 , 3 )
p l t . y l i m ( −3 , 3 )
## Five times , changing t h e samples
c o l o r = [ "r" , "g" , "b" , "k" , "m" ]
f o r h i n range ( 5 ) :
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( np . p i ∗ x / 2 ) + np . random . r a n d n ( n )
sigma_2 = 0 . 2
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 )
v_seq = [ ]
for u in u_seq :
r e s = gp_1 ( u )
v _ s e q . a p p e n d ( r e s [ "mm" ] )
p l t . p l o t ( u_seq , v_seq , c = c o l o r [ h ] )
time1 = 0.009966373443603516
time2 = 0.057814598083496094
m (x) := k x X (k X X + σ 2 I )−1 Y
6.1 Regression 175
3
2
2
1
1
f (z)
0
0
0
-1
-1
-2
-2
-3
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
z Index
Fig. 6.1 We show the range of 3 σ above and below the average (left) and executed different
samples five times (right)
k x,X α̂ = k x X (K + λI )−1 Y
obtained by multiplying the kernel ridge regression formula (4.6) by k x,X from the
left, we observe that the former is a specific case of the latter when setting λ = σ 2 .
6.2 Classification
We consider the classification problem next. We assume that the random variable Y
takes the value Y = ±1 and that its conditional probability given x ∈ E is
1
P(Y = 1|x) = , (6.5)
1 + exp(− f (x))
N
log[1 + exp{−yi f (xi )}] .
i=1
If we set
f X = [ f 1 , . . . , f N ] = [ f (x1 ), . . . , f (x N )] ∈ R N , vi := e−yi fi , and
N
l( f X ) := i=1 log(1 + vi ), then we have
176 6 Gaussian Processes and Functional Data Analyses
∂vi ∂l( f X ) yi vi ∂ 2 l( f X ) vi
= −yi vi , =− , = ,
∂ fi ∂ fi 1 + vi ∂ f i2 (1 + vi )2
where we use yi2 = 1. Given an initial value, we wait for the Newton-Raphson update
f X ← f X − (∇ 2 l( f X ))−1 ∇l( f X ) to converge. The update formula is
f X ← f X + W −1 u ,
yi vi vi
where u = and W = diag . In other words, we
1 + vi i=1,...,N (1 + vi )2 i=1,...,N
repeat the following two steps:
1. Obtain v, u, and W from f X .
2. Calculate f X + W −1 u and substitute it into f X
for v := [v1 , . . . , v N ] ∈ R N .
N
1
Next, we consider maximizing the likelihood multiplied
i=1
1 + exp{−y i f (x i )}
by the prior distribution of f X , i.e., finding the solution with the maximum posterior
probability. Here, the mean is often set to 0 as the prior probability of f in the
formulation of (6.5). Suppose first that the prior probability of f X ∈ R N is
1 f X k −1
X X fX
exp{− },
(2π ) N det k X X 2
1 −1 1 N
L( f X ) = l( f X ) + f k f X + log det k X X + log 2π , (6.6)
2 X XX 2 2
then we have
∇ L( f X ) = ∇l( f X ) + k −1 −1
X X f X = −u + k X X f X (6.7)
and
∇ 2 L( f X ) = ∇ 2 l( f X ) + k −1 −1
X X = W + kX X . (6.8)
f X ← f X + (W + k −1 −1 −1
X X ) (u − k X X f X )
= (W + k −1 −1 −1 −1
X X ) {(W + k X X ) f X − k X X f X + u}
= (W + k −1 −1
X X ) (W f X + u) .
(W + k −1XX)
−1
= k X X − k X X (W −1 + k X X )−1 k X X
= k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X . (6.10)
L L W −1/2 α = Lβ = W 1/2 k X X γ
k X X (γ − α)
= k X X {γ − W 1/2 (L L )−1 W 1/2 k X X γ } = {k X X − k X X W 1/2 (L L )−1 W 1/2 k X X }γ
= {k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X }γ = (W + k −1 −1
X X ) (W f + u) ,
Example 86 By using the first N = 100 of the 150 Iris data (the first 50 points
and the next 50 points are Setosa and Versicolor data, respectively), we found the
f X = [ f 1 , . . . , f N ] with the maximum posterior probability. The output showed that
f 1 , . . . , f 50 and f 51 , . . . , f 100 were positive and negative, respectively.
from s k l e a r n . d a t a s e t s import l o a d _ i r i s
df = l o a d _ i r i s ( ) # # I r i s Data
x = df . data [0:100 , 0:4]
y = np . a r r a y ( [ 1 ] ∗ 5 0 + [ − 1 ] ∗ 5 0 )
n = len ( y )
# # Compute K e r n e l v a l u e s f o r t h e f o u r c o v a r i a t e s
def k ( x , y ) :
r e t u r n np . exp ( np . sum( − ( x − y ) ∗ ∗ 2 ) / 2 )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i , : ] , x [ j , : ] )
eps = 0.00001
f = [0] ∗ n
g = [0.1] ∗ n
w = ( v / (1 + v ) ∗∗2)
W = np . d i a g (w)
W_p = np . d i a g (w∗ ∗ 0 . 5 )
W_m = np . d i a g (w∗ ∗ ( − 0 . 5 ) )
L = np . l i n a l g . c h o l e s k y ( np . i d e n t i t y ( n ) +np . d o t ( np . d o t ( W_p , K) , W_p) )
gamma = W. d o t ( f ) + u
b e t a = np . l i n a l g . s o l v e ( L , np . d o t ( np . d o t ( W_p , K) , gamma ) )
a l p h a = np . l i n a l g . s o l v e ( np . d o t ( L . T , W_m) , b e t a )
f = np . d o t (K, ( gamma− a l p h a ) )
print ( l i s t ( f ) )
then we obtain
f (x)| f X ∼ N (m (x), k (x, x)) , (6.11)
where
m (x) = k x X k −1
X X fX
and
k (x, x) = k x x − k x X k −1
X X kXx .
6.2 Classification 179
f X |Y ∼ N ( fˆ, (Ŵ + k −1 −1
X X ) ). (6.12)
That is, the covariance matrix is the inverse of Ŵ + k −1 X X , which is the Hessian
ˆ
∇ L( f ) of (6.8). Then, the variations in (6.11), (6.12) are independent, and f (x|Y ) =
2
m ∗ = k x X k −1
XX f
ˆ (6.13)
k∗ = k x x − k x X k −1 −1 −1 −1 −1
X X k X x + k x X k X X ( Ŵ + k X X ) k X X k X x
= k x x − k x X (Ŵ −1 + k X X )−1 k X x ,
1
P(Y = 1|x) = :
1 + exp(− f (x))
1 1 1
√ exp[− {z − m ∗ }2 ]dz. (6.14)
E 1 + exp(−z) 2π k∗ 2k∗
To implement this step, we only need to compute û from fˆ. Since (6.7) is zero
when the updates converge, from (6.13), we have that
m ∗ = k x X û
and
k∗ = k x x − α α
since we have
We can describe the procedure for finding the value of (6.14) in Python as follows.
We assume that the procedure starts immediately after the procedure of Example 86
completes. 6-I
def pred ( z ) :
kk = np . z e r o s ( n )
f o r i i n range ( n ) :
kk [ i ] = k ( z , x [ i , : ] )
mu = np . sum ( kk ∗ u ) # Mean
a l p h a = np . l i n a l g . s o l v e ( L , np . d o t ( W_p , kk ) )
s i g m a 2 = k ( z , z ) − np . sum ( a l p h a ∗ ∗ 2 ) # Variance
m = 1000
b = np . random . n o r m a l ( mu , sigma2 , s i z e = m)
p i = np . sum ( ( 1 + np . exp ( − b ) ) ∗∗( − 1) ) / m # Prediction
return pi
z = np . z e r o s ( 4 )
f o r j i n range ( 4 ) :
z [ j ] = np . mean ( x [ : 5 0 , j ] )
pred ( z )
0.9466371383148896
f o r j i n range ( 4 ) :
z [ j ] = np . mean ( x [ 5 0 : 1 0 0 , j ] )
pred ( z )
0.05301765489687672
f (x)| f X ∼ N (m(x) + k x X k −1 −1
X X ( f X − m X ), k(x, x) − k x X k X X k X x )
6.3 Gaussian Processes with Inducing Variables 181
y| f (x) ∼ N ( f (x), σ 2 )
by
f Z ∼ N (m Z , k Z Z ) (6.15)
f (x)| f Z ∼ N (m(x) + k x Z k −1 −1
Z Z ( f Z − m Z ), k(x, x) − k x Z k Z Z k Z x ) (6.16)
and
f Z |Y = k Z Z Q −1 k Z Z , (6.19)
where
Q := k Z Z + k Z X ( + σ 2 I N )−1 k X Z ∈ R M×M (6.20)
f X | f Z ∼ N (m X + k X Z k −1
Z Z ( f Z − m Z ), )
Y | f Z ∼ N (m X + k X Z k −1
Z Z ( f Z − m Z ), + σ I N ) .
2
·{Y − (m X + k Z X k −1
Z Z ( f Z − m Z ))} . (6.21)
−k −1 −1 −1 −1
Z Z a + k Z Z k Z X ( + σ I N ) (b − k Z X k Z Z a)
2
= k −1 2 −1 −1 −1
Z Z k Z X ( + σ I N ) b − k Z Z {k Z Z + k Z X ( + σ I N ))k X Z }k Z Z a
2
= k −1 −1 −1 −1
Z Z k Z X ( + σ I N ) b − k Z Z Qk Z Z a
2
= k −1 −1 −1 −1
Z Z Qk Z Z {k Z Z Q k Z X ( + σ I N ) b − a}
2
= − −1 f Z |Y ( f Z − μ f Z |Y ) . (6.22)
1
− ( f Z − μ f Z |Y ) −1
f Z |Y ( f Z − μ f Z |Y ), (6.23)
2
and we obtain the proposition.
Proposition 64 Under the generation process outlined in (6.15), (6.16), (6.17) and
Assumption 1, we have
Y ∼ N (μY , Y )
with
μY := m X (6.24)
and
Y := + σ 2 I N + k X Z k −1
Z Z kX Z . (6.25)
1 1
− a k −1 −1 2 −1 −1
Z Z a − (b − k Z X k Z Z a) ( + σ I ) (b − k Z X k Z Z a) (6.26)
2 2
and
1
− (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) (k Z Z Q −1 k Z Z )−1 (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) .
2
(6.27)
From (6.20), we have
1 1 1 −1 −1
− a k −1 −1 2 −1 −1
Z Z a − (k Z X k Z Z a) ( + σ I ) k Z X k Z Z a = − a k Z Z Qk Z Z a .
2 2 2
From p(Y, f 2 ) = p( f 2 |Y ) p(Y ), the difference between (6.26) and (6.27) is the expo-
nent of p(Y ), which is
6.3 Gaussian Processes with Inducing Variables 183
1 1
− b ( + σ 2 I )−1 b + b ( + σ 2 I N )−1 k X Z Q −1 k Z X ( + σ 2 I N )−1 b ,
2 2
where we may set a = 0 because no terms will remain w.r.t. a. Furthermore, if we
set A = + σ 2 I N , U = k X Z , V = k Z X , and W = k −1
Z Z in the Woodbury-Sherman-
Morrison formula (6.9), then we have
1
− b ( + σ 2 I N + k X Z k −1 −1
Z Z kX Z ) b
2
and obtain (6.25).
Proposition 65 Under the generation process outlined in (6.15), (6.16), (6.17) and
Assumption 1, for each x ∈ E, we have
μ(x):=m(x)+k x Z k −1 −1 −1
Z Z (μ f Z |Y − m Z )=m(x)+k x Z Q k Z X (+σ I N ) (Y − m X )
2
(6.28)
σ 2 (x) := k(x, x) − k x Z (K Z−1Z − Q −1 )k Z x .
Proof: First, we note that Y → f Z → f (x) forms a Markov chain in this order. In the
following, we consider the distribution of f (x)|Y instead of f (x)| f Z , i.e., the distri-
bution of f (x)| f Z and f Z |Y . In (6.16), the term with a mean of k x Z k −1
Z Z ( fZ − mZ )
becomes k x Z k −1
Z Z (μ f Z |Y − m Z ) when averaged over f Z |Y . Thus, we obtain (6.28).
Moreover, if we take the variance of that term with respect to f Z |Y , we obtain the
same value as the variance of k x Z k −1Z Z ( f Z − μ f Z |Y ), so we have
E[k x Z k −1 −1 −1 −1 −1
Z Z ( f Z − μ f Z |Y )( f Z − μ f Z |Y ) k Z Z k Z x ] = k x Z k Z Z f Z |Y k Z Z k Z x = k x Z Q k Z x ,
(6.29)
where f Z varies with the given Y . Furthermore, from (6.16), since the variance
λ(x) = k(x, x) − k x Z k −1Z Z k Z x of f (x)| f Z is independent of f Z , we can write the
variance of f (x)|Y as the sum of the variance λ(x) of f (x)| f Z and (6.29). In other
words, we have σ 2 (x) = λ(x) + k x Z Q −1 k Z x .
In a case when the inducing variable method is employed, the calculations of
k Z Z , k x Z take O(M 2 ) and O(M), respectively, the calculation of takes O(N ),
and the calculations of Q and Q −1 take O(N M 2 ) and O(M 3 ), respectively. The
multiplication process is also completed in O(N M 2 ). On the other hand, without
the inducing variable method, it takes O(N 3 ) of computational time. In the inducing
variable method, we do not use the matrix K X X ∈ R N ×N .
We can randomly select the inducing points z 1 , · · · , z M from x1 , · · · , x N or via
K-means clustering.
sigma_2 = 0 . 0 5 # s h o u l d be e s t i m a t e d
def k ( x , y ) : # Covariance f u n c t i o n
r e t u r n np . exp ( − ( x − y ) ∗∗2 / 2 / s i g m a _ 2 )
d e f mu ( x ) : # Mean f u n c t i o n
return x
# Data G e n e r a t i o n
n = 200
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( x / 2 ) + np . random . r a n d n ( n )
e p s = 10∗∗( − 6)
m = 100
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
i n d e x = np . random . c h o i c e ( n , s i z e = m, r e p l a c e = F a l s e )
z = x [ index ]
m_x = 0
m_z = 0
K_zz = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K_zz [ i , j ] = k ( z [ i ] , z [ j ] )
K_xz = np . z e r o s ( ( n , m) )
f o r i i n range ( n ) :
f o r j i n range (m) :
K_xz [ i , j ] = k ( x [ i ] , z [ j ] )
K _ z z _ i n v = np . l i n a l g . i n v ( K_zz + np . d i a g ( [ 1 0 ∗ ∗ e p s ] ∗m) )
lam = np . z e r o s ( n )
f o r i i n range ( n ) :
lam [ i ] = k ( x [ i ] , x [ i ] ) − np . d o t ( np . d o t ( K_xz [ i , 0 :m] , K _ z z _ i n v ) , K_xz [ i , 0 :
m] )
l a m _ 0 _ i n v = np . d i a g ( 1 / ( lam+ s i g m a _ 2 ) )
Q = K_zz + np . d o t ( np . d o t ( K_xz . T , l a m _ 0 _ i n v ) , K_xz ) ## Computation o f Q
d o e s n o t r e q u i r e O( n ^ 3 )
Q_inv = np . l i n a l g . i n v (Q + np . d i a g ( [ e p s ] ∗ m) )
muu = np . d o t ( np . d o t ( np . d o t ( Q_inv , K_xz . T ) , l a m _ 0 _ i n v ) , y−m_x )
d i f = K _ z z _ i n v − Q_inv
R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) ## O( n ^ 3 ) o m p u t a t i o n is
required
d e f gp_1 ( x _ p r e d ) : # # W/ O I n d u c i n g V a r i a b l e Method
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , y−mu ( x ) )
s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h )
r e t u r n {"mm" : mm, "ss" : s s }
x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 )
mmv = [ ] ; s s v = [ ]
6.3 Gaussian Processes with Inducing Variables 185
for u in x_seq :
mmv . a p p e n d ( g p _ i n d ( u ) [ "mm" ] )
s s v . a p p e n d ( g p _ i n d ( u ) [ "ss" ] )
plt . figure ()
p l t . p l o t ( x_seq , mmv, c = "r" )
p l t . p l o t ( x_seq , np . a r r a y (mmv) + 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) ,
c = "r" , l i n e s t y l e = "−−" )
p l t . p l o t ( x_seq , np . a r r a y (mmv) − 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) ,
c = "r" , l i n e s t y l e = "−−" )
p l t . x l i m ( −2 , 2 )
p l t . p l o t ( np . min (mmv) , np . max (mmv) )
x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 )
mmv = [ ] ; s s v = [ ]
for u in x_seq :
mmv . a p p e n d ( gp_1 ( u ) [ "mm" ] )
s s v . a p p e n d ( gp_1 ( u ) [ "ss" ] )
mmv = np . a r r a y (mmv)
s s v = np . a r r a y ( s s v )
In this section, we continue to study the probability space (, F, P) and the map f :
× E (ω, x) → f (ω, x) ∈ H . We assume that H is a general separable Hilbert
space. In the following, we continue to denote f (ω, x) by f (x) as a random variable
for each x ∈ E. In particular, we assume that f is a mean-square continuous process
(Fig. 6.2), which is defined by
m(x) = E f (ω, x)
and
k(x, y) = Cov( f (ω, x), f (ω, y)) .
In Chap. 5, we obtained the expectation and covariance of k(X, ·); in this section,
however, x, y ∈ E are not random, and the randomness of m, k is due to that of
f (ω, ·).
In the following, we assume that E is compact.
186 6 Gaussian Processes and Functional Data Analyses
3
curves show the results
obtained by the inducing
2
variable and standard
Gaussian processes,
1
respectively
0
0
-1
-2
-3
-2 -1 0 1 2
x
M(n)
I f (g; {(E i , xi )}1≤i≤M(n) ) := f (ω, xi ) g(y)dμ(y)
i=1 Ei
for a pair of interior points {(E i , xi )}1≤i≤M(n) and g ∈ L 2 (E, B(E), μ). Hence, we
have
M(n)
{I f (g; {(E i , xi )}1≤i≤M(n) )}2 d P(ω) ≤ M(n) { f (ω, xi )}2 { g(u)dμ(u)}2
i=1 Ei
M(n)
d P = M(n) k(xi , xi ) {g(u)}2 dμ(u) < ∞
i=1 Ei
and I f (g; {(E i , xi )}1≤i≤M(n) ) ∈ L 2 (, F, P). Although this value is different
depending on the choices of the region decomposition and the points inside the
regions, the difference in I f converges to zero as n goes to infinity. In fact, we have
6.4 Karhunen-Lóeve Expansion 187
E |I f (g; {(E i , xi )}1≤i≤M(n) ) − I f (g; {(E j , x j )}1≤ j≤M(n ) )|2
M(n)
M(n)
= k(xi , xi ) g(u)dμ(u) g(v)dμ(v)
i=1 i =1 Ei Ei
) M(n
M(n )
+ k(x j , x j ) g(u)dμ(u) g(v)dμ(v)
j=1 j =1 Ej E j
M(n
M(n)
)
−2 k(xi , x j ) g(u)dμ(u) g(v)dμ(v) .
i=1 j=1 Ei E j
Since k is uniformly continuous, each double sum on the right-hand side converges
to
k(u, v)g(u)g(v)dμ(u)dμ(v) .
E E
Since the Cauchy sequence converges to zero, its convergence destination I f (ω, g)
is contained in L 2 (ω, F, P) regardless of the choice of {(E i , xi )}1≤i≤M(n) .
If the eigenvalues and eigenfunctions obtained from the integral operator Tk ∈
B(L 2 (E, B(E), μ)),
Tk g(·) = k(y, ·)g(y)dμ(y) , g ∈ L 2 (E, B(E), μ),
E
are {λ j }∞ ∞
j=1 and {e j (·)} j=1 , by Mercer’s theorem, we can express the covariance
function k as
∞
k(x, y) = λ j e j (x)e j (y) , (6.31)
j=1
Proof: For the proofs of the above three items, see the Appendix at the end of this
chapter. We obtain (6.32) by substituting Mercer’s theorem (6.31), g = ei , and h = e j
into the third item:
188 6 Gaussian Processes and Functional Data Analyses
∞
E[I f (ω, ei )I f (ω, e j )] = λr er (x)er (y)ei (x)e j (y)dμ(x)dμ(y).
E E r =1
Furthermore, we have the following theorem.
Proposition 68 (Karhunen-Lóeve [17, 18])Suppose that { f (ω, x)}x∈E is a mean-
square continuous process with a mean of zero. Then, we have
n
for f n (ω, x) := j=1 I f (ω, e j )e j (x).
Proof: From (6.32), we have
n
E[ f n (ω, x)2 ] = E[{ I f (ω, e j )e j (x)}2 ]
j=1
n
n
n
= E[I f (ω, ei )I f (ω, e j )]ei (x)e j (x) = λ j e2j (x) .
i=1 j=1 j=1
Moreover, from (6.31) and the second item of Proposition 67, we have
n
n
E[ f n (ω, x) f (ω, x)] = E[ I f (ω, e j )e j (x) f (ω, x)] = e j (x) k(x, y)e j (y)dμ(y)
j=1 j=1 E
n
n
= λ j e2j (x) e2j (y)dμ(y) = λ j e2j (x) ,
j=1 E j=1
E| f n (ω, x) − f (ω, x)|2 = E[ f n (ω, x)2 ] − 2E[ f n (ω, x) f (ω, x)] + E[ f (ω, x)2 ]
n
n n
= λ j e2j (x) − 2 λ j e2j (x) + k(x, x) = k(x, x) − λ j e2j (x) .
j=1 j=1 j=1
In a general mean-square continuous process (without assuming a Gaussian pro-
cess),
the series expansion provided by Karhunen-Lóeve’s theorem makes I f (ω, e j )/
λ j a random variable with a mean of 0 and a variance of 1. Instead, if we assume
a Gaussian process such that f (x) (x ∈ E) follows a Gaussian distribution, then we
can write
n
f n (x) = z j λ j e j (x) , (6.33)
j=1
6.4 Karhunen-Lóeve Expansion 189
E[ f (ω, x) f (ω, y)] = E[ f (ω, x)2 ] + E[ f (ω, x){ f (ω, y) − f (ω, x)}] = x ,
which implies the second condition of Proposition 69. On the contrary, supposing
that m ≡ 0 for simplicity, if we assume that the first two items of Proposition 69
hold, then because k(x, x) = x, when x ≤ y ≤ z, we have
and
E[ f (ω, y){ f (ω, y) − f (ω, z)}] = k(y, y) − k(y, z) = y − y = 0 ,
Example 89 (Brownian Motion as a Gaussian Process) For the integral operator
(Example 58) on the covariance function of a Brownian motion k(x, y) = min(x, y)
(x, y ∈ E), its eigenvalues and eigenfunctions are (3.13) and (3.14), respectively.
190 6 Gaussian Processes and Functional Data Analyses
n
f n (x) = z j (ω) λ j e j (x)
j=1
d e f lam ( j ) : ## EigenValue
return 4 / ( ( 2 ∗ j − 1 ) ∗ np . p i ) ∗∗2
def ee ( j , x ) : ## D e f i n i t i o n o f E i g e n f u n c t i o n
r e t u r n np . s q r t ( 2 ) ∗ np . s i n ( ( 2 ∗ j − 1 ) ∗ np . p i / 2 ∗ x )
n = 10; m = 7
## D e f i n i t i o n o f Gaussian Process
def f ( z , x ) :
n = len ( z )
S = 0
f o r i i n range ( n ) :
S = S + z [ i ] ∗ e e ( i , x ) ∗ np . s q r t ( lam ( i ) )
return S
plt . figure ()
p l t . xlim (0 , 1)
p l t . x l a b e l ( "x" )
p l t . y l a b e l ( "f(omega,x)" )
c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , m) ]
f o r j i n range (m) :
z = np . random . r a n d n ( n )
x _ s e q = np . a r a n g e ( 0 , 3 . 0 0 1 , 0 . 0 0 1 )
y_seq = [ ]
for x in x_seq :
y_seq . append ( f ( z , x ) )
p l t . p l o t ( x_seq , y_seq , c = c o l o r s [ j ] )
p l t . t i t l e ( "BrownMotion" )
ν, l > 0 are the parameters of the kernel, and K ν is a variant Bessel function of the
second kind.
π I−α (x) − Iα (x)
K ν (x) :=
2 sin(αx)
6.4 Karhunen-Lóeve Expansion 191
Brownian Motion
2
1
f (ω, x)
0
-1
-2
Fig. 6.3 We generated the sample paths of Brownian motions seven times. Each run involved a
sum of up to 10 terms
∞
1 x 2m+α
Iα (x) := .
m=0
m!(m + α + 1) 2
For example, we express ν = 5/2, 3/2, 1/2 as follows. In particular, we call the
stochastic process with ν = 1/2 the Ornstein-Uhlenbeck process.
√ √
5z 5z 2 5z
ϕ5/2 (z) = 1 + + 2 exp(− )
l 3l l
√ √
3z 3z
ϕ3/2 (z) = 1 + exp(− )
l l
For example, if we write this process in Python, we have the following code.
192 6 Gaussian Processes and Functional Data Analyses
10
ν = 1.5 ν = 1.5
ν = 2.5 ν = 2.5
ν ν
Kernel Values ϕ(z)
8
ν = 4.5 ν = 4.5
ν = 5.5 ν = 5.5
ν = 6.5 ν = 6.5
6
6
ν = 7.5 ν = 7.5
ν = 8.5 ν = 8.5
ν ν
4
4
= 9.5 = 9.5
ν = 10.5 ν = 10.5
2
2
0
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
z z
Fig. 6.4 The values of the Matérn kernel for ν = 1/2, 3/2, . . . , m + 1/2 (Example 90). l = 0.1
(left) and l = 0.02 (right)
d e f m a t e r n ( nu , l , r ) :
p = nu − 1 / 2
S = 0
f o r i i n range ( i n t ( p + 1 ) ) :
S = S + gamma ( p + i + 1 ) / gamma ( i + 1 ) / gamma ( p − i + 1 ) \
∗ ( np . s q r t ( 8 ∗ nu ) ∗ r / l ) ∗ ∗ ( p − i )
S = S ∗ gamma ( p + 2 ) / gamma ( 2 ∗ p + 1 ) ∗ np . exp ( − np . s q r t ( 2 ∗ nu ) ∗ r /
l)
return S
Example 90 We present the Matérn kernel values for l = 0.1, 0.02 with ν =
1/2, 3/2, . . . , m + 1/2 (Fig. 6.4).
m = 10
l = 0.1
c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 1 , l e n ( range (m) ) ) ]
x = np . l i n s p a c e ( 0 , 0 . 5 , 2 0 0 )
p l t . p l o t ( x , m a t e r n ( 1 − 1 / 2 , l , x ) , c = c o l o r [ 0 ] , l a b e l = r"$\nu=%d$"%1)
p l t . ylim (0 , 10)
f o r i i n range ( 2 , m + 1 ) :
p l t . p l o t ( x , m a t e r n ( i − 1 / 2 , l , x ) , c = c o l o r [ i − 1 ] , l a b e l = r"$\nu=%d$"
%i )
In the case of the Matérn kernel and in general, we cannot analytically obtain
the eigenvalues and eigenfunctions, as in the cases involving Gaussian kernels and
Brownian motion. Even in those cases, if we assume a Gaussian process, then we can
find x1 , . . . , xn ∈ E to obtain its Gram matrix, which will be a covariance matrix.
6.4 Karhunen-Lóeve Expansion 193
3
2
1
0
y
-1
-2
-3
-1
-2
-3
Fig. 6.5 The Orstein-Uhlenbeck process (ν = 1.2, top) and the Matérn process (ν = 3/2, top) for
l = 0.1
c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , 5 ) ]
d e f r a n d _ 1 0 0 ( Sigma ) :
L = np . l i n a l g . c h o l e s k y ( Sigma ) ## Cholesky d e c o m p o s i t i o n o f c o v a r i a n c e
matrix
194 6 Gaussian Processes and Functional Data Analyses
u = np . random . r a n d n ( 1 0 0 )
y = L . dot ( u ) # # G e n e r a t e random numbers w i t h z e r o −mean and t h e
covariance matrix
return y
x = np . l i n s p a c e ( 0 , 1 , 1 0 0 )
z = np . abs ( np . s u b t r a c t . o u t e r ( x , x ) ) # c o m p u t e d i s t a n c e m a t r i x , d_ { i j } = | x _ i
− x_j |
l = 0.1
Sigma_OU = np . exp ( − z / l ) # # OU: matern ( 1 / 2 , l , z ) i s slow
y = r a n d _ 1 0 0 ( Sigma_OU )
plt . figure ()
plt . plot (x , y)
plt . ylim ( −3 ,3)
for i i n range ( 5 ) :
y = r a n d _ 1 0 0 ( Sigma_OU )
plt . plot (x , y , c = colors [ i ])
p l t . t i t l e ( "OUprocess(nu=1/2,l=0.1)" )
Sigma_M = m a t e r n ( 3 / 2 , l , z ) # # Matern
y = r a n d _ 1 0 0 ( Sigma_M )
plt . figure ()
plt . plot (x , y)
p l t . y l i m ( −3 , 3 )
f o r i i n range ( 5 ) :
y = r a n d _ 1 0 0 ( Sigma_M )
plt . plot (x , y , c = colors [ i ])
p l t . t i t l e ( "Maternprocess(nu=3/2,l=0.1)" )
Let (, F, P) and H be a probability space and a separable Hilbert space, respec-
tively. Let F : → H be a measurable map, i.e., {h ∈ H | g − h < r }) is an ele-
ment of F for each open set (g ∈ H, r ∈ (0, ∞) in H . We call such an F : → H
a random element of H . Intuitively, a random element is a random variable that
takes a value in H . Thus far, we have assumed that f : × E → R is measurable
at each x ∈ E (stochastic process). This section addresses situations in which we do
not assume such measurability. For simplicity, we write F(ω) as F, similar to the
elements of H .
Although we do not go into details in this book, it is known that the following
relationship holds between stochastic processes and random elements. It is only
necessary to understand the close relationship between the two.
Proposition 70 (Hsing-Eubank [14])
1. If f : × E → R is measurable w.r.t. × E and f (ω, ·) ∈ H , for ω ∈ , then
f (ω, ·) is a random element of H .
2. If f (·, x) → R is measurable for each x ∈ E and f (ω, ·) is continuous for each
ω ∈ , then f (ω, ·) is a random element.
3. If f : × E → R is a (zero-mean) mean-square continuous process and its
covariance function is k, then
a random element of H exists such that the covari-
ance operator is H g → E k(·, y)g(y)dμ(y) ∈ H .
6.5 Functional Data Analysis 195
from Proposition 22. We write this formally as m = E[F], which is the definition of
the mean of a random element F.
Proposition 71 If EF2 < ∞, then
holds.
Proof: If we substitute g = m into (6.36), we obtain
Since EF2 < ∞ implies that EF < ∞, we proceed with our discussion by
assuming the former case.
Regarding covariance, if H = R p , then the covariance matrix is
is linear for each of g and h. Moreover, if EF2 < ∞, then it is bounded from
E[F − m, gF − m, h] ≤ EF − m2 · g h ≤ EF2 · g h .
If we define u ⊗ v ∈ B(H ) by
From
we have
In the following, for simplicity, we proceed with our discussion by assuming that
m = 0.
Proposition 73 If m = 0 and EF2 < ∞, then
1. The covariance operator K is nonnegative definite and is a trace class operator
whose trace is
K T R = EF2 .
g, K h = K g, h = 0 , h ∈ H .
Additionally, from Propositions 27 and 31 and the first item of Proposition 73,
the following holds.
Proposition 74 The eigenvalue function {e j } of the covariance operator K is an
orthonormal basis of Im(K ); the corresponding eigenvalues {λ j }∞
j=1 are nonnegative,
monotonically decrease, and converge to 0. Furthermore, the multitude of each of
the nonzero eigenvalues is finite.
Additionally, from Propositions 73 and 74, the following holds.
Proposition 75 If { f j } is an orthonormal basis of H , then we have
n
n
EF − F, f j f j 2 = EF2 − K f j , f j , (6.38)
j=1 j=1
⎡ ⎤
n
n
n
n
E F, f j f j 2 = E ⎣F, F, f j f j ⎦ = E F, f j 2 = K f j , f j .
j=1 j=1 j=1 j=1
1
N
KN = (Fi − m N ) ⊗ (Fi − m N ) , (6.40)
N i=1
3. Find C = (ci, j )i=1,...,N , j=1,...,m ∈ R N ×m such that Fi (x) = mj=1 ci η j (x).
4. Find the coefficients d1 ,
. . . , dm of the estimated mean function m N (x) :=
1
N m
N i=1 Fi (x) (m N (x) = j=1 d j η j (x)).
5. Since the variance function is
1
N
1
k(x, y) = {Fi (x) − m N (x)}{Fi (y) − m N (y)} = η(x)T (C − d) (C − d)η(y) ,
N N
i=1
1
η(x) (C − d) (C − d)η(x)η(x) b = λη(x) b ,
N
which is equivalent to
1
(C − d) (C − d)W b = λb .
N
we have π
ηi (x)η j (x)d x = δi, j ,
−π
X, y = s k f d a . d a t a s e t s . f e t c h _ w e a t h e r ( r e t u r n _ X _ y = True , a s _ f r a m e = T r u e )
df = X. iloc [: , 0]. values
def g ( j , x) : ## B a s i s c o n s i s t i n g o f p e l e m e n t s
if j == 0 :
r e t u r n 1 / np . s q r t ( 2 ∗ np . p i )
i f j % 1 == 0 :
r e t u r n np . c o s ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i )
else :
r e t u r n np . s i n ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i )
def beta ( x , y ) : ## C o e f f i c i e n t s i n f r o n t o f t h e p e l e m e n t s
X = np . z e r o s ( ( N, p ) )
f o r i i n range (N) :
f o r j i n range ( p ) :
X[ i , j ] = g ( j , x [ i ] )
b e t a = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X)
+ 0 . 0 0 0 1 ∗ np . i d e n t i t y ( p ) ) , X . T ) , y )
r e t u r n np . s q u e e z e ( b e t a )
Reconstructions for m = 2, 3, 4, 5, 6
20
Temperature (Celsius)
10
0
Original m=4
-10
m=2 m=5
m=3 m=6
-3 -2 -1 0 1 2 3
Dates (trandformed from Jan. 1 through Dec. 31 to −π through π)
Fig. 6.6 We present the output of approximating Toronto’s annual temperature by using m =
2, 3, 4, 5, 6 principal components. As m increases, the data are faithfully recovered from the original
data
in front of η j (x) fpr j = 1, . . . , p). We change m and the function z and run the
following program to see if we can recover the original function.
d e f z ( i , m, x ) : # # The a p p r o x i m a t e d f u n c t i o n u s i n g m c o m p o n e n t s r a t h e r
than m
S = 0
f o r j i n range ( p ) :
f o r k i n range (m) :
f o r r i n range ( p ) :
S = S + C[ i , j ] ∗ B[ j , k ] ∗ B[ r , k ] ∗ g ( r , x )
return S
x _ s e q = np . a r a n g e ( − np . p i , np . p i , 2 ∗ np . p i / 1 0 0 )
plt . figure ()
p l t . x l i m ( − np . p i , np . p i )
# p l t . y l i m ( − 15 , 2 5 )
p l t . x l a b e l ( "Days" )
p l t . y l a b e l ( "Temp(C)" )
p l t . t i t l e ( "Reconstructionforeachm" )
p l t . p l o t ( x , d f [ 1 3 ] , l a b e l = "Original" )
f o r m i n range ( 2 , 7 ) :
p l t . p l o t ( x_seq , z ( 1 3 , m, x _ s e q ) , c = c o l o r [m] , l a b e l = "m=%d"%m)
p l t . l e g e n d ( l o c = "lowercenter" , n c o l = 2 )
Figure 6.6 shows the output of approximating the annual temperature in Toronto
by using m = 2, 3, 4, 5, 6 principal components.
Next, we list the principal components in order of increasing eigenvalue and draw
a graph of their contribution ratio (Fig. 6.7).
6.5 Functional Data Analysis 201
lam = p c a . e x p l a i n e d _ v a r i a n c e _
r a t i o = lam / sum ( lam ) # Or u s e pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _
p l t . p l o t ( range ( 1 , 6 ) , r a t i o [ : 5 ] )
p l t . x l a b e l ( "PC1throughPC5" )
p l t . y l a b e l ( "Ratio" )
p l t . t i t l e ( "Ratio" )
def h ( coef , x ) : ## D e f i n e a f u n c t i o n u s i n g c o e f f i c i e n t s
S = 0
f o r j i n range ( p ) :
S = S + coef [ j ] ∗ g ( j , x )
return S
p r i n t (B)
plt . figure ()
p l t . x l i m ( − np . p i , np . p i )
p l t . y l i m ( −1 , 1 )
f o r j i n range ( 3 ) :
p l t . p l o t ( x_seq , h ( B [ : , j ] , x _ s e q ) , c = c o l o r s [ j ] , l a b e l = "PC%d"%( j + 1 ) )
p l t . l e g e n d ( l o c = "best" )
1 2 3 4 5
The first component through fifth
202 6 Gaussian Processes and Functional Data Analyses
1.0
Second
Third
0.5
0.0
-0.5
-1.0
-3 -2 -1 0 1 2 3
Date (transformed Jan. 1 throuugh Dec. 31 to −π through π)
Fig. 6.8 The first, second, and third principal component functions for temperature in the Canadian
weather data. Some of the principal component functions are multiplied by −1, which means that
they are upside down compared to those of other packages. Additionally, because √ the horizontal
axis is normalized by [−π, π ], the value of each eigenfunction is multiplied by 365/(2π )
place = X. i l o c [ : , 1]
index = [ 9 , 11 , 12 , 13 , 16 , 23 , 25 , 26]
o t h e r s = [ x f o r x i n range ( 3 4 ) i f x n o t i n i n d e x ]
f i r s t = [ place [ i ] [ 0 ] for i in index ]
print ( f i r s t )
plt . figure ()
p l t . x l i m ( − 15 , 2 5 )
p l t . y l i m ( − 25 , − 5)
p l t . x l a b e l ( "PC1" )
p l t . y l a b e l ( "PC2" )
p l t . t i t l e ( "CanadianWeather" )
p l t . s c a t t e r ( xx [ o t h e r s , 0 ] , xx [ o t h e r s , 1 ] , m a r k e r = "x" , c = "k" )
f o r i i n range ( 8 ) :
l = p l t . t e x t ( xx [ i n d e x [ i ] , 0 ] , xx [ i n d e x [ i ] , 1 ] ,
s = f i r s t [ i ] , c = color [ i ])
[ ’Q’ , ’M’ , ’O’ , ’T’ , ’W’ , ’C’ , ’V’ , ’V’ ]
Appendix 203
10
warmer regions such as
Vancouver and Victoria W
5
appearing furthest to the left O
M
in the first principal Q
T
0
component
Second
C
-5
V
V
Q Quebec W Winnipeg
-10
M Montreal C Calgary
O Ottawa V Vancouver
T Toronto V Victoria
-15
-20 -10 0 10 20 30
First
Appendix
Proof of Proposition 66
Proof: Since the expectation and variance of f (ω, x) − m(x) are 0 and k(x, x),
respectively, and the covariance between f (ω, x) − m(x) and f (ω, y) − m(y) is
k(x, y), from
|m(x) − m(y)| = |E[ f (ω, x) − f (ω, y)]| ≤ {E[| f (ω, x) − f (ω, y)|2 ]}1/2 .
|k(x, y) − k(x , y)| = |E[ f (ω, x) f (ω, y)] − E[ f (ω, x ) f (ω, y)]|
≤ E[ f (ω, y)2 ]1/2 E[{ f (ω, x) − f (ω, x )}2 ]1/2 = {k(y, y)}1/2 {E[| f (ω, x) − f (ω, x )|2 ]}1/2
and
|k(x , y) − k(x , y )| ≤ {k(x , x )}1/2 {E[| f (ω, y) − f (ω, y )|2 ]}1/2 .
Proof of Proposition 67
We define I (n) (n)
f (g) := I f (g; {(E i , x i )}1≤i≤M(n) ). Then, we have E[I f (g)] = 0. From
E[ f (ω, x)] = 0, x ∈ E, and the convergence proven thus far, we obtain the first
claim:
as n → ∞. From
M(n)
M(n)
E[I (n) (n)
f (g)I f (h)] = k(xi , x j ) g(x)dμ(x) h(y)dμ(y)
i=1 j=1 Ei Ej
→ k(x, y)g(x)h(y)dμ(x)dμ(y) ,
E E
+ |k(x, y) − k(xi , x j )|g(x)h(y)dμ(x)dμ(y) → 0 .
i j Ei Ej
Exercises 83∼100 205
Exercises 83∼100
84. Construct a function gp_sample that generates random numbers f (ω, x1 ), . . . ,
f (ω, x N ) from the mean function m, the covariance function k, and x1 , . . . ,
x N ∈ E for a set E. Then, set m, k to generate 100 random numbers and examine
if the covariance matrix matches the m, k.
85. Using Proposition 61, prove (6.3) and (6.4).
86. In the following program, other than the Cholesky decomposition, is there any
step that requires a calculation with O(N 3 ) complexity?
d e f gp_2 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) )
a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) )
mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) )
gamma = np . l i n a l g . s o l v e ( L . T , h )
s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 )
r e t u r n {"mm" :mm, "ss" : s s }
88. Explain that Lines 19 through 24 of the program in Example 86 are used to
update f X ← (W + k −1 −1
X X ) (W f X + u).
89. Replace the first 100 Iris data (50 Setosa, 50 Versicolor) with the 51st to 150th
data (50 Versicolor, 50 Versinica) in Example 86 to execute the program.
90. In the proof of Proposition 65, why is it acceptable to replace f Z in (6.16) of
the generation process by μ f Z |Y to μ(x)? In σ 2 (x), the variations due to f Z |Y
and f (x)| f Z are independent. Why can we assume that they are independent?
91. In Example 88, there is a step in which the function gp_ind that realizes the
inducing variable method avoids processing O(N 3 ) calculations. Where is this
step?
92. Show that a stochastic process is a mean-square continuous process if and only
if its mean and covariance functions are continuous.
93. From Mercer’s theorem (6.31) and Proposition 67, derive Karhunen-Lóeve’s
theorem. Additionally, for n = 10, generate five sample paths of Brownian
motion.
94. From the formula for the Matérn kernel (6.35), derive ϕ5/2 and ϕ3/2 . Addition-
ally, illustrate the value of the Matérn kernel (ν = 1, . . . , 10) for l = 0.05, as
in Fig. 6.4.
95. Illustrate the sample path of the Matérn kernel with ν = 5/2, l = 0.1.
96. Give an example of a random element that does not involve a stochastic process
and an example of a stochastic process that does not involve a random element.
206 6 Gaussian Processes and Functional Data Analyses
1. N. Aronszajn, Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404 (1950)
2. H. Avron, M. Kapralov, C. Musco, A. Velingker and A. Zandieh. Random fourier fea-
tures for kernel ridge regression: Approximation bounds and statistical guarantees. ArXiv,
abs/1804.09893, 2017
3. C. Baker. The Numerical Treatment of Integral Equations (Claredon Press, 1978)
4. P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and struc-
tural results. In J. Mach. Learn. Res., 2001
5. K.P. Chwialkowski and A. Gretton. A kernel independence test for random processes. In ICML,
2014)
6. R. Dudley. Real Analysis and Probability (Cambridge Studies in Advanced Mathematics, 1989)
7. K. Fukumizu. Introduction to Kernel Methods (kaneru hou nyuumon) (Asakura, 2010). (In
Japanese)
8. T. Gneiting, Compactly supported correlation functions. J. Multivar. Anal. 83, 493–508 (2002)
9. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, Baltimore, 1996)
10. I.S. Gradshteyn, I.M. Ryzhik, R.H. Romer, Tables of integrals, series, and products. Am. J.
Phys. 56, 958–958 (1988)
11. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test. J.
Mach. Learn. Res. 13, 723–773 (2012)
12. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, B. Schölkopf, Kernel methods for measuring
independence. J. Mach. Learn. Res. 6, 2075–2129 (2005)
13. D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
UCSC, 1999
14. T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Intro-
duction to Linear Operators (Wiley, 2015)
15. K. Itõ. An Introduction to Probability Theory (Cambridge University Press, 1984)
16. Y. Kano and S. Shimizu. Causal inference using nonnormality. In Proceedings of the Annual
Meeting of the Behaviormetric Society of Japan 47, 2004
17. K. Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci.
Fennicae. Ser. A. I. Math.-Phys 37, 1–79 (1947)
18. K. Karhunen. Probability theory. Vol. II (Springer-Verlag, 1978)
19. H. Kashima, K. Tsuda and A. Inokuchi. Marginalized kernels between labeled graphs. In ICML,
2003
© The Editor(s) (if applicable) and The Author(s), under exclusive license 207
to Springer Nature Singapore Pte Ltd. 2022
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1
208 Bibliography