0% found this document useful (0 votes)

155 views215 pages

Kernel Methodsfor Machine Learningwith Mathand Pytho

Uploaded by

Kabir West

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

155 views215 pages

Kernel Methodsfor Machine Learningwith Mathand Pytho

Uploaded by

Kabir West

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 215

Kernel Methods for Machine Learning with Math

and Python
Joe Suzuki

Kernel Methods for Machine

Learning with Math
and Python
100 Exercises for Building Logic
Joe Suzuki
Graduate School of Engineering Science
Osaka University
Toyonaka, Osaka, Japan

ISBN 978-981-19-0400-4 ISBN 978-981-19-0401-1 (eBook)

https://doi.org/10.1007/978-981-19-0401-1

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

How to Overcome Your Kernel Weakness

Among machine learning methods, kernels have always been a particular weakness
of mine. I tried to read “Introduction to Kernel Methods” by Kenji Fukumizu (in
Japanese) but failed many times. I invited Prof. Fukumizu to give an intensive lecture
at Osaka University and listened to the course for a week with the students, but I
could not understand the book’s essence. When I first started writing this book, my
goal was to rid my sense of weakness. However, now that this book is completed, I
can tell readers how they can overcome their own kernel weaknesses.
Most people, even machine learning researchers, do not understand kernels and
use them. If you open this page, I believe you have a positive feeling that you want
to overcome your weakness.
The shortest path I would most recommend for achieving this is to learn mathe-
matics by starting from the basics. Kernels work according to the mathematics behind
them. It is essential to think through this concept until you understand it. The mathe-
matics needed to understand kernels are called functional analysis (Chap. 2). Even if
you know linear algebra or differential and integral calculus, you may be confused.
Vectors are finite dimensional, but a set of functions is infinite dimensional and can
be treated as linear algebra. If the concept of completeness is new to you, I hope you
will take the time to learn about it. However, if you get through this second chapter,
I think you will understand everything about kernels.
This book is the third volume (of six) in the 100 Exercises for Building Logic set.
Since this is a book, there must be a reason for publishing it (the so-called cause)
when existing books on kernels can be found. The following are some of the features
of this book.
1. The mathematical propositions of kernels are proven, and the correct conclu-
sions are stated so that the reader can reach the essence of kernels.

v
vi Preface

2. As in the other books in the 100 Mathematical Problems in Machine Learning

series, source programs and running examples are presented to promote under-
standing. It is not easy for readers to understand the results if only mathematical
formulas are given, and this is especially true for kernels.
3. Once the reader understands the basic topics of functional analysis (Chap. 2), the
applications in the subsequent chapters are discussed, and no prior knowledge
of mathematics is assumed.
4. This kernel considers both the kernel of the RKHS and the kernel of the Gaussian
process. A clear distinction is made between the two treatments. In this book,
the two types of kernels are discussed in Chaps. 5 and 6, respectively.
We surveyed books on kernels both in Japan and overseas but found that none satisfied
two or more of the above characteristics.
I have experienced many failures leading up to the publication of this book. Every
year, I give a lecture (at the graduate school of Osaka University). Each area of
machine learning is studied by solving 100 mathematical and programming exercises.
Sparse estimation (2018) and graphical models (2019) have gained popularity, and
the 2020 kernel lecture has more than 100 students enrolled. However, although I
prepared for the lectures for more than 2 days every week, the talks did not go well,
probably due to my weakness regarding the subject. This was evident from the class
questionnaires provided by the students. However, I analyzed each of these problems
and made improvements, and this book was born.
I hope that readers will learn about kernels efficiently without following the same
path that I took (consuming much time and energy through trial and error). Reading
this book does not mean that you will write a paper immediately, but it will give
you a solid foundation. You will be able to read kernel papers smoothly, which had
previously seemed difficult, and you will be able to see the whole kernel paradigm
from a higher level. This book is also enjoyable, even for researchers in machine
learning. We hope that you will use this book to achieve success in your respective
fields.

What Makes KMMP Unique?

I have summarized the features of this book as follows.

1. Developing logic
We mathematically formulate and solve each ML problem and build those
programs to grasp the subject’s essence. The KMMP (Kernel methods for
Machine learning with Math and Python) instills “logic” in the minds of the
readers. The reader will acquire both the knowledge and ideas of ML. Even
if new technology emerges, they will be able to follow the changes smoothly.
After solving the 100 problems, most students would say, “I learned a lot”.
Preface vii

2. Not just a story

If programming codes are available, you can immediately take action. It is
unfortunate when an ML book does not offer the source codes. Even if a package
is available, if we cannot see the inner workings of the programs, all we can do
is input data into those programs. In KMMP, the program codes are available
for most of the procedures. In cases where the reader does not understand the
math, the codes will help them know what it means.
3. Not just a how-to book: an academic book written by a university professor.
This book explains how to use the package and provides examples of executions
for those unfamiliar with them. Still, because only the inputs and outputs are
visible, we can see the procedure as a black box. In this sense, the reader will
have limited satisfaction because they will not obtain the subject’s essence.
KMMP intends to show the reader the heart of ML and is more of a full-fledged
academic book.
4. Solve 100 exercises: problems are improved with feedback from university
students
The exercises in this book have been used in university lectures and refined
based on students’ feedback. The best 100 problems were selected. Each chapter
(except the exercises) explains the solutions, and you can solve all of the
exercises by reading the book.
5. Self-contained
All of us have been discouraged by phrases such as “for the details, please refer to
the literature XX”. Unless you are an enthusiastic reader or researcher, nobody
will seek out those references. In this book, we have presented the material
so that consulting external references is not required. Additionally, the proofs
are simple derivations, and the complicated proofs are given in the appendices
at the end of each chapter. KMMP completes all discussions, including the
appendices.
6. Readers’ pages: questions, discussion, and program files The reader can ask any
question on the book via https://bayesnet.org/books.

Osaka, Japan Joe Suzuki

November 2021
Acknowledgments

The author wishes to thank Mr. Bing Yuan Zhang, Mr. Tian Le Yang, Mr. Ryosuke
Shimmura, Mr. Tomohiro Kamei, Ms. Rieko Tasaka, Mr. Keito Odajima, Mr. Daiki
Fujii, Mr. Hongming Huang, and all graduate students at Osaka University, for
pointing out logical errors in mathematical expressions and programs. Furthermore, I
would like to take this opportunity to thank Dr. Hidetoshi Matsui (Shiga University),
Dr. Michio Yamamoto (Okayama University), and Dr. Yoshikazu Terada (Osaka
University) for their advice on functional data analysis in seminars and workshops.
This English book is based mainly on the Japanese book published by Kyoritsu
Shuppan Co., Ltd. in 2021. The author would like to thank Kyoritsu Shuppan Co.,
Ltd., particularly its editorial members Mr. Tetsuya Ishii and Ms. Saki Otani. The
author also appreciates Ms. Mio Sugino, Springer, preparing the publication and
providing advice on the manuscript.

Osaka, Japan Joe Suzuki

November 2021

ix
Contents

1 Positive Definite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Positive Definiteness of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Positive Definite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Bochner’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Kernels for Strings, Trees, and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 16
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Exercises 1∼15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1 Metric Spaces and Their Completeness . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Linear Spaces and Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Compact Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Appendix: Proofs of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Exercises 16∼30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1 RKHSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Sobolev Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Mercer’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Exercises 31∼45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Kernel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Kernel Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Kernel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4 Spline Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xi
xii Contents

4.5 Random Fourier Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.6 Nyström Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Incomplete Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Exercises 46∼64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5 The MMD and HSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Random Variables in RKHSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2 The MMD and Two-Sample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3 The HSIC and Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4 Characteristic and Universal Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.5 Introduction to Empirical Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Exercises 65∼83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6 Gaussian Processes and Functional Data Analyses . . . . . . . . . . . . . . . . . 167
6.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.3 Gaussian Processes with Inducing Variables . . . . . . . . . . . . . . . . . . . . 180
6.4 Karhunen-Lóeve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.5 Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Exercises 83∼100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Chapter 1
Positive Definite Kernels

In data analysis and various information processing tasks, we use kernels to evaluate
the similarities between pairs of objects. In this book, we deal with mathematically
defined kernels called positive definite kernels. Let the elements x, y of a set E
correspond to the elements (functions) (x), (y) of a linear space H called the
reproducing kernel Hilbert space. The kernel k(x, y) corresponds to the inner product
(x), (y) H in the linear space H . Additionally, by choosing a nonlinear map ,
this kernel can be applied to various problems. The set E may be a string, a tree, or a
graph, even if it is not a real-numbered vector, as long as the kernel satisfies positive
definiteness. After defining probability and Lebesgue integrals in the second half,
we will learn about kernels by using characteristic functions (Bochner’s theorem).

1.1 Positive Definiteness of a Matrix

Let n ≥ 1; we say that a square matrix A is symmetric if A ∈ Rn×n is equal to its

transpose (A = A)1 , and we say that A is nonnegative definite if all the eigenvalues
are nonnegative.
Proposition 1 (nonnegative definite matrix) The following three conditions are
equivalent for a symmetric matrix A ∈ Rn×n .
1. A matrix B ∈ Rn×n exists such that A = B B.
2. x Ax ≥ 0 for any x ∈ Rn .
3. The eigenvalues of A are nonnegative.
⇒2. holds because A = B B=
Proof: 1.= ⇒x Ax = x B Bx = Bx2 ≥ 0. 2.=
⇒3.

⇒ 0 ≤ y Ay = y λy = λy2 for
follows from the fact that x Ax ≥ 0, x ∈ Rn =

1 We write the transpose of matrix A as A .

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 1
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_1
2 1 Positive Definite Kernels

an eigenvalue λ of A and√ its

√ eigenvector √ y ∈ R √
n
. 3.=⇒1. holds since λ √1 , . . . , λn ≥

0=⇒A = P D P = P D D P = ( D P√ ) D P√ , where D and D are diag-
onal matrices with elements λ1 , . . . , λn and λ1 , . . . , λn , and P is the correspond-
ing orthogonal matrix.
A nonnegative definite matrix A is symmetric. In this book, we say that a non-
negative definite matrix is positive definite if all of its eigenvalues are positive. In
addition, we assume that the elements of any matrix are real. However, the following
fact is often useful when we deal with complex numbers and Fourier transformations.

√ definite matrix A ∈ R , we have that z Az ≥ 0
n×n
Corollary 1 For a nonnegative
for any z ∈ C , where i = −1 is the imaginary unit, and we write the conjugate
n

x − i y of z = x + i y ∈ C with x, y ∈ R as z.
Proof: Since there exists a B ∈ Rn×n such that A = B B for a nonnegative definite
matrix A ∈ Rn×n , we have that

z Az = (Bz) Bz = |Bz|2 ≥ 0

for any z = [z 1 , . . . , z n ] ∈ Cn .

Example 1

# In this chapter, we assume that the following has been executed.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("seaborn−ticks")

n=3
B = np.random.normal(size=n∗∗2).reshape(3, 3)
A = np.dot(B.T, B)
values, vectors = np.linalg.eig(A)
print("values:\n", values, "\n\nvectors:\n", vectors, "\n")

values:
[0.09337468 7.75678625 4.43554113]
vectors:
[[ 0.49860775 0.84350568 0.199721 ]
[ 0.39606374 -0.42663779 0.81308899]
[-0.77105371 0.32631023 0.54680692]]
1.1 Positive Definiteness of a Matrix 3

S = []
for i in range(10):
z = np.random.normal(size = n)
y = np.squeeze(z.T.dot(A.dot(z)))
S.append(y)
if (i+1) % 5 == 0:
print("S[%d:%d]:"%((i−4),i), S[i−4:i])

S[0:4]: [23.24608872999895, 6.601263342526701, 5.334515801733688, 14.886876186736613]

S[5:9]: [18.85503241886245, 34.30290091714191, 1.025291282540866, 29.59512428090335]

1.2 Kernels

Let E be a set. We often express similarity between elements x, y ∈ E by using

a bivariate function k : E × E → R not just for data analysis but also for various
information processing tasks. The larger k(x, y) is, the more similar x, y are. We
call such a function k : E × E → R a kernel.
Example 2 (Epanechnikov kernel) We use the kernel k : E × E → R such that

|x − y|
k(x, y) = D
λ
3
(1 − t 2 ), |t| ≤ 1
D(t) = 4
0, Other wise

for λ > 0, and we construct the following function (the Nadaraya-Watson estimator)
from observations (x1 , y1 ), . . . , (x N , y N ) ∈ E × R:
N
k(x, xi )yi
fˆ(x) = i=1
N
.
j=1 k(x, x j )

For a given input x∗ ∈ E that is different from the N pairs of inputs, we return
the weighted sum of y1 , . . . , y N ,
k(x∗ , x1 ) k(x∗ , x N )
N , . . . , N ,
j=1 k(x ∗ , x j ) j=1 k(x ∗ , x j )

as the output fˆ(x∗ ). Because we assume that a larger k(x, y) yields a more similar
x, y ∈ E, the more similar x∗ and xi are, the larger the weight of yi .
Given an input x∗ ∈ E for i = 1, . . . , N , we weight yi such that xi − λ ≤ x∗ ≤
xi + λ is proportional to k(xi , x∗ ). If we make the λ value smaller, we predict y∗ by
using only the (xi , yi ) for which xi and x∗ are close. We display the output obtained
when we execute the following code in Fig. 1.1.
4 1 Positive Definite Kernels

3
λ = 0.05
λ = 0.35
λ = 0.5
2
1
y
0
-1
-2

-3 -2 -1 0 1 2 3
x

Fig. 1.1 We use the Epanechnikov kernel and Nadaraya-Watson estimator to draw the curves for
λ = 0.05, 0.35, 0.5. Finally, we obtain the optimal λ value and present it in the same graph

n = 250
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation

def D(t): # Function Definition D

return np.maximum(0.75 ∗ (1 − t∗∗2), 0)

def k(x, y, lam): # Function Definition K

return D(np.abs((x − y) / lam))

def f(z, lam): # Function Definition f

S = 0; T = 0
for i in range(n):
S = S + k(x[i], z, lam) ∗ y[i]
T = T + k(x[i], z, lam)
return S / T

plt.figure(num=1, figsize=(15, 8),dpi=80)

plt.xlim(−3, 3); plt.ylim(−2, 3)
plt.xticks(fontsize = 14); plt.yticks(fontsize = 14)
plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o")

xx = np.arange(−3, 3, 0.1)
yy = [[] for _ in range(3)]
lam = [0.05, 0.35, 0.50]
color = ["g", "b", "r"]
for i in range(3):
for zz in xx:
yy[i].append(f(zz, lam[i]))
plt.plot(xx, yy[i], c = color[i], label = lam[i])

plt.legend(loc = "upper left", frameon = True, prop={’size’:14})

plt.title("Nadaraya−Watson Estimator", fontsize = 20)
1.3 Positive Definite Kernels 5

1.3 Positive Definite Kernels

The kernels that we consider in this book satisfy the positive definiteness criterion
defined below. Suppose k : E × E → R is symmetric, i.e., k(x, y) = k(y, x), x, y ∈
E. For x1 , . . . , xn ∈ E (n ≥ 1), we say that the matrix
⎡ ⎤
k(x1 , x1 ) · · · k(x1 , xn )
⎢ .. .. .. ⎥
⎦∈R
n×n
⎣ . . . (1.1)
k(xn , x1 ) · · · k(xn , xn )

is the Gram matrix w.r.t. a k of order n. We say that k is a positive definite kernel2 if
the Gram matrix of order n is nonnegative definite for any n ≥ 1 and x1 , . . . , xn ∈ E.
Example 3 The kernel in Example 2 does not satisfy positive definiteness. In
fact, when λ = 2, n = 3, and x1 = −1, x2 = 0, x3 = 1, the matrix consisting of
K λ (xi , yi ) can be written as
⎡ ⎤ ⎡ ⎤
k(x1 , x1 ) k(x1 , x2 ) k(x1 , x3 ) 3/4 9/16 0
⎣ k(x2 , x1 ) k(x2 , x2 ) k(x2 , x3 ) ⎦ = ⎣ 9/16 3/4 9/16 ⎦
k(x3 , x1 ) k(x3 , x2 ) k(x3 , x3 ) 0 9/16 3/4

and the determinant is computed as 33 /26 − 35 /210 − 35 /210 = −33 /29 . In general,
the determinant of a matrix is the product of its eigenvalues, and we find that at least
one of the three eigenvalues is negative.
∞
Example 4 For random variables {X i }i=1 that are not necessarily independent, if
k(X i , X j ) is the covariance between X i , X j , the Gram matrix of any order is the
covariance matrix among a finite number of X j , which means that k is positive
definite. We discuss Gaussian processes based on this fact in Chap. 6.
By assuming positive definiteness, the theory of kernels will be developed in this
book. Hereafter, when we state kernels, we are referring to positive definite kernels.
Let H be a linear space (vector space) equipped with an inner product ·, · H .
Then, we often construct a positive definite kernel with

k(x, y) = (x), (y) H . (1.2)

By using an arbitrary map : E → H . We say that such a is a feature map. In

this chapter, we may assume that the linear space H is the Euclidean space H = Rd
of dimensionality d with the standard inner product x, yRd = x y, x, y ∈ Rd . We
define the linear space and inner product concepts in Chap. 2.
Proposition 2 The kernel k : E × E → R defined in (1.2) is positive definite.

2 Although it seems appropriate to say “a nonnegative definite kernel”, the custom of saying “a
positive definite kernel” has been established.
6 1 Positive Definite Kernels

Proof: We arbitrarily fix n = 1, 2, · · · and x1 , · · · , xn ∈ E and denote the Gram

matrix (1.1) by K . Then, from the definition of inner products, for an arbitrary
z = [z 1 , · · · , z n ] ∈ Rn , we have

n
n
n
n
n
z K z = z i z j (xi ), (x j ) H = z i (xi ), z j (x j ) H = z j (x j )2H ≥ 0 ,
i=1 j=1 i=1 j=1 j=1

1/2
where we write a H := a, a H for a ∈ H .

Proposition 3 If the matrices A, B are nonnegative definite, then so is the Hadamard

product A ◦ B (elementwise multiplication).

Proof: See the appendix at the end of this chapter.

Proposition 3 is helpful for proving the second part of the following proposition.
Proposition 4 If the kernels k1 , k2 , . . . are positive definite, then so are the following
E × E → R:
1. ak1 + bk2 (a, b ≥ 0) ,
2. k1 k2 ,
3. the limit3 of {ki } when it converges,
4. k has only one value a ≥ 0 (constant function), and
5. f (x)k(x, y) f (y) (x, y ∈ E) for an arbitrary f : E → R ,
where the third point claims that the limit k∞ (x, y) := lim ki (x, y) satisfies positive
i→∞
definiteness for any x, y ∈ E.
Proof: ak1 + bk2 is positive definite because

x Ax ≥ 0, x Bx ≥ 0=
⇒x (a A + bB)x ≥ 0

for A, B ∈ Rn×n . The product k1 k2 is positive definite because if A = (Ai, j ), B =

(Bi, j ) are nonnegative definite, then so is the Hadamard product A ◦ B (Proposition
3). The third statement assumes the existence of a positive integer n such that

n
n
B∞ = z j z h k∞ (x j , x h ) = −
j=1 h=1

n x1 ,
for · · · , xn ∈ E, z 1 , . . . , z n ∈ R, and > 0. Then, the difference between Bi :=
n
j=1 h=1 z j z h ki (x j , x h ) ≥ 0 and B∞ becomes arbitrarily close to zero as i → ∞.
However, the difference is at least > 0, which is a contradiction and means that
B∞ ≥ 0. If a kernel takes only a (nonnegative) constant value a, since all the values
in (1.1) are a ≥ 0, we have

3 the limit of ki (x, y) for each (x, y) ∈ E.

1.3 Positive Definite Kernels 7

⎡ ⎤ ⎡√ √ ⎤ ⎡ √ √ ⎤
a ··· a a/n · · · a/n a/n · · · a/n
⎢ .. . . .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. . ⎥
⎣. . .⎦=⎣ . . . ⎦ ⎣ . . .. ⎦ .
√ √ √ √
a ··· a a/n · · · a/n a/n · · · a/n

The last claim is due to the implication

x Ax ≥ 0 , x ∈ Rn =
⇒x D ADx ≥ 0 , x ∈ Rn ,

which we can examine by substituting y = Dx into y Ay ≥ 0. In particular, we

may regard A and D as the matrix (1.1) and diagonal matrix with the elements
f (x1 ), · · · , f (xn ), respectively.
In addition, the f (x) f (y) obtained by substituting k(x, y) = 1 for x, y ∈ E in
the last item of Proposition 4 is positive definite. Moreover, the

k(x, y)
√ (1.3)
k(x, x)k(y, y)

obtained by substituting f (x) = {k(x, x)}−1/2 for k(x, x) > 0 (x ∈ E) in the last
item of Proposition 4 is positive definite. Furthermore„ the value obtained by substi-
tuting n = 2, x1 = x, and x2 = y into (1.1) is nonnegative, and the absolute value of
(1.3) does not exceed one. We say that (1.3) is the positive definite kernel obtained
by normalizing k(x, y).

Example 5 (Linear Kernel) Let E := Rd . Then, the kernel k(x, y) = x Ay =

Bx, By H , x, y ∈ Rd using the nonnegative definite matrix A = B B ∈ Rd×d ,
B ∈ Rd×d is positive definite because it corresponds to the case in which the map
in Proposition 2 is E x → Bx ∈ H . In particular, if A is the unit matrix, then the
map is the identity map. In this sense, the positive definite kernel is an extension
of the inner product k(x, y) = x y.

Example 6 (Exponential Type) Let β > 0, n ≥ 0, and x, y ∈ Rd . Then,

β2 2 βm m
km (x, y) := 1 + βx y + (x y) + · · · + (x y) (1.4)
2 m!
(m ≥ 1) is a polynomial of the products of positive definite kernels, and the coef-
ficients are nonnegative. From the first two items of Proposition 4, this kernel is a
positive definite kernel. Additionally, because (1.4) is a Taylor expansion up to the
order m, from the third item of Proposition 4,

k∞ (x, y) := exp(βx y) = lim km (x, y)

m→∞

is a positive definite kernel as well.

Example 7 (Gaussian Kernel) The kernel

8 1 Positive Definite Kernels

1
k(x, y) := exp{− x − y2 } , σ > 0 (1.5)
2σ 2

for x, y ∈ Rd can be written as

1 x2 xy y2

exp{− x − y2
} = exp{− } exp{ } exp{− }.
2σ 2 2σ 2 σ2 2σ 2

Thus, from the fifth item of Proposition 4 and the fact that exp(βx y) with β = σ −2
is positive definite, we see that (7) is positive definite.
Example 8 (Polynomial Kernel) The kernel

km,d (x, y) := (x y + 1)m , (1.6)

for x, y ∈ Rd , d = 1, 2, . . . is a polynomial of positive definite kernels (linear kernels

x y), and its coefficients are nonnegative. From the first two items of Proposition 4,
(1.6) is positive definite.
Example 9 If we normalize the linear kernel by (1.3), we obtain x y/xy,
where we denote a := a, a1/2 for a ∈ Rn . The Gaussian kernel (1.5) remains the
same even if we normalize it. The polynomial kernel becomes
m
x y + 1
√
x x + 1 y y + 1

if we normalize it.
The converse is true for Proposition 2, which will be proven in Chap. 3: for
any nonnegative definite kernel k, there exists a feature map : E → H such that
k(x, y) = (x), (y) H .
Example 10 (Polynomial Kernel) Let m, d ≥ 1. The feature map of the kernel
km,d (x, y) = (x y + 1)m with x, y ∈ Rd is

m!
m,d (x1 , · · · , xd ) = ( x m 1 · · · xdm d )m 0 ,m 1 ,...,m d ≥0 ,
m 0 !m 1 ! · · · m d ! 1

where the indices (m 0 , m 1 , · · · , m d ) range over m 0 , m 1 , · · · , m d ≥ 0 and m 0 +

m 1 + · · · + m d = m, and we assume that an order exists among the indices
(m 0 , m 1 , · · · , m d ). If we use the multinomial theorem,

d m!
( z i )m = z m 1 · · · z dm d
i=0 m 0 +m 1 +···+m d =m
m 0 !m 1 ! · · · m d ! 1

(z 0 = 1), we see that

1.3 Positive Definite Kernels 9

(x y + 1)m = m,d (x), m,d (y) H

with x0 = y0 = 1. For example, we have

1,2 (x1 , x2 ) = [1, x1 , x2 ]

√ √ √
2,2 (x1 , x2 ) = [1, x12 , x22 , 2x1 , 2x2 , 2x1 x2 ]

because

2,1 (x1 , x2 ), 2,1 (y1 .y2 ) H = 1 + x1 y1 + x2 y2 = 1 + x y = k(x, y)

2,2 (x1 , x2 ), 2,2 (y2 .y2 ) H = 1 + x12 y12 + x22 y22 + 2x1 y1 + 2x2 y2 + 2x1 x2 y1 y2

= (1 + x1 y1 + x2 y2 )2 = (1 + x y)2 = k(x, y) .

Example 11 (Infinite-Dimensional
√ Polynomial Kernel) Let 0 < r ≤ ∞, d ≥ 1, and
E := {x ∈ Rd |x2 < r }. Let f : (−r, r ) → R be C ∞ . We assume that the func-
tion can be Taylor-expanded by
∞

f (x) = an x n , x ∈ (−r, r ) .
n=0

If a0 > 0, a1 , a2 , . . . ≥ 0, then f (x y) is a positive definite kernel for x, y ∈ E. The

exponential type is an infinite-dimensional polynomial kernel and is positive definite.

Example 12 In Example 2, we use the Nadaraya-Watson estimator to determine the

Gaussian kernel (Figs. 1.2 and 1.3).

def K(x, y, sigma2) :

return np.exp(−np. linalg .norm(x − y)∗∗2/2/sigma2)

def F(z , sigma2) : # Function Definition f

S=0; T=0
for i in range(n) :
S = S + K(x[ i ] , z , sigma2) ∗ y[ i ]
T = T + K(x[ i ] , z , sigma2)
return S / T

We often obtain the optimal value for each of the kernel parameters via cross
validation (CV)4 . If the parameters take continuous values, we select a finite number
of candidates and obtain the evaluation value for each parameter as follows. Divide
the N samples into K groups and conduct estimation with the samples belonging to
group K − 1 group. Perform testing with the samples belonging to the one remaining
group and calculate the corresponding score. Repeat the procedure K times (changing

4 Joe Suzuki, “Statistical Learning with Math and Python”, Chap. 4, Springer.
10 1 Positive Definite Kernels

σ 2 = 0.01

3
σ 2 = 0.001
σ 2 = σbest
2

2
1
y
0
-1
-2

-3 -2 -1 0 1 2 3
x

Fig. 1.2 Smoothing by predicting the values of x outside the N sample points via the Nadaraya-
Watson estimator. We choose the best parameter for the Gaussian kernel via cross validation

Group 1 Group 2 ··· Group k − 1 Group k

First Test Estimation ··· Estimation Estimation
Second Estimation Test ··· Estimation Estimation
.. .. .. .. ..
. . . . .
(k − 1)-th Estimation Estimation ··· Test Estimation
k-th Estimation Estimation Estimation Test

Fig. 1.3 A rotation employed for cross validation. Each group consists of N /k samples; we divide
N N 2N N
the samples into k groups based on their sample IDs. 1 ∼ , +1∼ , . . . , (k − 2) + 1 ∼
k k k k
N N
(k − 1) , (k − 1) + 1 ∼ N
k k

the test group) and find the sum of the obtained scores. In that way, we evaluate
the performance of the kernel based on one parameter. Execute this process for all
parameter candidates and use the parameters with the best evaluation values.
We obtain the optimal value of the parameter σ 2 via CV. We execute this procedure,
setting σ 2 = 0.01, 0.001.

n = 100
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation

# The Curves for sigma2 = 0.01, 0.001

plt.figure(num=1, figsize=(15, 8),dpi=80)
plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o")
plt.xlim(−3, 3)
plt.ylim(−2, 3)
plt.xticks(fontsize = 14); plt.yticks(fontsize = 14)
xx = np.arange(−3, 3, 0.1)
1.3 Positive Definite Kernels 11

yy = [[] for _ in range(2)]

sigma2 = [0.001, 0.01]
color = ["g", "b"]

for i in range(2):
for zz in xx:
yy[i].append(F(zz, sigma2[i]))
plt.plot(xx, yy[i], c = color[i], label = sigma2[i])
plt.legend(loc = "upper left", frameon = True, prop={’size’:20})
plt.title("Nadaraya−Watson Estimator", fontsize = 20)

# Optimum lambda Values

m = int(n / 10)
sigma2_seq = np.arange(0.001, 0.01, 0.001)
SS_min = np.inf
for sigma2 in sigma2_seq:
SS = 0
for k in range(10):
test = range(k∗m, (k+1)∗m)
train = [x for x in range(n) if x not in test]
for j in test:
u, v = 0, 0
for i in train:
kk = K(x[i], x[j], sigma2)
u = u + kk ∗ y[i]
v = v + kk
if v != 0:
z = u/v
SS = SS + (y[j] − z)∗∗2
if SS < SS_min:
SS_min = SS
sigma2_best = sigma2
print("Best sigma2 = ", sigma2_best)

Best sigma2 = 0.003

plt.figure(num = 1, figsize=(15, 8),dpi = 80)

plt.scatter(x, y, facecolors = ’none’, edgecolors = "k", marker = "o")
plt.xlim(−3, 3)
plt.ylim(−2, 3)
plt.xticks(fontsize = 14); plt.yticks(fontsize = 14)

xx = np.arange(−3, 3, 0.1)
yy = [[] for _ in range(3)]
sigma2 = [0.001, 0.01, sigma2_best]
labels = [0.001, 0.01, "sigma2_best"]
color = ["g", "b", "r"]

for i in range(3):
for zz in xx:
yy[i].append(F(zz, sigma2[i]))
plt.plot(xx, yy[i], c = color[i], label = labels[i])
plt.legend(loc = "upper left", frameon = True, prop={’size’: 20})
plt.title("Nadaraya−Watson Estimator", fontsize = 20)
12 1 Positive Definite Kernels

1.4 Probability

Each set is an event when the sets are closed by set operations (union, intersection,
and complement).
Example 13 We consider a set consisting of the subsets of E = {1, 2, 3, 4, 5, 6}
(dice eyes) that are closed by set operations:

{E, {}, {1, 3}, {5}, {2, 4, 6}, {1, 3, 5}, {2, 4, 5, 6}, {1, 2, 3, 4, 6}} .

If any of these eight elements undergo the union, intersection, or complement oper-
ations, the result remains one of these eight elements. In that sense, we can say
that these eight elements are closed by the set operations. The subsets {1, 3} and
{2, 4, 5, 6} are events, but {2, 4} is not. On the other hand, for the entire set E, if we
include {1}, {2}, {3}, {4}, {5}, {6} as events, 26 events should be considered. Even if
the entire set E is identical, whether it is an event differs depending on the set F of
events.
In the following, we start our discussion after defining the entire set E and the
set F of subsets (events) of E closed by the set operations. Any open interval (a, b)
with a, b ∈ R is a subset of the whole real number system R. Applying set operations
(union, set product, and set complement) to multiple open intervals does not form an
open interval, but the result remains a subset of R. We call any subset of R obtained
from an open set by set operations a Borel set of R, and we denote such a subset as
B. A set obtained by further applying set operations to Borel sets remains a Borel
set.

Example 14 For a, b ∈ R, the following are Borel sets: {a} = ∩∞n=1 (a − 1/n, a +
1/n), [a, b) = {a} ∪ (a, b), (a, b] = √{b} ∪ (a, b), [a, b] = {a} ∪ (a, b], R =
∪∞ ∞
n=0 (−2 , 2 ), Z = ∪n=0 {−n, n}, and [ 2, 3) ∪ Z.
n n

As described above, we assume that we have defined the entire set E and the
set F of events. At this time, the μ : F → [0, 1] that satisfies the following three
conditions is called a probability.
1. μ(A) ≥ 0, A ∈ F, ∞
∞
2. Ai ∩ A j = {}=
⇒μ(∪i=1 Ai ) = i=1 μ(Ai ), and
3. μ(E) = 1.
We say that μ is a measure if μ satisfies the first two conditions, and we say that
this measure is finite if μ(E) takes a finite value. We say that (E, F, μ) is either a
probability space or a measure space, depending on whether μ(E) = 1 or not.
For probability and measure spaces, if {e ∈ E|X (e) ∈ B} is an event for any
Borel set B, which means that {e ∈ E|X (e) ∈ B} ∈ F, we say that the function
X : E → R is measurable in X . In particular, if we have a probability space, X is a
random variable. Whether X is measurable depends on (E, F) rather than (E, F, μ).
1.4 Probability 13

The notion of measurability might be complex for a beginner to understand.

However, it seems smoother if we intuitively understand that the function X : E → R
depends on an element of F rather than an element of E.
Example 15 (Dice Eyes) Suppose that X : E → R for E = {1, 2, 3, 4, 5, 6} is given
by
1, e = 1, 3, 5
X (e) = .
0, e = 2, 4, 6

Then, if F = {{1, 3, 5}, {2, 4, 6}, {}, E}, then X is a random variable. In fact, since
X is measurable,
{e ∈ E|X (e) ∈ {1}} = {1, 3, 5}

{e ∈ E|X (e) ∈ [−2, 3)} = E

{e ∈ E|X (e) ∈ [0, 1)} = {2, 4, 6}

for the Borel set B = {1}, [−2, 3), [0, 1). Even if we choose the Borel set B, the
set {e ∈ E|X (e) ∈ B} is one of {1, 3, 5}, {2, 4, 6}, {}, E. On the other hand, if F =
{{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable.
In the following, assuming
that the function f : E → R is measurable, we define
the Lebesgue integral f dμ. We first assume that f is nonnegative. For a sequence
E
of exclusive subsets {Bk } of F, we define

inf f (e) μ(Bk ) . (1.7)
e∈Bk
k

If ∪k Bk = E and the supremum of (1.7) w.r.t. {Bk },

sup inf f (e) μ(Bk ),
{Bk } e∈Bk
k

takes a finite value, we say that the supremum is the Lebesgue integral of the mea-
surable function f for (E, B, μ), and we write f dμ. When the function f is not
E
necessarily nonnegative, we divide E into E + := {e ∈ E| f (e) ≤ 0}} and E − := {e ∈
E| f (e)
≥ 0}},and we define the above quantity for each of f + := f, f − :=− f . If
both f + dμ, f − dμ take finite values, we say that f dμ := f + dμ − f − dμ
is the Lebesgue integral of f for (E, B, μ).
If X is a random variable, the associated Borel sets are the events for the probability
μ(·). We say that the probability of event X ≤ x for x ∈ R

FX (x) := μ([−∞, x)) = dμ ,
(−∞,x]
14 1 Positive Definite Kernels

is the distribution function and that f X is the probability density function of X if we

can write FX as x
FX (x) = f X (t)dt .
−∞

We say that μ is absolutely continuous if the probability μ(B) diminishes when

the width sum of the intervals in any Borel set B approaches zero. The necessary
and sufficient condition ensuring that a probability density function exists for the
probability μ is that μ is absolutely continuous. If X takes a finite number of values,
the probability density function does not exist, which means that μ is not absolutely
continuous. If X takes values of a1 < · · · < am , then the distribution function can be
written as
FX (x) = μ({a j }) .
j:a j ≤x

Example 16 Suppose that X follows the standard Gaussian distribution. If we make

> 0 close to zero, FX (x + ) − FX (x − ) (for any x ∈ R) approaches zero, which
means that the probability is absolutely continuous. On the other hand, suppose that
X takes values of 0, 1; even if we make > 0 close to zero, FX (1 + ) − FX (1 − )
does not approach zero, which means that the probability is not absolutely continuous.
If we use the Lebesgue integral, we can express the probability without distin-
guishing between discrete and continuous variables.
Example 17 For E = R, ifthe probability
∞ density function f X exists, the expecta-
tion of X can be written as E xdμ = −∞ t f X (t)dt. On the other hand, if X takes

values of a1 < · · · < am , we have E xdμ = mj=1 a j μ({a j }).

1.5 Bochner’s Theorem

We consider the case in which a kernel is a function of the difference between

x, y ∈ E. According to Bochner’s theorem, which is the main topic of this section
and will be used in later chapters, the kernel should coincide with a characteristic
function in terms of probability and statistics (up to a constant).
When utilizing a univariate function φ : E → R, we often use kernels in the form
k(x, y) = φ(x − y), such as the Gaussian kernel. The kernel k being positive definite
is equivalent to the inequality

n
n
z i z j φ(xi − x j ) ≥ 0 , z = [z 1 , . . . , z n ] ∈ Rn (1.8)
i=1 j=1

√ n ≥ 1, x1 , . . . , xn ∈ E.
for an arbitrary
Let i = −1 be the imaginary unit. We define the characteristic function of a
random variable X by ϕ : Rd → C:
1.5 Bochner’s Theorem 15

ϕ(t) := E[exp(it X )] = exp(it x)dμ(x) , t ∈ Rd ,
E

where E[·] denotes the expectation. If μ is absolutely continuous

(i.e., the probability
density function f X exists), then ϕ(t) := E[exp(it X )] = exp(it x) f X (x)d x is
E
the Fourier transformation of f X (x) = dμ(x)
dx
, and f X (x) can be recovered from ϕ(x)
via the inverse Fourier transformation
∞
1
f X (x) = ϕ(t)e−it x dt .
2π −∞

Example 18 The characteristic function of the Gaussian distribution with a mean

1 (x − μ)2
of μ and a variance of σ 2 , f (x) = √ exp{− }, is
2πσ 2 2σ 2
∞
1 (x − μ)2
ϕ(t) = √ exp{it x} exp{− }d x
2πσ 2 −∞ 2σ 2
∞
1 {x − (μ + itσ 2 )}2 t 2 σ2
= √ exp[− ]d x · exp{iμt − }
2πσ 2 −∞ 2σ 2 2
t 2 σ2
= exp{iμt − }.
2
α
The characteristic function of the Laplace distribution f (x) = exp{−α|x|} with
2
a parameter α > 0 is
∞ 0 ∞
α α
exp{it x} exp{−α|x|}d x = { exp[(it + α)x]d x + exp[(it − α)x]d x}
−∞ 2 2 −∞ 0
0 ∞
α e(it+α)x e(it−α)x α2
= { − }= 2 .
2 it + α it − α t + α2
−∞ 0

Proposition 5 (Bochner) Suppose that φ : Rn → R is continuous. Then, condition

(1.8) holds for an arbitrary n ≥ 1 with x = [x1 , . . . , xn ] ∈ Rn and z = [z 1 , . . . , z n ] ∈
Rn if and only if φ coincides with the characteristic function w.r.t. a probability μ up
to a constant, i.e., there exists a finite measure η such that

φ(t) = exp(it x)dη(x), t ∈ Rn . (1.9)
E

Proof: See the appendix at the end of this chapter.

Because a kernel evaluates a similarity between two elements in E, we do not
care much about the multiplication of constants. In the following, we say that a
probability μ is the probability of kernel k if μ is the finite measure η when dividing
16 1 Positive Definite Kernels

the kernel k in Proposition 5 by a constant. Note that we only consider a kernel k(·, ·)
whose range is real in this book, although the range of the characteristic function is
generally Cn .
d
In the following, we denote t2 by j=1 t j for t = [t1 , . . . , td ] ∈ R .
2 d

Example 19 (Gaussian Kernel) k(x, y) = exp{− 2σ1 2 x − y22 }, x, y ∈ Rd coin-

t2
cides with the characteristic function exp{− 22 }, t = x − y ∈ Rd of the Gaussian
2σ
distribution with a mean of 0 and a covariance matrix (σ 2 )−1 I ∈ Rd×d .
1 1
Example 20 (Laplacian Kernel) k(x, y) = , x, y ∈ Rn coin-
2π x − y22 + β 2
β2
cides with the characteristic function , t = x − y ∈ Rn of the Laplace
t22
+ β2
distribution with a parameter α = β > 0 up to the constant multiplication [2πβ 2 ]−1 .
We can construct the kernel for this distribution if the probability density function
exists. However, if we restrict our search to the kernels whose ranges are real, we
need to choose the parameters so that the characteristic function takes real values.
For example, the Gaussian kernel obtained by setting the mean to zero takes real
values.

1.6 Kernels for Strings, Trees, and Graphs

As discussed in Chap. 4, the space E of the covariates is projected via the feature
map : E → H . The method of evaluating similarity via the inner product (kernel)
in another linear space ( RKHS ) has been widely used in machine learning and data
science. If the similarities between the elements of the set E are accurately repre-
sented, then this approach yields improved regression and classification processing
performance. As this is a kernel configuration method, we provide the notions of
convolutional and marginalized kernels and illustrate them by introducing string,
tree, and graph kernels.
First, we define positive definite kernels k1 , . . . , kd for the sets E 1 , . . . , E d . Sup-
pose that we define a set E and a map R : E 1 × · · · × E d → E. Then, we define the
kernel E × E (x, y) → k(x, y) ∈ R by

d
k(x, y) = ki (xi , yi ) , (1.10)
R −1 (x) R −1 (y) i=1

where R −1 (x) is the sum over (x1 , . . . , xd ) ∈ E 1 × · · · E d such that R(x1 , . . . , xd ) =
x. A kernel in the form of (1.10) is called a convolutional kernel [13]. Since each
ki (xi , yi ) is positive definite, k(x, y) is also positive definite (according to the first
two items of Proposition 4).
1.6 Kernels for Strings, Trees, and Graphs 17

Example 21 (String Kernel) Let p be a set of strings consisting of p ≥ 0 characters

in a finite set , and let ∗ := ∪i i . For example, if = {A, T, G, C}, we have
AGGC GT G ∈ 7 . Then, we define the kernel

k(x, y) := cu (x)cu (y)
u∈ p

for x, y ∈ ∗ , where cu (x) denotes the number of occurrences of u ∈ p in x ∈ ∗ .

The following represents sample code for defining this string kernel.

def string_kernel (x, y, p) :

m, n = len(x) , len(y)
S= 0
for i in range(m) :
for j in range( i , n) :
i f x[ i : ( i+p) ] == y[ j : ( j+p) ] :
S=S+ 1
return S

Then, we execute the procedure.

C = ["a", "b", "c"]

m = 10
w = np.random.choice(C, m, replace = True)
x = ""
for i in range(m):
x = x + w[i]
n = 12
w = np.random.choice(C, n, replace = True)
y = ""
for i in range(n):
y = y + w[i]

’ababbcaaac’

’ccbcbcaaacaa’

string_kernel (x,y,2)

58
18 1 Positive Definite Kernels

Suppose that d = 3, E 1 = E 3 = ∗ , and E 2 = p . Then, if we concatenate

(x1 , x2 , x3 ) ∈ E 1 × E 2 × E 3 , then we may state that R(x1 , x2 , x3 ) = x ∈ E. If x2 =
u and y2 = u appear cu (x) times in x and cu (y) times in y, respectively, then by
setting k1 (x1 , y1 ) = k3 (x3 , y3 ) = 1 and k2 (x2 , y2 ) = I (x2 = y2 = u), we have

cu (x)cu (y) = 1 · I (x2 = y2 = u) · 1
R(x1 ,x2 ,x3 )=x R(y1 ,y2 ,y3 )=y

k(x, y) = cu (x)cu (y) = 1 · I (x2 = y2 ) · 1 .
u R(x1 ,x2 ,x3 )=x R(y1 ,y2 ,y3 )=y

Thus, we observe that the string kernel can be expressed by (1.10), where I (A) takes
values of one and zero depending on whether condition A is satisfied.

Example 22 (Tree Kernel) Suppose that we assign a label to each vertex of trees
x, y. We wish to evaluate the similarity between x, y based on how many subtrees
are shared. We denote by ct (x), ct (y) the numbers of occurrences of subtree t in x, y,
respectively. Then, the kernel

k(x, y) := ct (x)ct (y) (1.11)
t

is positive definite. In fact, for x1 , . . . , xn ∈ E and an arbitrary z 1 , . . . , z n ∈ R, we

have
n n n
z i z j k(xi , x j ) = { z i ct (xi )}2 ≥ 0 .
i=1 j=1 t i=1

Let Vx , Vy be the sets of vertices in trees x, y, respectively; we write I (u, t) = 1 and

I (u, t) = 0 depending
on whether t has u as a vertex or not. Since (1.11) can be
written as ct (x) = u∈Vx I (u, t), ct (y) = v∈Vy I (v, t), we have

k(x, y) = I (u, t)I (v, t) = c(u, v) ,
u∈Vx v∈Vy t u∈Vx v∈Vy

where c(u, v) = t I (u, t)I (v, t) is the number of common subtrees in x and y
such that the vertices u ∈ Vx and v ∈ Vy are their roots. We assume that a label l(v)
is assigned to each v ∈ V and determine whether they coincide.
1. For the descendants u 1 , . . . , u m and v1 , . . . , vn of u and v, if any of the following
hold, then we define c(u, v) := 0:
(a) l(u) = l(v),
(b) m = n,
(c) there exists i = 1, . . . , m such that l(u i ) = l(vi ),
2. otherwise, we define
1.6 Kernels for Strings, Trees, and Graphs 19

m
c(u, v) := {1 + c(u i , vi )}.
i=1

For example, suppose that we assign one of the labels A, T, G, C to each vertex
in Fig. 1.4. We may write this in a Python function as follows, where we assume that
we assign no identical labels to the vertices at the same level of the tree. Note that
the function calls itself (it is a recursive function). For example, the function requires
the value C(4, 2) when it obtains C(1, 1).

def C( i , j ) :
S, T = s [ i ] , t [ j ]
# Return zero when verteces i and j of the trees s and t do not coincides
i f S[0] != T[0]:
return 0
# Return zero when either verteces i or j of the trees s and t does not have a descendant
i f S[1] is None:
return 0
i f T[1] is None:
return 0
i f len(S[1]) != len(T[1]) :
return 0
U = []
for x in S[1]:
U.append( s [x][0])
U1 = sorted(U)
V = []
for y in T[1]:
V.append( t [y][0])
V1 = sorted(V)
m = len(U)
# Return zero when the labels of the descendants do not coincide
for h in range(m) :
i f U1[h] != V1[h] :
return 0
U2 = np. array (S[1]) [np. argsort (U) ]
V2 = np. array (T[1]) [np. argsort (V) ]
W= 1
for h in range(m) :
W = W ∗ (1 + C(U2[h] , V2[h]) )
return W

def k(s , t ) :
m, n = len( s ) , len( t )
kernel = 0
for i in range(m) :
for j in range(n) :
i f C( i , j ) > 0:
kernel = kernel + C( i , j )
return kernel

s = [[] for _ in range(6)]

s[0] = ["G", [1, 3]]; s[1] = ["T", [2]]; s[2] = ["C", None]
s[3] = ["A", [4, 5]]; s[4] = ["C", None]; s[5] = ["T", None]

t = [[] for _ in range(9)]

t[0] = ["G", [1, 4]]; t[1] = ["A", [2, 3]]; t[2] = ["C", None]
t[3] = ["T", None]; t[4] = ["T", [5, 6]]; t[5] = ["C", None]
20 1 Positive Definite Kernels

1 G 1 G

2 T 4 A A 2 5 T

A
3 C 5 6 T 3 4 6 7
C C
C T
C 8 9 T

Fig. 1.4 A tree kernel evaluates the similarity in terms of which the labels A, G, C, T are assigned
to the vertices of the trees

t[6] = ["A", [7, 8]]; t[7] = ["C", None]; t[8] = ["T", None]

for i in range(6):
for j in range(9):
if C(i, j) > 0:
print(i, j, C(i, j))

0 0 2
3 1 1
3 6 1

k(s , t )

Thus, the sum 4 will be the kernel value.

Let X and Y be discrete random variables that take values in E X and E Y , respec-
tively, and let P(y|x) be the conditional probability of X = x ∈ E X given Y = y ∈
E Y . Suppose that we are given a positive definite kernel k X Y : E X Y × E X Y → R for
E X Y := E X × E Y . We define the marginalized kernel by

k(x, x ) := k X Y ((x, y), (x , y ))P(y|x)P(y |x ) , x, x ∈ E X (1.12)
y∈E Y y ∈E Y

for x, x ∈ E X (Tsuda et al. [32]). We claim that the marginalized kernel is positive
definite. In fact, k X Y being positive definite implies the existence of the feature map
: E X Y (x, y) → (x, y) such that

k X Y ((x, y), (x , y )) = ((x, y)), ((x , y )) .

1.6 Kernels for Strings, Trees, and Graphs 21

We may define (1.12) for the conditional density function f of Y given X as follows:

k(x, x ) := kY |X ((x, y), (x , y )) f (y|x) f (y |x )dydy
y∈E Y y ∈E Y

for x, x ∈ E X .

Example 23 (Graph Kernel (Kashima et al. [19]) We construct a kernel that

expresses the similarity between (directed) graphs G 1 , G 2 that may contain a loop
according to the set of paths connecting two vertices.
Let V, E be the sets of vertices and (directed) edges, respectively. We express each
path of length m by a sequence consisting of vertices and edges: (v0 , e1 , . . . , em , vm ),
v0 , v1 , . . . , vm ∈ V , and e1 , . . . , em ∈ E. We assume that a label is assigned to each of
the vertices and edges of the two graphs, and we define the probability of the sequence
π = (v0 , e1 , . . . , em , vm ) by the products of the associated conditional probabilities
p(π) := p(v0 ) p(v1 |v0 ) · · · p(vm |vm−1 ). To this end, we consider a random walk in
which we first choose v0 ∈ V with a probability of p(v0 ) = 1/|V | (|V |: the cardi-
nality of V ) and repeatedly choose either to stop at that point with a probability
of p or to move to a neighbor vertex via one of the connected directed edges with
an equiprobability times 1 − p, where the stopping probability 0 < p < 1 should
be determined in an a priori manner. For example, if the random walk arrives at a
vertex v that connects to |V (v)| vertices, then the probability of moving to one of
the neighboring vertices is (1 − p)/|V (v)|. For example, for 1 → 4 → 3 → 5 → 3
in Fig. 1.5, the labels are A, e, A, d, D, a, B, c, D. If p = 1/3, then the probability
of the directed path can be obtained via the following code.

def k(s , p) :
return prob(s , p) / len(node)

def prob(s , p) :
i f len(node[ s [0]]) == 0:
return 0
i f len( s ) == 1:
return p
m = len( s )
S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p)
return S
22 1 Positive Definite Kernels

Fig. 1.5 Evaluating D

similarity via a graph kernel b a
A 1 3 5
c B
c e d

C 2 4 A
b

We demonstrate the execution of the code below:

node = [ [ ] for _ in range(5) ]

node[0] = [2 , 4]; node[1] = [4]; node[2] = [1 , 5]
node[3] = [1 , 5]; node[4] = [3]
k([0 , 3, 2, 4, 2] , 1 / 3)

0.0016460905349794243

Because five vertices exist, we multiply by 1/5, choose one of the next two
transitions, and so on.

1 2 1 2 2 1 2 1 22
· · · ( · 1) · · · ( · 1) · = .
5 3 2 3 3 2 3 3 5 × 35

Because these probabilities are different in the directed graphs G 1 , G 2 , we denote

them by p(π|G 1 ) and p(π|G). We express the label sequence of the path π (of length
2m + 1) by L(π) and define the graph kernel by

k(G 1 , G 2 ) := p(π1 |G 1 ) p(π2 |G 2 )I [L(π1 ) = L(π2 )].
π1 π2

We find that this kernel is a marginalized kernel if k X Y ((G 1 , π1 ), (G 2 , π2 )) =

I [L(π1 ) = L(π2 )].

Appendix

Many books have proofs because Fubini’s theorem, Lebesgue’s dominant conver-
gence theorem, and Levy’s convergence theorem are general theorems. We have
abbreviated these statements and proofs. The proof of Proposition 5 was provided
by Ito [15].
Appendix 23

Proof of Proposition 3
D is a diagonal matrix whose components are the eigenvalues λi ≥ 0 of the non-
negative definite matrix A, and U is an orthogonal matrix whose column vectors
n u i that
are unit eigenvectors are orthogonal to each other. Then, we can write
A = U DU = i=1 λi u i u i . Similarly, if μi , vi , i = 1, . . . , n arethe eigenvalues
and eigenvectors of matrix B, respectively, then we can write B = i=1 n
μi vi vi . At
this moment, we have

(u i u i ) ◦ (v j v j )=(u i,k u i,l ·v j,k v j,l )k,l =(u i,k v j,k · u i,l v j,l )k,l = (u i ◦ v j )(u i ◦ v j ) .

Note that this matrix is nonnegative definite. In fact, if we write u i ◦ v j = [y1 ,

. . . , yn ] ∈ Rn , then component (h, l) of (u i ◦ v j )(u i ◦ v j ) is yh yl , which means that
n
nh=1
= ( nh=1 z h yh )2 ≥ 0 for any z 1 , . . . , z n . Since matrices A and B are
l=1 z h z l yh yl
nonnegative definite, we have that λi , μ j ≥ 0 for each i, j = 1, · · · , n, which means
that

n
n
n
n
A◦B = λi μ j (u i u i ) ◦ (v j u j ) = λi μ j (u i ◦ v j )(u i ◦ v j )
i=1 j=1 i=1 j=1

is nonnegative definite.

Proof of Proposition 5
We only show the case in which φ(0) = η(E) = 1 because the extension is straight-
forward. Suppose that (1.9) holds. Then, we have

n
n
n
n
n
z j z k φ(x j − xk ) = z j ei x j t z k e−i xk t dη(t) = | z j ei x j t |2 dη(t) ≥ 0,
j=1 k=1 E j=1 k=1 E j=1

and (1.8) follows. Conversely, suppose that (1.8) holds. Since the matrix consisting
of φ(xi − x j ) for the (i, j)th element is nonnegative definite and symmetric, we
have that φ(x) = φ(−x), x ∈ R. If we substitute n = 2, x1 = u, and x2 = 0, then
we obtain

1 φ(u) z1
[z 1 , z 2 ] ≥0
φ(u) 1 z2

and φ(u)2 ≤ 1 because the determinant is nonnegative. Since φ is bounded and

continuous, it is uniformly continuous. On the other hand, e−t /n e−i xt is uniformly
2

continuous as well. In the following, we show that

∞
1 /n −i x t
φ(t)e−t
2
f n (x) := e dt
2π −∞
24 1 Positive Definite Kernels

is a probability density function, and the characteristic function φn approaches φ

as n → ∞. If we verify the claim, by Levy’s convergence theorem [15], φ is the
characteristic function. We show the d = 1 case first.
∞
1
φ(t)e−t /n e−i xt dt.
2
f n (x) =
2π −∞

For a > 0, we have

a a ∞ ∞
1 1 2 sin at
φ(t)e−t /n e−i xt dtd x = φ(t)e−t /n
2 2
f n (x)d x = dt ,
−a 2π −a −∞ 2π −∞ t
b
where we use Fubini’s theorem for the last equality. Then, for b > 0, from sin(at)
∞ 0
1 − cos t
da = 1−cos
t
bt
≥ 0, dt = π, and φ(0) = 1, as b → ∞, we have
−∞ t2
∞
1 b a 1 b 1 2 sin at
φ(t)e−t /n
2
{ f n (x)d x}da = dadt
b 0 −a b 0 2π −∞ t
∞ ∞
1 2(1 − cos tb) 1 u 2(1 − cos u)
φ(t)e−t /n φ( )e−(u/b) /n
2 2
= 2
dt= du → 1 ,
2π −∞ t b 2π −∞ b u2

where we use the dominant convergence theorem for the last equality. In general, for
a g : R → R that is monotonically increasing and bounded from above, we have
y
1
lim g(x)d x = lim g(x) .
y→∞ y 0 x→∞

∞
Thus, we have f n (x)d x = 1.
−∞
Finally, we show that φn → φ (n → ∞):
a ∞
1
φ(t)e−t /n e−ita dt
2
φn (z) := lim e i za
a→∞ −a −∞ 2π
∞
1 2 sin a(t − z)
φ(t)e−t /n
2
= lim dt
a→∞ 2π −∞ t −z
∞
1 b 1 2 sin a(t − z)
φ(t)e−t /n
2
= lim da dt
b→∞ b 0 2π −∞ t −z
∞
1 2(1 − cos b(t − z))
φ(t)e−t /n
2
= lim dt
b→∞ 2π −∞ b(t − z)2
∞
1 s 2(1 − cos s)
φ(z + )e−(z+s/b) /n ds = φ(z)e−z /n → φ(z).
2 2
= lim 2
b→∞ 2π −∞ b s
Appendix 25

For a general d ≥ 1, if we use t22 = t12 + . . . + td2 ,

a1 ad
2 sin a1 x1 2 sin ad xd
··· e−i(x1 t1 +···xd td ) d x1 · · · d xd = ··· ,
−a1 −ad t1 td

and bi
2 sin ai xi 2(1 − cos ti bi )
dai = ,
0 ti ti2 bi

(i = 1, . . . , d), then the same claim can be obtained.

Exercises 1∼15

1. Show that the following three conditions are equivalent for a symmetric matrix
A ∈ Rn×n .
(a) There exists a square matrix B such that A = B B.
(b) x Ax ≥ 0 for an arbitrary x ∈ Rn .
(c) All the eigenvalues of A are nonnegative.
In addition, using Python, generate a square matrix B ∈ Rn×n with real elements
by generating random numbers to obtain A = B B. Then, randomly generate
five more x ∈ Rn (n = 5) to examine whether x Ax is nonnegative for each
value.
2. Consider the Epanechnikov kernel defined by k : E × E → R

|x − y|
k(x, y) = D
λ
3
(1 − t 2 ), |t| ≤ 1
D(t) = 4
0, Otherwise

for λ > 0. Suppose that we write a kernel for λ > 0 and (x, y) ∈ E × E in
Python as shown below:

def k(x, y, lam) :

return D(np.abs((x − y) / lam) ) .

Specify the function D using Python. Moreover, define the function f that
makes a prediction at z ∈ E based on the Nadaraya-Watson estimator by uti-
lizing the function k such that z, λ are the inputs of f and k, respectively,
and (x1 , y1 ), . . . , (x N , y N ) are global. Then, execute the following to examine
whether the functions D, f work properly.
26 1 Positive Definite Kernels

n = 250
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4

plt.figure(num=1, figsize=(15, 8),dpi=80)

plt.xlim(−3, 3); plt.ylim(−2, 3)
plt.xticks(fontsize = 14); plt.yticks(fontsize = 14)
plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o")

plt.legend(loc = "upper left", frameon = True, prop={’size’:14})

plt.title("Nadaraya−Watson Estimator", fontsize = 20)

Replace the Epanechnikov kernel with the Gaussian kernel, the exponential type,
and the polynomial kernel and execute them.
3. Show that the determinant of A ∈ R3×3 coincides with the product of the three
eigenvalues. In addition, show that if the determinant is negative, at least one of
the eigenvalues is negative.
4. Show that the Hadamard product of nonnegative definite matrices of the same
size is nonnegative definite. Show also that the kernel obtained by multiplying
positive definite kernels is positive definite.
5. Show that a square matrix whose elements consist of the same nonnegative value
is nonnegative definite. Show further that a kernel that outputs a nonnegative
constant is positive definite.
6. Find the feature map 3,2 (x1 , x2 ) of the polynomial kernel k3,2 (x, y) = (x y +
1)3 for x, y ∈ R2 to derive

k3,2 (x, y) = 3,2 (x1 , x2 ) 3,2 (x1 , x2 ) .

7. Use Proposition 4 to show that the Gaussian and polynomial kernels and expo-
nential types are positive definite. Show also that the kernel obtained by nor-
malizing a positive definite kernel is positive definite. What kernel do we obtain
when we normalize the exponential type and the Gaussian kernel?
8. The following procedure chooses the optimal parameter σ 2 of the Gaussian
kernel via 10-fold CV when applying the Nadaraya-Watson estimator to the
samples. Change the 10-fold CV procedure to the N -fold (leave-one-out) CV
process to find the optimal σ 2 , and draw the curve by executing the procedure
below:
def K(x, y, sigma2):
return np.exp(−np.linalg.norm(x − y)∗∗2/2/sigma2)

n = 100
x = 2 ∗ np.random.normal(size = n)
y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4
Exercises 1∼15 27

m = int(n / 10)
sigma2_seq = np.arange(0.001, 0.01, 0.001)
SS_min = np.inf
for sigma2 in sigma2_seq:
SS = 0
for k in range(10):
test = range(k∗m,(k+1)∗m)
train = [x for x in range(n) if x not in test]
for j in test:
u, v = 0, 0
for i in train:
kk = K(x[i], x[j], sigma2)
u = u + kk ∗ y[i]
v = v + kk
if not(v==0):
z=u/v
SS = SS + (y[j] − z)∗∗2
if SS < SS_min:
SS_min = SS
sigma2_best = sigma2
print("Best sigma2 = ", sigma2_best)

9. For a probability space (E, F, μ) with E = {1, 2, 3, 4, 5, 6} and a map X : E →

R, show that if
1, e = 1, 3, 5
X (e) =
0, e = 2, 4, 6

and F = {{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable (not mea-
surable).
10. Derive the characteristic function of the Gaussian distribution f (x) =
1 (x − μ)2
√ exp{− } with a mean of μ and a variance of σ 2 and find the
2π 2σ 2
condition for the characteristic function
α to be a real function. Do the same for
the Laplace distribution f (x) = exp{−α|x|} with a parameter α > 0.
2
11. Obtain the kernel value between the left tree and itself in Fig. 1.4. Construct and
execute a program to find this value.
12. Randomly generate binary sequences x, y of length 10 to obtain the string kernel
value k(x, y).

def string_kernel (x, y) :

m, n = len(x) , len(y)
S= 0
for i in range(m) :
for j in range( i , m) :
for k in range(n) :
i f x[( i −1): j ] == y[(k−1): (k+j−i ) ] :
S=S+ 1
return S

13. Show that the string, tree, and marginalized kernels are positive definite. Show
also that the string and graph kernels are convolutional and marginalized kernels,
respectively.
14. How can we compute the path probabilities below when we consider a random
walk in the directed graph of Fig. 1.5 if the stopping probability is p = 1/3?
28 1 Positive Definite Kernels

(a) 3 → 1 → 4 → 3 → 5,
(b) 1 → 2 → 4 → 1 → 2,
(c) 3 → 5 → 3 → 5.
15. What inconvenience occurs when we execute the procedure below to compute
a graph kernel? Illustrate this inconvenience with an example.

def k(s , p) :
return prob(s , p) / len(node)

def prob(s , p) :
i f len(node[ s [0]]) == 0:
return 0
i f len( s ) == 1:
return p
m = len( s )
S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p)
return S
Chapter 2
Hilbert Spaces

When considering machine learning and data science issues, in many cases, the
calculus and linear algebra courses taken during the first year of university provide
sufficient background information. However, we require knowledge of metric spaces
and their completeness, as well as linear algebras with nonfinite dimensions, for
kernels. If your major is not mathematics, we might have few opportunities to study
these topics, and it may be challenging to learn them in a short period. This chapter
aims to learn Hilbert spaces, the projection theorem, linear operators, and (some of)
the compact operators necessary for understanding kernels. Unlike finite-dimensional
linear spaces, ordinary Hilbert spaces require scrutiny of their completeness.

2.1 Metric Spaces and Their Completeness

Let M be a set. We say that a bivariate function d : M × M → R is a distance if

1. d(x, y) ≥ 0;
2. d(x, y) = 0 ⇐⇒ x = y;
3. d(x, y) = d(y, x); and
4. d(x, z) ≤ d(x, y) + d(y, z)
for x, y, z ∈ M, and the pair (M, d) is a metric space1 .
Let E be a subset of the metric space M. We say that E is an open set if a positive
constant exists such that U (x, ) := {y ∈ M|d(x, y) < } ⊆ E for each x ∈ E.
Moreover, we say that y ∈ M is a convergence point of E if U (y, ) ∩ E
= {} for an
arbitrary > 0, and E is a closed set if E contains all the convergence points of E.

1 We call M a metric space rather than (M, d) when we do not stress d or when d is apparent.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 29
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_2
30 2 Hilbert Spaces

Example 24 The set M = [0, 1] is a closed set because the neighborhood U (y, ) of
y∈/ M has no intersection with M if we make the radius > 0 smaller, which means
that M contains all the convergence points of M. On the other hand, M = (0, 1) is
an open set because M contains the neighborhood U (y, ) of y ∈ M if we make the
radius > 0 smaller. If we add {0}, {1} to the interval (0, 1), (0, 1], [0, 1), we obtain
the closed set [0, 1].

We say that the minimum closed set in M that contains E is the closure of E, and
we write this as E. If E is not a closed set, then E does not contain all the convergence
points. Thus, the closure is the set of convergence points of E. Moreover, we say that
E is dense in M if E = M, which is equivalent to the following conditions. “y ∈ E
exists such that d(x, y) < for an arbitrary > 0 and x ∈ M”, and “each point in M
is a convergence point of E”. Furthermore, we say that M is separable if it contains
a dense subset that consists of countable points.

Example 25 For the distance d(x, y) := |x − y| with x, y ∈ R and the metric space
(R, d), each irrational number a ∈ R\Q is a convergence point of Q. In fact, for an
arbitrary > 0, the interval (a − , a + ) contains a rational number b ∈ Q. Thus, Q
does not contain the convergence point a ∈ / Q and is not a closed set in R. Moreover,
the closure of Q is R (Q is dense in R). Furthermore, since Q is a countable set, we
find that R is separable.

Let (M, d) be a metric space. We say that a sequence {xn } in2 M converges to
x ∈ M if d(xn , x) → 0 as n → ∞ for x ∈ M, and we write this as xn → x. On
the other hand, we say that a sequence {xn } in M is Cauchy if d(xm , xn ) → 0 as
m, n → ∞, i.e., if supm,n≥N d(xm , xn ) → 0 as N → ∞.
If {xn } converges to some x ∈ M, then it is a Cauchy sequence. However, the
converse does not hold. We say that a metric space (M, d) is complete if each
Cauchy sequence {xn } in M converges to an element in M. We say that (M, d) is
bounded if there exists a C > 0 such that d(x, y) < C for an arbitrary x, y ∈ M, and
the minimum and maximum values are the upper and lower limits if M is bounded
from above and below, respectively.

Example 26 An arbitrary Cauchy sequence is bounded. In fact, for any > 0, we

⇒d(xm , xn ) < , and we have
can choose N := N () such that m, n ≥ N =

min{x1 , . . . , x N −1 , x N − } ≤ xn ≤ max{x1 , . . . , x N −1 , x N + } .

Example 27 (Q is Not Complete) The sequence {an } defined by a1 =1, an+1 =

1 1
an + (n ≥ 1) is in Q. However, we can prove that {an } is a Cauchy sequence in
2 an √
Q but an → 2 ∈ / Q (Exercise 17).

Proposition 6 R is complete.

2 {xn } with xn ∈ M for each n.

2.1 Metric Spaces and Their Completeness 31

Proof: If {xn } is a Cauchy sequence in R, then {xn } is bounded (Example 26). If

we write the upper and lower limits of {xn }∞ n=s as ls , m s , respectively, the monotone
sequences {m s }, {ls } in R share the same limit. In fact, from the above assumption,
we can make ls − m s = sup{|x p − xq | : p, q ≥ s} as small as possible. Thus, R is
complete.
If the number of dimensions is finite, we may check completeness for each dimen-
sion, and we see that R p is complete for any p ≥ 1.
Suppose that we arbitrarily set a neighborhood U (P) for each P ∈ M beforehand.
We say that a set M is compact if there exist finite m and P1 , . . . , Pm ∈ M such that
M ⊆ ∪i=1m
U (Pi ).
Example 28 Let M = (0, 1), and suppose that we define the neighborhood U (x) :=
( 21 x, 23 x) for each x ∈ M beforehand. Then, for any n and x1 , . . . , xn ∈ M, we have

1 3
(0, 1) ∪i=1
n
( xi , xi ) ,
2 2
which means that M is not compact.
Proposition 7 (Heine-Borel) For R p , any bounded closed set M is compact.
Proof: Suppose that we have set a neighborhood U (P) for each P ∈ M and that
M ⊆ ∪i=1m
U (Pi ) cannot be realized by any m and P1 , . . . , Pm . If we divide the
closed set (rectangular) that contains M ⊆ R p into two components for each dimen-
sion, then at least one of the 2 p rectangles cannot be covered by a finite number of
neighborhoods. If we repeat this procedure, then the volume of the rectangle that a
finite number of neighborhoods cannot cover becomes sufficiently small for the cen-
ter to converge to a P ∗ ∈ M; furthermore, we can cover the rectangle with U (P ∗ ),
which is a contradiction.
Let (M1 , d1 ), (M2 , d2 ) be metric spaces. We say that the map f : M1 → M2 is
continuous at x ∈ M1 if for any > 0, there exists δ(x, ) such that for y ∈ M1 ,

d1 (x, y) < δ(x, )=

⇒d2 ( f (x), f (y)) < . (2.1)

In particular, if there exists δ(x, ) that does not depend on x ∈ M1 in (2.1), we say
that f is uniformly continuous.
Example 29 The function f (x) = 1/x defined on the interval (0, 1] is contin-
uous but is not uniformly continuous. In fact, if we make x approach y after
1 1
fixing y, we can make d2 ( f (x), f (y)) = | − | as small as possible, which
x y
means that f is continuous in (0, 1]. However, when we make x approach y to
make d2 ( f (x), f (y)) smaller than a constant, we observe that for each > 0, the
smaller y is, the smaller d1 (x, y) = |x − y| should be. Thus, no such δ() for which
d1 (x, y) < δ()=⇒d2 ( f (x), f (y)) < exists if δ() does not depend on x, y ∈ M
(Fig. 2.1).
32 2 Hilbert Spaces

Fig. 2.1 The function

f (x) = 1/x is not uniformly
f (x) = 1/x
continuous over (0, 1]. To

10
make the | f (x) − f (y)|
value smaller than a

f (x) = 1/x
8
constant, we need to make
the |x − y| value smaller

6
when x, y are close to 0 (red
lines) than that when x, y are

4
far away from 0 (blue lines).
Thus, δ > 0 depends on the

2
locations of x, y
0.1 0.2 0.3 0.4 0.5 0.6
x

Proposition 8 A continuous function over a bounded closed set is uniformly con-

tinuous.

Proof: Let f : E → R be a continuous function defined over a bounded closed set

M. Because the function f is continuous, for an arbitrary > 0, there exists a (z)
for each z ∈ M such that

d1 (x, z) < (z)=

⇒d2 ( f (x), f (z)) < . (2.2)

From Proposition 7, we can prepare a finite number of neighborhoods to cover M.

Let U1 , . . . , Um be neighborhoods with centers z 1 , . . . , z m and radii (z 1 )/2, . . . ,
1
(z m )/2. Suppose that we choose x, y ∈ M such that d1 (x, y) < δ := min (z i ).
2 1≤i≤m
Because x belongs to one of the U1 , . . . , Um , without loss of generality, we assume
that x ∈ Ui . From the distance property, we have

1
d1 (x, z i ) < (z i ) < (z i )
2
d1 (y, z i ) ≤ d1 (x, y) + d1 (x, z i ) < (z i ) .

Combining these inequalities, from the assumption that f is continuous and (2.2),
we have That

d2 ( f (x), f (y)) ≤ d2 ( f (x), f (z i )) + d2 ( f (y), f (z i )) < + = 2 .

Since > 0 is arbitrary and δ does not depend on x, y, f is uniformly continuous.

Example 30 We can prove that “a definite integral exists for a continuous function
defined over a closed interval [a, b]” by virtue of Proposition 8. If we divide a < b
2.1 Metric Spaces and Their Completeness 33

into n equal-length segments as a = x0 < . . . < xn = b, then for an arbitrary > 0,

we require
n
b−a
n
b−a
sup f (x) − inf f (x) <
n xi−1 <x<xi n xi−1 <x<xi
i=1 i=1

to define the definite integral. Because we assume that f is uniformly continuous,

the condition | f (x) − f (y)| < /(b − a), x, y ∈ [xi−1 , xi ] is satisfied if we make
b−a
δ = xi − xi−1 = smaller (i.e., we make n larger).
n

2.2 Linear Spaces and Inner Product Spaces

We say that a set V is a linear space3 if it satisfies the following conditions: for
x, y ∈ V and α ∈ R,
1. x + y ∈ V and
2. αx ∈ V.
Example 31 If we define the sum of x = [x1 , . . . , xd ], y = [y1 , . . . , yd ] ∈ Rd and
the multiplication by a constant α ∈ R as x + y = [x1 + y1 , . . . , xd + yd ] and αx =
[αx1 , . . . , αxd ], respectively, then the d-dimensional Euclidean space Rd forms a
linear space.
Example
1 32 (L 2 Space) The set L 2 [0, 1] of functions f : [0, 1] → R for which
0 { f (x)} d x takes a finite value is a linear space because
2

1 1 1
{ f (x) + g(x)} d x ≤ 2 2
f (x) d x + 2
2
g(x)2 d x < ∞
0 0 0

1 1
{α f (x)}2 d x = α2 f (x)2 d x < ∞
0 0

1 1
for α ∈ R when 0 { f (x)}2 d x < ∞ and 0 g(x)2 d x < ∞.
Let V be a linear space. We say that any bivariate function ·, · : V × V → R
that obeys the four conditions below is an inner product:
1. x, x ≥ 0;
2. αx + β y, z = αx, z + βy, z;
3. x, y = y, x; and
4. x, x = 0 ⇐⇒ x = 0
for x, y, z ∈ V and α, β ∈ R.

3 The same as a vector space.

34 2 Hilbert Spaces

Example 33 For the linear space Rd in Example 31,

d
x, y = xi yi
i=1

for x = [x1 , . . . , xd ], and y = [y1 , . . . , yd ] ∈ Rd is an inner product because the first

three conditions are obvious, and the last condition holds because

d
x, x = 0 ⇐⇒ xi2 = 0 ⇐⇒ x = 0 .
i=1

For a given linear space, we need to choose its inner product. We say that a linear
space equipped with an inner product is an inner product space.
Example 34 (Inner Product of the L 2 Space) Let L 2 [0, 1] be the linear space in
Example 32. The bivariate function
1
f, g = f (x)g(x)d x
0

f, g ∈ L 2 [0, 1] is not an inner product because the last condition fails. If f (1/2) = 1
and f (x) = 0 for x
= 1/2, then we have
1
f, f = f (x)2 d x = 0 .
0

Strictly speaking, we construct such an inner product, identifying4 f, g ∈ L 2 if

1
and only if 0 { f (x) − g(x)}2 d x = 0, which may rarely be noticed.
Let V be a linear space. We say that any function · : V → R that obeys the
four conditions below is a norm.
1. x ≥ 0;
2. av = |a|x;
3. x + y ≤ x + y (triangle inequality); and
4. x = 0 ⇐⇒ x = 0
for x, y ∈ V and a ∈ R.
If we define the norm of V , then we can construct a metric space with the distance
d(x, y) = x − y. If we define the inner product of V , the function

x = x, x1/2 (2.3)

1
4 We define the equivalent relation ∼ such that f − g ∈ {h| 0 {h(x)}2 d x = 0} ⇐⇒ f ∼ g and
construct the inner product for the quotient space L 2 / ∼.
2.2 Linear Spaces and Inner Product Spaces 35

satisfies the four conditions of a norm, and we call this norm a norm induced by an
inner product.
In Examples 32 and 34, we introduced L 2 over E = [0, 1] using the Riemann
integral. However, in general, we define L 2 (E, F, μ) according to the set of f :
E → R for which
f 2 dμ (2.4)
E

is finite5 for the measure space (E, F, μ).

Example 35 (Uniform Norm) The set of continuous functions over [a, b] forms a
linear space. The uniform norm defined by

f := sup | f (x)| , f ∈ C[a, b]

x∈[a,b]

is not induced by an inner product but satisfies the norm conditions:

f = 0 ⇐⇒ f (x) = 0 , x ∈ [a, b] .

In this book, we often use the Cauchy-Schwarz inequality:

|x, y| ≤ x y (2.5)

for x, y ∈ V . Apparently, (2.5) holds for y = 0. If y

= 0, since the inner product of
x, y
y and z := x − y is zero, we have
y2

x, y 2 x, y 2 x, y2

x2 = z + y = z2
+ y ≥ ,
y2 y2 y2

where the equality holds if and only if z = 0, which occurs exactly when one of x, y
is a constant multiplication of the other. Moreover, we can examine the triangle norm
inequality via (2.5):

x + y2 =x2 + 2|x, y|+y2 ≤ x2 + 2x y + y2 = (x + y)2 .

Because we did not use the last condition of inner products in deriving the Cauchy-
Schwarz inequality, we may apply this inequality to any bivariate function that sat-
isfies the first three conditions.
Using Cauchy-Schwarz’s inequality, we can prove the continuity of an inner prod-
uct.

5 We often abbreviate (E, F , μ) or specify the interval as [a, b] (as in L 2 [a, b]).
36 2 Hilbert Spaces

n=1 n=2

0.8

0.8
y

y
0.4

0.4
0.0

0.0
-0.5 0 0.5 1.0 1.5 -0.5 0 0.5 1.0 1.5
x x

n=3 n=4
0.8

0.8
y

y
0.4

0.4
0.0

0.0
-0.5 0 0.5 1.0 1.5 -0,5 0 0.5 1.0 1.5
x x

Fig. 2.2 We illustrate Example 37 when f n → f . The function is continuous for finite values of n
but is not continuous as n → ∞

Proposition 9 (Continuity of an Inner Product) Let {xn }, {yn } be sequences in a

linear space V . For x, y ∈ V , if xn → x and yn → y as n → ∞, then xn , yn →
x, y, where we denote by · the norm induced by the inner product of V .
Proof: The proposition follows from xn ≤ x + xn − x → x (n → ∞) and

2.3 Hilbert Spaces

We say that a vector space in which a norm is defined and the distance is complete
is a Banach space. Hereafter, we denote by C(E) the set of continuous functions
defined over E.

Example 36 If p ≥ 1, the vector space R p is complete under the standard inner

product and is a Hilbert space. On the other hand, the vector space Q p is not complete
under the standard inner product and does not make a Hilbert space in this case.
2.3 Hilbert Spaces 37

Example 37 The set of continuous functions [0, 1] → R forms a linear space

C[0, 1]. We consider the function
⎧
⎪ 1
⎪
⎪ 0, 0≤t ≤
⎪
⎨ 2
1 1 1
f n (t) := n(t − 21 ), < t < +
⎪
⎪ 2 2 n
⎪
⎪ 1 1
⎩ 1, + ≤t ≤1.
2 n
For m ≥ n, we have

2+n
1 1
1
f n − f m 22 = | f n (t) − f m (t)|2 dt = | f n (t) − f m (t)|2 dt
1
0 2

2+m 2+n
1 1 1 1
1 1 1
= [n(t − ) − m(t − )]2 dt + [n(t − ) − 1]2 dt
1
2
2 2 2+m
1 1 2
(n − m) (n − m) (n − m)
2
1 3 2
= 3
− 3
= 2
< →0.
3m 3m n 3m n 3n
Thus, { f n } is a Cauchy sequence in C[0, 1] (Fig. 2.2). However, f n converges to a
function that is not continuous:

0, 0 ≤ t ≤ 21
f (t) :=
1, 21 < t ≤ 1

2+n
1 1
1
1 1
f n − f 2 = f n (t) − f (t)2 dt = [n(t − ) − 1]2 dt = →0.
0 1
2
2 3n

As we have seen, C[a, b] is not complete w.r.t. the L 2 norm. However, it is

complete w.r.t. the uniform norm:
Proposition 10 C[a, b] is complete w.r.t. the uniform norm.
Proof: Let { f n } be a Cauchy sequence in C[a, b], which means that

sup sup | f m (x) − f n (x)| → 0 (2.6)

m,n≥N x∈[a,b]

as N → ∞. Then, the real sequence { f n (x)} is a Cauchy sequence for each x ∈ [a, b]
and converges to a real value (Proposition 6). If we define the function f (x) with
x ∈ E by limn→∞ f n (x), supn≥N f n (x) and inf n≥N f n (x) converge to f (x) from
above and from below, respectively. From (2.6), we see that

| f N (x) − f (x)| ≤ sup f n (x) − inf f n (x) = sup | f m (x) − f n (x)|

n≥N n≥N m,n≥N
38 2 Hilbert Spaces

uniformly converges to 0 for an arbitrary x ∈ [a, b], which implies that C[a, b] is
complete.
Because any inner product does not induce the uniform norm, C[a, b] is a Banach
space but is not a Hilbert space.

Proposition 11 C[a, b] is separable w.r.t. the uniform norm.

For the proof, we use the Stone-Weierstrass theorem (Proposition 12) [30, 31, 34,
35]. The term algebra is used to denote the linear space A that defines associative “·”
and commutative “+” properties and satisfies

x · (y + z) = x · y + x · z

(y + z) · x = y · x + z · x

α(x · y) = (αx) · y

for x, y, z ∈ A and α ∈ R, where the first two properties are identical if · is commu-
tative. The general theory may be complex, but we only suppose that + and · are the
standard addition and multiplication operations and that A is either a polynomial or
continuous function.
Example 38 (Polynomial Ring) Let +,· be the standard commutative addition and
multiplication operations. The polynomial ring R[x, y, z] is a set of polynomials with
indeterminates (variables) x, y, z and is an algebra with R commutative coefficients.
R[x, y, z] is a linear space, and if two elements belong to R[x, y, z], multiplication
by R and addition among the elements belong to R[x, y, z]. Moreover, the three laws
follow for the elements in R[x, y, z].

Proposition 12 (Stone-Weierstrass [30, 31, 34, 35]) Let E and A be a compact set
and an algebra, respectively. Under the following conditions, A is dense in C(E).
1. No x ∈ E exists such that f (x) = 0 for all f ∈ A.
2. For each pair x, y ∈ E with x
= y, f ∈ A exists such that f (x)
= f (y).

We refer to Proposition 12 several times. Although we abbreviate its proof in this

book, please follow it because it contains no complicated derivation processes. We
can use this proposition to prove that a neural network can approximate any contin-
uous function.
Proof of Proposition 11: A set A of polynomials with indeterminate x and real
coefficients satisfies the two conditions in Proposition 12. Hence, A is dense in
C[a, b]. Furthermore, if we restrict the coefficients of A to Q, then A is a countable
set, which means that C[a, b] is separable.

Proposition 13 C[a, b], a < b, is dense in L 2 [a, b].

2.3 Hilbert Spaces 39

For the proof, see the appendix at the end of this chapter. We say that a function in
the form
m
h(Bk )I (Bk ) (2.7)
k=1

with exclusive B1 , . . . , Bm ∈ F and h : F → R≥0 is a simple function. Equation

(1.7) in Chap. 1 approximates this function by using a simple function. The proof
in the appendix shows that a simple function approximates an arbitrary f ∈ L 2 and
that a continuous function approximates an arbitrary simple function.

Proposition 14 (Riesz-Fischer) L 2 is complete6 . In other words, L 2 is a Banach

space and a Hilbert space.

The outline of the proof is as follows; see the appendix for details. It is sufficient
to show that “{ f n } is a Cauchy sequence in L 2 =
⇒ there exists an f ∈ L 2 such that
f n − f → 0”. We define the f to which the Cauchy sequence { fn } in L 2 converges
and derive f n − f → 0 and f ∈ L 2 .
1. Let { f n } be an arbitrary Cauchy
∞sequence in L .
2

2. There exists {n k } such that k=1 | f n k+1 − f n k |2 < ∞.

3. We show the existence of f : E → R such that μ{x ∈ E| limk→∞ f n k (x) =
f (x)} = μ(E).
4. We show that f n − f → 0 and f ∈ L 2 [a, b].
Let V and ·, · be an inner product space and its inner product, respectively.
We say that x, y ∈ V are orthogonal if x, y = 0 and that a sequence {e j } in V
is orthonormal if e j = 1 for each j and each pair ei , e j (i
= j) is orthogonal.
Moreover, we say that {e j } is an orthonormal basis of V if {e j } is an orthonormal
sequence and each x ∈ V can be expressed by x = ∞ j=1 α j e j using the real α j .

Proposition 15 For an orthonormal sequence {e j } in a Hilbert space H , we have

the following properties :
∞

1. x, ei 2 ≤ x2 (Bessel’s inequality);
i=1
∞

2. x, ei ei converges;
i=1
∞
∞
3. αi ei converges ⇐⇒ i=1 αi2 < ∞; and
i=1
∞

4. y = αi ei =
⇒αi = y, ei .
i=1

6 L p is complete for any p ≥ 1, although we abbreviate this proof.

40 2 Hilbert Spaces

Proof: See the appendix at the end of this chapter.

Suppose that A is a subset of a linear space V equipped with a norm, and let
span(A) be the closure of span(A), which is a linear combination of the elements in
A. Suppose that {xn } is a sequence in a Hilbert space and that each xn is orthogonal.
Then, the sequence {en } constructed below is orthonormal

i−1
vi := xi − xi , e j e j , ei = vi /vi , i = 1, 2, . . .
j=1

and satisfies span{xn } = span{en } (Gram-Schmidt).

Proposition 16 Let {e j } be an orthonormal sequence in a Hilbert space H . The
following conditions are equivalent:
1. {ei } is an orthonormal sequence in H ;
2. For an arbitrary x ∈ H , x, ei = 0, k = 1, 2, . . . , =⇒ x = 0;
3. span{ei } is dense in H ;
4. The equality (Parseval) of Bessel’sinequality (the first of Proposition 15);
5. For arbitrary x, y ∈ H , x, y = ∞ k=1 x, ek y, ek ; and
6. For an arbitrary x ∈ H , x = ∞ j=1 x, e j e j .

Proof: See the appendix at the end of this chapter.

Example 39 (Fourier Series Expansion) By approximating f ∈ L 2 [−π, π] by

m
f m (x) = a0 + (an cos nx + bn sin nx) (2.8)
n=1

and computing the {an }, {bn } that minimizes f − f m , we obtain f − f m → 0

as m → ∞. However, f − f m = 0 does not occur for any m. We see that
f ∈ / span(A) for A := {1, cos x, sin x, cos 2x, sin 2x, · · · } and that span(A) is not
a closed set. We add all the f ∈ / span(A) such that f − f m → 0 (convergence
points) for span(A) to obtain the closure span(A). Moreover, span(A) is dense in
L 2 [−π, π]. Then,

1 cos x sin x cos 2x sin 2x

{√ , √ , √ , √ , √ , · · · }
2π π π π π

which are the {1, cos x, sin x, cos 2x, sin 2x, · · · } divided by their norms, form-
ing an orthonormal basis of the Hilbert space L 2 [−π, π] that consists of the
functions
π expressed by the Fourier series as in (2.8), where π we regard f, g :=
−π f (x)g(x)d
π
x as the inner product of f, g ∈ H , and we use −π cos mx sin nxd x =
π
0 and −π cos2 mxd x = −π sin2 nxd x = π for m, n > 0.
Proposition 17 For a Hilbert space H , being separable is equivalent to having an
orthonormal basis.
2.3 Hilbert Spaces 41

Proof: If {ej } forms an orthonormal basis of H , we can express an arbitrary x ∈

H as x = ∞ j=1 x, e j e j . E := {x ∈ H : x, e j ∈ Q, j = 1, 2, . . .} is dense and
countable. In fact, {x, e j } and {e j } are countable sets, as is E, that is, the combination
of these two sets, which means that H is separable. On the other hand, if H is
separable, by extracting linearly independent elements from H that are dense and
countable via Gram-Schmidt’s method, we can construct an orthonormal basis {en }.
Thus, we see that any linear combination of these elements is dense in H . From the
third part of Proposition 16, {ei } is an orthonormal basis of H .
Proposition 17 implies the following:
Proposition 18 L 2 [a, b] is separable under the L 2 norm.
In this book, we assume that the Hilbert space we deal with is separable.

2.4 Projection Theorem

Let V and M be a linear space equipped with an inner product and its subspace,
respectively. We define the orthogonal complement of M as

M ⊥ := {x ∈ V : x, y = 0 for all y ∈ M} .

For subspaces M1 and M2 of V , we write the direct sum of M1 , M2 as M1 + M2 .

In particular, when M1 , M2 are orthogonal to each other, i.e.,

x 1 ⊥ x 2 , x 1 ∈ M1 , x 2 ∈ M2 , (2.9)

we write it as M1 ⊕ M2 := {x1 + x2 : x1 ∈ M1 , x2 ∈ M2 }.

Proposition 19 (Projection Theorem) Let M be a closed subset of a Hilbert space

H . Then, for an arbitrary x ∈ H , there exists a y ∈ M that minimizes x − y.
Moreover such a y satisfies

x − y, z = 0 , z ∈ M (2.10)

and is unique.

Proof: Given an x ∈ H , we consider a y ∈ M such that x = y + (x − y), y ∈ M,

and x − y ∈ M ⊥ . For the proof, we utilize the following steps.
1. Show that a sequence {yn } in M such that

lim x − yn 2 = inf x − y2

n→∞ y∈M

is a Cauchy sequence.
42 2 Hilbert Spaces

2. Demonstrate the existence of a y ∈ M such that yn → y ∈ M.

3. Show that 2ax − y, z − y ≤ a 2 z − y2 for arbitrary 0 < a < 1 and z ∈ M,
and derive a contradiction in the inequality when we assume that x−y, z−y>0.
4. Show that x − y, z ≤ 0.
5. Substitute −z instead of z to obtain the proposition.
For details, see the appendix at the end of this chapter.
Equation (2.10) implies that an arbitrary x ∈ H can be uniquely decomposed into
x = y + (x − y) with y ∈ M and x − y ∈ M ⊥ , which means that

H = M ⊕ M⊥ . (2.11)

Example 40 For a positive definite kernel k : E × E → R and each x ∈ E, we

regard k(x, ·) : E → R as a function over E. In general, the space spanned by
{k(x, ·)}x∈E and its closure H := span({k(x, ·)}x∈E ) is a linear space. We show that
the inner product is k(x, ·), k(y, ·) H = k(x, y) and demonstrate its completeness
in Chap. 3. For x1 , . . . , x N ∈ E, M := span({k(xi , ·)}i=1
N
) forms a finite-dimensional
linear space, and H can be written as (2.11) by using

M ⊥ = { f ∈ H | f, k(xi , ·) H = 0, i = 1, . . . , N } .

If f = f 1 + f 2 , for f 1 ∈ M and f 2 ∈ M ⊥ , we have

f 2H = f 1 2H + f 2 2H + 2 f 1 , f 2 H = f 1 2H + f 2 2H ≥ f 1 2H ,

which is used in Chap. 4 for kernel calculations.

Proposition 20 Let H and M be a Hilbert space and its subset7 , respectively. Then,
we have the following:
1. M ⊥ is a closed subset of H .
2. M ⊆ (M ⊥ )⊥ .
3. If M is a subspace, then (M ⊥ )⊥ = M, where M is a closure of the set M.
Proof: For the first item, we see that M ⊥ is a subspace. From continuity of an inner
product (Proposition 9), if xn → x as n → ∞ for a sequence {xn } in M ⊥ , then we
have for a ∈ M,
x, a = lim xn , a = 0 ,
n→∞

which means that M ⊥ is closed. The second item is due to

⇒x, y = 0 , y ∈ M ⊥ =
x ∈ M= ⇒x ∈ (M ⊥ )⊥ .

For the third item, from the first two properties, taking the closure on both sides
of M ⊆ (M ⊥ )⊥ yields M ⊆ (M ⊥ )⊥ . From Proposition 19, an arbitrary x ∈ (M ⊥ )⊥

7 This is not necessarily a subspace.

2.4 Projection Theorem 43

⊥
can be written as y ∈ M ∩ (M ⊥ )⊥ = M and z ∈ M ∩ (M ⊥ )⊥ . However, we have
⊥
M ∩ (M ⊥ )⊥ ⊆ M ⊥ ∩ (M ⊥ )⊥ = {0}, which means that z = 0, and we obtain the
third item.

2.5 Linear Operators

Let X 1 , X 2 be linear spaces with norms of · 1 , · 2 , respectively, and let T :

X 1 → X 2 be the map that linearly transforms an element in X 1 to an element in X 2 .
We call such a T a linear operator. We define an image and a kernel by

Im(T ) := {T x : x ∈ X 1 } ⊆ X 2

and
Ker(T ) := {x ∈ X 1 : T x = 0} ⊆ X 1 ,

and we call the dimensionality of Im(T ) the rank of T . We say that the linear operator
T : X 1 → X 2 is bounded if for each x ∈ X 1 , there exists a constant C > 0 such that

T x2 ≤ Cx1 .

We write the set of such T ’s as B(X 1 , X 2 ). In particular, we write B(X 1 , X 2 ) as

B(X ) when X 1 = X 2 = X .

Proposition 21 A linear operator T is bounded if and only if T is uniformly con-

tinuous.

Proof: If T is uniformly continuous, then there exists a δ > 0 such that x1 ≤
δx
δ=⇒ T x2 ≤ 1. Since ≤ δ, we have
x

δx x1 x1
T x2 = T ( )2 ≤
x1 δ δ

for any x
= 0. On the other hand, if T is bounded, there exists a constant C that does
not depend on x ∈ X 1 such that

T (xn − x)2 ≤ Cxn − x1

for any {xn } and an x ∈ X 1 such that xn → x as n → ∞.

Hereafter, we define the operator norm of T ∈ B(X 1 , X 2 ) by

T := sup T x2 . (2.12)

x∈X 1 ,x1 =1
44 2 Hilbert Spaces

Thus, for an arbitrary x ∈ X 1 , we have

T x2 ≤ T x1 .

Example 41 Let X 1 := R p and X 2 := Rq . If the norm is the Euclidean norm, then

we can write the linear operator T : R p → Rq as T : x → Bx by using some B ∈
Rq× p . If the matrix B is square, the norm T is the square root of the maximum
eigenvalue of the nonnegative definite matrix A := B B.

T 2 = max x B Bx = max Bx2 .

x=1 x=1

Example 42 For K : [0, 1]2 → R, let

1 1
K 2 (x, y)d xd y
0 0

be finite. We define the integral operator by the linear operator T in L 2 [0, 1] such
that 1
(T f )(·) = K (·, x) f (x)d x (2.13)
0

for f ∈ L 2 [0, 1]. Note that (2.13) belongs to L 2 [0, 1] and that T is bounded: From
1 1 1
|(T f )(x)|2 ≤ K 2 (x, y)dy f 2 (y)dy = f 22 K 2 (x, y)dy ,
0 0 0

we have
1 1 1
T f 22 = |(T f )(x)|2 d x ≤ f 22 K 2 (x, y)d xd y .
0 0 0

We call such a K an integral operator kernel and distinguish between the positive
definite kernels we deal with in this book.

In particular, we call any linear operator with X 2 = R a linear functional.

Proposition 22 (Riesz’s Representation Theorem) Let H be a Hilbert space with

an inner product ·, · and a norm of · , and let T ∈ B(H, R). Then, there exists
a unique eT ∈ H such that

T f = f, eT , f ∈ H (2.14)

and T = eT .
2.5 Linear Operators 45

Proof: See the appendix at the end of this chapter.

Example 43 (RKHS) Let x ∈ E, and let Tx : H → R be the map from f ∈ H to

f (x). Then, Tx is linear because

Tx (a f + bg) = (a f + bg)(x) = a f (x) + bg(x) = aTx ( f ) + bTx (g) .

We assume that Tx is bounded for each x ∈ E. Then, from Proposition 22, there
exists a k x ∈ H such that
f (x) = Tx ( f ) = f, k x

x ∈ E, and Tx = k x .

Proposition 23 (Adjoint Operator) Let Hi be a Hilbert space with an inner product

·, ·i for i = 1, 2 and T ∈ B(H1 , H2 ). Then, there exists a T ∗ ∈ B(H2 , H1 ) such
that
T x1 , x2 2 = x1 , T ∗ x2 1 , x1 ∈ H1 , x2 ∈ H2 .

Proof: If we fix x2 ∈ H2 and regard T x1 , x2 2 as a function of x1 ∈ H1 , then

from x1 → T x1 , x2 2 ≤ x1 2 x2 2 , T is a bounded operator w.r.t. x1 ∈ H1 . From
Proposition 22, for each x2 ∈ H2 , there exists y2 (x2 ) ∈ H1 such that T x1 , x2 2 =
x1 , y(x2 )1 . If we define T ∗ x2 = y2 (x2 ), then T ∗ is a bounded linear map. The
boundness property is due to

T ∗ x2 21 = |x2 , T T ∗ x2 |2 ≤ T T ∗ x2 1 x2 2 .

We call the T ∗ in Proposition 23 the adjoint operator of T . In particular, if T ∗ = T ,
we call such an operator T self-adjoint.

Example 44 Let H = R p . We can express any T ∈ B(H ) by a square matrix T ∈

R p× p . From
T x, y = x T y = x, T y ,

we see that the adjoint T ∗ is the transpose matrix of T and that T can be written as
a symmetric matrix if and only if T is self-adjoint.

Example 45 For the integral operator of L 2 [0, 1] in Example 42, from Fubini’s
theorem, we have that
1 1 1
T f, g = K (x, y) f (x)g(y)d xd y = f, K (y, ·)g(y)dy ,
0 0 0

1
and y → (T ∗ g)(y) = 0 K (x, y)g(x)d x is an adjoint operator. If the integral oper-
ator kernel K is symmetric, the operator T is self-adjoint.
46 2 Hilbert Spaces

2.6 Compact Operators

Let (M, d) and E be a metric space and a subset of M, respectively. If any infinite
sequence in E contains a subsequence that converges to an element in E, then we
say that E is sequentially compact. If {xn } has a subsequence that converges to x,
then x is a convergence point of {xn }.
Example 46 Let E := R and d(x, y) := |x − y| for x, y ∈ R. Then, E is not
sequentially compact. In fact, the sequence xn = n has no convergence points. For
E = (0, 1], the sequence xn = 1/n converges to 0 ∈ / (0, 1] as n → ∞, and the con-
vergence point of any subsequence is only 0. Therefore, E = (0, 1] is not sequentially
compact.
Proposition 24 Let (M, d) and E be a metric space and a subset of M, respectively.
Then, E is sequentially compact if and only if E is compact.
Proof: Many books on geometry deal with the proof of equivalence. See such books
for the details of this proof.
In this section, we explain compactness by using the terminology of sequential
compactness.
Let X 1 , X 2 be linear spaces equipped with norms, and let T ∈ B(X 1 , X 2 ). We
say that T is compact if {T xn } contains a convergence subsequence for any bounded
sequence {xn } in X 1 .
Example 47 The orthonormal basis {e j } in a Hilbert space H is √ bounded because
e j = 1. However, for an identity map, we have that ei − e j = 2 for any i
= j.
Thus, the sequence e1 , e2 , . . . does not have any convergence points in H . Hence,
the identity operator for any infinite-dimensional Hilbert space is not compact.
Proposition 25 For any bounded linear operator T , the following hold.
1. If the rank is finite, then the operator T is compact.
2. If a sequence of finite-rank operators {Tn } exists such that Tn − T → 0 as
n → ∞, then T is compact8 .
Proof: See the appendix at the end of this chapter.
Let H and T ∈ B(H ) be a Hilbert space and its bounded linear operator, respec-
tively. If λ ∈ R and 0
= e ∈ H exist such that

T e = λe , (2.15)

then we say that λ and e are an eigenvalue and an eigenvector of T , respectively.

Proposition 26 Let T ∈ B(H ) and e j ∈ Ker(T − λ j I ) for j = 1, 2, . . .. If the
eigenvalues λ j
= 0 have different values, then

8 It is known that the converse is true.

2.6 Compact Operators 47

1. e j is linearly independent.
2. If T is self-adjoint, then {e j } are orthogonal.

Proof: See the appendix at the end of this chapter.

Example 48 Let T ∈ B(H ) be a compact operator. For each eigenvalue λ

= 0,
the eigenspace Ker(T − λI ) has a finite dimensionality. In fact, if Ker(T − λI ) is
of infinite dimensionality for an eigenvalue λ
= 0, then λ contains infinitely many
eigenvectors e j , and if we apply the operator T to them, then as in Example 47,
{λe j } does not have any convergence subsequence. Thus, T is not compact, which
is a contradiction.

Example 49 For any C > 0, the absolute values of a finite number of eigenvalues
λi for a compact operator T exceed C. Suppose that the absolute values of an infinite
number of eigenvalues λ1 , λ2 , . . . exceed C. Let M0 := {0}, Mi := span{e1 , . . . , ei },
e j ∈ Ker(T − λ j I ), j = 1, 2, . . ., i = 1, 2, . . .. Since the {e1 , . . . , ei } are linearly
⊥
independent, each Mi ∩ Mi−1 is one dimensional for i = 1, 2, . . .. Thus, if we
⊥
define the orthonormal sequence xi ∈ Ker(T − λi I ) ∩ Mi−1 , i = 1, 2, . . . via Gram-
Schmidt, then we have

T xi − T xk 2 = T xi 2 + T xk 2 ≥ 2C 2

for i > k. Thus, {T xi } has no convergence subsequence.

Example 49 implies that the set of nonzero eigenvalues of T is countable.

We summarize the above discussion and its implications below.
Proposition 27 Let T be a self-adjoint compact operator of a Hilbert space H .
Then, the set of nonzero eigenvalues of T is finite, or the sequence of eigenvalues
converges to zero. Each eigenvalue has a finite multitude, and any pair of eigen-
vectors corresponding to different eigenvalues is an orthogonal pair. Let λ1 , λ2 , . . .
be a sequence of eigenvalues such that |λ1 | ≥ |λ2 | ≥ · · · , and let e1 , e2 , . . . be any
corresponding eigenvectors that are orthogonal (orthogonalized eigenvectors via
Gram-Schmidt) if they possess the same eigenvalue. Then, {e j } is the orthonormal
basis of Im(T ), and we can express T by
∞

Tx = λ j x, e j e j (2.16)
j=1

for each x ∈ H .
Proof: We utilize the following steps, where the second item is equivalent to
(Ker(T ))⊥ = Im(T ) because T = T ∗ .
1. Show that H = Ker(T ) ⊕ (Ker(T ))⊥ .
2. Show that (Ker(T ))⊥ = Im(T ∗ ).
48 2 Hilbert Spaces

3. Show that span{e j | j ≥ 1} ⊆ Im(T ).

4. Show that span{e j | j ≥ 1} ⊇ Im(T ).
See the appendix at the end of this chapter.
We say that an operator T is nonnegative definite if
∞ ∞
∞

T x, x = λi x, ei ei , x, e j e j = λi x, ei 2 ≥ 0
i=1 j=1 i=1

∞
for arbitrary H x = i=1 x, ei ei ; this condition is equivalent to λ1 ≥ 0,
λ2 ≥ 0, . . ..
Proposition 28 If T is nonnegative definite, we have
T e, e
λk = max (2.17)
e∈span{e1 ,...,ek−1 }⊥ e2

which expresses the maximum value over the Hilbert space H when k = 1.
Proof: The claim follows from (2.16) and λ j ≥ 0:

∞

max T e, e = max λ j e, e j 2 = λk .
e∈{e1 ,...,ek−1 }⊥ e=1 e=1
j=k

Let H1 , H2 be Hilbert spaces, {ei } an orthonormal basis of H1 , and T ∈ B(H1 , H2 ).
If
∞

T ei 2
i=1

takes a finite value, we say that T is a Hilbert-Schmidt (HS) operator, and we write
the set of HS operators in B(H1 , H2 ) as B H S (H1 , H2 ).
We define the inner product of T1 , T2 ∈ B H S (H1 , H2 ) and the HS norm of T ∈
B H S (H1 , H2 ) by T1 , T2 H S := ∞ j=1 T1 e j , T2 e j 2 and

∞
1/2
1/2

T H S := T, T H S = T ei 22 ,
i=1

respectively.
Proposition 29 The HS norm value of T ∈ B(H1 , H2 ) does not depend on the choice
of orthonormal basis {ei }.
Proof: Let {e1,i }, {e2, j } be arbitrary orthonormal bases of Hilbert spaces H1 , H2 ,
and let T1 , T2 ∈ B(H1 , H2 ). Then, for Tk e1,i = ∞ ∗
j=1 Tk e1,i , e2, j 2 e2, j , Tk e2, j =
∞ ∗
i=1 Tk e2, j , e1,i 1 e1,i , and k = 1, 2, we have
2.6 Compact Operators 49

∞
∞
∞
T1 e1,i , T2 e1,i 2 = T1 e1,i , e2, j 2 T2 e1,i , e2, j 2
i=1 i=1 j=1
∞
∞ ∞

= e1,i , T1∗ e2, j 1 e1,i , T2∗ e2, j 1 = T1∗ e2, j , T2∗ e2, j 1 ,
i=1 j=1 i=1

which means that both sides do not depend on the choices of {e1,i }, {e2, j }. In particu-
lar, if T1 = T2 = T , we see that T 2H S does not depend on the choices of {e1,i }, {e2, j }.

Proposition 30 An HS operator is compact.
Proof: Let T ∈ B(H1 , H2 ) be an HS operator, x ∈ H1 , and

n
Tn x := T x, e2i 2 e2i ,
i=1

where {e2i } is an orthonormal basis of H2 . Since the image of Tn is of finite

dimensionality, Tn is compact. Thus, from the second item of Proposition 25, it is
sufficient
∞ to show that T − Tn → 0 as n → ∞. However, since (T − Tn )x =
i=n+1 T x, e2,i 2 e2,i , we have that when x1 ≤ 1,

∞
∞
∞

(T − Tn )x22 = T x, e2,i 22 = x, T ∗ e2,i 21 ≤ T ∗ e2,i 2 .
i=n+1 i=n+1 i=n+1

Because T ∗ is an HS operator, the right-hand side converges to zero, where T is an

HS operator if and only if T ∗ is an HS operator due to the derivation in Proposition 29.

Example 50 When an operator is expressed by a matrix T = (Ti, j ) such that T ∈
B(Rm , Rn ), m, n ≥ 1, the HS norm becomes the squared sum of the mn elements
of this matrix. In fact, if T is expressed by a matrix Rn×m , then the HS norm is the
Frobenius norm:

n
n
m
n
T 2H S = T e X,i 2 = T ∗ eY, j 2 = Ti,2j ,
i=1 j=1 i=1 j=1

where e X,i ∈ Rm is a column vector such that the ith element is one and the other
elements are zeros, and eY, j ∈ Rn is a column vector such that the jth element is one
and the other elements are zeros.
Let T ∈ B(H ) be nonnegative definite and {ei } be an orthonormal basis of H . If
∞

T T R := T e j , e j
j=1
50 2 Hilbert Spaces

is finite, we say that T T R is the trace norm of T and that T is a trace class. Similar to
an HS norm value, a trace norm value does not depend on the choice of orthonormal
basis {e j }.
If we substitute x = e j into (2.16) in Proposition 27, then we have T x = λe j and
obtain that
∞ ∞

T T R := T e j , e j = λj .
j=1 j=1

On the other hand, from

∞
∞ ∞

T 2H S = T ei,1 , e j,2 2 = λ2j ,
i=1 j=1 j=1

we have 1/2
∞

T H S ≤ λ1 λi = λ1 T T R .
i=1

Thus, we have established the following proposition.

Proposition 31 If T ∈ B(H ) is a trace class, it is a compact HS class.

Appendix: Proofs of Propositions

Proof of Proposition 13
We show that a simple function approximates an arbitrary f ∈ L 22 and that a contin-
uous function approximates an arbitrary simple function. Hereafter, we denote the
L 2 norm by · .
Since f ∈ L 2 is measurable, if f is nonnegative, the sequence { f n } of simple
functions defined by

(k − 1)2−n , (k − 1)2−n ≤ f (ω) < k2−n , 1 ≤ k ≤ n2n
f n (ω) =
n, n ≤ f (ω) ≤ ∞

satisfies 0 ≤ f 1 (ω) ≤ f 2 (ω) ≤ · · · ≤ f (ω) and | f n (ω) − f (ω)|2 → 0 almost surely.

Since the right-hand side of | f n (ω) − f (ω)|2 ≤ 4{ f (ω)}2 is finite when integrated,
from the dominant convergence theorem, we have

f n − f 2 → 0 .

We can show a similar derivation for a general f that is not necessarily nonnegative,
as derived in Chap. 1.
Appendix: Proofs of Propositions 51

On the other hand, let A be a closed subset of [a, b], and let K A be the indi-
cator function (K A (e) = 1 if e ∈ A; K A (e) = 0 otherwise). If we define h(x) :=
1
inf y∈A {|x − y|} and gnA (x) := , then gnA is continuous, gnA (x) ≤ 1 for
1 + nh(x)
x ∈ [a, b], gnA (x) = 1 for x ∈ A, and lim gnA (x) = 0 for x ∈ B := [a, b]\A. Thus,
n→∞
we have
1/2 1/2
lim gnA − K A = lim gnA (x)2 d x = lim g A (x)2 d x =0,
n→∞ n→∞ B B n→∞ n

where the second equality follows from the dominant convergence theorem. More-

over, if A, A are disjoint, then αgnA + α gnA with α, α > 0 approximates αK A +

α K A . In fact, we have

αgnA + α gnA − (αK A + α K A ) ≤ αgnA − K A + α gnA − K A .

Hence, a sequence of continuous functions can approximate an arbitrary simple

function.

Proof of Proposition 14
Suppose that { f n } is a Cauchy sequence in L 2 , which means that

lim sup f m − f n 2 = 0 . (2.18)

N →∞ m,n≥N

Then, there exists a sequence {n k } such that

∞
∞

| f n k+1 − f n k | ≤ f n k+1 − f n k 2 < ∞ .

k=1 2 k=1

Thus, almost surely, we have

∞

| f n k+1 (x) − f n k (x)| < ∞ . (2.19)
k=1

For arbitrary r < t and x ∈ E, from the triangle inequality, we have

t−1
| f nr (x) − f n t (x)| ≤ | f n k+1 (x) − f n k (x)| .
k=r

Combined with (2.19), the real sequence { f n k (x)}∞

k=1 is almost surely Cauchy. Since
the entire real system is complete (Proposition 6), we define f (x) := limk→∞ f n k (x)
52 2 Hilbert Spaces

for x ∈ E such that { f n k (x)}∞

k=1 is Cauchy, and we define f (x) := 0 for the other
x ∈ E. From (2.18), for an arbitrary > 0, we have

as n → ∞, where the first inequality is due to Fatou’s lemma. Furthermore, since

f n , f − f n ∈ L 2 and L 2 is a linear space, we have f ∈ L 2 .

Proof of Proposition 15
The first item holds because

n
n
0 ≤ x − x, ei ei 2 = x2 − x, ei 2
i=1 i=1

n
for all n. For the second item, letting n > m, sn := k=1 x, ek ek , we have

n
n
n
sn − sm = 2
x, ek ek , x, ek ek = |x, ek |2 ,
k=m+1 k=m+1 k=m+1

which diminishes as n, m → ∞ according to the first item. For the third item, we
have

n
n n
sn − sm 2 = αk ek , αk ek = αk2 = Sn − Sm
k=m+1 k=m+1 k=m+1

n n
for sn := i=1 αi ei , Sn := i=1 αi2 , and n > m. Thus, the third item follows from
the equivalence: {sn } is Cauchy ⇐⇒ {Sn } is Cauchy.
n

The last item holds because y, ei = lim α j e j , ei = αi for y = ∞j=1 α j e j ,
n→∞
j=1
which follows from the continuity of inner products (Proposition 9).

Proof of Proposition 16
For 1.=⇒6., since
∞{ei } is an orthonormal basis of H , we may write an arbitrary
x ∈ H as x = i=1 αi ei , αi ∈ R. From the fourth item of Proposition15, we have
∞
αi = x, ei and obtain 6. 6.= ⇒5. is obtained by substituting x = i=1 x, ei ei ,
∞
y = i=1 y, ei ei into x, y. 5.= ⇒4. is obtained by substituting x = y in 5. 4.=
⇒3.
is due to
n n
x − x, ek ek = x −
2 2
|x, ek |2 → 0
k=1 k=1

⇒2., note the implication x, ek = 0, k = 1, 2, . . .

as n → ∞ for each x ∈ H . For 3.=
=
⇒ x ⊥ span{ek }, which implies that x ⊥ span{ek } from the continuity of inner prod-
Appendix: Proofs of Propositions 53

∞x, x = 0 and x = 0. For 2.=

ucts (Proposition 9). Thus, we have ⇒1., from the sec-
ond item of Proposition 15, y = i=1 z, ei ei converges for each z ∈ H . Therefore,
for each j, we have

n
z − y, e j = z, e j − lim z, ei , e j = z.e j − z.e j = 0 .
n→∞
i=1

∞ the assumption of 2., we have that z − y = 0 and that z can be written as

From
i=1 z, ei ei .

Proof of Proposition 19
Let M be a closed subset of H . We show that for each x ∈ H , there exists a unique
y ∈ M that minimizes x − y and that we have

x − y, z − y ≤ 0 (2.20)

for z ∈ M. To this end, we first show that any sequence {yn } in M for which

lim x − yn 2 = inf x − y2 (2.21)

n→∞ y∈M

is Cauchy. Since M is a linear space, we have (yn + ym )/2 ∈ M and

yn + ym 2
yn − ym 2 = 2x − yn 2 + 2x − ym 2 − 4x −
2
≤ 2x − yn + 2x − ym − 4 inf x − y → 0 .
2 2 2
y∈M

Hence, {yn } is Cauchy. Then, suppose that more than one lower limit y exists, and
let u
= v be such a y. For example, for {yn }, let y2m−1 → u, and let y2m → v satisfy
(2.21). However, this limit is not Cauchy and contradicts the discussion shown thus
far. Hence, the y that achieves the limit in (2.21) is unique. In the following, we
assume that y gives the lower limit.
Moreover, note that

x − {az + (1 − a)y}2 ≥ x − y2 ⇐⇒ 2ax − y, z − y ≤ a 2 z − y2

for arbitrary 0 < a < 1 and z ∈ M, and if x − y, z − y > 0, the inequality flips
for small a > 0. Thus, we have x − y, z − y ≤ 0.
Finally, if we substitute z = 0, 2y into (2.20), we have x − y, y = 0. Therefore,
(2.20) implies that x − y, z ≤ 0 for z ∈ M. We obtain the proposition by replacing
z with −z.
54 2 Hilbert Spaces

Proof of Proposition 22
If the operator T maps to zero for any element, then eT = 0 satisfies the desired
condition. Thus, we assume that T outputs a nonzero value for at least one input.
From the first item of Proposition 20, Ker(T )⊥ is a closed subset of H and contains
a y such that T y = 1. Thus, for an arbitrary x ∈ H , we have

T (x − (T x)y) = T x − T x T y = 0

and x − (T x)y ∈ Ker(T ). Since y ∈ Ker(T )⊥ , we have x − (T x)y, y = 0 and

x, y = T xy, y = T xy2 .

Thus, eT = y/y2 satisfies the desired condition.

To demonstrate uniqueness, if eT satisfies the same condition, then x, eT − eT =
0 for any x ∈ H , which means that eT = eT .
Furthermore, since T x = x, eT ≤ xeT for x ∈ H , we have that T ≤
eT when x = 1. Additionally, we obtain the inverse inequality eT = y 1
=
T y
y
≤ T .

Proof of Proposition 25
For the first item, note that if {xn } is bounded, so is {T xn }. Moreover, if the image
of T is of finite dimensionality, then {T xn } is also compact (Proposition 7)9 . For the
second item, we use the so-called diagonal argument. In the following, we denote the
norms of H1 , H2 by · 1 , · 2 . Let {xk } be a bounded sequence in X 1 . From the
compactness of T1 , there exists {x1,k } ⊆ {x0,k } := {xk } such that {T1 x1,k } converges to
a y1 ∈ H2 as k → ∞. Then, there exists {x2,k } ⊆ {x1,k } such that {T2 x2,k } converges
to a y2 ∈ H2 as k → ∞. If we repeat this process, the sequence {yn } in H2 converges.
In fact, for each n, there exists a large kn such that

1
Tn xn,k − yn 2 < , k ≥ kn .
n
If we make {kn } monotone, then for m < n, we obtain

ym − yn =(ym − Tm xn,kn )+(Tn xn,kn − yn ) + (Tm xn,kn − T xn,kn ) + (T xn,kn − Tn xn,kn ) .

Thus, as m, n → ∞, we have

1 1
ym − yn ≤ + + Tm − T · xn,kn 1 + Tn − T · xn,kn 1 → 0
m n
. Since H2 is complete, there exists a y ∈ H2 such that {yn } converges. Since

9This statement is called Bolzano-Weierstrass’s theorem for sequential compactness rather than
Heine-Borel’s theorem. The two theorems coincide for metric spaces.
Appendix: Proofs of Propositions 55

T xn,kn − y2 ≤ T − Tn xn,kn 1 + Tn xn,kn − yn 2 + yn − y2 → 0

as n → ∞, we have shown that {T xn } has a convergent subsequence in H2 .

Proof of Proposition 26
By induction, we show that

n
c j e j = 0=
⇒c1 = c2 = · · · = cn = 0 . (2.22)
j=1

For n = 2, suppose that c1 e1 + c2 e2 = 0. Then, we have T (c1 e1 + c2 e2 ) = λ1 c1 e1 +

λ2 c2 e2 = 0. From these two equations
and λ1
= λ2 , we have c1 = c2 = 0. Thus, we
obtain (2.22) for n = 2. For n = k, k+1j=1 c j e j = 0 and k+1
j=1 λ j c j e j = 0 imply that

k+1
k+1
k
0 = λk+1 cjej − λjcjej = (λk+1 − λ j )c j e j .
j=1 j=1 j=1

From λk+1
=λ j , if we assume that cj := (λk+1 − λ j )c j
=0, then from kj=1 cj e j = 0
and the assumption of induction, we have c1 = · · · = ck = 0, which means that c1 =

· · · = ck = 0 and ck+1 ek+1 = − kj=1 c j e j = 0. Thus ck+1 = 0. Moreover, under
the condition that T is self-adjoint, from ei , e j = ei , λ−1 −1
j T e j = λ j T ei , e j =
−1
λ j λi ei , e j and λi
= λ j for i
= j, we have ei , e j = 0. Thus, the {e j } are orthog-
onal.

Proof of Proposition 27
We first show that
Ker(T )⊥ = Im(T ∗ ) . (2.23)

For x1 ∈ Ker(T ) and x2 ∈ H , we see that x1 , T ∗ x2 1 = T x1 , x2 2 = 0 and that x1

is orthogonal to any element of Im(T ∗ ). Thus, we have

Ker(T ) ⊆ (Im(T ∗ ))⊥

. Moreover, if x1 ∈ (Im(T ∗ ))⊥ , from T ∗ (T x1 ) ∈ Im(T ∗ ), we have

T x1 2 = x1 , T ∗ T x1 1 = 0 ,

which means that x1 ∈ Ker(T ) and establishes inverse inclusion. Thus, we have
shown that Ker(T ) = (Im(T ∗ ))⊥ . Furthermore, if we apply the third item of Propo-
sition 20, we obtain
(Ker(T ))⊥ = Im(T ∗ ) .
56 2 Hilbert Spaces

Note that since Ker(T ) is an orthogonal complement of subset Im(T ∗ ) of H , the first
item of Proposition 20 and (2.11) can be applied. Since T ∈ B(H ) is self-adjoint
(T ∗ = T ), we can write (2.23) further as

H = Ker(T ) ⊕ Im(T ) .

Hence, in order to show that (2.16), it is sufficient to prove that

Im(T ) = span{e j : j ≥ 1} . (2.24)

Note that for each finite n = 1, 2, . . . and c1 , c2 . . . . , cn ∈ R, we have

n
n
cjej = T ( λ−1
j cjej)
j=1 j=1

and span{e j | j ≥ 1} ⊆ Im(T ). Even if we perform closure on both sides, the inclusion
relation does not change. Thus, we have span{e j | j ≥ 1} ⊆ Im(T ). Furthermore, we
decompose (2.11)
Im(T ) = span{e j | j ≥ 1} ⊕ N ,

⊥
where N = span{e j | j ≥ 1} ∩ Im(T ). Note that T y ∈ span{e j | j ≥ 1} for y ∈ span
{e j | j ≥ 1}, and
T x, y = x, T y = 0

for x ∈ N because T is self-adjoint. Thus, we have T x ∈ N .

Now, in general, we have

T = w(T ) := sup |T x, x| . (2.25)

x=1

In fact,

1 1
|T x, y| = | T (x + y), x + y − T (x − y), x − y|
4 4
1 1
≤ |T (x + y), x + y| + | T (x − y), x − y|
4 4
1 1
≤ (w(T )(x + y + x − y2 ) = w(T )(x2 + y2 ),
2
4 2
and if we take the upper limit under x = y = 1, we obtain

Tx
T = sup T x, ≤ sup T x, y ≤ w(T ) .
x=1 T x x=y=1
Appendix: Proofs of Propositions 57

On the other hand, we have

w(T ) ≤ sup T x · x = sup T x = T

x=1 x=1

and (2.25).
In addition, we know that either ±T is an eigenvalue of T . In fact, from (2.25),
there exists a sequence {xn } in H with xn = 1 such that T xn , xn → T or
T xn , xn → −T (the upper and lower limits are convergence points). For the
former case, we have

0 ≤ T xn − T xn 2 = T xn 2 + T 2 xn 2 − 2T T xn , xn → 0 .

From compactness of T , there exists {xn k } (⊆ {xn }) such that T xn k → y ∈ H . From

T xn k − T xn k → 0, there exists 0
= x ∈ H such that T xn k → T x. From
T x = y = limk→∞ T xn k , we have that T x = T x and that T is an eigen-
value of T . For the latter case, −T is an eigenvalue of T .
Finally, we assume that there exists an x ∈ N such that T x
= 0. Let TN be the
restriction of T on N . Because TN > 0, either TN or −TN is an eigenvalue
of T . The existence of an eigenvalue on N contradicts the chosen orthonormal basis
{e j }∞
j=1 . Therefore, when x ∈ N , we have T x = 0, which means that N ⊆ Im(T ) ∩
Ker(T ) = {0}. Thus, we have established (2.16).

Exercises 16∼30

16. Choose the closed sets among the sets below. For the nonclosed sets, find their
closures.
(a) ∪∞n=1 [n − n , n + n ];
1 1

(b) {2, 3, 5, 7, 11, 13, . . .};

(c) R ∩ Z;
(d) {(x, y) ∈ R2 | x 2 + y 2 < 1 when x ≥ 0, x 2 + y 2 ≤ 1 when x < 0 }.
1 1 √
17. Show that the sequence a1 = 1, an+1 = an + converges to 2 as n → ∞.
2 an
18. Let f : M → R be a function defined over a bounded closed set M, and we
define (z 1 ), . . . , (z m ) for some m ≥ 1 and z 1 , . . . , z m such that

d(x, z) < (z)=

⇒d( f (x), f (z)) <

for z ∈ M.
58 2 Hilbert Spaces

(a) Why can the neighborhoods cover M ?

1
Let x, y ∈ M satisfy d1 (x, y) < δ := min (z i ). Without loss of generality,
2 1≤i≤m
we assume that x ∈ Ui with a center at z i and a radius of (z i )/2. Prove the
following.
(a) d1 (x, z i ) < 21 (z i ) < (z i ).
(b) d1 (y, z i ) ≤ d1 (x, y) + d1 (x, z i ) < (z i ).
(c) d2 ( f (x), f (y)) ≤ d2 ( f (x), f (z i )) + d2 ( f (y), f (z i )) < + = 2.
(d) f is uniformly continuous.
19. Using the fact that any continuous function over a bounded closed set is uniformly
continuous, show that a continuous function over [0, 1] is a Riemann integral.
20. that the Cauchy-Schwarz inequality (2.5) holds if and only if one of x, y is a
constant multiplied by the other.
21. Show that a one-indeterminate polynomial ring A is an algebra. In addition,
show that the set of functions f ∈ A over E := [0, 1] is dense in C(E).
22. Derive Riesz-Fischer’s theorem stating that “L 2 is complete” (Proposition 14)
according to the following steps in the appendix.
(a) Let { f n } be an arbitrary Cauchy sequence.

(b) There exists a sequence {n k } such that ∞ k=1 | f n k+1 − f n k |2 < ∞.
(c) Prove the existence of an f : E → R such that μ{x ∈ E| limk→∞ f n k (x) =
f (x)} = μ(E).
(d) Show that f n − f → 0 and f ∈ L 2 [a, b].
23. Show that the basis of the Fourier series expansion

1 cos x sin x cos 2x sin 2x

{√ , √ , √ , √ , √ , · · · }
2π π π π π

is orthonormal.
24. Derive Proposition 19 according to the following steps in the appendix. What
are the derivations of (a) through (e)?
(a) Show that a sequence {yn } in M for which

lim x − yn 2 = inf x − y2

n→∞ y∈M

converges in M. Hereafter, let y satisfy yn → y ∈ M.

(b) Show that 2ax − y, z − y ≤ a 2 z − y2 for 0 < a < 1 and z ∈ M.
(c) Show that the inequality x − y, z − y > 0 contains a contradiction.
(d) Show that x − y, z ≤ 0.
(e) Obtain the proposition by replacing z with −z.
25. Show that the linear operator norm (2.12) satisfies the triangle inequality.
Exercises 16∼30 59

26. Show that the integral operator (2.13) is a bounded linear operator and that it is
self-adjoint when K is symmetric.
27. Let (M, d) be a metric space with M := R and a Euclidean distance d. Show that
each of the following E ⊆ M is not sequentially compact. Furthermore, show
that they are not compact without using the equivalence between compactness
and sequential compactness.
(a) E = [0, 1)and
(b) E = Q.
28. Proposition 27 is derived according to the following steps in the appendix. What
are the derivations of (a) through (c)?
(a) Show that H1 = Ker(T ) ⊕ Im(T ).
(b) Show that span{e j | j ≥ 1} ⊆ Im(T ).
(c) Show that span{e j | j ≥ 1} ⊇ Im(T ).
Why do we need to show (2.25)?
29. Show that the HS and trace norms satisfy the triangle inequality.
30. Show that if T ∈ B(H ) is a trace class, then it is also an HS class, and show that
if T ∈ B(H ) is a trace class, it is also compact.
Chapter 3
Reproducing Kernel Hilbert Space

Thus far, we have learned that a feature map : E x → k(x, ·) is obtained by the
positive definite kernel k : E × E → R. In this chapter, we generate a linear space
H0 based on its image k(x, ·)(x ∈ E) and construct a Hilbert space H by completing
this linear space, where H is called reithe reproducing kernel Hilbert space (RKHS),
which satisfies the reproducing prsoperty of the kernel k (k is the reproducing kernel
of H ). In this chapter, we first understand that there is a one-to-one correspondence
between the kernel k and the RKHS H and that H0 is dense in H (via the Moore-
Aronszajn theorem). Furthermore, we introduce the RKHS represented by the sum
of RKHSs and apply it to Sobolev spaces. We prove Mercer’s theorem regarding
integral operators in the second half of this chapter and compute their eigenvalues
and eigenfunctions. This chapter is the core of the theory contained in this book, and
the later chapters correspond to its applications.

3.1 RKHSs

Let H be a Hilbert space whose elements are functions f : E → R.

A function k : E × E → R is said to be a reproducing kernel of a Hilbert space
H with an inner product ·, · H if it satisfies the following two conditions.
1. For each x ∈ E, we have
k(x, ·) ∈ H. (3.1)

2. Reproducing property: for each f ∈ H and x ∈ E,

f (x) = f, k(x, ·) H . (3.2)

When H has a reproducing kernel, we say that H is a reproducing kernel Hilbert

space (RKHS). The reproducing property (3.2) is called a kernel trick.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 61
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_3
62 3 Reproducing Kernel Hilbert Space

Example 51 Let {e1 , . . . , e p } be an orthonormal basis of a finite-dimensional

Hilbert space H . If we define

p
k(x, y) := ei (x)ei (y) (3.3)
i=1

for x, y ∈ E, then we have k(x, ·) ∈ H and

p
e j (·), k(x, ·) H = e j , ei H ei (x) = e j (x)
i=1

p
for each 1 ≤ j ≤ p. Thus, for any f (·) = i=1 f i ei (·) ∈ H , f i ∈ R, we have
f (·), k(x, ·) H = f (x) (reproducing property). Therefore, H is an RKHS, and (3.3)
is a reproducing kernel.
Proposition 32 The reproducing kernel k of the RKHS H is unique, symmetric
k(x, y) = k(y, x), and nonnegative definite.
Proof: If k1 , k2 are RKHSs of H , then by the reproducing property, we have that

f (x) = f, k1 (x, ·) H = f, k2 (x, ·) H .

In other words,
f, k1 (x, ·) − k2 (x, ·) H = 0

holds for all f ∈ H , x ∈ E for which k1 = k2 (Proposition 16). Additionally, the

symmetry of a reproducing kernel follows from that of its inner product:

k(x, y) = k(x, ·), k(y, ·) H = k(y, ·), k(x, ·) H = k(y, x) .

The nonnegative definiteness of the reproducing kernel can be shown as follows.

n
n
n
n n
n
z i z j k(xi , x j ) = z i z j k(xi , ·), k(x j , ·) H = z i k(xi , ·), z j k(x j , ·) H ≥ 0
i=1 j=1 i=1 j=1 i=1 j=1

.
Proposition 33 A Hilbert space H is an RKHS if and only if Tx ( f ) = f (x) ( f ∈ H )
is bounded at each x ∈ E for the linear functional Tx : H f → f (x) ∈ R.
Proof: If H has a reproducing kernel k, then at each x ∈ E, we have

f (·), k(x, ·) H = Tx ( f ) , f ∈ H .

Thus, we have
3.1 RKHSs 63

|Tx ( f )| = | f (·), k(x, ·) H | ≤
f
·
k(x, ·)
=
f
k(x, x) .

Conversely, if the linear functional Tx ( f ) = f (x) is bounded for x ∈ E, from Propo-

sition 22, there exists a k x : E → R such that

f (·), k x (·) H = f (x) , f ∈ H .

In other words, a reproducing kernel exists.

In Proposition 32, we showed that a reproducing kernel is unique once its RKHS
is determined, but the following proposition asserts the converse.
Proposition 34 (Aronszajn [1]) Let k : E × E → R be a positive definite kernel.
Then, the Hilbert space H with the reproducing kernel k is unique. Moreover, for
k(x, ·) ∈ H , x ∈ E holds, and the generated linear space is dense in H .
The proof is given by the following procedure.
1. Define the inner product ·, · H0 of H0 := span{k(x, ·)|x ∈ E}.
2. For any Cauchy sequence { f n } in H0 and each x ∈ E, the real sequence { f n (x)}
is a Cauchy sequence, and we have the convergence value f (x) := lim f n (x)
n→∞
(Proposition 6). Let H be such a set of f .
3. Define the inner product ·, · H of the linear space H .
4. Show that H0 is dense in H .
5. Show that any Cauchy sequence { f n } in H converges to some element of H as
n → ∞ (completeness of H ).
6. Show that k is a reproducing kernel of H .
7. Show that such an H is unique.
See the appendix at the end of the chapter for details1 .

Example 52 (Linear Kernel) Let ·, · E be the inner product of E := Rd . Then, the
linear space
H := {x, · E |x ∈ E}

is complete since it has finite dimensions (Proposition 6). Moreover, H is an RKHS

with the reproducing kernel k(x, y) = x, y E , x, y ∈ E.

Example 53 Let E be a finite set {x1 , . . . , xn }, and let k : E × E → R be a positive

definite kernel; then, the linear space

n
H := { αi k(xi , ·)|α1 , . . . , αn ∈ R}
i=1

is a reproducing kernel Hilbert space. We define the inner product by

1 The proof is due to [33].

64 3 Reproducing Kernel Hilbert Space

f (·), g(·) H = a K b

for f (·), g(·) ∈ H , where f (·) = nj=1 a j k(x j , ·) ∈ H , a = [a1 , . . . , an ] ∈ Rn and

g(·) = nj=1 b j k(x j , ·) ∈ H , b = [b1 , . . . , bn ] ∈ Rn via the Gram matrix
⎡ ⎤
k(x1 , x1 ) · · · k(x1 , xn )
⎢ .. .. .. ⎥
K := ⎣. . . . ⎦ .
k(xn , x1 ) · · · k(xn , xn )

Then, for each xi , i = 1, 2, . . ., we have

n
f (·), k(xi , ·) H = [a1 , . . . , an ]K ei = a j k(x j , xi ) = f (xi )
j=1

(reproducing property), where ei is an n-dimensional column vector in which we set

component i and the other components to 1 and 0, respectively.
Example 54 (Polynomial Kernel) Let ·, · E be the inner product between the ele-
ments in E. The Hilbert space H obtained by completing the linear space H0
generated by (x, · E + 1)d ∈ R (x ∈ E) is an RKHS with the reproducing kernel
k(x, y) = (x, y E + 1)d for x, y ∈ E.
Example 55 Let k(x, y) be the kernel expressed by a function φ(x − y) as con-
sidered in Sect. 1.5. If we require k(x, y) to take real values, the associated proba-
bility density functions must be even functions such as those of the Gaussian and
Laplace distributions. Otherwise, since the imaginary part of t → ei(x−y)t is odd,
the kernel k might take imaginary values. Now, using L 2 (E, η) F : E = R → C
whose real and imaginary parts are even and odd, respectively, we consider the linear
space consisting
of f : E → R with f (x) = E F(t)e i xt
dη(t). The function F(t) →
f (x) = E F(t)ei xt dη(t) is injective (if E F(t)ei xt dη(t) = 0, then the inverse

Fourier transform F(t) = 0). If its inner product is f, g H = E F(t)G(t)dη(t)
for F, G ∈ L 2 (E, η), then L 2 (E, η) and

H = {E x → F(t)ei xt dη(t) ∈ R|F ∈ L 2 (E, η)}
E

are isomorphic as an inner product space. Note that H has a reproducing kernel
E × E → R with
k(x, y) = e−i(x−y)t dη(t) .
E

In fact, we have k(x, y) = E e−i xt eiyt dη(t). Thus, if we set G(t) = e−i xt , we obtain

f (·), k(x, ·) H = F(t)G(t)dη(t) = F(t)ei xt dη(t) = f (x)
E E
3.1 RKHSs 65

for f (y) = F(t)eiyt dη(t) and k(x, y) = G(t)eiyt dη(t). For different kernels
E E
k(x, y), such as the Gaussian and Laplacian kernels, the measure η(t) will be differ-
ent, and the corresponding RKHS H will be different.
1
Example 56 Let E := [0, 1]. Using the real-valued function F with 0 F(u)2 du <
1
∞, we consider the set H of functions f : E → R, f (t) = 0 F(u)(t − u)0+ du,
where we denote (z)0+ = 1 and (z)0+ = 0 when z ≥ 0 and when z < 0, respectively.
1
The linear space H is complete for the norm
f
2 = 0 F(u)2 du (Proposition 14)
1 1
if the inner product is f, g H = 0 F(u)G(u)du for f (t) = 0 F(u)(t − u)0+ du
1
and g(t) = 0 G(u)(t − u)0+ du. This Hilbert space H is the RKHS for k(x, y) =
min{x, y}. In fact, for each z ∈ E, we see that

1 1 1
f (z), k(x, z) H = F(u)(z − u)0+ du, (x − u)0+ (z − u)0+ du H = F(u)(x − u)0+ du = f (x).
0 0 0

Thus far, we have obtained the RKHS corresponding to each positive definite
kernel, but a necessary condition exists for a Hilbert space H to be an RKHS. If that
condition is not satisfied, we can claim that it is not an RKHS.
Proposition 35 Let H be an RKHS consisting of functions on E. If limn→∞ | f n −
f
H = 0 f, f 1 , f 2 , . . . ∈ H , then for each x ∈ E, limn→∞ | f n (x) − f (x)| = 0 holds.
Proof: In fact, we have that for each x ∈ E,

| f n (x) − f (x)| ≤
f n − f
k(x, x) .

Example 57 H := L 2 [0, 1] is not an RKHS. In fact, for a sequence { f n } with

1
f n (x) = x n , the norm converges to
f n
2H = 0 f n2 (x)d x = 2n+1
1
→ 0. However,
for f (x) = 0 with x ∈ E, we have
f n − f
H → 0, and | f n (1) − f (1)| = 1 0.
This contradicts the fact that H is an RKHS (Proposition 35).

Example 57 illustrates that L 2 [0, 1] is too large, and as we will see in the next section,
the Sobolev space restricted to L 2 [0, 1] is an RKHS.

3.2 Sobolev Space

We first show that if k1 , k2 are reproducing kernels, the sum k1 + k2 is also a repro-
ducing kernel. To this end, we show the following.
Proposition 36 If H1 , H2 are Hilbert spaces, so is the direct product F := H1 × H2
under the inner product
66 3 Reproducing Kernel Hilbert Space

( f 1 , f 2 ), (g1 , g2 ) F := f 1 , g1 H1 + f 2 , g2 H2 (3.4)

for f 1 , g1 ∈ H1 , f 2 , g2 ∈ H2 .
Proof: From
( f 1 , f 2 )
2F =
f 1
2H1 +
f 2
H2 , we have

f 1,n − f 1,m
H1 ,
f 2,n − f 2,m
H2

≤
f 1,n − f 1,m
2H1 +
f 2,n − f 2,n
2H2 =
( f 1,n , f 2,n ) − ( f 1,m , f 2,m )
F .

Thus, we have
{( f 1,n , f 2,n )} is Cauchy
=
⇒ { f 1,n }, { f 2,n } is Cauchy
=
⇒ f 1 ∈ H1 , f 2 ∈ H2 exists such that f 1,n → f 1 , f 2,n → f 2
=
⇒
( f 1,n , f 2,n ) − ( f 1 , f 2 )
F =
( f 1,n − f 1 , f 2,n − f 2 )
F
=
f 1,n − f 1
2 +
f 2,n − f 2
2 → 0 ,

which means that F is complete.

Let
H := H1 + H2 := { f 1 + f 2 | f 1 ∈ H1 , f 2 ∈ H2 }

be the direct sum of H1 , H2 , and define the linear map from F to H by u :

F ( f 1 , f 2 ) → f 1 + f 2 ∈ H . Then, we can decompose F into N := u −1 (0) and
its orthogonal complement N ⊥ . If we restrict u to N ⊥ to obtain the injection
v : N ⊥ → H , then the bivariate function

f, g H := v −1 ( f ), v −1 (g) F (3.5)

for f, g ∈ H forms an inner product. Note that N ⊥ is a closed subspace of the Hilbert
space F.
Proposition 37 If the direct sum H of Hilbert spaces H1 , H2 has the inner product
(3.5), then H is complete (a Hilbert space).
Proof: Since F is a Hilbert space (Proposition 36) and N ⊥ is its closed subset, N ⊥
is complete. Thus, we have

f n − f m
H → 0=
⇒
v −1 ( f n − f m )
F → 0
=
⇒ g ∈ F exists such that
v −1 ( f n ) − g
F → 0
=
⇒
f n − v(g)
H → 0, v(g) ∈ H .

Proposition 38 (Aronszajn [1]) Let k1 , k2 be the reproducing kernels of RKHSs
H1 , H2 , respectively. Then, k = k1 + k2 is the reproducing kernel of the Hilbert
space
H := H1 ⊕ H2 := { f 1 + f 2 | f 1 ∈ H1 , f 2 ∈ H2 }

such that the inner product is (3.5) and the norm is

3.2 Sobolev Space 67

f
2H = min {
f 1
2H1 +
f 2
2H2 } (3.6)
f = f 1 + f 2 , f 1 ∈H1 , f 2 ∈H2

for f ∈ H .
The proof proceeds as follows.
1. Let f ∈ H and N ⊥ ( f 1 , f 2 ) := v −1 ( f ). We define k(x, ·):=k1 (x, ·)+k2 (x, ·)
and (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)), and we show that

f 1 , h 1 (x, ·)1 + f 2 , h 2 (x, ·)2 = f 1 , k1 (x, ·)1 + f 2 , k2 (x, ·)2 .

2. Using the above, we present the reproducing property f, k(x, ·) H = f (x) of
k.
3. We show that the norm of H is (3.6).
For details, see the Appendix at the end of this chapter.
In the following, we construct the Sobolev space as an example of an RKHS and
obtain its kernel.
Let W1 [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable
almost everywhere and f ∈ L 2 [0, 1]. Then, we can write each f ∈ W1 [0, 1] as
x
f (x) = f (0) + f (y)dy . (3.7)
0

Similarly, let Wq [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable
q − 1 times and q times almost everywhere and f (q) ∈ L 2 [0, 1]. If we define

xi
φi (x) := , i = 0, 1, . . .
i!
and
q−1
(x − y)+
G q (x, y) := ,
(q − 1)!

then we can Taylor-expand each f ∈ Wq [0, 1] as follows.

q−1 1
f (x) = f (i) (0)φi (x) + G q (x, y) f (q) (y)dy. (3.8)
i=0 0

In fact, we have the partial integral

1
(q)
(q−1)
1 1
d
G q (x, y) f (y)dy = G q (x, y) f (y)
0
− { G q (x, y)} f (q−1) (y)dy
0 0 dy
1
x q−1
=− f (q−1) (0) + G q−1 (x, y) f (q−1) (y)dy
(q − 1)! 0
68 3 Reproducing Kernel Hilbert Space

and obtain (3.7) by repeatedly applying this integral to the right-hand side of (3.8).
For the transformation, we use
q−1
1 1
(x − y)+
G q (x, y)h(y)dy = h(y)dy
0 0 (q − 1)!
1
q−1
q −1
x
= xi (−y)q−1−i h(y)dy .
(q − 1)! i=0
i 0

and the differentiation

q−2 x
1
d 1 q −2
{G q (x, y)h(y)}dy = {−x i (−y)q−2−i h(y)dy}
0 dy (q − 2)! i=0 i 0
1
=− G q−1 (x, y)h(y)dy .
0

Hereafter, we write each element of Wq [0, 1] as

q−1 1
αi φi (x) + G q (x, y)h(y)dy (3.9)
i=0 0

α0 = f (0), . . . , αq−1 = f (q−1) (0) ∈ R, h ∈ L 2 [0, 1].

Although more than one Hilbert space Wq [0, 1] exists with different definitions
of inner products, we consider the Hilbert space H that can be written as the direct
sum of H0 and H1 , which is defined below. Let

H0 := span{φ0 , . . . , φq−1 },

and define its inner product by

q−1
f, g H0 = f (i) (0)g (i) (0)
i=0

for f, g ∈ H0 . We find that the inner product ·, · H0 satisfies the requirement of inner
products and that {φ0 , . . . , φq−1 } is an orthonormal basis. Since the inner product
space H0 is of finite dimensionality, it is apparently a Hilbert space. We define another
inner product space H1 as
1
H1 := { G q (x, y)h(y)dy|h ∈ L 2 [0, 1]} .
0

Since h ∈ L 2 [0, 1], if we define the inner product as

3.2 Sobolev Space 69
1
f, g H1 = f (q) (y)g (q) (y)dy
0

for f, g ∈ H , then we have

f m − f n
H1 → 0 ⇐⇒
f m(q) − f n(q)
L 2 [0,1] → 0,

and there exists an f ∈ H1 such that

f n − f
H1 → 0 ⇐⇒
f n(q) − f (q)
L 2 [0,1] → 0 .

From Proposition 14, we have

f m − f n
H1 → 0=
⇒
f n − f
H1 → 0 (complete-
ness), and H1 is a Hilbert space. Moreover, from

q−1
f (x) = ⇒h = f (q) = 0
αi φi (x) ∈ H1 =
i=0

and
1
f (x) = ⇒α0 = f (0) = 0, . . . , αq−1 = f (q−1) (0) = 0 ,
G q (x, y)h(y)dy ∈ H0 =
0

we have that H0 ∩ H1 = {0}. From Proposition 38, for f = f 0 + f 1 , g = g0 + g1 ,

f 0 , g0 ∈ H0 , and f 1 , g1 ∈ H1 , the inner product is

f, gWq [0,1] = f 0 + f 1 , g0 + g1 Wq [0,1] = f 0 , g0 H0 + f 1 , g1 H1 .

The reproducing kernels of H0 , H1 are respectively

q−1
k0 (x, y) := φi (x)φi (y)
i=0

and 1
k1 (x, y) := G q (x, z)G q (y, z)dz ,
0

where k0 is derived from Example 3.2, and k1 is derived from

1 1
f (·), k1 (x, ·) H1 = G q (·, z)h(z)dz, G q (x, z)G q (·, z)dz H1
0 0
1
= G q (x, z)h(z)dz = f (x)
0
70 3 Reproducing Kernel Hilbert Space
1
for arbitrary f (·) = G q (·, z)h(z)dz ∈ H and x ∈ E (the uniqueness is due to
0
Proposition 32).
Furthermore, we can construct Wq [0, 1] such that its kernel is

k(x, y) = k0 (x, y) + k1 (x, y)

for x, y ∈ E.

3.3 Mercer’s Theorem

Let (E, F, μ) be a measure space. We assume that the integral operator kernel
K : E × E → Risa measurable function and is not necessarily nonnegative definite.
E×E K (x, y)dμ(x)dμ(y) takes finite values. Then, we define
2
Suppose that
the integral operator TK by

(TK f )(·) := K (x, ·) f (x)dμ(x) (3.10)
E

for f ∈ L 2 (E, B, μ). Since

TK f
2 = {(TK f )(x)}2 dμ(x) ≤ {K (x, y)}2 dμ(x)dμ(y) { f (z)}2 dμ(z)
E E×E E

=
f
2
{K (x, y)} dμ(x)dμ(y) ,
2
E×E

we have TK ∈ B(L 2 (E, B, μ)) and

1/2

TK
≤ K (x, y)dμ(x)dμ(y)
2
.
E×E

In the following, we assume that K : E × E → R is continuous and that the entire

set E is compact (such as E = [0, 1]). Thus, we assume that the integral operator
kernel K is uniformly continuous (Proposition 8).
Lemma 1 For each f ∈ L 2 (E, F, μ), TK f (·) is uniformly continuous.
Proof: Since E × E→R is uniformly continuous, we achieve |K (x, y)−K (x, z)| <
by making |y − z| smaller for arbitrary x ∈ E and > 0. Thus, we have

K (x, y) f (x)dμ(x) − K (x, z) f (x)dμ(x) ≤
f
.

E E

3.3 Mercer’s Theorem 71

Proposition 39 TK is a compact operator.

Proof: By Proposition 12, for an arbitrary > 0, there exist n() ≥ 1 and an R-
n()
coefficient bivariate polynomial K n() (x, y) := i=1 gi (x)y i whose order of y is at
most n() such that
sup |K (x, y) − K n() (x, y)| < ,
x,y∈E

where g1 , . . . , gn() are R-coefficient univariable polynomials. If we abbreviate n()

as n and write the integral operator corresponding to K n as TK n , then we may regard

n
TK n f (·) = y i
f (x)gi (x)dμ(x)
i=0 E

as

TK n f : H f → [ f (x)g0 (x)dμ(x), . . . , f (x)gn (x)dμ(x)] ∈ Rn+1 .
E E

Since the rank of TK n is finite, from the first item of Proposition 25, TK n is a compact
operator. Moreover, since

(TK n − TK ) f
2 = ( [K n (x, y) − K (x, y)] f (y)dμ(y))2 dμ(x) ≤ 2
f
2 μ2 (E) ,
E E

from Proposition 25, TK is a compact operator.

In the following, we assume that K is symmetric. Then, from Example 45, TK is
self-adjoint. Thus, from Proposition 39, we have that
∞

TK x = λ j e j , xe j
j=1

using {λ j } and {e j } that satisfy Proposition 27. Moreover, Lemma 1 implies the
following:
Lemma 2
e j (y) = λ−1
j K (x, y)e j (x)dμ(x)
E

is uniformly continuous w.r.t. y.

Example 58 (Brown Motion) We obtain the eigenvalues and eigenfunctions

{(λ j , e j )} when the integral operator kernel in L 2 [0, 1] is K (x, y) = min{x, y},
x, y ∈ E = [0, 1], (the subspace H1 of the Sobolev space W1 [0, 1]). Since
1 x 1
TK f (x) = K (x, y) f (y)dy = y f (y)dy + x f (y)dy ,
0 0 x
72 3 Reproducing Kernel Hilbert Space

the eigen equation is

1
min(x, y)e(y)dy = λe(x) , (3.11)
0

i.e.,
x 1
ye(y)dy + x e(y)dy = λe(x) .
0 x

If we differentiate the both sides by x, we obtain

1
xe(x) + e(y)dy − xe(x) = λe (x) ,
x

i.e., 1
e(y)dy = λe (x) . (3.12)
x

If we further differentiate both sides by x, then we obtain e(x) = −λe (x) and
√ √
e(y) = α sin(y/ λ) + β cos(y/ λ) .

If we substitute x = 0 into (3.11), then we have e(0) √= 0, which is equivalent to

β = 0. From (3.12), we have e (1) = 0, i.e., α cos(1/ λ) = 0. Thus, we obtain
√
1/ λ = (2 j − 1)π/2 , j = 1, 2, . . . .

Therefore, the eigenvalues are

4
λj = , (3.13)
{(2 j − 1)π }2

and the orthonormal eigenfunctions are

√ (2 j − 1)π
e j (x) = 2 sin x , (3.14)
2
√
where to derive α = 2, we use
√
1
y 1 1 − cos( √2yλ ) 1 1 λ 2y 1
sin ( √ )dy =
2
dy = − [ sin √ ]10 = .
0 λ 0 2 2 2 2 λ 2

Example 59 (Zhu et al. [36]) For a Gaussian kernel,

−(x − y)2
K (x, y) = exp
2σ 2
3.3 Mercer’s Theorem 73

if we regard the finite measure μ in (3.10) of the integral operator kernel as a Gaus-
sian distribution with a mean of 0 and a variance of σ̂ 2 ; then, the eigenvalue and
eigenfunction are
2a j
λj = B
A

and √
e j (x) = exp(−(c − a)x 2 )H j ( 2cx) ,

where H j is a Hermite polynomial of order j:

dj
H j (x) := (−1) j exp(x 2 ) exp(−x 2 ) ,
dx j
√
a −1 := 4σ̂ 2 , b−1 := 2σ 2 , c := a 2 + 2ab, A := a + b + c, and B := b/A. The
proof is not difficult but rather monotonous and long. See the Appendix at the end
of this chapter for details. Note that for a Gaussian kernel with a parameter σ 2 , if the
measure is also a Gaussian distribution with a mean of 0 and a variance of σ̂ 2 , we
σ̂ 2 b
can compute the eigenvalues from β := 2 = :
σ 2a

2a j 2a b
B = √ ( √ )j
A a + b + a + 2ab a + b + a 2 + 2ab
2

β
= [1/2 + β + 1/4 + β]−1/2 ( √ )j ,
1/2 + β + 1/4 + β

which forms a geometric sequence. For example, if σ 2 = σ̂ 2 = 1, then the eigenvalue

is √
3 − 5 j+1/2
λj = ( ) .
2

The Hermite polynomials are H1 (x) = 2x, H2 (x) = −2 + 4x 2 , and H3 (x) = 12x −

8x 3 (H0 (1) = 1, H j (x) = 2x H j−1 (x) − H j−1 (x)), and the other quantities are
√
5 1
c = a 2 + 2ab = (4σ̂ 2 )−1 1 + 4σ̂ 2 /σ 2 = , a = (4σ̂ 2 )−1 = .
4 4
We show the eigenfunction φ j for j = 1, 2, 3 in Fig. 3.1. The code is as follows.
3-start
# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . u s e ( "seaborn−ticks" )
74 3 Reproducing Kernel Hilbert Space

def Hermite ( j ) :
i f j == 0 :
return [ 1 ]
a = [ 0] ∗ ( j + 2)
b = [0] ∗ ( j + 2)
a [0] = 1
f o r i i n range ( 1 , j + 1 ) :
b [ 0 ] = −a [ 1 ]
f o r k i n range ( i + 1 ) :
b [ k ] = 2 ∗ a [ k − 1] − ( k + 1) ∗ a [ k + 1]
f o r h i n range ( j + 2 ) :
a[h] = b[h]
return b [ : ( j +1) ]

Hermite ( 1 ) # 1 s t order Hermite Polynomial

[0, 2]

H e r m i t e ( 2 ) # 2 nd o r d e r H e r m i t e P o l y n o m i a l

[-2, 0, 4]

Hermite ( 3 ) # 3 rd order Hermite Polynomial

[0, -12, 0, 8]

d e f H( j , x ) :
coef = Hermite ( j )
S = 0
f o r i i n range ( j + 1 ) :
S = S + np . a r r a y ( c o e f [ i ] ) ∗ ( x ∗∗ i )
return S

c c = np . s q r t ( 5 ) / 4
a = 1/4

def phi ( j , x ) :
r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x )

c o l o r = [ "b" , "g" , "r" , "k" ]

p = [ [ ] f o r _ i n range ( 4 ) ]
x = np . l i n s p a c e ( − 2 , 2 , 1 0 0 )
f o r i i n range ( 4 ) :
for k in x :
p [ i ] . append ( phi ( i , k ) )
plt . plot (x , p[ i ] , c = color [ i ] , label = "j=%d"%i )
3.3 Mercer’s Theorem 75

p l t . y l i m ( −2 , 8 )
p l t . y l a b e l ( "phi" )
p l t . t i t l e ( "CharacteristicfunctionofGaussKernel" )

In this section, we prove Mercer’s theorem for integral operators and illustrate
some examples. Hereafter, we assume that K and TK are nonnegative definite.

Proposition 40 An integral operator TK is nonnegative definite if and only if K :

E × E → R is nonnegative definite, i.e., K is a positive definite kernel.

Proof: See the Appendix at the end of this chapter.

Proposition 41 (Mercer [21]) Let K : E × E → R be a continuous positive defi-

nite kernel and TK be the corresponding integral operator. Let {(λ j , e j )}∞
j=1 be the
sequence of eigenvalues and eigenvectors of TK . Then, we can write
∞

K (x, y) = λ j e j (x)e j (y),
j=1

and this sum absolutely and uniformly converges.

By absolute convergence, we mean that the sum of the absolute values converges,
and by uniform convergence, we mean that the upper bound of the error that does
not depend on x, y ∈ E converges to zero.
Proof: Note that K n (x, y) := K (x, y) − nj=1 λ j e j (x)e j (y) is continuous and that
the integral operator TK n is nonnegative definite. In fact, for each f ∈ L 2 (E,
F, μ), we have

n ∞

TK n f, f = TK f, f − λ j f, e j 2 = λ j f, e j 2 ≥ 0 .
j=1 j=n+1

Fig. 3.1 The eigenfunctions Gaussian Kernel EigenFunctions

for the Gaussian kernel and
8

Gaussian distribution, where j =0

σ 2 = σ̂ 2 = 1. If j is odd, the j =1
6

j =2
eigenfunctions are even and
4

j =3
odd functions, respectively
2
φj
-6 -4 -2 0

-2 -1 0 1 2
x
76 3 Reproducing Kernel Hilbert Space

Thus, from Proposition 40, K n is nonnegative definite, and K n (x, x) ≥ 0. Thus, for
all x ∈ E, we have
∞
λ j e2j (x) ≤ K (x, x) . (3.15)
j=1

Moreover, for any set J consisting of positive numbers, we have

⎛ ⎞1/2 ⎛ ⎞1/2

|λ j e j (x)e j (y)| ≤ ⎝ λ j e2j (x)⎠ ⎝ λ j e2j (y)⎠ , (3.16)
j∈J j∈J j∈J

which means that from (3.15),

|λ j e j (x)e j (y)| ≤ {K (x, x)K (y, y)}1/2
j∈J

for x, y ∈ E. From (3.16), we have

⎛ ⎞1/2 ⎛ ⎞1/2
∞
∞
∞

|λ j e j (x)e j (y)| ≤ ⎝ λ j e2j (x)⎠ ⎝ λ j e2j (y)⎠
j=n+1 j=n+1 j=n+1

and the right-hand side monotonically converges to 0 as n grows. Since E is compact,

the left-hand side uniformly converges according to the lemma below.

Lemma 3 (Dini) Let E be a compact set. For a continuous function f n : E → R,

if f n (x) monotonically converges to f (x) for a continuous f and each x ∈ E, then
the convergence is uniform.

Proof: See the Appendix at the end of this chapter.

Thus, for an arbitrary > 0, there exists an n such that
∞

sup |λ j e j (x)e j (y)| < , (3.17)
x,y∈E j=n+1

and this sum absolutely and uniformly converges.

Example 60 (The Kernel Expressed by the Difference Between Two Variables) Let
E = [−1, 1]. An integral operator for which K :E × E → R can be expressed by
K (x, z) = φ(x − z) (φ : E → R) is TK f (x) = E φ(x − y) f (y)dy, which can be
expressed by (φ ∗ f )(x) using convolution: (g ∗ h)(u) = E g(u − v)h(v). Here-
after, we assume that the cycle of φ is two, i.e., φ(x) = φ(x + 2Z). In this case,
e j (x) = cos(π j x) is the eigenfunction of TK . In fact, since φ is an even function and
is cyclic, we have
3.3 Mercer’s Theorem 77
1−x
TK e j (x) = φ(x − y) cos(π jy)dy = φ(−u) cos(π j (x + u))du = φ(u) cos(π j (x + u))du
E −1−x E

and

TK e j (x) = { φ(u) cos(π ju)du} cos(π j x) − { φ(u) sin(π ju)du} sin(π j x)
E E
= λ j cos(π j x)

from the Addition theorem cos(π j (x + u)) = cos(π j x) cos(π ju) − sin(π j x)
sin(π ju), where λ j = E φ(u) cos(π ju)du. Similarly, sin(π j x) is an eigenfunction,
and λ j is the corresponding eigenvalue. Thus, from Mercer’s theorem, we have
∞
∞

K (x, y) = λ j {cos(π j x) cos(π jy) + sin(π j x) sin(π jy)} = λ j cos{π j (x − y)} .
j=0 j=0

Example 61 (Polynomial Kernel) For the polynomial kernel in Example 8, let m =

2, d = 1. We compute the eigenfunction of K (x, y) = (1 + xy)2 over x, y ∈ E =
[−1, 1] by setting e(x) := a0 + a1 x + a2 x 2 . By comparing

K (x, y)e(y)dy = (1 + xy)2 e(y)dy = e(y)dy + {2 ye(y)dy}x + { y 2 f (y)dy}x 2
E E E E E

with λe(x), we obtain

⎧
⎪
⎪ (a0 + a1 y + a2 y 2 )dy = λa0
⎪
⎪
⎪
⎨ E
2 y(a0 + a1 y + a2 y 2 )dy = λa1 .
⎪
⎪ E
⎪
⎪
⎪
⎩ y 2 (a0 + a1 y + a2 y 2 )dy = λa2
E

We solve the eigenequation w.r.t. the following matrix:

⎡ 2 ⎤
ydy
dy y dy
⎢ E E E ⎥⎡ ⎤ ⎡ ⎤
⎢ ⎥ a0 a0
⎢ ⎥
⎢ 2 ydy 2 E y 2 dy 2 E y 3 dy ⎥ ⎣ a1 ⎦ = λ ⎣ a1 ⎦ .
⎢ E ⎥
⎣ 3 4 ⎦ a2 a2
y dy E y dy
2
E y dy
E

Now, we consider the general method for approximately obtaining eigenvalues

and eigenvectors in Mercer’s theorem. Let X be a random variable in E. Then, for
the integral operator Tx ∈ B(H ) (x ∈ E) defined by

TK : L φ →
2
K (·, x)φ(x)dμ(x) ∈ L 2 ,
E
78 3 Reproducing Kernel Hilbert Space

there exist λ1 ≥ λ2 ≥ . . . and φ1 , φ2 , . . . ∈ L 2 such that

TK φ j = λφ j

and
φ j φk dμ = δ j.k .
E

We say that the probability μ has generated x1 , . . . , xm ∈ E with m ≥ 1, and we

approximate the generation as

1
m
K (x j , y)φi (x j ) = λi φi (y) , y ∈ E (3.18)
m j=1

i = 1, 2, . . .. Since we have

1
m
φ j (xi )φk (xi ) = δ j,k
m i=1

if we substitute x1 , . . . , xm into y in (3.18), we find that there exists an orthogonal

matrix U ∈ Rm×m such that
Km U = U ,

where K m ∈ Rm×m is the Gram matrix and is the diagonal matrix with the elements
√ λi(m)
λ(m)
1 = mλ 1 , . . . , λ (m)
m = mλ m . If we substitute φ i (x j ) = mU j,i , λi = into
m
(3.18), we obtain
√ m
m
φi (·) = (m) K (x j , ·)U j,i . (3.19)
λi j=1

We require that the distribution of x1 , . . . , xm ∈ E coincide with the measure μ

of the integral operator. It is known that if we make m larger in λi(m) /m, the term
converges to the eigenvalue λi . For the proof and the convergence process, consult
Baker (Theorem 3.4 [3]).
We write the procedure using the Python as below.
Example 62 We obtain the eigenvalue and eigenfunction by using the following
program with a Gaussian kernel, where the measure required for the definition of
the integral kernel should be the same as the measure used when providing random
numbers. Even with the same Gaussian kernel, if x1 , . . . , x N follows a different dis-
tribution, we obtain different eigenvalues and eigenfunctions. We compare the cases
in which N = 300 and N = 1000 to find that the eigenvalues and eigenfunctions
coincide (Figs. 3.2 and 3.3).
3.3 Mercer’s Theorem 79

The First 100 Eigenvalues

0.00 0.05 0.10 0.15 0.20 0.25 m = 1000
m = 300
EigenValues

0 20 40 60 80 100
# Eigenvalues

Fig. 3.2 The eigenvalues obtained in Example 62. We compare the cases involving m = 1000
samples and the first m = 300 samples. The largest eigenvalues for both cases coincide

# Kernel D e f i n i t i o n
sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# G e n e r a t e S a m p l e s and D e f i n e t h e Gram M a t r i x
m = 300
x = np . random . r a n d n (m) − 2 ∗ np . random . r a n d n (m) ∗∗2 + 3 ∗ np . random . r a n d n (m)
∗∗3
# E i g e n v a l u e s and E i g e n v e c t o r s
K = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
v a l u e s , v e c t o r s = np . l i n a l g . e i g (K)
lam = v a l u e s / m
a l p h a = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
a l p h a [ : , i ] = v e c t o r s [ i , : ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16)
# D i s p l a y Graph
def F ( y , i ) :
S = 0
f o r j i n range (m) :
S = S + alpha [ j , i ] ∗ k ( x [ j ] , y )
return S
i = 1 ## Execute i t changing i
d e f G( y ) :
return F ( y , i )

w = np . l i n s p a c e ( − 2 , 2 , 1 0 0 )
p l t . p l o t (w, G(w) )
p l t . t i t l e ( "EigenValuesandtheirEigenFunctions" )

Finally, we present the RKHS obtained from Mercer’s theorem (Proposition 41).
In Example 57, we pointed out that the condition was too loose for the L 2 -space to
be an RKHS. The following proposition suggests the restrictions that we should add.
80 3 Reproducing Kernel Hilbert Space

First Eigenfunction Second Eigenfunction

m = 1000 m = 1000

-1.0 -0.5 0.0 0.5 1.0 1.5

2.0
m = 300 m = 300
Eigenfunction

Eigenfunction
1.5
1.0
0.5

-2 -1 0 1 2 -2 -1 0 1 2
x x

Third Eigenfunction Fourth Eigenfunction

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

m = 1000 m = 1000

-1.5 -1.0 -0.5 0.0 0.5 1.0

m = 300 m = 300
Eigenfunction

Eigenfunction

-2 -1 0 1 2 -2 -1 0 1 2
x x

Fig. 3.3 The eigenfunctions obtained in Example 62. We show a comparison between the functions
of the m = 1000 samples and the first m = 300 samples. The eigenfunctions coincide for the first
largest three eigenvalues, but they are far from each other for the fourth eigenvalue. However, the
fourth eigenvalues coincide

Proposition 42 Let {(λ j , e j )} be an eigenvalue of an integral operator with a posi-

tive definite kernel k and an orthonormal eigenfunction. In this case.
∞
∞
β 2j
H ={ βjej| < ∞}
j=1 j=1
λj

∞
f (x)e j (x)dη(x) g(x)e j (x)dη(x)
f, g H := E E
(3.20)
j=1
λj

gives the RKHS.

∞
The proposition claims that if we restrict the elements j=1 β j e j for which
∞ 2 ∞ β 2j
j=1 β j < ∞ to those for which j=1 λ j < ∞, the L space becomes an RKHS.
2

Proof: From the definition of the inner product (3.20), we can write ei , e j H =
1
δ
λi i, j
. Thus, we have
3.3 Mercer’s Theorem 81

∞ ∞
β 2j
{ β j e j (x)}2 dβ(x) < ∞ ⇐⇒ <∞,
E j=1 j=1
λj

∞
and H is a Hilbert space. From Mercer’s theorem, we can write k(x, ·) = j=1 λjej
(x)e j (·), so we have

∞
∞

{λ j e j (x)}2
= λ j e j (x)e j (x) = k(x, x) < ∞
j=1
λj j=1

and k(x, ·) ∈ H . Finally, since E k(·, y)e j (y)dη(y) = λ j e j (·), we have

∞
1
f, k(·, x) H = f (y)e j (y)dη(y) k(x, y)e j (y)dη(y)
λ
j=1 j E E
∞

= { f (y)e j (y)dη(y)}e j (x) = f (x),
j=1 E

which is the reproducing property.

As seen from the proof, the eigenvector {e j } of Mercer’s theorem is orthonormal
−1/2
in the L 2 space, but in the obtained RKHS, the norm is λ j . We can see that the
RKHS reduces {β j } faster than the L space.
2

Appendix

Proof of Proposition 34
Let k : E × E → R be the positive definite kernel of a Hilbert space H . We show
that for the linear space H0 spanned by k(x, ·), x ∈ E, the bivariate function

m
n
f, g H0 = ai b j k(xi , y j )
i=1 j=1

is an inner product between

m
n
f (·) = ai k(xi , ·) and g(·) = b j k(y j , ·) ∈ H0 . (3.21)
i=1 j=1

m
m
f, g H0 = ai g(xi ) = b j f (x j )
i=1 j=1
82 3 Reproducing Kernel Hilbert Space

does not depend on the choice of f, g in (3.21). In particular, f, g H0 is symmetric.

Since k is a positive definite kernel, we have

m
n

f
2 = ai a j k(xi , x j ) ≥ 0 .
i=1 j=1

Moreover, from

| f (x)| = | f (·), k(x, ·) H0 | ≤
f
H0 k(x, x),

we have
f
H0 = 0= ⇒ f = 0. In the following, we construct the linear space H
obtained by completing H0 .
Let { f n } be a Cauchy sequence in H0 . For an arbitrary x ∈ E and m, n ≥ 1, we
have
| f m (x) − f n (x)| ≤
f m − f n
H0 k(x, x),

and { f n (x)} is Cauchy. Since this sequence is a real sequence, it has a convergence
point for each x ∈ E. In the following, let H be the set of f : E → R such that the
{ f n (x)} for which { f n } is Cauchy in H0 converges to f (x) for each x ∈ E. In general,
H0 is a subset of H . In the following, we define an inner product in H and prove that
H is an RKHS with a reproducing kernel k.
Lemma 4 Suppose that { f n } is a Cauchy sequence in H0 . If the sequence { f n (x)}
converges to 0 for each x ∈ E, then we have

lim
f n
H0 = 0 .
n→∞

Proof of Lemma 4: Since a Cauchy sequence is bounded (Example 26), there exists
a B > 0 such that
f n
< B, n = 1, 2, . . .. Moreover, since the above sequence
is a Cauchy sequence, for an arbitrary > 0,p there exists an N such that n >
N= ⇒
f n − f N
< /B. Thus, for f N (x) = i=1 αi k(xi , x) ∈ H0 , αi ∈ R, xi ∈ E,
and i = 1, 2, . . ., we have that when n > N

f n
2H0 = f n − f N , f n H0 + f N , f n H0 ≤
f n − f N
H0
f n
H0 + αi | f n (xi )| .
i=1

Each of the first and second terms is at most since we have f n (xi ) → 0 as n → ∞
for each i = 1, . . . , p. Hence, we have Lemma 4.
For Cauchy sequences { f n }, {gn } in H0 , we define f, g ∈ H such that { f n (x)},
{gn (x)} converge to f (x), g(x), respectively, for each x ∈ E. Then, { f n , gn H0 } is
Cauchy:

| f n , gn H0 − f m , gm H0 | = | f n , gn − gm H0 + f n − f m , gm H0 |
≤
f n
H0
gn − gm
H0 +
f n − f m
H0
gm
H0 .
Appendix 83

Since { f n , gn H0 } is real and Cauchy, it converges (Proposition 6). The inner product
obtained by convergence depends only on f (x), g(x) (x ∈ E).
Let { f n }, {gn } be other Cauchy sequences in H0 that converge to f, g for each
x ∈ E. Then, { f n − f n }, {gn − gn } are Cauchy sequences that converge to 0 for each
x ∈ E, and from Lemma 4, we have
f n − f n
H0 ,
gn − gn
H0 → 0 as n → ∞,
which means that

| f n , gn H0 − f n , gn H0 | = | f n , gn − gn H0 + f n − f n , gn H0 |

≤
f n
H0
gn − gn
H0 +
f n − f n
H0
gn
H0 → 0 .

Thus, the convergence point of { f n , gn H0 } does not depend on { f n }, {gn } but on

f, g ∈ H . We define the inner product of H by

f, g H := lim f n , gn H0 .
n→∞

To show that this expression satisfies the definition of an inner product, we assume
that
f
H = f, f H = 0. Then, for each x ∈ E, as n → ∞, from

| f n (x)| = | f n (·), k(x, ·)| ≤ k(x, x)
f n
H0 → 0,

we have | f (x)| = limn→∞ | f n (x)| = 0.

Moreover, since we have defined f ∈ H according to lim f n (x) (x ∈ E) for
n→∞
any Cauchy sequence { f n } in H0 that converges to f , from the definition of inner
products, we have

f − f n
H = lim
f m − f n
H0 → 0 (3.22)
m→∞

n → ∞, and H0 is dense in H .
We show that H is complete. Let { f n } be a Cauchy sequence in H . From denseness,
there exists a sequence { f n } in H0 such that

f n − f n
H → 0 (3.23)

as n → ∞. Therefore, given an arbitrary > 0, for m, n > N , we have

f n − f n
H ,
f m − f m
H ,
f n − f m
H < /3 and

f n − f m
H0 =
f n − f m
H ≤
f n − f n
H +
f n − f m
H +
f m − f m
H ≤

for f n , f m ∈ H0 ⊆ H . Thus, { f n } is a Cauchy sequence in H0 , and we define f ∈

H by the convergence of f (x) for each x ∈ E. Moreover, from (3.22), we have

f − f n
H → 0. Combining this with (3.23), we obtain

f − f n
H ≤
f − f n
H +
f n − f n
H → 0

as n → ∞. Hence, H is complete.
84 3 Reproducing Kernel Hilbert Space

Next, we show that k is the corresponding reproducing kernel of the Hilbert space
H . Property (3.1) holds immediately because k(x, ·) ∈ H0 ⊆ H , x ∈ E. For another
property (3.2), since f ∈ H is a limit of the Cauchy sequence { f n } in H0 at x ∈ E,
we have

f (x) = lim f n (x) = lim f n (·), k(x, ·) H0 = f, k(x, ·) H .

n→∞ n→∞

Finally, we show that such an H uniquely exists. Suppose that G exists and shares
the same properties possessed by H . Since H is a closure of H0 , G should contain
H as a subspace. Since H is closed, from (2.11), we write G = H ⊕ H ⊥ . However,
since k(x, ·) ∈ H , x ∈ E and f (·), k(x, ·)G = 0 for f ∈ H ⊥ , we have f (x) = 0,
x ∈ E, which means that H ⊥ = {0}.

Proof of Proposition 38
From our assumption, we have k(x, ·) = k1 (x, ·) + k2 (x, ·) ∈ H for each x ∈ E.
We define N ⊥ (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)) for each x ∈ E, where h 1 (x, ·),
h 2 (x, ·) are elements in H1 , H2 for x ∈ E, but h 1 , h 2 are not necessarily reproducing
kernels k1 , k2 of H1 , H2 , respectively. Since k(x, ·) = k1 (x, ·) + k2 (x, ·), we have

h 1 (x, ·) − k1 (x, ·) + h 2 (x, ·) − k2 (x, ·) = k(x, ·) − k(x, ·) = 0

and z := (h 1 (x, ·) − k1 (x, ·), h 2 (x, ·) − k2 (x, ·)) ∈ N , so

0 = 0, f H = z, ( f 1 , f 2 ) F

for f ∈ H and N ⊥ ( f 1 , f 2 ) := v −1 ( f ). Thus, we have

f 1 , h 1 (x, ·)1 + f 2 , h 2 (x, ·)2 = f 1 , k1 (x, ·)1 + f 2 , k2 (x, ·)2 ,

which implies the reproducing property:

f, k(x, ·) H = v −1 ( f ), v −1 (k(x, ·)) F = ( f 1 , f 2 ), (h 1 (x, ·), h 2 (x, ·)) F

= ( f 1 , f 2 ), (k1 (x, ·), k2 (x, ·)) F = f 1 (x) + f 2 (x) = f (x) .

Furthermore, let ( f 1 , f 2 ) ∈ F, f := f 1 + f 2 , and (g1 , g2 ) := ( f 1 , f 2 ) − v −1 ( f ).

Then, from (g1 , g2 ) ∈ N and v −1 ( f ) ∈ N ⊥ , we have

( f 1 , f 2 )
2F =
v −1 ( f )
2F +
(g1 , g2 )
2F .

Combining this with (3.4) and (3.5), we have

f
2H =
v −1 ( f )
2F ≤
( f 1 , f 2 )
2F =
f 1
2H1 +
f 2
2H2 ,

where the equality holds when ( f 1 , f 2 ) = v −1 ( f ).

Appendix 85

Proof of Example 59
We use the equality [10]
∞ √ αy
exp(−(x − y)2 )H j (αx)d x = π(1 − α 2 ) j/2 H j ( ).
−∞ (1 − α 2 )1/2

Suppose that E p(y)dy = 1. If we have

k(x, y)φ j (y) p(y)dy = λφ j (x),
E

then
k̃(x, y)φ̃ j (y)dy = λφ̃ j (x)
E

for k̃(x, y) := p(x)1/2 k(x, y) p(y)1/2 , φ̃ j (x) := p(x)1/2 φ j (x). Thus, it is sufficient
to show that we obtain the right-hand side by substituting

2a
p(x) := exp(−2ax 2 )
π

2a
k̃(x, y) := exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 )
π

2a 1/4 √
φ̃ j (x) := ( ) exp(−cx 2 )H j ( 2cx)
π

into the left-hand side for E = (−∞, ∞). The left-hand side becomes
∞ √
2a
( )3/4 exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 ) exp(−cy 2 )H j ( 2cy)dy
−∞ π

2a 3/4 ∞ b b2 √
= ( ) exp{−(a + b + c)(y − x)2 + [ − (a + b)]x 2 }H j ( 2cy)dy
π −∞ a + b + c a + b + c
∞ √
2a 3/4 b 2c dz
= ( ) exp(−cx 2 ) exp{−(z − √ x)2 }H j ( √ z) √
π −∞ a+b+c a+b+c a+b+c

2a 1/4 2a √ 2c √
= ( ) exp(−cx 2 ) π (1 − ) j/2 H j ( 2cx)
π π(a + b + c) a+b+c

2a b 2a √ 2a j
= ( ) j ( )1/4 exp(−cx 2 )H j ( 2cx) = B φ̃ j (x),
a+b+c a+b+c π A
√
√ 2c
where we define z := y a + b + c, α := √ and use
a+b+c
86 3 Reproducing Kernel Hilbert Space

2c a+b−c (a + b)2 − c2 b
(1 − α ) 2 1/2
= 1− = = = .
a+b+c a+b+c (a + b + c)2 a+b+c

Proof of Proposition 40
Since K is uniformly continuous, if d is the distance E × E, there exists a δn such
that
⇒|K (x1 , y1 ) − K (x2 , y2 )| < n −1
d((x1 , y1 ), (x2 , y2 )) < δn =

for n = 1, 2, . . . and arbitrary x1 , x2 , y1 , y2 ∈ E. Since E is compact, we can cover

it with a finite number of balls {E n,i }i=1
m
of diameter δn . If we arbitrarily choose vi ∈
E n,i and define K n (x, y) := K (vi , v j ) for (x, y) ∈ E n,i × E n, j , from the uniform
continuity of K , we obtain

1
max |K (x, y) − K n (x, y)| < .
(x,y)∈E×E n

Let TK , TK n be the integral operators of K , K n . Then, we have

|TK f, f − TK n f, f | ≤ n −1
f
2

and

m
m
TK n f, f = K (vi , v j ) f (x)dμ(x) f (y)dμ(y)
i=1 j=1 E n,i E n, j

for an arbitrary n, and we have TK f, f ≥ 0. Conversely, suppose that TK f, f ≥ 0.

m m
If there exist x1 , . . . , xm ∈ E, z 1 , . . . , z m ∈ R such that i=1 j=1 z i z j k(x i , x j ) <
0, since K is uniformly continuous, there exist E 1 , . . . , E m ∈ F such that

m
m
max z i z j K (xi .y j ) < 0
x h ,yh ∈E h ,h=1,...,m
i=1 j=1

and μ(E 1 ), . . . , μ(E m ) > 0. However, from the mean value theorem, we have

m
m
TK f, f := z i z j {μ(E i )μ(E j )}−1 k(x, y)dμ(x)dμ(y) < 0 .
i=1 j=1 Ei Ej

m −1
for f = i=1 z i {μ(E i )} I Ei , which contradicts the fact that TK is positive definite.

Appendix 87

Proof of Lemma 3
We assume that f n (x) monotonically increases as n grows for each x ∈ E. Let > 0
be arbitrary. For each x ∈ E, let n(x) be the minimum n such that | f n (x) − f (x)| <
. From continuity, for each x ∈ E, we set U (x) so that

⇒| f (x) − f (y)| < , | f n(x) (x) − f n(x) (y)| < .

y ∈ U (x)=

Then, we have

f (y) − f n(x) (y) ≤ f (x) + − f n(x) (y) ≤ f n(x) (x) + 2 − f n(x) (y) ≤ | f n(x) (x) − f n(x) (y)| + 2 < 3 .

Moreover, since E is compact, we may suppose that E ⊆ ∪i=1 m

U (xi ). If N is the
maximum value of n(x1 ), . . . , n(xm ), for n ≥ N , we have

f (y) − f n (y) ≤ f (y) − f n(xi ) (y) ≤ 3

for each y ∈ E and each i for which y ∈ U (xi ).

Exercises 31∼45
31. Proposition 34 can be derived according to the following steps. Which part of
the proof in the appendix does each step correspond to?
(a) Define the inner product ·, · H0 of H0 := span{k(x, ·) : x ∈ E}.
(b) For any Cauchy sequence { f n } in H0 and each x ∈ E, the real sequence
{ f n (x)} is Cauchy, so it converges to a f (x) := lim f n (x) (Proposition 6).
n→∞
Let H be such a set of f s.
(c) Define the inner product ·, · H of the linear space H .
(d) Show that H0 is dense in H .
(e) Show that any Cauchy sequence { f n } in H converges to some element of H
as n → ∞ (completeness of H ).
(f) Show that k is a reproducing kernel of H .
(g) Show that such an H is unique.
1
32. In Examples 55 and 56, the inner product is f, g H = 0 F(u)G(u)du, and the
RKHS is

H = {E x → F(t)J (x, t)dη(t) ∈ R|F ∈ L 2 (E, η)} .
E

What are the J (x, t) in Examples 55 and 56? Also, how is the kernel k(x, y)
represented in general by using J (x, t)?
33. Proposition 38 can be derived according to the following steps. Which part of
the proof in the appendix does each step correspond to?
88 3 Reproducing Kernel Hilbert Space

(a) Fix f ∈ H arbitrarily define N ⊥ ( f 1 , f 2 ) := v −1 ( f ), k(x, ·) := k1 (x, ·) +

k2 (x, ·), and (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)), and show that

f 1 , h 1 (x, ·)1 + f 2 , h 2 (x, ·)2 = f 1 , k1 (x, ·)1 + f 2 , k2 (x, ·)2

(b) Using (a), prove the reproducing property of k: f, k(x, ·) H = f (x).
(c) Show that the norm of H is (3.6)
34. Show that each f ∈ Wq [0, 1] can be the Taylor series expanded by

q−1 1
f (x) = f (i) (0)φi (x) + G q (x, y) f (q) (y)dy
i=0 0

using
xi
φi (x) := , i = 0, 1, . . .
i!
and
q−1
(x − y)+
G q (x, y) := .
(q − 1)!

35. Show that Wq [0, 1] = H0 ⊕ H1 , where

q−1
H0 = { αi φi (x)|α0 , . . . , αq−1 ∈ R}
i=0

1
H1 = { G q (x, y)h(y)dy|h ∈ L 2 [0, 1]}
0

(You need to show the inclusion relation on both sides of the set). In addition,
show that H0 ∩ H1 = {0}.
36. We consider the integral operator Tk of k(x, y) = min{x, y}, in L 2 [0, 1], where
x, y ∈ E = [0, 1]. Substitute

4
λj =
{(2 j − 1)π }2

√ (2 j − 1)π
e j (x) = 2 sin x
2

into Tk e j = λ j e j to examine the equality.

37. Show that the eigenvalues in Example 59 form a geometric sequence with the
initial values and ratio that are determined by β := σ̂ 2 /σ 2 .
Exercises 31∼45 89

38. In Example 59, the following program obtains eigenvalues and eigenfunctions
under the assumption that σ 2 = σ̂ 2 = 1. We can change the program to set the
values of σ 2 , σ̂ 2 in ## and add σ 2 , σ̂ 2 as an argument to the function phi in ###
and run it to output a graph.

d e f H( j , x ) :
i f j == 0 :
return 1
e l i f j == 1 :
return 2 ∗ x
e l i f j == 2 :
r e t u r n −2 + 4 ∗ x ∗∗2
else :
r e t u r n 4 ∗ x − 8 ∗ x ∗∗3

c c = np . s q r t ( 5 ) / 4
a = 1/4 ##

def phi ( j , x ) : # ##
r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x )

c o l o r = [ "b" , "g" , "r" , "k" ]

x = np . l i n s p a c e ( − 2 , 2 , 1 0 0 )
p l t . plot (x , phi (0 , x ) , c = color [0] , l a b e l = "j=0" )
p l t . y l i m ( −2 , 8 )
p l t . y l a b e l ( "phi" )
f o r i i n range ( 0 , 3 ) :

p l t . p lot (x , phi ( i , x ) , c = color [ i + 1] , l a b e l = "j=%d"%i )

p l t . t i t l e ( "CharacteristicfunctionofGaussKernel" )

39. Show the following:

(a) The function f n (x) = n 2 (1 − x)x n+1 defined over [0, 1] converges at each
x ∈ [0, 1], but its upper bound does not converge (it is not uniformly con-
vergent).
(b) The function f n (x) = (1 − x)x n+1 defined over [0, 1] converges uniformly
(using Lemma 3).(−1)n
(c) The series ∞ √
n=0 n+1 converges absolutely.

40. In Example 58, suppose that the period of φ is 2π instead of 2. What are the
eigenvalues and eigenfunctions of Tk ? Additionally, derive the kernel k.
41. What eigenequations should be solved in Example 61 when m = 3, d = 1?
42. Define and execute the following part of the program in Example 62 as a function.
The input for this includes data x, a kernel k, and the i of the ith eigenvalue. The
output is a function F.

K = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
v a l u e s , v e c t o r s = np . l i n a l g . e i g (K)
90 3 Reproducing Kernel Hilbert Space

lam = v a l u e s / m
a l p h a = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
a l p h a [ : , i ] = v e c t o r s [ : , i ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16)

def F ( y , i ) :
S = 0
f o r j i n range (m) :
S = S + alpha [ j , i ] ∗ k ( x [ j ] , y )
return S

43. In Example 62, for the Gaussian kernel, random numbers are generated accord-
ing to the normal distribution, and we obtain the corresponding eigenvalues and
eigenfunctions. When the number of samples is large, theoretically, the eigenval-
ues are reduced exponentially (Example 59). What happens with the polynomial
kernel k(x, y) = (1 + xy)2 when m = 2 and d = 1? Output the eigenvalues and
eigenfunctions as the Gaussian kernel.
44. If we construct (3.19) using the solution of K m U = U , show that the result is
a solution of (3.18) and that it is orthogonal with
a magnitude of 1.
45. In Proposition 42, β j should originally satisfy ∞ β
j=1 j
2
< ∞. However, this is
not stated in the assertion of Proposition 42. Why is this the case?
Chapter 4
Kernel Computations

In Chap. 1, we learned that the kernel k(x, y) ∈ R represents the similarity between
two elements x, y in a set E. Chapter 3 described the relationships between a kernel
k, its feature map E x → k(x, ·) ∈ H , and its reproducing kernel Hilbert space
H . In this chapter, we consider k(x, ·) to be a function of E → R for each x ∈ E,
and we perform data processing for N actual data pairs (x1 , y1 ), . . . , (x N , y N ) of
covariates and responses. The xi , i = 1, . . . , N (row vectors) are p-dimensional and
given by the matrix X ∈ R N × p . The responses yi (i = 1, . . . , N ) may be real or
binary. This chapter discusses kernel ridge regression, principal component analysis,
support vector machines (SVMs), and splines, and we find the f ∈ H that minimizes
the objective function under
N various constraints. It is known that we can write the
optimal f in the form i=1 αi k(xi , ·) (representation theorem), and the problem
reduces to finding the optimal α1 , . . . , α N .
In the second half, we address the problem of computational complexity. The
computation of a kernel takes more than O(N 3 ), and real-time calculation is hard
when N is greater than 1000. In particular, we consider how to reduce the rank of the
Gram matrix K . Specifically, we learn actual procedures for random Fourier features,
Nyström approximation, and incomplete Cholesky decomposition.

4.1 Kernel Ridge Regression

N
We say that finding the β ∈ R p (column vector) that minimizes i=1 (yi − xi β)2
is the least-squares problem. If we assume that we have executed the central-
1
N
ization process such that yi ← yi − ȳ and xi, j ← xi, j − x̄ j for ȳ = yi and
N i=1

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 91
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_4
92 4 Kernel Computations

1
N
x̄ j = xi, j and that the matrix X X is nonsingular, we can obtain the solution
N i=1
as β̂ = (X X )−1 X y from X = (xi, j ) and y = (yi ). In the following, we prepare
a kernel k : E × E → R and consider the problem of finding the f ∈ H that mini-
mizes
N
L := (yi − f (xi ))2 .
i=1

As we considered in Example 40, we express the RKHS H as the sum of

M := span({k(xi , ·)}i=1
N
)

and
M ⊥ = { f ∈ H | f, k(xi , ·)
H = 0, i = 1, . . . , N } .

If we set f = f 1 + f 2 , f 1 ∈ M, f 2 ∈ M ⊥ , then we have

N
N
N
N
(yi − f (xi ))2 = (yi − f 1 (xi ))2 = (yi − αk(x j , xi ))2 (4.1)
i=1 i=1 i=1 j=1

and E := R p ; we then obtain

f (xi ) = f 1 (·) + f 2 (·), k(xi , ·)

H = f 1 (·), k(xi , ·)
H = f 1 (xi )

for i = 1, . . . , N . Thus, the minimization of L reduces to that of

N
N
L= {yi − α j k(x j , xi )}2 = y − K α2 , (4.2)
i=1 j=1

where K = (k(xi , x j ))i, j=1,...,N is a Gram matrix, and the norm z of z = [z 1 , . . . ,

N 2
z N ] ∈ R denotes i=1 z i . The above principle is the representation theorem.
If we differentiate L by α, we have −K (y − K α) = 0. If K is positive definite
rather than nonnegative definite, then the solution becomes α̂ = K −1 y.
If we use the fˆ ∈ H obtained as above that minimizes (4.2), then we can predict
the value of y given a new x ∈ R p via

n
fˆ(x) = α̂i k(xi , x).
i=1
4.1 Kernel Ridge Regression 93

We can construct a procedure to compute α as follows:

# We i n s t a l l s k f d a module b e f o r e h a n d
pip i n s t a l l cvxopt

# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a v e b e e n e x e c u t e d .
import numpy a s np
import p a n d a s a s pd
from s k l e a r n . d e c o m p o s i t i o n import PCA
import c v x o p t
from c v x o p t import s o l v e r s
from c v x o p t import m a t r i x
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . u s e ( "seaborn−ticks" )
from numpy . random import r a n d n # G a u s s i a n random numbers
from s c i p y . s t a t s import norm

def alpha ( k , x , y ) :
n = len ( x )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
r e t u r n np . l i n a l g . i n v (K + 10 e −5 ∗ np . i d e n t i t y ( n ) ) . d o t ( y )
# Add 10^( − 5) I t o K f o r making i t i n v e r t i b l e

Example 63 Utilizing the function alpha, we execute kernel regression via poly-
nomial and Gaussian kernels for n = 50 data (λ = 0.1). We present the output in
Fig. 4.1.

d e f k_p ( x , y) : # Kernel D e f i n i t i o n
return ( np . d o t ( x . T , y ) + 1 ) ∗∗3
d e f k_g ( x , y) : # Kernel D e f i n i t i o n
return np . exp ( − ( x − y ) ∗∗2 / 2 )

lam = 0 . 1
n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) # Data
Generation
a l p h a _ p = a l p h a ( k_p , x , y )
a l p h a _ g = a l p h a ( k_g , x , y )

p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , marker = "o" )

94 4 Kernel Computations

Kernel Regression

5
Polynomial Kernel
Gaussian Kernel
4
3
2
y

1
0
-1

-1.0 -0.5 0.0 0.5 1.0

Fig. 4.1 We execute kernel regression by using polynomial and Gaussian kernels

plt . p l o t ( z , u , c = "r" , l a b e l = "PolynomialKernel" )

plt . p l o t ( z , v , c = "b" , l a b e l = "GaussKernel" )
plt . x l i m ( −1 , 1 )
plt . y l i m ( −1 , 5 )
plt . x l a b e l ( "x" )
plt . y l a b e l ( "y" )
plt . t i t l e ( "KernelRegression" )
plt . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

We cannot obtain the solution of a linear regression problem when the rank of X
is smaller than p, i.e., N < p. Thus, we often minimize

N
(yi − xi β)2 + λβ22
i=1

for cases in which λ > 0. We call such a modification of linear regression a ridge. The
β to be minimized is given by (X X + λI )−1 X y. In fact, we derive the formula
by differentiating
y − Xβ2 + λβ β

by β and equating it to zero; we obtain

−X (y − Xβ) + λβ = 0 .

We consider extending ridge regression to the problem of finding the f ∈ H that

minimizes
4.1 Kernel Ridge Regression 95

N
L := (yi − f (xi ))2 + λ f 2H . (4.3)
i=1

Since f 1 and f 2 are orthogonal, we have

f 2H = f 1 2H + f 2 2H + 2 f 1 , f 2
H = f 1 2H + f 2 2H ≥ f 1 2H . (4.4)

From (4.1), (4.3), and (4.4), we also have

N
L ≥ (yi − f 1 (xi ))2 + λ f 1 2H .
i=1

If we note that the second term can be expressed by

N
N
N
N
f 1 2H = αi k(xi , ·), α j k(x j , ·)
H = αi α j k(xi , ·), k(x j , ·)
H = α K α
i=1 j=1 i=1 j=1

for α = [α1 , . . . , α N ] , then the minimization of L reduces to that of

y − K α2 + λα K α . (4.5)

If we differentiate the equation by α and set it equal to zero, we obtain

−K (y − K α) + λK α = 0 .

If K is nonsingular, we have

α̂ = (K + λI )−1 y . (4.6)

Finally, if we use the fˆ ∈ H that minimizes the (4.3) obtained thus far, we can
predict the value of y given a new x ∈ R p via

n
fˆ(x) = α̂i k(xi , x) .
i=1

For example, we can construct a procedure that finds α as follows:

def alpha ( k , x , y ) :
n = len ( x )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
r e t u r n np . l i n a l g . i n v (K + lam ∗ np . i d e n t i t y ( n ) ) . d o t ( y )
96 4 Kernel Computations

Fig. 4.2 We execute kernel Kernel Ridge

ridge regression using

5
polynomial and Gaussian
kernels

4
3
2
y

1
0
-1

-1.0 -0.5 0.0 0.5 1.0

Example 64 Using the function alpha, we execute kernel ridge regression for poly-
nomial and Gaussian kernels and n = 50 data(λ = 0.1). We show the outputs in
Fig. 4.2.

d e f k_p ( x , y) : # Kernel D e f i n i t i o n
return ( np . d o t ( x . T , y ) + 1 ) ∗∗3
d e f k_g ( x , y) : # Kernel D e f i n i t i o n
return np . exp ( − ( x − y ) ∗∗2 / 2 )

lam = 0 . 1
n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) #
Data G e n e r a t i o n
a l p h a _ p = a l p h a ( k_p , x , y )
a l p h a _ g = a l p h a ( k_g , x , y )

z = np . s o r t ( x ) ; u = [ ] ; v = [ ]
f o r j i n range ( n ) :
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ p [ i ] ∗ k_p ( x [ i ] , z [ j ] )
u . append ( S )
S = 0
f o r i i n range ( n ) :
S = S + a l p h a _ g [ i ] ∗ k_g ( x [ i ] , z [ j ] )
v . append ( S )
p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
p l t . p l o t ( z , u , c = "r" , l a b e l = "PolynomialKernel" )
p l t . p l o t ( z , v , c = "b" , l a b e l = "GaussKernel" )
p l t . x l i m ( −1 , 1 )
p l t . y l i m ( −1 , 5 )
p l t . x l a b e l ( "x" )
p l t . y l a b e l ( "y" )
p l t . t i t l e ( "KernelRidge" )
p l t . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )
4.2 Kernel Principle Component Analysis 97

4.2 Kernel Principle Component Analysis

We review the procedure of principal component analysis (PCA) when we do not use
any kernel. We centralize each of the columns in the matrix X and vector y. We first
compute the v1 := v ∈ R p that maximizes v X X v under v v = 1. Similarly, for
i = 2, . . . , p, we repeatedly compute vi with the v v = 1 that maximizes v X X v
and is orthogonal to v1 , · · · , vi−1 ∈ R p . In the actual cases, we do not use all of
the v1 , · · · , v p but compress R p to the v1 , · · · , vm (1 ≤ m ≤ p) with the largest
eigenvalues. We compute the v ∈ R p that maximizes

v X X v − μ(v v − 1) (4.7)

with a μ > 0 Lagrange coefficient to find v ∈ R p with the v v = 1 that maximizes

v X X v. In PCA, we often compute
⎡ ⎤
xv1
⎢ .. ⎥
⎣ . ⎦∈R
m

xvm

for each row vector x ∈ R p using the obtained v1 , . . . , vm ∈ R p . We call such a value
the score of x, which is the vector obtained by projecting x onto the m elements.
We may apply a problem that is similar to PCA for an RKHS H via the feature
map : E xi → k(xi , ·) ∈ H rather than the PCA in R p . To this end, we consider
the problem of finding the f ∈ H that maximizes

N
f (xi )2 − μ( f 2H − 1) (4.8)
j=1

with an μ > 0 Lagrange coefficient.

If we use the linear kernel (the standard inner product), we can express f ∈ H by
f (·) = w, ·
E with w ∈ E. Thus, (4.7) and (4.8) coincide. The centralization in the
kernel PCA is for the Gram matrix K rather than the matrix X . For the other part,
the extension follows in the same manner.
As discussed in the previous section, we apply the representation theorem. Thus,
for f 1 ∈ M := span({k(xi , ·)}i=1
N
) and f 2 ∈ M ⊥ , we have

N
N
N
N
f (xi )2 = f 1 (·)+ f 2 (·), k(xi , ·)
2H = f 1 (·), k(xi , ·)
2H = f 1 (xi )2
i=1 i=1 i=1 i=1

N N
N
N
N
= { α j k(x j , xi )} =
2
αr αs k(xr , xi )k(xs , xi ) = α K 2 α
i=1 j=1 i=1 r =1 s=1
98 4 Kernel Computations

f 1 + f 2 2H = f 1 2H + f 2 2H ≥ f 1 2H

N
N
N
= α j k(x j , ·)2H = αr αs k(xr , xs ) = α K α .
j=1 r =1 s=1

Hence, we can formulate (4.8) as the maximization of

α K K α − μ(α K α − 1) .

If we substitute β = K 1/2 α, then since K is symmetric, we have

β Kβ − μ(β β − 1) .

Let λ1 , . . . , λ N and u 1 , . . . , u N be the eigenvalues and eigenvectors of the eigenequa-

tion Kβ = λβ, respectively. Then, we have [26]

1 u1 uN
α = K −1/2 β = √ β = √ , . . . , √ .
λ λ1 λN

If we centralize the Gram matrix K = (k(xi , x j )), then the (i, j)th element of the
modified Gram matrix is

1 1
N N
k(xi , ·) − k(x h , ·), k(x j , ·) − k(x h , ·)
H
N h=1 N h=1

1 1
N N
= k(xi , x j ) − k(xi , x h ) − k(x j , xl )
N h=1 N l=1

1
N N
+ k(x h , xl ) . (4.9)
N 2 h=1 l=1

To obtain the score (size 1 ≤ m ≤ p) of x ∈ R p (row vector), we use the first m

columns of A = [α1 , . . . , α N ] ∈ R N × p . Let xi ∈ R p and αi ∈ Rm be a row vector
of X and the ith column of A ∈ R N ×m , respectively. Then,

N
αi k(xi , x) ∈ Rm
i=1

is the score of x ∈ R p .
Compared to ordinary PCA, kernel PCA requires a computational time of O(N 3 ).
Therefore, when N is large compared to p, the computational complexity may be
enormous. In the Python, we can write the procedure as follows:
4.2 Kernel Principle Component Analysis 99

def k e r n e l _ p c a _ t r a i n ( x , k ) :
n = x . shape [ 0 ]
K = np . z e r o s ( ( n , n ) )
S = [0] ∗ n ; T = [0] ∗ n
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i , : ] , x [ j , : ] )
f o r i i n range ( n ) :
S [ i ] = np . sum (K[ i , : ] )
f o r j i n range ( n ) :
T [ j ] = np . sum (K [ : , j ] )
U = np . sum (K)
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = K[ i , j ] − S [ i ] / n − T [ j ] / n + U / n ∗∗2
v a l , v e c = np . l i n a l g . e i g (K)
idx = v a l . a r g s o r t ( ) [ : : − 1 ] # d e c r e a s i n g order as R
val = val [ idx ]
vec = vec [ : , idx ]
a l p h a = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
a l p h a [ : , i ] = vec [ : , i ] / v a l [ i ] ∗ ∗ 0 . 5
return alpha

d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) :
n = x . shape [ 0 ]
p c a = np . z e r o s (m)
f o r i i n range ( n ) :
p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z )
return pca

In kernel PCA, when we use the linear kernel, the scores are consistent with those
of PCA without any kernel. For simplicity, we assume that X is normalized. If we do
not use the kernel, then by the singular value decomposition of X = U V (U ∈
R N × p , ∈ R p× p , V ∈ R p× p ), the multiplication of N 1−1 X X = N 1−1 V 2 V and
V is N 1−1 X X V = N 1−1 V 2 . Thus, each column of V is a principal component
vector, and the scores of x1 , . . . , x N ∈ R p (row vector) are the first m columns of

X V = U V · V = U .

On the other hand, for the linear kernel, we may write the Gram matrix as
K = X X = U 2 U and have K U = X X U = U 2 . That is, each column of
U is β1 , . . . , β N , and the columns α1 , . . . , α N of K −1/2 U are the principal compo-
nent vectors. Therefore, the scores of x1 , . . . , x N ∈ R p (row vectors) are the first m
columns of

K · K −1/2 U = U 2 U · (U 2 U )−1/2 · U = U .

Furthermore, we compare the results in terms of centralization. Equation (4.9) is

1 1 1
N N N N
xi x j − xi x h − x j xl + xl x h = (xi − x̄)(x j − x̄)
N h=1 N l=1 N h=1 l=1
100 4 Kernel Computations

for the linear kernel, which is consistent with that of the ordinary PCA approach.
Therefore, the obtained score is the same.

Example 65 We performed kernel PCA on a data set called US Arrests in Python.

We wished to project the ratio of the resident population living in urban areas and the
incidence rates of homicide, violent crime, and assaults on women (number of arrests
per 100,000 people) for all 50 states of the U.S. onto the axes of two variables using
PCA. We performed kernel PCA with a Gaussian kernel (σ 2 = 0.01, 0.08), kernel
PCA with a linear kernel, and ordinary PCA. We observed that the differences in the
features of the 50 states were not evident in the results of ordinary PCA and kernel
PCA with the linear kernel (Fig. 4.3a,b). With the Gaussian kernel (σ 2 = 0.08), the 50
states were divided into four categories (Fig. 4.3c). As far as the data were concerned,
California’s figures (fewer homicides for a higher urban population) differed from
those of the other states. Nevertheless, when we set σ = 0.01, the differences between

(a) Standard PCA (b) Kernel PCA (Linear)

11 33
30 5
-30
20

21 24 45
44 39
32 40 48
28
7 35 436 13
-40

2
10

38 47 3 4 41 34
49 23 25 22 9
3637 1 19
16 12 17
Second

Second

14 26
-50

8 10
27
0

31 42 50
46 20 18 29
15
29 18 15
50 20 46
42 10 31 27
-60

8
-10

19 17 12
26
1 3736 14
16
23 49
9 22 25
34 41 4 3 47 38
-70

2
-20

13 35 7
28 643
48 40 32
39 44
45 24 21
-80

5
-30

33 30 11

-100 -50 0 50 100 150 -350 -300 -250 -200 -150 -100 -50

First First

(c) Kernel PCA (σ 2 = 0.08) (d) Kernel PCA (σ 2 = 0.01)

0.4

15 33
32
0.4

29
3
44
0.2

20
21
0.2

5 50
38
Second

Second

7 2916 42 4
42
41
40 11 43
47
0.0

43
39
44
113
33
38
37
36
35
28
34
30
24
45
27
32
31
23
25
22
26
50
49 23
48
4746 13 3 48
18
17
19
16 30 281 7 39
0.0

21
20
11
89
14
10
12 49
9 35
19
612
2817
10
36
5
22
-0.2

24 14
26
4 1815 31 4645
27 37 41
-0.2

25
-0.4

34
6
2 40

-0.4 -0.2 0.0 0.2 0.4 -0.2 0.0 0.2 0.4 0.6

First First

Fig. 4.3 For the US Arrests data, we ran the ordinary PCA and kernel PCA methods (linear;
Gaussian with σ 2 = 0.08, 0.01), and we display the scores here. In the figure, 1 to 50 are the IDs
given to the states, and California’s ID is 5 (written in red). The results of the kernel PCA approach
differ greatly depending on what kernel we choose. Additionally, since kernel PCA is unsupervised,
it is not possible to use CV to select the optimal parameters. The scores of ordinary PCA and PCA
with the linear kernel should be identical. Although the directions of both axes are opposite, which
is common in PCA, we can conclude that they match
4.3 Kernel SVM 101

California and the other 49 states became clear (Fig. 4.3d). We used the following
code for the execution of the compared approaches:

# def k (x , y ) :
# r e t u r n np . d o t ( x . T , y )
sigma2 = 0.01

def k ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 )

X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/master/
USArrests.csv’ )
x = X. v a l u e s [ : , : − 1]
n = x . shape [ 0 ] ; p = x . shape [ 1 ]
alpha = kernel_pca_train (x , k )
z = np . z e r o s ( ( n , 2 ) )

f o r i i n range ( n ) :
z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] )

min1 = np . min ( z [ : , 0 ] ) ; min2 = np . min ( z [ : , 1 ] )

max1 = np . max ( z [ : , 0 ] ) ; max2 = np . max ( z [ : , 1 ] )

plt . x l i m ( min1 , max1 )

plt . y l i m ( min2 , max2 )
plt . x l a b e l ( "First" )
plt . y l a b e l ( "Second" )
plt . t i t l e ( "KernelPCA(Gauss0.01)" )
for i i n range ( n ) :
i f i != 4 :
pl t . text (x = z [ i , 0] , y = z [ i , 1] , s = i )
p l t . t e x t ( z [ 4 , 0 ] , z [ 4 , 1 ] , 5 , c = "r" )

4.3 Kernel SVM

Consider binary discrimination using support vector machines (SVMs). Given X ∈

R N × p and y ∈ {1, −1} N , we find the boundary Y = Xβ + β0 with the β ∈ R p and
β0 ∈ R that maximize the margin. Let γ ≥ 0. We wish to maximize the margin M
by ranging (β0 , β) ∈ R × R p and i ≥ 0, i = 1, . . . , N to satisfy

N
i ≤ γ
i=1

and
yi (β0 + xi β) ≥ M(1 − i ) , i = 1, . . . , N .

We often formulate this as the problem of minimizing

1 N
β2 + C i (4.10)
2 i=1
102 4 Kernel Computations

under yi (xi β + β0 ) ≥ 1 − i , i ≥ 0 for i = 1, . . . , N by using a constant C > 0

(the prime problem). We further transform it into the problem of finding 0 ≤ αi ≤ C,
i = 1, 2, . . . , N that maximizes

N
1
N N
αi − αi α j yi y j xi x j (4.11)
i=1
2 i=1 j=1

N
under i=1 αi yi = 0, where xi is the ith row vector of X (the dual problem)1 . The
constant C > 0 is a parameter that represents the flexibility of the boundary surface.
The higher the value is, the more samples are used to determine the boundary (samples
with αi = 0, i.e., support vectors). Although we sacrifice the fit of the data, we reduce
the boundary variation caused by sample data to prevent overtraining. Then, from
the support vectors, we can calculate the slope of the boundary with the following
formula:

N
β= αi yi xi ∈ R p .
i=1

Then, suppose that we replace the boundary surface with a curved surface by replac-
ing the inner product xi x j with a general nonlinear kernel k(xi , x j ). Then, we can
obtain complicated boundary surfaces rather than planes. However, the theoretical
basis for replacing the product with a kernel is not clear.
Therefore, in the following, we derive the same results by formulating the opti-
mization using k : E × E → R. As in to the previous application of the representa-
tion theorem, we find the f ∈ H that minimizes

1 N N N
f 2H + C i − αi [yi { f (xi ) + β0 } − (1 − i )] − μi i . (4.12)
2 i=1 i=1 i=1

Noting that f (xi ) = f 1 (xi ), i = 1, . . . , N and f H ≥ f 1 H , we find γ1 , . . . , γ N

N
such that f (·) = i=1 γi k(xi , ·).
The Karush-Kuhn-Tucker (KKT) condition results in the following nine equa-
tions:
yi { f (xi ) + β0 } − (1 − i ) ≥ 0

i ≥ 0

αi [yi { f (xi ) + β0 } − (1 − i )] = 0

μi i = 0

1 We see this derivation in several references, such as Joe Suzuki, “Statistical Learning with Math
and Python” (Springer); C. M. Bishop, “Pattern Recognition and Machine Learning” (Springer);
Hastie, Tibshirani, and Fridman, “Elements of Statistical Learning” (Springer); and other primary
machine learning books.
4.3 Kernel SVM 103

γ j k(xi , x j ) − α j y j k(xi , x j ) = 0 (4.13)
j j

αi yi = 0
i

C − αi − μi = 0 (4.14)

μi ≥ 0 , 0 ≤ αi ≤ C.

Next, suppose that f 0 , f 1 , . . . , f m : R p → R are convex and differentiable at β =

∗
β . In general, Eqs. (4.15, 4.16, and 4.17) are called the KKT condition2 .
Proposition 43 (KKT Condition) Suppose that f 1 (β) ≤ 0, . . . , f m (β) ≤ 0. Then,
β = β ∗ ∈ R p minimizes f 0 (β) if and only if

f 1 (β ∗ ), . . . , f m (β ∗ ) ≤ 0 (4.15)

and α1 , . . . , αm ≥ 0 exist such that

α1 f 1 (β ∗ ) = · · · = αm f m (β ∗ ) = 0 (4.16)

m
∇ f 0 (β ∗ ) + αi ∇ f i (β ∗ ) = 0 . (4.17)
i=1

Utilizing these nine equations, from (4.13)(4.14), we can express (4.12) as

N
1
N N
αi − αi α j yi y j k(xi .x j ) . (4.18)
i=1
2 i=1 j=1

Comparing (4.11) and (4.18), we observe that the dual problem replaces xi x j with
k(xi , x j ) for the formulation without any kernel.
In fact, if we set f (·) = β, ·
H , β ∈ R p , k(x, y) = x y (x, y ∈ R p ), then we
obtain the dual problem for a linear kernel (4.11).
Example 66 By using the following function svm_2, we can compare how the
bounds differ between a linear kernel (the standard inner product) and a nonlin-
ear kernel (a polynomial kernel), as shown in Fig. 4.4. cvxopt is a Python module
for solving quadratic programming problems. The function cvxopt calculates α.

def K_linear ( x , y ) :
r e t u r n x . T@y
d e f K_poly ( x , y ) :
r e t u r n ( 1 + x . T@y) ∗∗2

2 For the proof, see Chap. 9 of Joe Suzuki “Statistical Learning with R/Python” (Springer).
104 4 Kernel Computations

d e f svm_2 (X, y , C , K) :
eps =0.0001
n=X . s h a p e [ 0 ]
P=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
P [ i , j ] =K(X[ i , : ] , X[ j , : ] ) ∗y [ i ] ∗ y [ j ]
# S p e c i f y i t v i a t h e m a t r i x f u n c t i o n i n t h e package m a t r i x
P= m a t r i x ( P+np . e y e ( n ) ∗ e p s )
A= m a t r i x ( − y . T . a s t y p e ( np . f l o a t ) )
b= m a t r i x ( np . a r r a y ( [ 0 ] ) . a s t y p e ( np . f l o a t ) )
h= m a t r i x ( np . a r r a y ( [ C] ∗ n + [ 0 ] ∗ n ) . r e s h a p e ( − 1 , 1 ) . a s t y p e ( np . f l o a t ) )
G= m a t r i x ( np . c o n c a t e n a t e ( [ np . d i a g ( np . o n e s ( n ) ) , np . d i a g ( − np . o n e s ( n ) ) ] ) )
q= m a t r i x ( np . a r r a y ( [ − 1 ] ∗ n ) . a s t y p e ( np . f l o a t ) )
r e s = c v x o p t . s o l v e r s . qp ( P , q , A=A, b=b , G=G, h=h )
a l p h a =np . a r r a y ( r e s [ ’x’ ] ) # x i s t h e a l p h a i n t h e t e x t
b e t a = ( ( a l p h a ∗y ) .T@X) . r e s h a p e ( 2 , 1 )
index = ( eps < alpha [ : , 0 ] ) & ( alpha [ : , 0] < C − eps )
b e t a _ 0 =np . mean ( y [ i n d e x ] −X[ i n d e x , : ] @beta )
r e t u r n {’alpha’ : a l p h a , ’beta’ : b e t a , ’beta_0’ : b e t a _ 0 }

d e f p l o t _ k e r n e l (K, l i n e ) : # S p e c i f y t h e l i n e s v i a t h e l i n e a r g u m e n t
r e s =svm_2 (X, y , 1 ,K)
a l p h a = r e s [ ’alpha’ ] [ : , 0 ]
b e t a _ 0 = r e s [ ’beta_0’ ]
def f ( u , v ) :
S= b e t a _ 0
f o r i i n range (X . s h a p e [ 0 ] ) :
S=S+ a l p h a [ i ] ∗ y [ i ] ∗K(X[ i , : ] , [ u , v ] )
return S [ 0 ]
# ww i s t h e h e i g h t o f f ( x , y ) . We can draw t h e c o n t o u r .
uu=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; vv=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; ww= [ ]
f o r v i n vv :
w= [ ]
f o r u i n uu :
w. append ( f ( u , v ) )
ww. a p p e n d (w)
p l t . c o n t o u r ( uu , vv , ww, l e v e l s =0 , l i n e s t y l e s = l i n e )

Fig. 4.4 After generating

samples, we draw linear

(planar) and nonlinear
2

(curved) boundaries with

support vector machines
0
1
X[,2]
0
-1

0
-2
-3

-3 -2 -1 0 1 2 3
X[,1]
4.4 Spline Curves 105

a = 3 ; b=−1
n =200
X= r a n d n ( n , 2 )
y=np . s i g n ( a ∗X [ : , 0 ] + b∗X[ : , 1 ] ∗ ∗ 2 + 0 . 3 ∗ r a n d n ( n ) )
y=y . r e s h a p e ( − 1 , 1 )
f o r i i n range ( n ) :
i f y [ i ]==1:
p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="red" )
else :
p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="blue" )
p l o t _ k e r n e l ( K_poly , l i n e ="dashed" )
p l o t _ k e r n e l ( K _ l i n e a r , l i n e ="solid" )

pcost dcost gap pres dres

0: -6.6927e+01 -4.6679e+02 2e+03 3e+00 1e-14
1: -4.2949e+01 -2.9229e+02 5e+02 4e-01 8e-15
2: -2.8717e+01 -1.0653e+02 1e+02 1e-01 6e-15
3: -2.5767e+01 -4.7367e+01 3e+01 2e-02 4e-15
4: -2.6165e+01 -3.1836e+01 8e+00 5e-03 4e-15
5: -2.6940e+01 -2.8267e+01 2e+00 7e-04 3e-15
6: -2.7243e+01 -2.7483e+01 3e-01 1e-04 3e-15
7: -2.7325e+01 -2.7330e+01 6e-03 1e-06 3e-15
8: -2.7327e+01 -2.7328e+01 9e-05 2e-08 3e-15
9: -2.7328e+01 -2.7328e+01 9e-07 2e-10 3e-15
Optimal solution found.
pcost dcost gap pres dres
0: -8.1804e+01 -4.7816e+02 2e+03 3e+00 3e-15
1: -5.3586e+01 -3.0647e+02 4e+02 4e-01 3e-15
2: -4.1406e+01 -8.6880e+01 6e+01 3e-02 5e-15
3: -4.7360e+01 -5.9604e+01 1e+01 6e-03 2e-15
4: -4.9819e+01 -5.5157e+01 6e+00 2e-03 2e-15
5: -5.0999e+01 -5.3276e+01 2e+00 7e-04 2e-15
6: -5.1869e+01 -5.2122e+01 3e-01 9e-06 3e-15
7: -5.1966e+01 -5.2010e+01 4e-02 1e-06 2e-15
8: -5.1986e+01 -5.1988e+01 3e-03 5e-15 3e-15
9: -5.1987e+01 -5.1987e+01 8e-05 1e-15 2e-15
10: -5.1987e+01 -5.1987e+01 2e-06 3e-15 3e-15
Optimal solution found.

4.4 Spline Curves

Let J ≥ 1. We say that the function

106 4 Kernel Computations

J
g(x) = β1 + β2 x + β3 x 2 + β4 x 3 + β j+4 (x − ξ j )3+ (4.19)
j=1
⎧
⎪ 2
⎨ g0 (x) = β1 + β2 x + β3 x + β4 x ,
3 x < ξ1
= g j (x) = g j−1 (x) + β j+4 (x − ξ j )3 , ξ j ≤ x < ξ j+1
⎩ g (x) = β + β x + β x 2 + β x 3 + J β
⎪ 3
J 1 2 3 4 j=1 j+4 (x − ξ j ) , x ≥ ξ J

with the constants β1 , . . . , β J +4 ∈ R is a spline function of order three with knots

0 < ξ1 < · · · < ξ J < 1. We may define the spline function of order three by the
function g, which is a piecewise polynomial for each of the J + 1 intervals whose
g, g , g are continuous at the J knots. The spline expressed by (4.19) consists of a
linear space, and
1, x, x 2 , x 3 , (x − ξ1 )3+ , . . . , (x − ξ J )3+ (4.20)

can be its basis.

In particular, we consider the natural spline of order three in which we pose more
conditions such as
g (ξ1 ) = g (ξ1 ) = 0 (4.21)

and
g (ξ J ) = g (ξ J ) = 0 . (4.22)

The resulting curve is not of order three in x ≤ ξ1 , ξ J ≤ x, and we approximate it

by a line. The linear space of natural splines possesses J dimensions. In fact, from
(4.21), we have
g (ξ1 ) = 6β4 = 0

g (ξ1 ) = 2β3 + 6β4 ξ1 = 0 ⇐⇒ β3 = β4 = 0 .

Additionally, from (4.22), we have

J
g (ξ J ) = 6β4 + 6 β j+4 = 0
j=1

J
g (ξ J ) = 2β3 + 6β4 ξ J + 6 β j+4 (ξ J − ξ j ) = 0
j=1

J
J
⇐⇒ β j+4 = β j+4 ξ j = 0 .
j=1 j=1

Thus, the β J +3 , β J +4 values are determined by the other β j ; j = 1, 2, 5, . . . , J + 2.

In the following, we consider the problem of finding the f : [0, 1] → R that
minimizes
4.4 Spline Curves 107

N 1
{yi − f (xi )}2 + λ { f (x)}2 d x (4.23)
i=1 0

given samples (x1 , y1 ), . . . , (x N , y N ) ∈ R × R. The second term is zero if the func-

tion is a straight line, but it becomes a significant value if the function deviates from
a straight line. In other words, this term represents the complexity of the function f .
The constant λ ≥ 0 balances the two terms, and if it is large, the curve is smooth; if
the constant is small, the curve follows the sample closely. Note that, in general, the
bounds ξ1 , . . . , ξ J and x1 , . . . , x N are defined separately.
In this case, it is known that the f that minimizes (4.23) is a natural spline of order
three such that f (xi ) = yi , i = 1, . . . , N at the N boundaries ξ1 = x1 , . . . , ξ N =
x N 3 . However, f is once differentiable everywhere and twice differentiable almost
1
everywhere with 0 { f (x)}2 d x < ∞, which implies that f is an element of W2 [0, 1].
A similar proposition holds for the general Wq [0, 1].
Example 67 In the case of a natural spline with q = 2, if we choose the basis
1 (q) (q)
g1 , . . . , g N appropriately, such as g(·) = Nj=1 β j g j (·), and G = ( 0 gi (x)g j
(x)d x) ∈ R N ×N , y = [y1 , . . . , y N ], then we obtain the optimal

[β1 , . . . , β N ] = (X X + λG)−1 X y .

Figure 4.5 shows the graphs obtained for λ = 1, 30, 80.

# d , h define the function that obtains the basis

def d ( j , x , knots ) :
K = len ( knots )
r e t u r n ( np . maximum ( ( x− k n o t s [ j ] ) ∗ ∗ 3 , 0 )
− np . maximum ( ( x− k n o t s [K− 1]) ∗ ∗ 3 , 0 ) ) / ( k n o t s [K−1]− k n o t s [ j ] )

def h ( j , x , knots ) :
K = len ( knots )
i f j == 0 :
return 1
e l i f j == 1 :
return x
else :
r e t u r n d ( j − 1 , x , k n o t s )−d (K− 2 , x , k n o t s )
# G g i v e s values i n t e g r a t i n g the f u n c t i o n s t h a t are d i f f e r e n t i a t e d twice
d e f G( x ) : # The x v a l u e s a r e o r d e r e d i n a s c e n d i n g
n = len ( x )
g = np . z e r o s ( ( n , n ) )
f o r i i n range ( 2 , n − 1) :
f o r j i n range ( i , n ) :
g [ i , j ] = 1 2 ∗ ( x [ n −1]− x [ n − 2]) ∗ ( x [ n −2]− x [ j − 2]) \
∗ ( x [ n −2]− x [ i − 2]) / ( x [ n −1]− x [ i − 2]) / \
( x [ n −1]− x [ j − 2]) +(12∗ x [ n − 2]+6∗ x [ j − 2] − 18∗x [ i − 2]) \
∗ ( x [ n −2]− x [ j − 2]) ∗ ∗ 2 / ( x [ n −1]− x [ i − 2]) / ( x [ n −1]− x [ j − 2])
g[ j , i ] = g[ i , j ]
return g

3 See Chap. 7 of this series (“Statistical Learning with R/Python” (Springer)) for the proof.
108 4 Kernel Computations

Fig. 4.5 Instead of giving Smoothing Spline (n = 100)

knots or the number of knots
in the smoothing spline, we λ=1
specify the λ value, which λ = 30
indicates smoothness. λ = 80
Comparing λ = 1, 30, 80, as
we increase the value of λ,

g(x)
the spline does not follow the
observed data, but it
becomes smoother

-5 0 5
x

# MAIN
n = 100
x = np . random . u n i f o r m ( − 5 , 5 , n )
y = x + np . s i n ( x ) ∗2 + np . random . r a n d n ( n ) # Data G e n e r a t i o n
i n d e x = np . a r g s o r t ( x )
x = x [ index ] ; y = y [ index ]
X = np . z e r o s ( ( n , n ) )
X[ : , 0] = 1
f o r j i n range ( 1 , n ) :
f o r i i n range ( n ) :
X[ i , j ] = h ( j , x [ i ] , x ) # Generation of Matrix X
GG = G( x ) # Generation of Matrix G
lam_set = [ 1 , 30 , 80]
c o l _ s e t = [ "red" , "blue" , "green" ]

plt . figure ()
plt . y l i m ( −8 , 8 )
plt . x l a b e l ( "x" )
plt . y l a b e l ( "g(x)" )

f o r i i n range ( 3 ) :
lam = l a m _ s e t [ i ]
gamma = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X) +lam ∗GG) ,X . T ) , y )
def g ( u ) :
S = gamma [ 0 ]
f o r j i n range ( 1 , n ) :
S = S + gamma [ j ] ∗ h ( j , u , x )
return S
u _ s e q = np . a r a n g e ( − 8 , 8 , 0 . 0 2 )
v_seq = [ ]
for u in u_seq :
v_seq . append ( g ( u ) )
p l t . p l o t ( u_seq , v_seq , c = c o l _ s e t [ i ] , l a b e l = "$\lambda=%d$"%l a m _ s e t [
i ])
p l t . legend ( )
p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
p l t . t i t l e ( "smoothspline(n=100)" )
T e x t ( 0 . 5 , 1 . 0 , ’smoothspline(n=100)’ )

Generalizing (4.23), we consider minimizing

4.4 Spline Curves 109

N 1
{yi − f (xi )}2 + λ { f (q) (x)}2 d x . (4.24)
i=1 0

First, each f = f 0 + f 1 ( f 0 ∈ H0 and f 1 ∈ H1 in Wq [0, 1]) can be written with the

appropriate linear operators P0 ∈ B(H, H0 ) and P1 ∈ B(H, H1 ) as f 0 = P0 f ∈ H0 ,
f 1 = P1 f ∈ H1 . Since f 0 , f 1
H = 0, f 0 , f 1 minimize f − f 0 H and f − f 1 H ,
respectively. Furthermore, P0 , P1 are self-adjoint. In fact, from Proposition 19, for
each i = 0, 1, we have

Pi f, g
H = f i , g0 + g1
H = f i , gi
H = f 0 + f 1 , gi
H = f, Pi g
H

for f 0 , g0 ∈ H0 , f 1 , g1 ∈ H1 , f = f 0 + f 1 , g = g0 + g1 . Moreover, we have Pi f ∈

Hi and Pi2 f = Pi f . Thus, we can write the norm of the second term in (4.24) as
1
| f (q) (x)|2 d x = P1 f 2H1 = P1 f, P1 f
H1 = f, P12 f
H = f, P1 f
H
0

and can express (4.24) as

N
{yi − f (xi )}2 + λ f, P1 f
H (4.25)
i=1

for f ∈ Wq [0, 1]. Let f = g + h ∈ H , g ∈ M := span{φ0 (·), . . . , φq−1 (·),

k(x1 , ·), . . . k(x N , ·)}, and h ∈ M ⊥ . Then, for i = 1, . . . , N , we have

f (xi ) = g + h, k(xi , ·)
H = g(xi )

P1 f H1 ≥ P1 g H1

(the representation theorem). Thus, we may restrict the range of f to M for searching
the optimum to find α1 , . . . , α N , β1 , . . . , βq in

q−1

N
g(·) = βi φi (·) + αi k(xi , ·) . (4.26)
i=0 i=1

In natural spline functions, we regard the differential of order q at x = x N as zero,

which means that
g (q) (x N ) = . . . = g (2q−1) (x N ) = 0, (4.27)

and the dimensionality of span{k1 (xi , ·)|i = 1, . . . , N } is N − q. For spline functions

of order three (q = 2), (4.27) corresponds to (4.22). The basis {1, x} w.r.t. the lines
in x ≤ x1 corresponds to {φ0 (x), . . . , φq−1 }. Thus, we find the optimal solution in
the subspace of Wq [0, 1] for N .
110 4 Kernel Computations

Proposition 44 Let r ∈ Wq [0, 1] be a natural spline with knots x1 , . . . , x N and a

maximum order of 2q − 1, and suppose that g ∈ Wq [0, 1] satisfies g(xi ) = r (xi ) for
i = 1, 2, . . . , N . Then, we have
1 1
(q)
{r (x)} d x ≤
2
{g (q) (x)}2 d x.
0 0

Moreover, the maximum order of s is q − 1 such that s(xi ) = 0 for s := g − r and

i = 1, 2, . . . , N , and if N ≥ q, then the function s is zero.

Proof: See the appendix at the end of this chapter.

Since the natural splines of the highest order 2q − 1 possess N dimensions, an
r ∈ Wq [0, 1] exists that shares the values r (x1 ) = g(x1 ), . . . , r (x N ) = g(x N ) at the N
boundaries x1 , . . . , x N . Among them, since the second term in (4.25) is the optimum,
the natural spline of the highest order 2q − 1 is optimal.
To summarize the above, the problem of finding the f that minimizes (4.25) in
Wq [0, 1] reduces to finding the solution over the range of (4.26), (4.27). In other
words, we can think of the problem in a subspace with N dimensions.
basis consists of N elements regardless of whether q ≥ 1, and if
Moreover, the
we set g(·) = Nj=1 β j g j (·), the problem is to find the β1 , . . . , β N that minimize

N
n
N
N
N 1
(q) (q)
{yi − β j g j (xi )} + λ 2
βi β j gi (x)g j (x)d x.
i=1 i=1 j=1 i=1 j=1 0

1 (q) (q)
Let X = (g j (xi )) ∈ R N ×N , G = ( 0 gi (x)g j (x)d x) ∈ R N ×N , and y = [y1 , . . . ,
y N ] . The optimal solution β = [β1 , . . . , β N ] is given by

β = (X X + λG)−1 X y .

4.5 Random Fourier Features

In the following, we examine computational cost reduction methods.

In particular, in this section, we learn about random Fourier features, which we
can apply to the case where the kernel k(x, y) (x, y ∈ E) is a function of x − y.

Proposition 45 (Rahimi and Recht [23]) Suppose that k : E × E (x, y) →

k(x, y) ∈ R is a function of x − y. Then, we have

k(x, y) = 2Eω,b cos(w x + b) cos(w y + b) , (4.28)

where the expectation Eω,b is calculated over ω ∼ μ (the probability of k in Propo-

sition 5) and b ∈ [0, 2π ) (the uniform distribution).
4.5 Random Fourier Features 111

Proof: The claim is due to Bochner’s theorem (Proposition 5). See the appendix at
the end of this chapter for details. √
Based on Proposition 45, we generate 2 cos(ω x + b) m ≥ 1 times, i.e.,
(wi , bi ), i = 1, . . . , m, and construct the function
√
z i (x) = 2 cos(ωi x + bi ) i = 1, . . . , m.

From the law of large numbers, the constructed

1
m
k̂(x, y) := z i (x)z i (y)
m i=1

approaches k(x, y). Utilizing this fact, when m is small compared to N , the method
to reduce the complexity of kernel computation is called random Fourier features
(RFF).
We claim that the RFF possesses the following property:

P(|k(x, y) − k̂(x, y)| ≥ ) ≤ 2 exp(−m 2 /8) . (4.29)

Proposition 46 (Hoeffding’s Inequality) For independent random variables X i , i =

1, . . . , n, each of which takes values in [ai , bi ], and an arbitrary > 0, we have

2n 2 2
P(|X − E[X ]| ≥ ) ≤ 2 exp(− n ), (4.30)
i=1 (bi − ai )
2

where X denotes the sample mean (X 1 + . . . + X n )/n.

Proof: We use the Chernoff bound and Hoeffding’s lemma, which are shown below.

Lemma 5 (Chernoff Bound) For a random variable X and an arbitrary > 0, we

have
P(X ≥ ) ≤ inf e−s E[es X ] . (4.31)
s>0

To prove this lemma, we use the following lemma.

Lemma 6 (Markov’s Inequality) For a random variable X that takes nonnegative
values, we have
E[X ]
P(X ≥ ) ≤ .

Lemma 6 is due to

E[X ] = E[X · I (X ≥ )] + E[X · I (X < )] ≥ E[X · I (X ≥ )] ≥ P(X ≥ ) .

Lemma 5 follows from lemma 6 and the fact that

112 4 Kernel Computations

P(X ≥ ) = P(s X ≥ s ) = P(exp(s X ) ≥ exp(s )) ≤ e−s E[es X ]

for s > 0. To prove Proposition 46, we use the following lemma:

Lemma 7 (Hoeffding) Suppose that a random variable X satisfies E[X ] = 0 for
a ≤ X ≤ b. Then, for an arbitrary > 0, we have

E e X ≤ e (b−a) /8 .
2 2
(4.32)

Proof: See the appendix at the end of this chapter. n

Returning to the proof of Proposition 46, let Sn := i=1 X i , and apply Lemma 5
to obtain
P(Sn − E[Sn ] ≥ ) ≤ min e−s E[exp{s(Sn − E[Sn ])}] .
s>0

In particular, since X 1 , . . . , X n are independent, we have

n
e−s E[exp{s(Sn − E[Sn ])}] = e−s E[es(X i −E[X i ]) ] .
i=1

Moreover, by applying Lemma 7, we obtain

s2
n
P(Sn − E[Sn ] ≥ ) ≤ min exp{−s + (bi − ai )2 } ,
s>0 8 i=1
n
in which the minimum value is attained when s := 4 / i=1 (bi − ai )2 , and we have

n
P(Sn − E[Sn ] ≥ ) ≤ exp{−2 2 / (bi − ai )2 } .
i=1

Furthermore, if we replace X 1 , . . . , X n with −X 1 , . . . , −X n , we obtain

n
P(Sn − E[Sn ] ≤ − ) ≤ exp{−2 2 / (bi − ai )2 } .
i=1

Hence, we have

P(|Sn − E[Sn ]| ≥ ) = 1 − P(|Sn − E[Sn ]| ≤ )

n
≤ P(Sn − E[Sn ] ≥ ) + P(Sn − E[Sn ] ≤ − ) ≤ 2 exp{−2 2 / (bi − ai )2 } .
i=1

If we substitute X̄ = Sn /n, we obtain Proposition 46.

4.5 Random Fourier Features 113

0.6
0.4
0.2
0.0
-0.2
-0.4

m=20 m=100 m=400

Fig. 4.6 In the RFF approximation, we generated k̂(x, y) 1000 times by changing m. We observe
that they all have zero centers, and the larger m is, the smaller the estimation error is

Since E[k̂(x, y)] = k(x, y) and −2 ≤ z i (x)z i (y) ≤ 2, using Proposition 46, we
obtain (4.29)4 .
Example 68 From Example 19, since the probability of a Gaussian kernel has a
mean of 0 and a covariance matrix σ −2 I ∈ Rd×d , we generate the d-dimensional
random numbers and √ uniform random numbers independently and construct the m
functions z i (x) = 2 cos(ωi x + bi ), i = 1, . . . , m. We draw a boxplot of k̂(x, y) −
k(x, y) by generating (x, y) 1000 times with d = 1 and m = 20, 100, 400 in Fig. 4.6.
We observe that k̂(x, y) − k(x, y) has a mean of 0 (k̂(x, y) is an unbiased estimator),
and the larger m is, the smaller the variance is. The program is written as follows:

s i g m a =10
s i g m a 2 = s i g m a ∗∗2
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
def z ( x ) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b )
def zz ( x , y ) :
r e t u r n np . sum ( z ( x ) ∗ z ( y ) )
u=np . z e r o s ( ( 1 0 0 0 , 3 ) )
m_seq = [ 2 0 , 1 0 0 , 4 0 0 ]
f o r i i n range ( 1 0 0 0 ) :
x= r a n d n ( 1 )
y= r a n d n ( 1 )
f o r j i n range ( 3 ) :
m=m_seq [ j ]
w= r a n d n (m) / s i g m a
b=np . random . r a n d (m) ∗2∗ np . p i
u [ i , j ] = z z ( x , y )−k ( x , y )

4The original paper by Rahimi and Recht (2007) and subsequent work proved more rigorous upper
and lower bounds than these [2].
114 4 Kernel Computations

fig = plt . figure ()

ax = f i g . a d d _ s u b p l o t ( 1 , 1 , 1 )
ax . b o x p l o t ( [ u [ : , 0 ] , u [ : , 1 ] , u [ : , 2 ] ] , l a b e l s = [ ’20’ , ’100’ , ’400’ ] )
ax . s e t _ x l a b e l ( ’m’ )
ax . s e t _ y l i m ( − 0 . 5 , 0 . 6 )
p l t . show ( )

N
The solution α = [α1 , . . . , α N ] with f (·) = i=1 αi k(xi , ·) for kernel ridge
regression with the Gram matrix K is given by (4.6) (Sect. 4.1). If we obtain the
fˆ that approximates f and
Napproximates the Gram matrix K via RFF as K̂ = Z Z ,
ˆ
then we obtain f (·) = i=1 α̂i k̂(xi , ·) by using α̂ ∈ R for ( K̂ + λI N )α̂ = y for
N

Z = (Z j (xi )) ∈ R N ×m and the unit I N ∈ R N ×N .

Using Woodbury’s formula, for U ∈ Rr ×s , V ∈ Rs×r , r, s ≥ 1,

U (Is + V U ) = (Ir + U V )U .

And we have
Z (Z Z + λI N )−1 = (Z Z + λIm )−1 Z .

Let x ∈ E be a value other than the x1 , . . . , x N used for estimation, and let z(x) :=
[z 1 (x), . . . , z m (x)] (row vector). Then, for

β̂ := (Z Z + λIm )−1 Z y , (4.33)

we have

N
N
fˆ(x) = αi k̂(x, xi ) = z(x) z (xi )α̂i = z(x)Z α̂ = z(x)Z ( K̂ + λI N )−1 y
i=1 i=1

= z(x)(Z Z + λIm )−1 Z y = z(x)β̂ .

Then, for the new x ∈ E, we can find its value from fˆ(x) = z(x)β̂. The computa-
tional complexity of (4.33) is O(m 2 N ) for the multiplication of Z Z , O(m 3 ) for
finding the inverse of Z Z + λIm ∈ Rm×m , O(N m) for the multiplication of Z y,
and O(m 2 ) for multiplying (Z Z + λIm )−1 and Z y. Thus, overall, the process
requires only O(N 2 m) complexity at most. On the other hand, the process takes
O(N 3 ) time when using the kernel without approximation. If m = N /10, the com-
putational time becomes 1/100. Obtaining fˆ(x) from a new x ∈ E also takes only
O(m) time.

Example 69 We applied RFF to kernel ridge regression. For N = 200 data, we used
m = 20 for the approximation. We plotted the curve for λ = 10−6 , 10−4 (Fig. 4.7).
The program is as follows:
4.5 Random Fourier Features 115

λ = 10−6 , m = 20, N = 200 λ = 10−4 , m = 20, N = 200

W/O. Approx. W/O. Approx.

8
10
W. Approx. W. Approx.

6
8

4
6
y

y
2
4
2

0
0

-2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0
x x

Fig. 4.7 We applied RFF to kernel ridge regression. On the left and right are λ = 10−6 and
λ = 10−4 , respectively

s i g m a =10
s i g m a 2 = s i g m a ∗∗2

# Function z
m=20
w= r a n d n (m) / s i g m a
b=np . random . r a n d (m) ∗2∗ np . p i
d e f z ( u ,m) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗u+b )
# Gaussian Kernel
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
# Data G e n e r a t i o n
n =200
x= r a n d n ( n ) / 2
y =1+5∗ np . s i n ( x / 1 0 ) +5∗ x ∗∗2+ r a n d n ( n )
x_min=np . min ( x ) ; x_max=np . max ( x ) ; y_min=np . min ( y ) ; y_max=np . max ( y )
lam = 0 . 0 0 1
# lam =0.9
# Low Rank A p p r o x i m a t e d F u n c t i o n
d e f a l p h a _ r f f ( x , y ,m) :
n= l e n ( x )
Z=np . z e r o s ( ( n ,m) )
f o r i i n range ( n ) :
Z [ i , : ] = z ( x [ i ] ,m)
b e t a =np . d o t ( np . l i n a l g . i n v ( np . d o t ( Z . T , Z ) +lam ∗ np . e y e (m) ) , np . d o t ( Z . T , y ) )
return ( beta )
# Usual Function
def alpha ( k , x , y ) :
n= l e n ( x )
K=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
a l p h a =np . d o t ( np . l i n a l g . i n v (K+lam ∗ np . e y e ( n ) ) , y )
return alpha
# N u m e r i c a l Comparison
alpha_hat=alpha (k , x , y )
b e t a _ h a t = a l p h a _ r f f ( x , y ,m)
r =np . s o r t ( x )
116 4 Kernel Computations

u=np . z e r o s ( n )
v=np . z e r o s ( n )
f o r j i n range ( n ) :
S=0
f o r i i n range ( n ) :
S=S+ a l p h a _ h a t [ i ] ∗ k ( x [ i ] , r [ j ] )
u [ j ]=S
v [ j ] = np . sum ( b e t a _ h a t ∗ z ( r [ j ] ,m) )

plt . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )

plt . p l o t ( r , u , c = "r" , l a b e l = "w/oApprox" )
plt . p l o t ( r , v , c = "b" , l a b e l = "withApprox" )
plt . xlim ( −1.5 , 2)
plt . y l i m ( −2 , 8 )
plt . x l a b e l ( "x" )
plt . y l a b e l ( "y" )
plt . t i t l e ( "KernelRegression" )
plt . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

The RFF are said to have no significant degradation due to approximation in

practice. Still, this does cause an issue regarding theoretical guarantees.

4.6 Nyström Approximation

We consider finding the coefficient estimates (K + λI )−1 y in kernel ridge regression.

Suppose that we can realize a low-rank matrix decomposition of K = R R with
R ∈ R N ×m in a computationally inexpensive way. In this case, we can complete the
estimation task quickly. Note that we have

1
(R R + λI N )−1 = {I N − R(R R + λIm )−1 R } , (4.34)
λ

which is due to Sherman-Morrison-Woodbury’s formula5 : r, s ≥ 1, A ∈ Rs×s , U ∈

Rs×r , C ∈ Rr ×r , V ∈ Rr ×s

(A + U C V )−1 = A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1 (4.35)

with r = m, s = N ,A = λI N , U = R, C = Ir , and V = R .
Computing the left side of (4.34) requires an inverse matrix operation of size N ,
while computing the right side involves the product of N × m and m × m matrices
and an inverse matrix operation of size m. The computations on the left- and right-
hand sides require O(N 3 ) and O(N 2 m) complexity, respectively. In the following
part of this section, we show that with some approximation, the decomposition of
K = R R is completed in O(N m 2 ) time, i.e., the calculation of the ridge regression
is performed in O(N m 2 ). In other words, if N /m = 10, the computational time is
only 1/100.

5 Joe Suzuki, “Statistical Learning with Math and R/Python”.

4.6 Nyström Approximation 117

In Sect. 3.3, based on (3.18), we considered approximating the eigenfunctions

from x1 , . . . , xm ∈ E by
√ m
m
φi (·) = k(x j , ·)U j,i .
λi(m) j=1

Let m ≤ N ; from the first m samples

x1 , . . . , xm

of x1 , . . . , xm , xm+1 , . . . , x N , we construct φi and λi . Then, via

√ √
vi := [φi (x1 )/ N , . . . , φi (x N )/ N ] ∈ R N

λi(N ) := N λi

m
KN = λi(N ) vi vi ,
i=1

we approximate the Gram matrix K N w.r.t. x1 , . . . , x N . In order to decompose R R ,

we may set it as
R= λi(N ) [v1 , . . . , vm ] .

To compute R, we require O(m 3 ) and O(N m 2 ) time complexities for obtaining

the eigenvalue and eigenvector of K m and v1 , . . . , vm ∈ R N , respectively. Thus, the
computation completes O(N m 2 ) time in total.

Example 70 We compared the results of kernel ridge regression with N = 300, m =

10, 20, and λ = 10−5 , 10−3 (Fig. 4.8). For these data, when λ ≥ 1, the graphs
obtained with and without approximation were consistent. For m = 10, 20, the curves
were almost identical. We observed that the approximation error was smaller when
λ was small for RFF, while the error was smaller when λ was large for the Nyström
approximation.

s i g m a 2 =1
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) )
n =300
x= r a n d n ( n ) / 2
y=3 − 2∗ x ∗∗2 + 3∗ x ∗∗3 + 2∗ r a n d n ( n )
lam =10∗∗( − 5)
m=10

K=np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
# Low Rank A p p r o x i m a t e d F u n c t i o n
118 4 Kernel Computations

d e f alpha_m (K, x , y ,m) :

n= l e n ( x )
U, D, V=np . l i n a l g . s v d (K [ : m , : m] )
u=np . z e r o s ( ( n ,m) )
f o r i i n range (m) :
f o r j i n range ( n ) :
u [ j , i ] = np . s q r t (m/ n ) ∗ np . sum (K[ j , : m] ∗U [ : m, i ] / D[ i ] )
mu=D∗n /m
R=np . z e r o s ( ( n ,m) )
f o r i i n range (m) :
R [ : , i ] = np . s q r t ( mu [ i ] ) ∗u [ : , i ]
Z=np . l i n a l g . i n v ( np . d o t (R . T , R) +lam ∗ np . e y e (m) )
a l p h a =np . d o t ( ( np . e y e ( n )−np . d o t ( np . d o t ( R , Z ) ,R . T ) ) , y ) / lam
return ( alpha )
# Usual Function
d e f a l p h a (K, x , y ) :
a l p h a =np . d o t ( np . l i n a l g . i n v (K+lam ∗ np . e y e ( n ) ) , y )
return alpha
# N u m e r i c a l Comparison
a l p h a _ 1 = a l p h a (K, x , y )
a l p h a _ 2 = alpha_m (K, x , y ,m)
r =np . s o r t ( x )
w=np . z e r o s ( n )
v=np . z e r o s ( n )
f o r j i n range ( n ) :
S_1 =0
S_2 =0
f o r i i n range ( n ) :
S_1=S_1+ a l p h a _ 1 [ i ] ∗ k ( x [ i ] , r [ j ] )
S_2=S_2+ a l p h a _ 2 [ i ] ∗ k ( x [ i ] , r [ j ] )
w[ j ] = S_1
v [ j ] = S_2
p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
p l t . p l o t ( r , w, c = "r" , l a b e l = "w/oApprox" )
p l t . p l o t ( r , v , c = "b" , l a b e l = "withApprox" )
p l t . xlim ( −1.5 , 2)
p l t . y l i m ( −2 , 8 )
p l t . x l a b e l ( "x" )
p l t . y l a b e l ( "y" )
p l t . t i t l e ( "KernelRegression" )
p l t . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

4.7 Incomplete Cholesky Decomposition

In general, we can decompose a positive definite matrix A ∈ R N ×N into A = R R .

By using a lower triangular matrix R with nonnegative diagonal components. Such
a decomposition is called the Cholesky decomposition of A.
Proposition 47 For a positive definite matrix A ∈ Rn×n , there exists a Cholesky
decomposition A = R R that is unique if and only if A is positive definite.
Many books cover this material. For the proofs, see, for example, [9].
The following is the Cholesky decomposition procedure. We construct the process
so that we can stop anytime to obtain an approximation of R R with rank r ≤ N .
1. In the initial stage, B = A, and R is a zero matrix.
4.7 Incomplete Cholesky Decomposition 119

λ = 10−5 , m = 10 λ = 10−5 , m = 20
5

5
4

4
3

3
2

2
y

y
1

1
-1 0

-1 0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
x x

λ = 10−3 , m = 10 λ = 10−3 , m = 20
5

5
4

4
3

3
2

2
y

y
1

1
-1 0

-1 0

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
x x

Fig. 4.8 We approximated data with N = 300 and ranks m = 10, 20. The upper and lower subfig-
ures display the results obtained when running λ = 10−5 and λ = 10−3 , respectively. The red and
blue lines are the results obtained without approximation and with approximation, respectively. The
accuracy is almost the same as that in the case without approximation when m = 20. The larger the
value of λ is, the smaller the approximation error becomes

N
2. For each i = 1, . . . , r , the first i columns of R are set so that B j,i = h=1 R j,h Ri,h
for j = 1, . . . , N . In other words, the setup is complete through the ith column
of B. ⎡ ⎤
R1,1 0 · · · · · · · · · 0
⎢ .. . . . . ⎥
⎢ . . . ··· ··· 0⎥
⎢ ⎥
⎢ . ⎥
⎢ Ri,1 .. Ri,i 0 · · · 0 ⎥
R=⎢ ⎢ ⎥.
⎥
⎢ Ri+1,1 ... Ri+1,i 0 · · · 0 ⎥
⎢ ⎥
⎢ . . . . . . ⎥
⎣ . . .
. .
. .
. . .
. .⎦
R N ,1 · · · R N ,i 0 ··· 0

In this case, we swap the two subscripts in B by multiplying a matrix Q from the
front and rear of B.
3. The final result is that R R = B = P A P with P = Q 1 · · · Q N . Therefore, A =
P R R P , and we have that P R(P R) is the Cholesky decomposition.
120 4 Kernel Computations

Here, to replace the (i, j) rows and (i, j) columns of the symmetric matrix B, let
Q be the matrix obtained by replacing the (i, j), ( j, i) and (i, i), ( j, j) components
of the unit matrix with 1 and 0, respectively, and multiplying B by the symmetric
matrix Q from the front and rear of B. For example,
⎡⎤⎡ ⎤⎡ ⎤ ⎡ ⎤
100 b11 b12 b13 100 b11 b13 b12
Q B Q = ⎣ 0 0 1 ⎦ ⎣ b21 b22 b23 ⎦ ⎣ 0 0 1 ⎦ = ⎣ b31 b22 b32 ⎦ .
010 b31 b32 b33 010 b21 b23 b33

Specifically, for i = 1, 2, · · · , r , we perform the following steps: Assume that

> 0.

1. Let k be the j (i ≤ j ≤ N ) that maximizes R 2j, j = B j, j − i−1 2
h=1 R j,h .

(a) Swap the ith and kth rows and ith and kth columns of B.
(b) Let Q i,k := 1, Q k,i := 1, Q i,i := 0, Q k,k := 0.
(c) Swap Ri,1 , · · · , Ri,i−1 and Rk,1 , · · · , Rk,i−1 .

(d) Ri,i = Bk,k − i−1 2
h=1 Rk,h .

2. End if Ri,i < .

1
i−1
3. R j,i = (B j,i − R j,h Ri,h ) for each j = i + 1, · · · , N .
Ri,i h=1
N
Once the ith column is completed, B j,i = h=1 R j,h Ri,h follows, and R j,i remains
the same after that for each j = 1, . . . , N . Then, B = R R follows if the procedure
completes up to r = N .
At the beginning of each i = 1, 2, . . . , r , we select the j that maximizes R 2j, j =

B j, j − i−1
h=1 R j,h ≥ 0. In step 3, the components of the jth ( j = i + 1, . . . , N ) rows
2

of the ith column join, but we divide them by Ri,i . Compared to the case where other
values are selected as Ri,i in step 1, the absolute value of R j,i after dividing by

Rii becomes smaller, and the B j, j − ih=1 R 2j,h in the next step becomes larger for
2
each j. If Rr,r takes a negative value, then regardless of the selection order, there
is no solution to the Cholesky decomposition, contradicting Proposition 47 (the
uniqueness of the solution is also guaranteed). Even in the case of an incomplete
Cholesky decomposition, we use the first r columns when running r = N .
We show the code for executing the incomplete Cholesky decomposition below:

d e f im_ch (A,m) :
n=A . s h a p e [ 1 ]
R=np . z e r o s ( ( n , n ) )
P=np . e y e ( n )
f o r i i n range (m) :
max_R=− np . i n f
f o r j i n range ( i , n ) :
RR=A[ j , j ]
f o r h i n range ( i ) :
RR=RR−R[ j , h ] ∗ ∗ 2
i f RR>max_R :
k= j
4.7 Incomplete Cholesky Decomposition 121

max_R=RR
R[ i , i ] = np . s q r t ( max_R )
i f k != i :
f o r j i n range ( i ) :
w=R [ i , j ] ; R [ i , j ] =R [ k , j ] ; R [ k , j ] =w
f o r j i n range ( n ) :
w=A[ j , k ] ; A[ j , k ] =A[ j , i ] ; A[ j , i ] =w
f o r j i n range ( n ) :
w=A[ k , j ] ; A[ k , j ] =A[ i , j ] ; A[ i , j ] =w
Q=np . e y e ( n ) ; Q[ i , i ] = 0 ; Q[ k , k ] = 0 ; Q[ i , k ] = 1 ; Q[ k , i ] = 1
P=np . d o t ( P , Q)
i f i <n :
f o r j i n range ( i +1 , n ) :
S=A[ j , i ]
f o r h i n range ( i ) :
S=S−R[ i , h ] ∗R[ j , h ]
R[ j , i ] = S / R [ i , i ]
r e t u r n np . d o t ( P , R)

# Data G e n e r a t i o n : Make m a t r i x A n o n n e g a t i v e d e f i n i t e
n=5
D=np . m a t r i x ( [ [ np . random . r a n d i n t ( − n , n ) f o r i i n range ( n ) ] f o r j i n range ( n )
])
A=np . d o t (D, D . T )
A

matrix([[ 68, 11, 6, -14, 13],

[ 11, 23, 0, -17, 11],
[ 6, 0, 24, 36, -16],
[-14, -17, 36, 75, -24],
[ 13, 11, -16, -24, 45]])

L= im_ch (A, 5 )
np . d o t ( L , L . T )

array([[ 46., 22., 36., 2., -7.],

[ 22., 31., 14., -19., -8.],
[ 36., 14., 41., -6., -20.],
[ 2., -19., -6., 46., 21.],
[ -7., -8., -20., 21., 22.]])

## I n c o m p l e t e Cholesky Decomposition o f rank t h r e e

L= im_ch (A, 3 ) np . l i n a l g . e i g (A)
122 4 Kernel Computations

(array([1.01074272e+02, 5.66112516e+01, 2.41344415e+01, 9.62653508e-02,

4.08376941e+00]),
matrix([[-0.56411909, -0.45552615, -0.22499121, -0.46541646, 0.45500775],
[ 0.30282894, -0.79305647, 0.04313883, -0.11440623, -0.51420456],
[-0.40884563, 0.15490099, -0.68851469, 0.1243158 , -0.56510533],
[ 0.31554644, -0.28590181, - 0.51655213, 0.58928787, 0.45233208],
[-0.56862992, -0.24046456, 0.45457608, 0.63842315, -0.06792108]]))

# # A c a n n o t be r e c o v e r e d
B=np . d o t ( L , L . T )
B

array([[ 46. , 2. ,22. , -7. ,

36. ],
[ 2. , 46. , -19. , 21. ,
-6. ],
[ 22. , -19. , 31. , -8. ,
14. ],
[ -7. , 21. , -8. , 12.74957882,
-11.52827918],
[ 36. , -6. , 14. , -11.52827918,
33.00601685]])

# The f i r s t t h r e e e i g e n v a l u e s o f B a r e c l o s e t o t h o s e o f A .
np . l i n a l g . e i g (B )

(array([ 9.53379063e+01, 5.65627665e+01, 1.68549229e+01, 1.42156846e-14,

-3.05830677e-15]),
array([[-0.60391559, -0.44498737, -0.0412799 , 0.6556516 , 0.01706317],
[ 0.31630878, -0.79905862, -0.14333816, -0.30429563, -0.42390544],
[-0.44500967, 0.16578492, -0.79173017, -0.36203582, -0.1781387 ],
[ 0.24801368, -0.27858603, -0.38477667, 0.11142099, 0.84426293],
[-0.52506222, -0.24165419, 0.45040025, -0.57796243, 0.27477214]]))

# The r a n k o f B i s t h r e e .
np . l i n a l g . m a t r i x _ r a n k ( B)

3
Appendix 123

Appendix

Proof of Proposition 44
Since r is a natural spline function whose highest order is 2q − 1 and it satisfies

r (q) (0) = · · · = r (2q−1) (0) = r (q) (1) = · · · = r (2q−1) (1) = 0 ,

we have

1 1 1
r (q) (x)s (q) (x)d x = r (q) (x)s (q−1) (x) 0 − r (q+1) (x)s (q−1) (x)d x
0 0
1 1
=− r (q+1) (x)s (q−1) (x)d x = · · · = (−1)q−1 r (2q−1) (x)s (x)d x
0 0
N −1

= (−1)q−1 r (2q−1) (x +j ) s(x j+1 ) − s(x j ) = 0 , (4.36)
j=1

where we use s(xi ) = 0 for i = 1, . . . , N . Moreover, r (2q−1) (x +j ) is the (2q − 1)th

right differential coefficient of r , and it has a constant value during x j < x < x j+1 .
Thus, we have the following inequality in the proposition:
1 1
(q)
{g {r (q) (x) + s (q) (x)}2 d x
(x)} d x =
2
0 0
1 1 1
(q) (q)
= {r (x)} d x +
2
{s (x)} d x + 2
2
r (q) (x)s (q) (x)d x
0 0 0
1 1 1
= {r (q) (x)}2 d x + {s (q) (x)}2 d x ≥ {r (q) (x)}2 d x , (4.37)
0 0 0

where the third equality is due to (4.36). On the other hand, from g, r ∈ Wq [0, 1]
and s ∈ Wq [0, 1], we have

q−1 (i) q−1
s (0) 1
(x − u)+ (q)
s(x) = x + i
s (u)du.
i=0
i! 0 (q − 1)!

1
Therefore, when the equality of (4.37) holds, i.e., 0 {s (q) (x)}2 d x = 0, we have
s (q) (x) = 0 almost everywhere. Hence,

q−1 (i)
s (0)
s(x) = xi ,
i=0
i!
124 4 Kernel Computations

which means that s(xi ) = 0 for i = 1, 2, . . . , N . Thus, if N exceeds the order of the
polynomial q − 1, then we require s(x) = 0 for x ∈ [0, 1].

Proof of Proposition 45
From the additive theorem, we have

2 cos(ω x + b) cos(ω y + b) = cos(w (x − y)) + cos(w (x + y) + 2b) .

Since the expectation of the second term w.r.t. b when fixing ω is zero, we have
√ √
Eω,b [ 2 cos(ω x + b) · 2 cos(ω y + b)] = Eω cos(w (x − y)) .

If we apply Euler’s formula eiθ = cos θ + i sin θ to Proposition 5, then k(x, y) takes
a real value. Thus, we have E[sin(ω (x − y))] = 0, and k(x, y) can be written as

Eω exp(iω (x − y)) = Eω [cos(ω (x − y)) + i sin(ω (x − y))] = Eω [cos(ω (x − y))] .

From Proposition 5, we obtain (4.28).

Proof of Lemma 7
Let > 0. Since e x is convex w.r.t. x, if we take the expectation on the both sides
of
X − a b b − X a
e X ≤ e + e
b−a b−a

for b > a, then

−a b b
E[e X ] ≤ e + e a = θe (1−θ )(b−a) + (1 − θ)e− θ (b−a) = exp{−θs + log(1 − θ + θes )}
b−a b−a

−a
for s = (b − a) and θ = . Therefore, it is sufficient for the exponent f (s) :=
b−a
−θ s + log(1 − θ + θ es ) to be at most s 2 /8. Since

θ es
f (s) = −θ +
1 − θ + θ es

and f (0) = f (0) = 0, we have

(1 − θ ) · θ es 1
f (s) = = φ(1 − φ) ≤
(1 − θ + θ e )s 2 4

θ es
for φ = . Hence, a μ ∈ R exists such that
1 − θ + θ es
Appendix 125

1 s2
f (s) = f (0) + f (0)(s − 0) + f (μ)(s − 0)2 ≤ ,
2 8
which implies (4.31).

Exercises 46∼64

Let k be a kernel and (x1 , y1 ),

46. . . . , (x N , y N ) be samples, and let f (·) :=
N N
i=1 αi k(x i , ·). If we minimize i=1 {yi − f (x i )} + λ f , λ > 0 (kernel
2 2

ridge regression),why does this mean that we have minimized over f ∈ H ?

In addition, express the optimal value of α = [α1 , . . . , α N ] using the Gram
matrix K ∈ R N ×N and y = [y1 , . . . , y N ] .
In kernel PCA, let k be a kernel and x1 , . . . , x N be samples, and let f (·) :=
47.
N
i=1 αi k(x i , ·). If we maximize (4.8), why does this mean that we have maxi-
mized it over f ∈ H ? Additionally, express the eigenequations obtained when
β = K 1/2 α by using the Gram matrix K ∈ R N ×N .
48. In kernel PCA, we wish to find α for a centered Gram matrix, as in (4.9). Complete
the function kernel_pca_train by filling in the space below.

Based on the α obtained from the data X , kernel k, and function

kernel_pca_train, we wish to calculate the score of z ∈ R N × p (any of the
x1 . . . , x N ) (up to 1 ≤ m ≤ p dimensions). Complete the function below:

d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) :
n = x . shape [ 0 ]
126 4 Kernel Computations

p c a = np . z e r o s (m)
f o r i i n range ( n ) :
p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z )
return pca

Check whether the constructed function works with the following program:

sigma2 = 0.01
def k ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 )
X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/
master/USArrests.csv’ )
x = X. v a l u e s [ : , : − 1]
n = x . shape [ 0 ] ; p = x . shape [ 1 ]
alpha = kernel_pca_train (x , k )
z = np . z e r o s ( ( n , 2 ) )
f o r i i n range ( n ) :
z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] )

min1 = np . min ( z [ : , 0 ] ) ; min2 = np . min ( z [ : , 1 ] )

max1 = np . max ( z [ : , 0 ] ) ; max2 = np . max ( z [ : , 1 ] )
p l t . x l i m ( min1 , max1 )
p l t . y l i m ( min2 , max2 )
p l t . x l a b e l ( "First" )
p l t . y l a b e l ( "Second" )
p l t . t i t l e ( "KernelPCA(Gauss0.01)" )
f o r i i n range ( n ) :
i f i != 4 :
pl t . text (x = z [ i , 0] , y = z [ i , 1] , s = i )
p l t . t e x t ( z [ 4 , 0 ] , z [ 4 , 1 ] , 5 , c = "r" )

49. Show that the ordinary PCA and kernel PCA with a linear kernel output the same
score.
50. Derive the KKT condition for the kernel SVM (4.12).
51. In Example 66, instead of linear and polynomial kernels, use a Gaussian kernel
with different values of σ 2 (three different types), and draw the boundary curve
in the same graph.
52. From (4.21) and (4.22), derive Jj=1 β j+4 = 0 and Jj=1 β j+4 ξ j = 0.
53. Prove Proposition 44 according to the following steps:
1
(a) Show that r (q) (x)s (q) (x)d x = 0.
1
0 1
(b) Show that 0 {g (q) (x)}2 d x ≥ 0 {r (q) (x)(x)}2 d x.

q−1 (i)
s (0) i
(c) When the equality in (b) holds, show that s(x) = x.
i=0
i!
(d) Show that the function s decreases when the equality in (b) holds and N
exceeds the degree q − 1 of the polynomial.
54. In RFF, instead of finding the kernel k(x, y), we find its unbiased estimator
k̂(x, y). Show that the average of k̂(x, y) is k(x, y). Moreover, construct a func-
tion that outputs k̂(x, y) from (x, y) ∈ E for m = 100 by using the constants
and functions in the program below. Furthermore, compare the result with the
value output by the Gaussian kernel and confirm that it is correct.
Exercises 46∼64 127

s i g m a =10
s i g m a 2 = s i g m a ∗∗2
def z ( x ) :
r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b )
def zz ( x , y ) :
r e t u r n np . sum ( z ( x ) ∗ z ( y ) )

55. Derive the Chernoff bound.

56. Show that Proposition 46 implies (4.29).
57. The RFF are based on Bochner’s theorem (Proposition 5). What relationship
exists between them?
58. In RFF, after randomly generating (w1 , b1 ), . . . , (wm , bm ), we obtain Z =
(z j (xi )) ∈ R for i = 1, . . . , N and j = 1, . . . , m. If we use K̂ = Z Z rather
m
than K = (k(xi , x j )) ∈ R N ×N , show that fˆ(x) = i=1 α̂i k̂(x, xi ) (x ∈ E) can
be expressed by fˆ(x) = z(x)β̂ using β̂ in (4.33). Moreover, prove Woodbury’s
formula:
U (Is + V U ) = (Ir + U V )U

for U ∈ Rr ×s , V ∈ Rs×r , r, s ≥ 1.
59. Evaluate the number of computations required to obtain (4.33) for the RFF. In
addition, evaluate the computational complexity of finding fˆ(x) for the new
x ∈ E.
60. To find the coefficient estimates (K + λI )−1 y in kernel ridge regression, we
wish to decompose the low-rank matrix K = R R with R ∈ R N ×m . If we can
decompose K = R R , evaluate the computations on the left- and right-hand
sides, where we assume that finding the inverse of the matrix A ∈ Rn×n takes
O(n 3 ).
61. We wish to find the coefficient α̂ of the kernel ridge regression by using the
Nyström approximation. If we use the left-hand side of (4.34) instead of the
right-hand side, what changes would be necessary in the following code?

d e f alpha_m (K, x , y ,m) :

s i g m a = 1 0 ; s i g m a 2 = s i g m a ^2
z= f u n c t i o n ( x ) s q r t ( 2 /m) ∗ c o s (w∗x+b )
z z = f u n c t i o n ( x , y ) sum ( z ( x ) ∗ z ( y ) )

a l p h a .m= f u n c t i o n ( k , x , y ,m) {
n= l e n g t h ( x ) ; K= m a t r i x ( 0 , n , n ) ; f o r ( i i n 1 : n ) f o r ( j i n 1 : n )K[ i , j ] = k ( x [ i ] ,
x[ j ])
A= s v d (K [ 1 : m, 1 : m] )
u= a r r a y ( dim=c ( n ,m) ) ;
f o r ( i i n 1 :m) f o r ( j i n 1 : n ) u [ j , i ] = s q r t (m/ n ) ∗sum (K[ j , 1 : m] ∗ A$u [ 1 : m, i ] ) /
A$d [ i ]
mu=A$d∗n /m
R= s q r t ( mu [ 1 ] ) ∗u [ , 1 ] ; f o r ( i i n 2 :m) R= c b i n d ( R , s q r t ( mu [ i ] ) ∗u [ , i ] )
a l p h a = ( d i a g ( n )−R%∗%s o l v e ( t (R)%∗%R+lambda∗ d i a g (m) )%∗%t (R) )%∗%y / lambda
return ( as . v e c t o r ( alpha ) )
}

62. In Step 1 of the procedure for the incomplete Cholesky decomposition,

each
time we choose the j (i ≤ j ≤ N ) that maximizes R 2j, j = B j, j − i−1 2
h=1 R j,h as

k. Show that Bk,k − i−1 2
h=1 Rk,h in Step 1(d) is nonnegative.
63. Show that when the incomplete Cholesky decomposition process is completed
up to the r th column, we have

i
B ji = R jh R jh
h=1

for each i = 1, . . . , r and j = i + 1, . . . , N .

64. Generate a nonnegative definite matrix of size 5 × 5 and run im_ch to perform
the incomplete Cholesky decomposition of rank three.
Chapter 5
The MMD and HSIC

In this chapter, we introduce the concept of random variables X : E → R in an RKHS

and discuss testing problems in RKHSs. In particular, we define a statistic and its
null hypothesis for the two-sample problem and the corresponding independence
test. We do not know the distribution according to the null hypothesis under a finite
sample in either case. Therefore, we introduce a permutation test and a U-statistic
with which we construct the process and run the program. Then, we study the notions
of characteristics and universal kernels to learn what kernels are valid for such tests.
Finally, we learn about empirical processes, which are often used in the mathematical
analyses of machine learning and deep learning methods.

5.1 Random Variables in RKHSs

In Chap. 1, we proved that a function X : E → R that takes values in R is measurable

if {ω ∈ E|X (ω) ∈ B} for any Borel set B is an event (element) in F, and we call
such an X a random variable.
In the following, we say that a kernel k is measurable if the set of (x, y) such that
k(x, y) ∈ B is an event in E × E, and we assume that any kernel k is measurable.
Moreover, in this chapter, the expectation
√ E[k(X,
√ X )] of k(x, x) ∈ R, x ∈ E, is
bounded, which means that both E[ k(X, X )] ≤ E[k(X, X )] are bounded.
Proposition 48 Let k : E × E → R be measurable. Then, the map : E x →
k(x, ·) ∈ H is measurable. Thus, k(X, ·) is a random variable in H for any random
variable X that takes values in E.
Proof: See the appendix at the end of this chapter.
Let X : E → R be a random variable. The linear functional T : H → R with

T ( f ) := E[ f (X )] = E[ f (·), k(X, ·) H ] ≤ E[
f
H k(X, X )] ≤
f
H E[ k(X, X )]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 129
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_5
130 5 The MMD and HSIC

T( f )
satisfies ≤ E[ k(X, X )] < ∞. From Proposition 22, there exists an m X ∈ H

f
H
such that
E[ f (X )] = f (·), m X (·) H

for any f ∈ H . We call such an m X the expectation of k(X, ·), and we write m X (·) =
E[k(X, ·)]. Then, we have

E[ f (·), k(X, ·) H ] = f (·), E[k(X, ·)] H ,

which means that we can change the order of the inner product and expectation
operations. Let E X , E Y be sets. We define the tensor product H0 of RKHSs H X and
HY consisting of kernels k X : E X → R and kY : E Y → R, respectively, by the set
m
of functions E X × E Y → R, f (x, y) = i=1 f X,i (x) f Y,i (y), f X,i ∈ H X , f Y,i ∈ HY
for (x, y) ∈ E X × E Y , and we define the inner product and norm by

m
n
f, g H0 = f X,i , g X, j HX f Y,i , gY, j HY
i=1 j=1

and
f
2H0 = f, f H0 , respectively for f = mj=1 f X, j f Y, j , f X,i ∈ H X , f Y,i ∈ HY

and g = nj=1 g X, j gY, j , g X, j ∈ H X , gY, j ∈ HY . In fact, we have

m
n
f, g H0 = αi,r γ j,t k X (xr , xt ) βi,s δ j,u kY (ys , yu )
i=1 j=1 r t s u

m
n
= αi,r βi,s g(xr , ys ) = γ j,t δ j,u f (xt , yu )
i=1 r s j=1 t u

for f X,i (·) = r αi,r k X (xr , ·), f Y,i (·) = s βi,s kY (ys , ·), g X, j (·) = t γ j,t k X (xt , ·),
and gY, j (·) = u δ j,u kY (yu , ·), which means that the functions do not depend on the
expressions of f, g.
If we complete
∞ H 0 , we can construct a linear space H consisting of the func-
∞ ∞ ∞
tions f = i=1 a e e
j=1 i, j X.i Y, such that
f
2
:= a 2
< ∞, and
∞ ∞ ∞i, j
j i=1 j=1
the inner product is f, g H = i=1 j=1 ai, j bi, j , where g = j=1 bi, j e X,i eY, j
∞ ∞ 2
( i=1 j=1 bi, j < ∞) and {e X,i }, {eY, j } are orthonormal bases of H X , HY , respec-
tively. Then, H0 is a dense subspace in H , and H is a Hilbert space. We say that
H0 is the direct product of H X , HY and write H X ⊗ HY . H is the set of functions f
such that f (x) := limn→∞ f n (x) for any Cauchy sequence { f n } in H0 and x ∈ E.
The claim follows from a similar discussion as that in Steps 1-5 of Proposition 34.

Proposition 49 (Neveu [22]) The direct product H X ⊗ HY of RKHSs H X , HY with

reproducing kernels k X , kY is an RKHS with a reproducing kernel k X kY .

Proof: The derivation utilizes the following steps [1].

5.1 Random Variables in RKHSs 131
√ √
1. Show that |g(x, y)| ≤ k X (x, x) kY (y, y)
g
for g ∈ H X ⊗ HY and x ∈ E X ,
y ∈ E Y , which means that H is an RKHS due to Proposition 33.
2. Show that k(x, ·, y, ) := k X (x, ·)kY (y, ) ∈ H when we fix x ∈ E X , y ∈ E Y .
3. Show that g(x, y) = f (·, ), k(x, ·, y, ) H .
For details, consult the proof at the end of this chapter.
Then, we introduce the notion of expectation w.r.t. the variables X, Y . If we assume
that E[k X (X, X )] and E[kY (Y, Y )] are finite, then E X Y [k X (X, ·)kY (Y, ·)] is obtained
by taking the expectation of k X (x, ·)kY (y, ·) ∈ H X ⊗ HY w.r.t. X Y :

E X Y [
k X (X, ·)kY (Y, ·)
HX ⊗HY ] = E X Y [
k X (X, ·)
HX
kY (Y, ·)
HY ]

= E X Y [ k X (X, X )kY (Y, Y )] ≤ E X [k X (X, X )]EY [kY (Y, Y )] .

Thus, the left-hand side takes a finite value, and we have

E X Y [ f (X, Y )] = E X Y [ f, k X (X, ·)kY (Y, ·) ] ≤

f
H X ⊗HY E X Y [
k X (X, ·)kY (Y, ·)
H X ⊗HY ]

for f ∈ H X ⊗ HY . From Proposition 22 (Riesz’s representation theorem), there

exists an m X Y ∈ H X ⊗ HY such that

E X Y [ f (X, Y )] = f, m X Y ,

and we write
m X Y := E X Y [k X (X, ·)kY (Y, ·)] ,

which means that we can change the order of the inner product and expectation
operations:

E X Y [ f, k X (X, ·)kY (Y, ·) ] = f, E X Y [k X (X, ·)kY (Y, ·)] .

Moreover, for the m X , m Y of X, Y , the expectation m X m Y belongs to H X ⊗ HY ,

and we have

f g, m X m Y HX ⊗HY = f, m X HX g, m Y HY = E X [ f (X )]EY [g(Y )]

f ∈ H X , g ∈ HY , which means that we multiply the expectations of X, Y even if

they are not independent. Thus, we call

m XY − m X mY

the covariate of (X, Y ) in H X ⊗ HY , which belongs to H X ⊗ HY .

Proposition 50 For each f ∈ H X , g ∈ HY , there exist X Y ∈ B(HY , H X ) and

Y X ∈ B(H X , HY ) such that
132 5 The MMD and HSIC

f g, m X Y − m X m Y HX ⊗HY = Y X f, g HY = f, X Y g HX . (5.1)

Proof: The operators Y X , X Y are conjugates of each other, and from Proposition
22, if one exists, so does the other. We prove the existence of X Y . The linear
functional
Tg : H X f → f g, m X Y − m X m Y HX ⊗HY ∈ R

for an arbitrary g ∈ HY is bounded from

f g, m X Y − m X m Y HX ⊗HY ≤
f
HX
g
HY
m X Y − m X m Y
HX ⊗HY ,

and there exists an h g ∈ H X such that Tg f = f, h g HX from Proposition 22. Thus,

there exists X Y : HY g → h g ∈ H X such that

f g, m X Y − m X m Y HX ⊗HY = f, X Y g HX .

The boundness of X Y is due to

X Y g
HX =
h g
HX =
Tg
≤
g
HY
m X Y − m X m Y
HX ⊗HY .

We call X Y , Y X the mutual covariance operators.
Let H and k be an RKHS and its reproducing kernel respectively, and let P be
the set of distributions that X follows. Then, we can define the map

P μ → k(x, ·)dμ(x) ∈ H ,

which we call the embedding of probabilities in the RKHS.

Suppose that the map is
injective, i.e., if the expectations k(x, ·)dμ1 (x) and k(x, ·)dμ2 (x) have the same
value, then the probabilities μ1 , μ2 coincide. We call such a reproducing kernel k of
an RKHS H characteristic.
We learn some applications by using characteristic kernels, such as two-sample
problems and independence tests, and we consider the associated theory in later
sections of this chapter.

5.2 The MMD and Two-Sample Problem

Gretton et al. (2008), [11] proposed a statistical testing approach for testing whether
two distributions share given independent sequences x1 , . . . , xm ∈ R and y1 , . . . ,
yn ∈ R. We write the two distributions as P, Q and regard P = Q as the null hypothe-
sis. Let H and k be an RKHS and its reproducing kernel respectively;
we define m P :=
E P [k(X, ·)] = E k(x, ·)d P(x), m Q := E Q [k(X, ·)] = E k(x, ·)d Q(x) ∈ H . We
5.2 The MMD and Two-Sample Problem 133

note that the random variable X : E → R is measurable, and either P or Q is the

probability distribution that X follows.
Let F be a set of functions that satisfies a condition. In general, the quantity
defined by
sup {E P [ f (X )] − E Q [ f (X )]}
f ∈F

is called the MMD (maximum mean discrepancy), and we assume that

F := { f ∈ H |
f
H ≤ 1} ,

which means that we regard the MMD as

MMD2 = sup {E P [ f (X )] − E Q [ f (X )]}2 = sup {m P , f − m Q , f }2

f ∈F f ∈F

= sup {m P − m Q , f } =
m P − m Q
2H .
2
f ∈F

If the kernel k is characteristic, then we have

MMD = 0 ⇐⇒ m P = m Q ⇐⇒ P = Q (5.2)

and

MMD2
= m P , m P + m Q , m Q − 2m P , m Q
= E X [k(X, ·)], E X [k(X , ·)] + EY [k(Y, ·)], EY [k(Y , ·)] − 2E X [k(X, ·)], EY [k(Y, ·)]
= E X X [k(X, X )] + EY Y [k(Y, Y )] − 2E X Y [k(X, Y )] ,

where X and X (Y and Y ) are independent and follow the same distribution. How-
ever, we do not know m X , m Y from the two-sample data. Thus, we execute the test
using their estimates:

1 1 2
m m n n m n
2

MMD B := k(x i , x j ) + k(yi , y j ) − k(xi , y j )
m 2 i=1 j=1 n 2 i=1 j=1 mn i=1 j=1
(5.3)
1 m
1 n
2
m n
k(xi , x j ) + k(yi , y j ) − k(xi , y j ) .
m(m − 1) i=1 j=i n(n − 1) i=1 j=i mn i=1 j=1
(5.4)
Then, the estimate (5.4) is unbiased while (5.3) is biased:

1
m
1
m
1
E[ k(X i , X j )] = EXi [ E X j [k(X i , X j )]] = E X X [k(X, X )].
m(m − 1) m m−1
i=1 j=i i=1 j=i
134 5 The MMD and HSIC

The Same Dist. (Permutation) Diﬀerent Dist. (Permutation)

20 40 60

50
Density

Density
30
0 10
0

-0.01 0.00 0.01 0.02 0.03 0.00 0.02 0.04 0.06 0.08
2 2
MMDU MMDU

Fig. 5.1 Permutation test for the two-sample problem. The distributions of X, Y are the same (left)
and different (right). The blue and red dotted lines show the statistics and the borders of the rejection
region, respectively.

However, similar to the HSIC in the next section, we do not know the distribution
of the MMD estimate under P = Q. We consider executing one of the following
processes.
1. Construct a histogram of the MMD estimate values randomly by changing the
values of x1 , . . . , xm and y1 , . . . , yn (permutation test).
2. Compute an asymptotic distribution from the distribution of U statistics.
For the former, for example, we may construct the following procedure.

Example 71 We perform a permutation test on two sets of 100 samples that fol-
low the standard Gaussian distribution (Fig. 5.1 Left). For the unbiased estimator of
2
M M D 2 , we use M M DU in (5.6) instead of (5.4) for a later comparison. We also
double the standard deviation of one set of samples and perform the permutation
2

test again (Fig. 5.1 Right). The reason why M MDU also takes negative values is that
when the true value of the M M D is close to zero, the value can also be negative
since it is an unbiased estimator.

# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
from s c i p y . s t a t s import kde
import i t e r t o o l s
import math
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
s t y l e . use ( " seaborn −t i c k s " )

sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# Data G e n e r a t i o n
n = 100
xx = np . random . r a n d n ( n )
yy = np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e e q u a l
5.2 The MMD and Two-Sample Problem 135

# y y = 2 ∗ np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e n o t e q u a l
x = xx ; y = yy
# Distribution of the null hypothesis
T = []
f o r h i n range ( 1 0 0 ) :
i n d e x 1 = np . random . c h o i c e ( n , s i z e = i n t ( n / 2 ) , r e p l a c e = F a l s e )
i n d e x 2 = [ x f o r x i n range ( n ) i f x n o t i n i n d e x 1 ]
x = l i s t ( xx [ i n d e x 2 ] ) + l i s t ( yy [ i n d e x 1 ] )
y = l i s t ( xx [ i n d e x 1 ] ) + l i s t ( yy [ i n d e x 2 ] )
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
i f i != j :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
T . append ( S / n / ( n − 1) )
v = np . q u a n t i l e ( T , 0 . 9 5 )
# Statistics
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
i f i != j :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
u = S / n / ( n − 1)
# D i s p l a y o f t h e graph
x = np . l i n s p a c e ( min ( min ( T ) , u , v ) , max ( max ( T ) , u , v ) , 2 0 0 )
d e n s i t y = kde . g a u s s i a n _ k d e ( T )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = u , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = v , c = "b" )

For the latter approach, we construct the following quantities. For m ≥ 1 sym-
metric variables and h : E m → R, we call the quantity

1
U N := h(xi1 , . . . , xim ) (5.5)
N 1≤i ,...,i ≤N
1 m
m

N
the U-statistic w.r.t. h of order m, where i1 ,...,im ranges over (i 1 , . . . , i m ) ∈
m
{1, . . . , N } ’s. We use this quantity for estimating the expectation E[h(X 1 , . . . , X m )]
m

given samples x1 , . . . , x N . Note that any U statistic is unbiased. In fact, we have

1
E[
h(X i1 , . . . , X im )]
N i <...<i
1 m
m
1
= Eh(X i1 , . . . , X im ) = Eh(X 1 , . . . , X m ) .
N i <...<i
1 m
m

We call the quantity

136 5 The MMD and HSIC

1
N N
VN := · · · h(xi1 , . . . , xim )
N m i =1 i =1
1 m

the V -statistic w.r.t. h

In the following, we conduct a statistical test with the null hypothesis that X, Y
are identically distributed when m = n. Under this null hypothesis, the operations
of taking the means of E X [·] and EY [·] have the same meaning.
We can define an unbiased estimator of M M D 2 in addition to (5.4). In the fol-
lowing, we consider the unbiased estimator

2 1

MMDU = h(z i , z j )
n(n − 1) i= j

for
h(z i , z j ) := k(xi , x j ) + k(yi , y j ) − k(xi , y j ) − k(x j , yi ) (5.6)

with z i = (xi , yi ).
We define

h c (z 1 , . . . , z c ) := E Z c+1 ···Z m h(z 1 , . . . , z c , Z c+1 , . . . , Z m ),

which is obtained by taking the expectation of the U statistics (5.5) over Z c+1 , . . . ,
Z m for 1 ≤ c ≤ m. Moreover, we define

h̃ c (z 1 , . . . , z c ) := h c (z 1 , . . . , z c ) − θ

for θ = E[h(Z 1 , . . . , Z m )].

Example 72 For (5.6), we have h 2 (z 1 , z 2 ) = h(z 1 , z 2 ) since m = 2. Under the null
hypothesis, X, Y follow the same distribution, and we have

h 1 (z 1 ) = E Z 2 [h(z 1 , Z 2 )] = E[k(xi , X j )] + E[k(yi , Y j )] − E[k(xi , Y j )] − E[k(x j , Yi )] = 0 .

Moreover, under the null hypothesis, since θ = Eh(Z 1 , . . . , Z m ) = 0, we have

h̃ 2 (z 1 , z 2 ) = h(z 1 , z 2 ).
Hereafter, we set the number of samples as N (= m = n).

Proposition 51 (Serfling [27]) Suppose that the U statistics are Eh 2 < ∞ and that
h 1 (z 1 ) is zero (degenerated). Let λ1 , λ2 , . . . be the eigenvalues of the conjugate
integral operation
L 2 f (·) → ĥ 2 (·, y) f (y)dη(y)
5.2 The MMD and Two-Sample Problem 137

whose kernel is h̃ 2 (z 1 , z 2 )1 . Then, N times the U statistics converges to the random

variable
∞
λ j (χ 2j − 1)
j=1

as m → ∞, where χ12 , χ22 , . . . are random variables that are independent of each
other and follow a χ 2 distribution with one degree of freedom.
For the proof, see Sect. 5.5.2 of Serfling [27] (page 193-199).
Note that h̃ 2 (z 1 , z 2 ) = h(z 1 , z 2 ) is given by (5.6), which is symmetric but not
nonnegative definite. Therefore, Mercer’s theorem cannot be applied. However. an
integral operator is generally compact (Proposition 39), and if the kernel of an integral
operator is symmetric, then the integral operator is self-adjoint (e.g., 45). Therefore,
from Proposition 27, eigenvalues and eigenfunctions exist. However, since they are
not nonnegative definite, some eigenvalues may not be nonnegative.
∞ ∞
In the following, we write {λi }i=1 and {φi (·)}i=1 as the eigenvalues and eigen-
functions, respectively, of the integral operator

Th̃ : L 2 [E, μ] f → h̃ 2 (·, y) f (y)dη(y) ∈ L 2 [E, η]
E

For the kernel h 2 when η = P = Q. Then, we have

h 2 (x, y)φi (y)dη(y) = λi φi (x)
E

φi (x)φ j (x)dη(x) = δi, j . (5.7)
E

2

Utilizing Proposition 51, we find that N MMDU converges to the random variable

∞

λ j (χ 2j − 1)
j=1

as the sample size N → ∞.

Example 73 With two sets of 100 samples that follow the standard Gauss distribu-
tion, we obtain eigenvalues by using the method described in Sect. 3.3 and construct
a distribution following the null hypothesis using the U-statistic to perform the test
(Fig. 5.2 Left). We also perform the same test by doubling the standard deviation of
one pair of samples (Fig. 5.2 Right).

The kernel K : E × E → R in the integral operation L (E, η) f → K f (·) =

1 2

E K (·, x) f (x)dη(x) is called a kernel of the integral operator even if it is not positive defi-
nite.
138 5 The MMD and HSIC

The Same Dist. (U-Statistics) Diﬀerent Dist. (U-Statistics)

20 40 60 80

50
Density

Density
30
0 10
0

0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06 0.08 0.10
2 2
M M DU M M DU

Fig. 5.2 Test performed using the U-statistic for the two-sample problem. The same (left) and
different (right) distributions of X, Y are employed. The blue line is the statistic, and the red dotted
line is the boundary of the rejection region. We can see that the distribution obtained according to
the null hypothesis has almost the same shape as that in Fig. 5.1

sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 )
# Data G e n e r a t i o n
n = 100
x = np . random . r a n d n ( n )
y = np . random . r a n d n ( n ) # The D i s t r i b u t i o n s a r e e q u a l
# y = 2 ∗ np . random . r a n d n ( n ) # The D i s t r i b u t i o n s a r e n o t e q u a l
# D i s t r i b u t i o n under n u l l h y p o t h e s i s
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] ) + k ( y [ i ] , y [ j ] ) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
lam , v e c = np . l i n a l g . e i g (K)
lam = lam / n
r = 20
z = []
f o r h i n range ( 1 0 0 0 0 ) :
z . a p p e n d ( np . l o n g d o u b l e ( 1 / n ∗ ( np . sum ( lam [ 0 : r ]
∗ ( np . random . c h i s q u a r e ( d f = 1 , s i z e = r ) − 1) ) ) ) )
v = np . q u a n t i l e ( z , 0 . 9 5 )
# Statistics
S = 0
f o r i i n range ( n − 1 ) :
f o r j i n range ( i + 1 , n ) :
S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \
− k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ])
u = np . l o n g d o u b l e ( S / n / ( n − 1 ) )
x = np . l i n s p a c e ( min ( min ( z ) , u , v ) , max ( max ( z ) , u , v ) , 2 0 0 )
# D i s p l a y o f t h e graph
d e n s i t y = kde . g a u s s i a n _ k d e ( z )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )
5.3 The HSIC and Independence Test 139

5.3 The HSIC and Independence Test

Let (E, F, P) be a probability space. We say that events A, B ∈ F are independent

if P(A)P(B) = P(A ∩ B).
Suppose that sequences x1 , . . . , x N ∈ R and y1 , . . . , y N ∈ R with the same length
N ≥ 1 have occurred according to the distributions of the random variables X, Y . We
wish to test the independence of X, Y , where both xi , x j and yi , y j are independent,
but we do not know whether xi , yi are independent.
For example, if the empirical correlation coefficient
N N
(1/N ) i=1 (xi − x̄) i=1 (yi − ȳ)
ρ̂ = N N
{(1/N ) i=1 (xi − x̄) } {(1/N ) i=1 (yi − ȳ)2 }1/2
2 1/2

N N
is close to zero for x̄ := (1/N ) i=1 xi and ȳ := (1/N ) i=1 yi , then we may say
that the variables are independent.

Example 74 (Gaussian Distribution) For simplicity, we assume that X, Y follows

the standard Gaussian distribution. If X, Y are independent (written as X ⊥⊥ Y ), then
their covariance

E[X Y ] = xy f X Y (x, y)d xdy = xy f X (x) f Y (y)d xdy = x f X (x)d x y f Y (y)dy
E E E E E E

is 0, and ρ X Y = E[X Y ] = 0 follows. Since we can write f X Y (x, y) as

1 1
exp{− (x 2 − 2ρ X Y xy + y 2 )} ,
2 1− ρ X2 Y 2(1 − ρ X2 Y )

we see that ρ X Y = 0 implies f X Y (x, y) = f X (x) f Y (y). Hence, ρ X Y = 0 ⇐⇒ X ⊥⊥

Y follows.

However, as in the following derivation, in general, ρ X Y = 0 does not mean that

X ⊥⊥ Y .
Example 75 Let X = cos θ and Y = sin θ . We uniformly generate random vari-
ables 0 ≤ θ < 2π , which√means that (X, Y ) is uniform over the unit circle. If X is
determined, then Y = ± 1 − X 2 , and if one of X, Y is determined, then at most
two possibilities exist for the other. Therefore, the two variables are not independent.
However, since the mean of X, Y is μ X = μY = 0, the covariance can be calculated
as
1
E X Y [(X − μ X )(Y − μY )] = E X Y [X Y ] = E X Y [cos θ sin θ ] = E X Y [sin 2θ ] = 0
2
and the correlation coefficient ρ X Y is 0.
140 5 The MMD and HSIC

To this end, Gretton et al. [12] thought that to test the independence of random
variables X, Y , if we map E X → k X (X, ·) ∈ H X and E Y → kY (Y, ·) ∈ HY
for kernels k X , kY , and perform the test of independence based on the covari-
ance between k X (X, ·) and kY (Y, ·), such an inconvenience would not occur. They
devised a statistical test for E[k X (X, ·)kY (Y, ·)] = E[k X (X, ·)]E[kY (Y, ·)] rather than
E[X Y ] = E[X ]E[Y ]. We define

H S I C(X, Y ) :=
m X m Y − m X Y
2HX ⊗HY ∈ R,

which is the norm of the covariance m X m Y − m X Y ∈ H1 ⊗ H2 , i.e., the Hilbert-

Schmidt information criterion (HSIC). Since the HSIC is a norm, it is zero only if
m X m Y − m X Y ∈ H X ⊗Y is zero. The HSIC is the M M D 2 when m P = m X m Y with
m Q = m XY .

Proposition 52 (Gretton et al. [12]) When the reproducing kernels k X , kY : E → R

in H X , HY are both characteristic kernels, the random variables X, Y that take values
in E being independent and H S I C(X, Y ) = 0 are equivalent, i.e.,

H S I C(X, Y ) = 0 ⇐⇒ X ⊥⊥ Y.

Proof: If k X , kY are both characteristic, then from Proposition 55 (see below), k X kY

is
also characteristic. Therefore, for m (·)
X := E k X (x, ·)d PX (x) ∈ H X . m Y (·) :=
k
E Y (y, ·)d P Y (y) ∈ H Y , and m XY (·, ) := E X (x, ·)k Y (y, )d PX Y (x, y) ∈ H X ⊗Y ,
k
the map P X ⊗Y PX Y → m X Y ∈ H X ⊗Y is injective. Hence, we have

X ⊥⊥ Y ⇐⇒ PX Y = PX PY ⇐⇒ m X Y = m X m Y ⇐⇒ H S I C(X, Y ) = 0 ,

where the second ⇐⇒ is due to (5.2).

From Propositions 50 and 52, we have that

Corollary 2 both of the reproducing kernels k X , kY : E → R of H X , HY are char-

acteristic, and we obtain

X Y = Y X = 0 ⇐⇒ X ⊥⊥ Y .

If we abbreviate
·
X ⊗Y and ·, · X ⊗Y as
·
and ·, · , respectively, then we
have

m X Y
2 = E X Y [k X (X, ·)kY (Y, ·)], E X Y [k X (X , ·)kY (Y , ·)]
= E X Y E X Y [k X (X, ·)kY (Y, ·), k X (X , ·)kY (Y , ·)
= E X Y X Y [k X (X, X )kY (Y, Y )] ,
5.3 The HSIC and Independence Test 141

m X Y , m X m Y = E X Y [k X (X, ·)kY (Y, ·)], E X [k X (X , ·)]EY [kY (Y , ·)]

= E X Y {E X [k X (X, ·)kY (Y, ·), k X (X , ·)EY [kY (Y , ·)] ]}
= E X Y {E X [k X (X, X )]EY [kY (Y, ·), kY (Y , ·) ]}
= E X Y {E X [k X (X, X )]EY [kY (Y, Y )]} ,

and

m X m Y
2 = E X [k X (X, ·)]EY [kY (Y, ·)], E X [k X (X , ·)]EY [kY (Y , ·)]
= E X E X [k X (X, X )]EY EY [kY (Y, Y )] ,

where X, X (Y, Y ) are independent and follow the same distribution. Hence, we
can write H S I C(X, Y ) as

H S I C(X, Y ) :=
m X Y − m X m Y
2
= E X X Y Y [k X (X, X )kY (Y, Y )] − 2E X Y {E X [k X (X, X )]EY [kY (Y, Y )]}
+E X X [k X (X, X )]EY Y [kY (Y, Y )] . (5.8)

When applying the HSIC, we often construct the following estimator, replacing
the mean by the relative frequency.

1 2
H S I C := 2 k X (xi , x j )kY (yi , y j ) − 3 k X (xi , x j ) kY (yi , yh )
N i j N i j h
1
+ 4
k X (xi , x j ) kY (yh , yr ) (5.9)
N i j h r

For example, we can write the HSIC in the Python as follows.

d e f HSIC_1 ( x , y , k_x , k_y ) :

n = len ( x )
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
S = S + k_x ( x [ i ] , x [ j ] ) ∗ k_y ( y [ i ] , y [ j ] )
T = 0
f o r i i n range ( n ) :
T_1 = 0
f o r j i n range ( n ) :
T_1 = T_1 + k_x ( x [ i ] , x [ j ] )
T_2 = 0
f o r l i n range ( n ) :
T_2 = T_2 + k_y ( y [ i ] , y [ l ] )
T = T + T_1 ∗ T_2
U = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
U = U + k_x ( x [ i ] , x [ j ] )
V = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
V = V + k_y ( y [ i ] , y [ j ] )
r e t u r n S / n∗∗2 − 2∗T / n ∗∗3+U∗V / n ∗∗4
142 5 The MMD and HSIC

We often write the statistics as H S I C = N12 trace(K X H K Y H ), where K X =

(k X (xi , x j ))i, j , K Y = (kY (yi , y j ))i, j , H := I − N1 E, I ∈ R N ×N is the unit matrix,
and E ∈ R N ×N is a matrix in which all the elements are ones. In fact, we have

trace(K X H K Y H ) = (K X H K Y H )i,i = (K X H )i, j (K Y H ) j,i
i i j
1 1
= { k X (xi , x h )(δh, j − )}{ kY (y j , yh )(δh,i − )}
N N
i j h h
1
= {k X (xi , x j )kY (yi , y j ) − k X (xi , x j ) kY (yi , yh )
N
i j h
1 1
− kY (yi , y j ) k X (xi , x h ) + 2 k X (xi , x h ) kY (y j , yr )}
N N r
h h
2
= k X (xi , x j )kY (yi , y j ) − k X (xi , x j ) kY (yi , yh )
N
i j i j h
1
+ 2 k X (xi , x h ) kY (y j , yr ) .
N r
i h j

d e f HSIC_1 ( x , y , k_x , k_y ) :

n = len ( x )
K_x = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_x [ i , j ] = k_x ( x [ i ] , x [ j ] )
K_y = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_y [ i , j ] = k_y ( y [ i ] , y [ j ] )
E = np . o n e s ( ( n , n ) )
H = np . i d e n t i t y ( n ) − E / n
r e t u r n np . sum ( np . d i a g ( np . d i a g ( K_x . d o t (H) . d o t ( K_y ) . d o t (H) ) ) ) / n ∗∗2

Example 76 We execute the above process for σ 2 = 1 and

1
k X (x, y) = kY (x, y) = exp(− ||x − y||2 )
2σ 2
(Gaussian kernel) as follows.

d e f k_x ( x , y ) :
r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗ ∗ 2 / 2 )
k_y = k_x
k_z = k_x
n = 100
for a in [0 , 0.1 , 0.2 , 0.4 , 0.6 , 0 . 8 ] : # a i s the c o r r e l a t i o n
x = np . random . r a n d n ( n )
z = np . random . r a n d n ( n )
y = a ∗ x + np . s q r t ( 1 − a ∗ ∗ 2 ) ∗ z
p r i n t ( HSIC_1 ( x , y , k_x , k_y ) )
5.3 The HSIC and Independence Test 143

0.0006847868161461435
0.004413058917908441
0.004693757443490376
0.01389332860758824
0.010176397492526468
0.0364733529032461

We define the HSIC ||m X Y − m X m Y ||2 when we test X ⊥⊥ Y for m X =

E X [k X (X, ·)], m Y = EY [kY (Y, ·)], and m X Y = E X Y [k X (X, ·)kY (Y, ·)]. If we test X ⊥
⊥ {Y, Z } between X and {Y, Z }, then we extend the HSIC to ||m X Y Z − m X m Y Z ||2 .
For the MMD, we test whether the simultaneous probability of X, Y, Z and the prod-
uct of the probabilities of X and (Y, Z ) are equal. Therefore, we can change kY (y, ·)
to kY (y, ·)k Z (z, ·). We add arguments to the function HSIC_1 to construct the function
HSIC_2 and perform the following operations.

d e f HSIC_2 ( x , y , z , k_x , k_y , k_z ) :

n = len ( x )
S = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
S = S + k_x ( x [ i ] , x [ j ] ) ∗ k_y ( y [ i ] , y [ j ] ) \
∗ k_z ( z [ i ] , z [ j ] )
T = 0
f o r i i n range ( n ) :
T_1 = 0
f o r j i n range ( n ) :
T_1 = T_1 + k_x ( x [ i ] , x [ j ] )
T_2 = 0
f o r l i n range ( n ) :
T_2 = T_2 + k_y ( y [ i ] , y [ l ] ) ∗ k_z ( z [ i ] , z [ j ] )
T = T + T_1 ∗ T_2
U = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
U = U + k_x ( x [ i ] , x [ j ] )
V = 0
f o r i i n range ( n ) :
f o r j i n range ( n ) :
V = V + k_y ( y [ i ] , y [ j ] ) ∗ k_z ( z [ i ] , z [ j ] )
r e t u r n S / n∗∗2 − 2∗T / n ∗∗3+U∗V / n ∗∗4

The smaller the value of H S I C is, the more likely independence is, but for
random variables X, Y, U, V , the condition H S I C(X, Y ) < HS I C(U, V ) does not
mean that X, Y is closer to independence than U, V . However, in practice, the HSIC
is often used as the criterion to measure the certainty of independence.
Example 77 (LiNGAM [16, 28]) We wish to know the cause-and-effect relation
among the random variables X, Y, Z from their independent N realizations x, y, z.
For example, we assume that X, Y are generated based on either Model 1 (in which
X = e1 and Y = a X + e2 for a constant a ∈ R and zero-mean independent variables
e1 , e2 ) or Model 2 (in which Y = e1 and X = a Y + e2 for a constant a ∈ R and zero-
mean independent variables e1 , e2 ). We choose the model with a higher probability
144 5 The MMD and HSIC

between e1 ⊥⊥ e2 and e1 ⊥⊥ e2 . Then, we can apply the function HSIC_1, where e2 , e2
are calculated from y − ax, x − a y. For example, using the function

def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) ) / l e n ( x )

def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v

we can estimate a and a via f(y,x) and f(x,y), respectively. When we have three
variables X, Y, Z , we first determine the upstream variable. To this end, using the
function HSIC_2, we compare three independence cases: between x and its residue
(f(y,x), f(z,x)), between y and its residue (f(z,y), f(x,y)), and between z and
its residue (f(x,z), f(y,z)). For example, if we choose the first pair, then X is the
upstream variable.
Then, we choose the midstream variable among the unselected two variables.
For example, if X is selected in the first round, then we compare two independence
sets f(y_x,z_xy) and f(z_x,y_zx). If we use the notation of the program, these are
y_x=f(y,x) and z_xy=f(z_x,y_x).

# Data G e n e r a t i o n
n = 30
x = np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
y = 2 ∗ x + np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
z = x + y +np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
x = x − np . mean ( x )
y = y − np . mean ( y )
z = z − np . mean ( z )

# Estimate the Upstream

x_y = f ( x , y ) ; y_z = f ( y , z ) ; z_x = f ( z , x )
x_z = f ( x , z ) ; z_y = f ( z , y ) ; y_x = f ( y , x )
v1 = HSIC_2 ( x , y_x , z_x , k_x , k_y , k_z )
v2 = HSIC_2 ( y , z_y , x_y , k_y , k_z , k_x )
v3 = HSIC_2 ( z , x_z , y_z , k_z , k_x , k_y )

i f v1 < v2 :
i f v1 < v3 :
top = 1
else :
top = 3
else :
i f v2 < v3 :
top = 2
else :
top = 3

# E s t i m a t e t h e DownStream
x_yz = f ( x_y , z_y )
y_zx = f ( y_z , x_z )
z_xy = f ( z_x , y_x )

i f top == 1 :
v1 = HSIC_1 ( y_x , z_xy , k_y , k_z )
v2 = HSIC_1 ( z_x , y_zx , k_z , k_y )
if v1 < v2 :
middle = 2
bottom = 3
else :
5.3 The HSIC and Independence Test 145

middle = 3
bottom = 2
i f t o p == 2 :
v1 = HSIC_1 ( z_y , x_yz , k_y , k_z )
v2 = HSIC_1 ( x_y , z_xy , k_z , k_y )
i f v1 < v2 :
middle = 3
bottom = 1
else :
middle = 1
bottom = 3

i f top == 3 :
v1 = HSIC_1 ( z_y , x_yz , k_z , k_x )
v2 = HSIC_1 ( x_y , z_xy , k_x , k_z )
if v1 < v2 :
middle = 1
bottom = 2
else :
middle = 2
bottom = 1
# Display the Result
print ( " top = " , top )
print ( " middle = " , middle )
print ( " bottom = " , bottom )

top = 1
middle = 3
bottom = 2

In the following, as in the case of the two-sample problem, we construct the

distribution under the null hypothesis X ⊥⊥ Y in two ways.
1. By shifting either x1 , . . . , x N or y1 , . . . , y N to make X, Y independent, repeatedly
obtain the resulting H S I C values to create a histogram that expresses the null
hypothesis (permutation test).
2. Compute the (asymptotic) distribution from a statistic whose asymptotic distri-
bution is known (U-statistic).
We perform the test by using the procedure shown in the following example.
Example 78 The following procedure randomly rearranges the order of one of the
two nonindependent sequences to make them independent, estimates the null distri-
bution of the HSIC values and tests whether they are independent or not (Fig. 5.3).

# # E n u m e r a t e x and show t h e d i s t r i b u t i o n o f HSIC a s a h i s t o g r a m ##

# # Data G e n e r a t i o n ##
x = np . random . r a n d n ( n )
y = np . random . r a n d n ( n )
u = HSIC_1 ( x , y , k_x , k_y )
# # E n u m e r a t e x and c o n s t r u c t t h e n u l l h y p o t h e s i s
m = 100
w = []
f o r i i n range (m) :
x = x [ np . random . c h o i c e ( n , n , r e p l a c e = F a l s e ) ]
w . a p p e n d ( HSIC_1 ( x , y , k_x , k_y ) )
## S e t t h e r e j e c t i o n r e g i o n
146 5 The MMD and HSIC

v = np . q u a n t i l e (w, 0 . 9 5 )
x = np . l i n s p a c e ( min ( min (w) , u , v ) , max ( max (w) , u , v ) , 2 0 0 )
## Graphical Output
d e n s i t y = kde . g a u s s i a n _ k d e (w)
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )

Now, let us use the unbiased estimate of the HSIC, H S I C U , to find the theoretical
asymptotic distribution according to the null hypothesis. Noting that

1
N N N N

H SIC = 4 h(z i , z j , z q , zr )
N i=1 j=1 q=1 r =1

for z i = (xi , yi ), we have

i,
j,h,r
1
h(z i , z j , z q , zr )= {k X (xt , xu )kY (yt , yu )+k X (xt , xu )kY (yv , yw )−2k X (xt , xu )kY (yt , yv )} ,
4!
(t,u,v,w)

i, j,q,r
where (t,u,v,w) denotes the sum such that (i, j, q, r ) ranges over (t, u, v, w) =
(i, j, h, r ), i.e., the sum over the permutations of (i, j, h, r ). If we modify this esti-
mate to make it an unbiased estimator, we obtain

1

H S I CU = h(z i , z j , z q , zr ),
N i< j<q<r
4

where i, j,q,r is the sum that ranges over 1 ≤ i, j, q, r ≤ N without any overlap.
For example, we can construct the program as follows. Since the program con-
sumes memory, the number of samples should be limited to 100 or less. Additionally,

HSIC (Permutation, Independent) HSIC (Permutation, Dependent)

300
300
Density

Density
0 100
0 100

0.001 0.003 0.005 0.007 0.001 0.003 0.005 0.007

HSIC HSIC

Fig. 5.3 The distribution follows the null hypothesis when using the unbiased estimator H S I CU
of the HSIC. The blue line is the statistic, and the red dotted line is the boundary with the rejection
region
5.3 The HSIC and Independence Test 147

since the estimator is different from H S I C, it produces different values for the same

data. The values of H S I C U are smaller than those of H S I C.

d e f h ( i , j , q , r , x , y , k_x , k_y ) :
M = l i s t ( i t e r t o o l s . combinations ( [ i , j , q , r ] , 4) )
m = l e n (M)
S = 0
f o r j i n range (m) :
t = M[ j ] [ 0 ]
u = M[ j ] [ 1 ]
v = M[ j ] [ 2 ]
w = M[ j ] [ 3 ]
S = S + k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ t ] , y [ u ] ) \
+ k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ v ] , y [w ] ) \
− 2 ∗ k_x ( x [ t ] , x [ u ] ) ∗ k_y ( y [ t ] , y [ v ] )
return S / m

d e f HSIC_U ( x , y , k_x , k_y ) :

M = l i s t ( i t e r t o o l s . c o m b i n a t i o n s ( range ( n ) , 4 ) )
m = l e n (M)
S = 0
f o r j i n range (m) :
S = S + h (M[ j ] [ 0 ] , M[ j ] [ 1 ] , M[ j ] [ 2 ] , M[ j ] [ 3 ] ,
x , y , k_x , k_y )
r e t u r n S / math . comb ( n , 4 )

The function h 1 (·) is the zero function. For h 2 (·, ·), we use the following formula.

Proposition 53 (Chwialkowski-Gretton [5]) Let

k̃ X (x, x ) = k X (x, x ) − E X k X (x, X ) − E X k X (X, x ) + E X X k X (X, X )

and

k̃Y (y, y ) = kY (y, y ) − EY kY (y, Y ) − EY kY (Y, y ) + EY Y kY (Y, Y ) .

Then, h 2 (·, ·) is given by

1
h 2 (z, z ) = k̃ X (x, x )k̃Y (y, y )
6

z = (x, y), z = (x , y ).
Proof: The derivation is due to simple transformations. See the original paper for the
proof.
Mercer’s theorem is not applicable since the kernel h 2 of the integral operator is
not nonnegative definite. However, since the kernel is symmetric and its integral oper-
ator is self-adjoint, eigenvalues {λi } and eigenfunctions {φi } exist (Proposition 27).
Therefore, as in the case involving the two-sample problem, the null distribution
can be calculated by using Proposition 51. Moreover, the mean of h 2 is zero, i.e.,
h̃ 2 = h 2 .
148 5 The MMD and HSIC

HSIC (U-Statistic, Independent) HSIC (U-Statistics, Dependent)

400 800 1200

600
Density

Density
0 200

0
-0.001 0.000 0.001 0.002 0.003 0.0000.0010.0020.0030.0040.005
HSIC U HSIC U

Fig. 5.4 The distribution follows the null hypothesis when using the unbiased estimator of the
HSIC, i.e., H S I C U . The blue line is the statistic, and the red dotted line is the boundary with the

rejection region. The distribution of the null hypothesis is different from that of the estimator H SIC
used in the permutation test. In particular, when X, Y are independent, the unbiased estimator can
take a negative value because the true value of the HSIC is zero

Example 79 Calculate the eigenvalues of the Gram matrix of the positive definite
kernel h 2 and divide them by N to obtain the desired eigenvalues (Sect. 3.3). Then,
find the distribution that follows the null hypothesis and calculate the rejection region.
We construct the following program and execute it. We input a random number that
follows a Gaussian distribution with N = 100 samples. In Fig. 5.4, the left panel
shows a correlation coefficient of 0, and the right panel shows a correlation coefficient
of 0.2.

sigma = 1
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / s i g m a ∗ ∗ 2 )
k_x = k ; k_y = k
# # Data G e n e r a t i o n
n = 1 0 0 ; x = np . random . r a n d n ( n )
a = 0 # Independent
# a =0.2 ## C o r r e l a t i o n 0 . 2
y = a ∗x + np . s q r t ( 1 − a ∗ ∗ 2 ) ∗ np . random . r a n d n ( n )
# y=rnorm ( n ) ∗2 ## The d i s t r i b u t i o n s a r e n o t e q u a l
## N u l l H y p o t h e s i s
K_x = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_x [ i , j ] = k_x ( x [ i ] , x [ j ] )
K_y = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K_y [ i , j ] = k_y ( y [ i ] , y [ j ] )
F = np . z e r o s ( n )
f o r i i n range ( n ) :
F [ i ] = np . sum ( K_x [ i , : ] ) / n
G = np . z e r o s ( ( n ) )
f o r i i n range ( n ) :
G[ i ] = np . sum ( K_y [ i , : ] ) / n
H = np . sum ( F ) / n
I = np . sum (G) / n
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = ( K_x [ i , j ] − F [ i ] − F [ j ] + H) \
5.3 The HSIC and Independence Test 149

∗ ( K_y [ i , j ] − G[ i ] − G[ j ] + I ) / 6
r = 20
lam , v e c = np . l i n a l g . e i g (K)
lam = lam / n
p r i n t ( lam )
z = []
f o r s i n range ( 1 0 0 0 0 ) :
z . a p p e n d ( 1 / n ∗ ( np . sum ( lam [ 0 : r ] ∗ ( np . random . c h i s q u a r e ( d f = 1 , s i z e = r ) − 1) )
))
v = np . q u a n t i l e ( z , 0 . 9 5 )
## S t a t i s t i c s
u = HSIC_U ( x , y , k_x , k_y )
## Graphical Output
x = np . l i n s p a c e ( min ( min ( z ) , u , v ) , max ( max ( z ) , u , v ) , 2 0 0 )
d e n s i t y = kde . g a u s s i a n _ k d e ( z )
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )

[9.43372312e-03 7.60184058e-03 6.84791271e-03 4.71833989e-03

2.91610998e-03 2.58505938e-03 2.57260480e-03 2.14716454e-03
1.52927314e-03 1.40680064e-03 1.30527061e-03 1.03809542e-03
8.17408141e-04 6.63885663e-04 5.98972166e-04 5.18116129e-04
4.66211054e-04 3.33278079e-04 3.08102237e-04 2.52215392e-04
2.08148331e-04 2.01305856e-04 1.61855802e-04 1.24737146e-04
9.77431144e-05 9.05299170e-05 6.74173717e-05 5.94551451e-05
4.79841522e-05 3.68125212e-05 3.47962810e-05 2.59468073e-05
2.27016212e-05 1.75277782e-05 1.44134018e-05 1.34476814e-05
1.07720204e-05 8.32532742e-06 8.02962726e-06 6.61660031e-06
6.03329520e-06 4.82571716e-06 4.39769787e-06 3.07993953e-06
2.92646288e-06 1.93843247e-06 1.43720831e-06 1.32156280e-06
1.21961656e-06 9.39545869e-07 8.74760924e-07 6.82246149e-07
6.17680338e-07 5.35249706e-07 4.13572087e-07 2.89912505e-07
2.63638065e-07 2.13422066e-07 1.43007791e-07 1.39668484e-07
1.11861810e-07 9.06782617e-08 8.18246166e-08 5.59605198e-08
4.96061526e-08 4.03184819e-08 3.74865518e-08 1.94370484e-08
1.55836383e-08 1.28960568e-08 9.98179017e-09 6.29967857e-09
4.60116540e-09 3.71943499e-09 2.78758512e-09 2.47338744e-09
1.54467729e-09 1.31346990e-09 8.12941854e-10 4.71936633e-10
3.62355934e-10 2.28630320e-10 1.71319496e-10 8.99150656e-11
5.81595026e-11 4.02249130e-11 2.95420131e-11 1.60077238e-11
1.00890685e-11 7.29486504e-12 3.89343807e-12 2.24468397e-12
1.62584385e-12 9.95269698e-13 6.98421746e-13 1.93405677e-13
4.44822157e-14 3.34480294e-14 3.34678155e-15 1.12240660e-14]

Instead of writing the HSIC as

m X Y − m X m Y
2HX ⊗HY , by using the HS norm, we
may write it as
X Y
2H S or
Y X
2H S . In fact, if {e X,i } and {eY, j } are orthonormal
bases of H X , HY , respectively, then from the definition of
·
H S (Sect. 2.6) and
(5.1), we have
150 5 The MMD and HSIC

∞
∞
∞

Y X
2H S =
Y X e X,i
2HY = eY, j , Y X e X,i 2HY
i=1 i=1 j=1
∞
∞
= e X,i ⊗ eY, j , m X Y − m X m Y 2HX ⊗HY =
m X Y − m X m Y
2HX ⊗HY .
i=1 j=1

Similarly,
X Y
2H S has the same value.

5.4 Characteristic and Universal Kernels

Let H and k be an RKHS and its reproducing kernel, respectively. Let P be the set
of distributions that a random variable X follows. Then, we can define the map

P μ → k(x, ·)dμ(x) ∈ H ,

which we call the embedding of probabilities in the RKHS.

Since each element of H is generated by k(x, ·) (x ∈ E) or can be written as the
limit of its sequence, we describe the condition as

k(x, y)dμ(x) = 0 , y ∈ E =
⇒μ = 0
E

for μ := μ1 − μ2 . Moreover, E k(x, y)dμ(x) = 0 (y ∈ E) implies that

k(x, y)dμ(x)dμ(y) = 0 . (5.10)
E E

If we use k(x, y) = φ(x − y) = ei(x−y)w dη(w) (Proposition 5), then we can write
E
(5.10) as
| eiwx dμ(x)|2 dη(w) = 0 .
E E

In other words, for the measure η,

μ̂(w) := eiwx dμ(x) = 0 (5.11)
E

almost surely holds.

In the following, let η be a finite measure. We call the set of x ∈ E such that
η(U (x, )) > 0 for any > 0 the support of η and denote it by E(η). We note that
the support of a finite measure is always a closed set. In fact, for x ∈ E\E(η), if
5.4 Characteristic and Universal Kernels 151

the radius of the open set U (x, ) is sufficiently small, then E(η), U (x, ) has no
intersection.
Here, if E(η) = E, then (5.11) means that μ = 0, i.e., μ1 = μ2 . On the other
hand, if E(η) E, then a μ = 0 exists such that (5.11) holds.
Proposition 54 k(x, y) = φ(x − y) is characteristic if and only if the support of
the finite measure η of k(x − y) = E ei(x−y)w dη(w) coincides with E.
For the proof of necessity, see the Appendix at the end of this chapter.
Example 80 The Gaussian kernel in Example 19 and the Laplace kernel in Example
20 are zero-mean Gaussian- and Laplace-distributed, respectively, and the support is
the entire interval. Therefore, they are characteristic kernels. On the other hand, the
kernel
k(x, y) = φ(x − y)

obtained from the characteristic function

2(1 − cos(at))
φ(t) =
a2t 2
whose probability distribution is a triangular distribution,

|x|
(1 − )/a, |x| < a
f (x) = a
0, Otherwise

is not a characteristic kernel if its support is not equal to E.

Proposition 55 If the reproducing kernel k1 , k2 expressed by the bivariate difference
between RKHSs H1 , H2 are both characteristic, then the reproducing kernel k1 k2 of
RKHS H1 ⊗ H2 is also characteristic.
Proof: If both of k1 (x1 , y1 ) = φ1 (x1 − y1 ), k2 (x2 , y2 ) = φ2 (x2 − y2 ) are character-
istic, then the supports of η1 , η2 are E. Since k1 (x1 , y1 )k2 (x2 , y2 ) can be expressed
by

w1 i(x2 −y2 ) w2
φ1 (x1 − y1 )φ2 (x2 − y2 ) = ei(x1 −y1 ) e dη1 (w1 )dη2 (w2 )
E E

= ei(x1 −y1 ,x2 −y2 ) (w1 ,w2 )
dη1 (w1 )dη2 (w2 ) ,
E E

k1 , k2 is characteristic as well.
Let E be a compact set. Suppose that the RKHS H induced by the kernel k : E ×
E → H is a dense (under the uniform norm) subset of the set C(E) of continuous
functions E → R. Then, we say that the kernel k is universal.
To show that the kernel k is universal, we only need to see if the correspond-
ing RKHS satisfies the two Stone-Weierstrass conditions (Proposition 12). Proposi-
tion 56 gives a sufficient condition for the kernel k to be universal (see Chap. 2 for
152 5 The MMD and HSIC

the definition of an algebra). However, for practical purposes, Corollary 3 deduced

from Proposition 56 is often used.
Proposition 56 (Steinwart [29]) Let E be compact, and let k : E × E → R be a
continuous function with k(x, x) > 0 for x ∈ E. If there exists an injective feature
map
∞

: E x → (x) = (1 (x), 2 (x), . . .) ∈ l2 := {(α1 , α2 , . . .) ∈ R∞ | α 2j < ∞}
i=1

and A := span{1 , 2 , . . .} is an algebra, then k is a universal kernel.

Proof: Since k(x, x) > 0, x ∈ E, the first condition of Proposition 12 is satisfied.
Since k(·, ·) is continuous, (x) = k(x, ·) ∈ l2 is also continuous at each x ∈ E.
Moreover, since is injective, the second condition
of Proposition 12 is satisfied,
and A is dense in C(E). Furthermore, any i αi i (·) ∈ A is also an element of
RKHS H with k as the reproducing kernel. Let {ei } be an orthonormal basis of
H and
f := i αi ei
∈ H . Then,
f (·), (x) = f (x) holds, which further implies
that i αi i (x) = i αi ei , i i (x)ei = f (x).

Corollary 3 The infinite-dimensional polynomial kernel (Example 11) is a universal

kernel in each compact set of E.

Proof: The feature map is injective. Moreover, A := span{m 1 ,...,m d |m 1 , . . . ,

m d ≥ 0} is a d-variable polynomial (algebra), and from Proposition 56, the kernel k
is universal.

Example 81 (Gaussian Kernel) The exponential type (Example 6) k∞ is the univer-

sal kernel from Corollary 3. The feature map of the Gaussian kernel (Example 7) is the
(x) of k∞ divided by γ (x) := k(x, x) > 0. For f ∈ C(E), since γ f ∈ C(E),
1/2

if we let
γ f (·) − i αi (·)
∞ ≤
γ
∞ , then we have

f (·) − αi γ −1 (·)
∞ ≤
γ
−1
∞
γ f (·) − αi (·)
∞ ≤
i i

Therefore, the Gaussian kernel is universal as well.

The necessary and sufficient condition for Proposition 54 assume that the kernel
is a function of the difference between two variables. The following is a sufficient
condition, but it refers to kernels in general.
Proposition 57 A universal kernel on a compact set is characteristic.
Proof: See the Appendix at the end of the chapter.

Example 82 The Gaussian kernel is characteristic. If a characteristic kernel based

on a triangular distribution (Example 80) has a support up to a distance a > 0 from
the origin and E is a compact set that includes some points outside the support, then
its kernel is not universal.
5.5 Introduction to Empirical Processes 153

5.5 Introduction to Empirical Processes

In this section, we study a mathematical approach to machine learning called the

empirical process. We analyze the accuracy of the MMD estimators by using the
Rademacher complexity and concentration inequalities. Through this example, we
learn the concept of empirical processes. The derivation performed in this section is
based on Gretton et al.’s [11] proof of a proposition regarding the accuracy of the
two-sample problem.
In this section, we prove the following proposition. We define the MMD by

sup E P [ f (X )] − E Q [ f (X )],
f ∈F

where F is a class of functions. This chapter also deals with the case in which
F := { f ∈ H |
f
H ≤ 1}.
Proposition 58 Suppose that a kmax exists such that 0 ≤ k(x, y) ≤ kmax for each
x, y ∈ E. Then, for any > 0, we have

2 4kmax 2 N
P |M M D B − M M D2| > + ≤ 2 exp − ,
N 4kmax

2

where the estimator M M D B of M M D 2 is given by (5.3), and we assume that the
number of samples for x, y is equal to N and that P = Q.
For the proof of Proposition 58, we use an inequality that slightly generalizes
Proposition 46.
Proposition 59 (McDiarmid) Let f : E m → R imply that a ci < ∞ (i = 1, · · · , m)
exists satisfying

sup | f (x1 , . . . , xm ) − f (x1 , . . . .xi−1 , x, xi+1 , . . . , xm )| ≤ ci .

x,x1 ,...,xm

For any probability measure P, > 0 and X 1 , . . . , X m , we have

2 2
P ( f (x1 , . . . , xm ) − E X 1 ···X m f (X 1 , . . . , X m ) > ) < exp − m 2
(5.12)
i=1 ci

and

2 2
P (| f (x1 , . . . , xm ) − E X 1 ···X m f (X 1 , . . . , X m )| > ) < 2 exp − m 2
.
i=1 ci
(5.13)
Proof: Hereafter, we denote f (X 1 , · · · , X N ) and E[ f (X 1 , · · · , X N )] by f and E[ f ],
respectively. If we define
154 5 The MMD and HSIC

V1 := E X 2 ···X N [ f |X 1 ] − E X 1 ···X N [ f ]
..
.
Vi := E X i+1 ···X N [ f |X 1 , · · · , X i ] − E X i ···X N [ f |X 1 , · · · , X i−1 ]
..
.
VN := f − E X N [ f |X 1 , · · · , X N −1 ]

for i = 1, i = 2, · · · , N − 1, and i = N , then we have

N
f − E X 1 ···X N [ f ] = Vi . (5.14)
i=1

From

E X i {E X i+1 ···X N [ f |X 1 , · · · , X i ]|X 1 , · · · , X i−1 } = E X i ··· ,X N [ f |X 1 , · · · , X i−1 ] ,

we have
E X i [Vi |X 1 , · · · , X i−1 ] = 0 . (5.15)

From (5.14), we have

N
f − E[ f ] > ⇐⇒ exp{t Vi } > et for arbitrary t > 0 .
i=1

If we apply Markov’s inequality (Lemma 6) to the latter equation, then we have

N
−t
P( f − E[ f ] ≥ ) ≤ inf e E[exp{t Vi }] . (5.16)
t>0
i=1

Moreover, from (5.15), we apply Lemma 7 to obtain

N N −1

E[exp{t Vi }] = E X 1 ···X N −1 [exp{t Vi }E X N [exp{t VN }|X 1 , · · · , X N −1 ]]
i=1 i=1
N −1

≤ E X 1 ···X N −1 [exp{t Vi }] exp{t 2 c2N /8}
i=1

t 2
N
= exp{ ci2 } .
8 i=1
5.5 Introduction to Empirical Processes 155

Therefore, from (5.16), we have

t2 2
N
P( f − E[ f ] ≥ ) ≤ inf exp{−t + c }.
t>0 8 i=1 i

N 2
The right-hand side is minimized when t = 4/ i=1 ci , and we obtain (5.12).
Replacing f with − f , we obtain the other inequality. From both inequalities, we
have (5.13).
In the following, we denote by

F := { f ∈ H |
f
H ≤ 1}

the unit ball in the universal (see Sect. 5.4 for the definition of universality) RKHS
H w.r.t. a compact E and assume that the kernel of H is less than or equal to kmax .
Hereafter, let X 1 , . . . .X m be independent random variables that follow probability
P, and let σ1 , . . . , σm be independent random variables, each of which takes a value
of ±1 equiprobably. Then, we say that the quantity

1
m
R N (F) := Eσ sup | σi f (xi )| (5.17)
f ∈F m i=1

is an empirical Rademacher complexity, where Eσ is the operation that takes the

expectation w.r.t. σ1 , . . . , σm . If we further take the expectation of (5.17) w.r.t. the
probability P, then we call the obtained value R(F, P) the Rademacher complexity.

Proposition 60 (Bartlett-Mendelson [4]) Let kmax := maxx,y∈E k(x, y). Then, we

have the following inequality:

kmax
R N (F) ≤ .
N

In particular, for an arbitrary probability P, we have

kmax
R(F, P) ≤ .
N

Proof: From
f
H ≤ 1 and k(x, x) ≤ kmax , we have

1 1
N N
R N (F) = Eσ [sup | σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |]
f ∈F N i=1 f ∈F N i=1

1
N
= Eσ [sup | f, σi k(xi , ·) H |]
f ∈F N i=1
156 5 The MMD and HSIC

1 N
1
N
≤ Eσ [sup
f
H σi k(xi , ·), σi k(xi , ·) H ]
f ∈F N i=1 N i=1

1 N N
N N
≤ Eσ [ σ i σ j k(x i , x j )] ≤ Eσ [ 1 σi σ j k(xi , x j )]
N 2 i=1 j=1 N 2 i=1 j=1

1 N N
kmax
= δi, j k(xi , x j ) ≤ ,
2
N i=1 j=1 N

where we use
E[σi σ j ] = σi2 δi, j = δi, j

in the derivation. We obtain the other inequality by taking the expectation w.r.t. the
probability P.
Propositions 59 and 60 are inequalities used for mathematical analysis in machine
learning as well as for the proof of Proposition 58.
Proof of Proposition 58: If we define

f (x1 , . . . , x N , y1 , . . . , y N )
1 1 1 1
:=
k(x1 , ·) + . . . + k(x N , ·) − k(y1 , ·) − . . . − k(y N , ·) ,
N N N N
then from the triangular inequality, we obtain

| f (x1 , . . . , x N , y1 , . . . , y N ) − f (x1 , . . . , x j−1 , x, x j+1 , . . . , x N , y1 , . . . , y N )|

1 2
≤
k(x j , ·) − k(x, ·)
≤ kmax . (5.18)
N N
Next, we obtain the upper bound of the expectation of

1 1
N N
2

|M M D 2 − M M D B | = | sup {E P ( f ) − E Q ( f )} − sup { f (xi ) − f (y j )}|
f ∈F f ∈F N N
i=1 j=1

1
N
1
N
≤ sup |E P ( f ) − E Q ( f ) − { f (xi ) − f (y j )}| .
f ∈F N N
i=1 j=1

Then, we perform the following derivation:

1 1
N N
E X,Y sup |E P ( f ) − E Q ( f ) − { f (X i ) − f (Yi )}|
f ∈F N N
i=1 i=1

1 1 1 1
N N N N
= E X,Y sup |E X { f (X i ) − f (X i )} − EY { f (Y j )) − f (Y j )}|
f ∈F N N N N
i=1 i=1 j=1 j=1
5.5 Introduction to Empirical Processes 157

1 1 1 1
N N N N
≤ E X,Y,X ,Y sup | f (X i ) − f (X i ) − f (Yi ) + f (Yi )|
f ∈F N N N N
i=1 i=1 i=1 i=1

1
N
1
N
= E X,Y,X ,Y ,σ,σ sup | σi { f (X i ) − f (X i )} + σi { f (Yi ) − f (Yi )}|
f ∈F N N
i=1 i=1

1 1
N n
≤ E X,X ,σ sup | σi { f (X i ) − f (X i )}| + EY,Y ,σ sup | σ j { f (Y j ) − f (Y j )}|
f ∈F N i=1 f ∈F N j=1

kmax
≤ 2[R(F , P) + R(F , Q)] ≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ] = 4 , (5.19)
N

where the first inequality is due to Jensen’s inequality, the second stems from the tri-
angular inequality, the third is derived from the definition of Rademacher complexity,
and the fourth due is obtained from the inequality of Rademacher complexity (Propo-
√ 2
sition 60). From (5.18) and (5.19), for ci = N2 kmax and f = M M D 2 − M MD ,
we have
kmax
E X 1 ...X N f ≤ 4 .
N

Finally, we obtain Proposition 58 from Proposition 59.

Hence, Proposition 60 follows from (5.18) and Proposition 59.

Appendix

The essential part of the proof of Proposition 54 was given by Fukumizu [7] but has
been rewritten as a concise derivation to make it easier for beginners to understand.

Proof of Proposition 48
The fact that E x → k(x, ·) ∈ H is measurable means that E[k(X, ·)] can be
treated as a random variable. However, the events in E × E are the direct prod-
ucts of the events generated by each E (the elements of F × F). Therefore, if the
function E × E (x, y) → k(x, y) ∈ R is measurable, then the function E y →
k(x, y) ∈ R is measurable for each x ∈ E (even if y ∈ E is fixed, (x, y) → k(x, y)
is still measurable). In the following, we show that any function belonging to H is
measurable. First, we note that H0 = span{k(x, ·)|x ∈ E} is dense in H . Addition-
ally, we note that for the sequence { f n } in H0 ,
f − f n
H → 0 (n → ∞) means that
| f (x) − f n (x)| → 0 for each x ∈ E (Proposition 35). The following lemma implies
that f is measurable.

Lemma 8 If f n : E → R is measurable and f n (x) converges to f (x) for each x ∈

E, then f : E → R is also measurable.

Proof: The proof follows after the proof of this proposition.

158 5 The MMD and HSIC

We assume that Lemma 8 is valid. We define the measurability of : E x →

k(x, ·) ∈ H by
{x ∈ E |
f − k(x, ·)
H < δ} ∈ F

for any f ∈ H and δ > 0 (this is an extension to the case where H = R). Moreover,
we have

f − k(x, ·)
H < δ ⇐⇒ k(x, x) − 2 f (x) < δ 2 −
f
2H .

In addition, since k(·, ·) is measurable, E x → k(x, x) ∈ R is also measurable.

Moreover, since f (x) is measurable, so is k(x, x) − 2 f (x). Thus, is measurable.

Proof of Lemma 8
It is sufficient to show that f −1 (B) ∈ F for any open set B. We fix B ⊆ R arbitrarily
and let Fm := {y ∈ B|U (y, 1/m) ⊆ B}, where U (y, r ) := {x ∈ R | d(x, y) < r }.
From the definition, we have the following two equations.

f (x) ∈ B ⇐⇒ for some m, f (x) ∈ Fm

f (x) ∈ Fm ⇐⇒ for some k, f n (x) ∈ Fm , n ≥ k .

In other words, we have

f −1 (B) = ∪m f −1 (Fm ) = ∪m ∪k ∩n≥k f n−1 (Fm ) ∈ F .

Proof of Proposition 49
∞ ∞
The evaluation is finite for arbitrary g = i=1 j=1 e X,i eY, j ∈ H X ⊗ HY and
(x, y) ∈ E. In fact, we have
⎛ ⎞1/2 ⎛ ⎞1/2
∞
∞ ∞
∞
∞

|g(x, y)| ≤ |ai, j | · |e X,i (x)| · |eY, j (y)|≤ |e X,i (x)| · ⎝ 2 (y)⎠
eY, j
⎝ 2 (y)⎠
ai, j ,
i=1 j=1 i=1 j=1 j=1

∞ (5.20)
where
we apply Cauchy-Schwarz’s inequality (2.5) to j=1 . If we set k Y (y, ·) =
j h j (y)eY,
j (·), then from eY,i (·), k Y (y, ·) = eY,i (y), we have h i (y) = eY,i (y) and
kY (y, ·) = ∞ j=1 eY, j (y)eY, j (·). Thus, we obtain

∞

j (y) = k Y (y, y)
2
eY, (5.21)
j=1
Appendix 159

and
⎛ ⎞1/2 ⎛ ⎞1/2 ⎛ ⎞1/2
∞
∞
∞ ∞
∞
|e X,i (x)| · ⎝ ai,2 j ⎠ ≤⎝ e X,i (x)⎠ ⎝
2 ai, j ⎠
2 = k X (x, x)
g
,
i=1 j=1 i=1 i=1 j=1
∞ (5.22)
where we apply Cauchy-Schwarz’s√ inequality √ (2.5) to i=1 . Note that (5.20), (5.21),
and (5.22) imply that |g(x, y)| ≤ k X (x, x) kY (y, y)
g
. Thus, H X ⊗ HY is an
RKHS.
From k X (x, ·) ∈ H X , kY (y, ·) ∈ HY , we have that k(x, ·, y, ) := k X (x, ·)
kY (y, ·) ∈ H X ⊗ HY for k(x, x , y, y ) := k X (x, x )kY (y, y ). From

∞
∞ ∞
∞
g(x, y) = ai, j e X,i (x)eY, j (y) = ai, j e X,i (·), k X (x, ·) H X eY,i (), kY (y, ) HY
i=1 j=1 i=1 j=1
∞
∞ ∞
∞
= ai, j e X,i (·)eY, j (), k(x, ·, y, ) H = ai, j e X,i (·)eY, j (·), k(x, ·, y, ) H
i=1 j=1 i=1 j=1
= g(·, ), k(x, ·, y, ) ,

k is the reproducing kernel of H X ⊗ HY .

Proof of Proposition 54 (Necessity)

Let W be an open set centered at the origin with a radii of > 0 and w0 ∈ E.
We assume that w0 + W has a measure of 0 and show that this contradicts another
assumption, i.e., that k(x, y) = φ(x − y) is a characteristic kernel. In this case, η
is an even function, and −w0 + W is also of measure 0 (±w0 + W ⊆ E\E(η)),
(d+1)/2
where we use the fact that g(w) := ( −
w
2 )+ is nonnegative definite when
E = R (d ≥ 1) (see [8] for the proof). From Proposition
d
5 (Bochner’s Theorem),

there exists a finite measure μ such that g(w) = E eiw x μ(x). Moreover, the closure
of ±w0 + W is the support of

h(w) = g(w − w0 ) + g(w + w0 ) = eiw x 2 cos(w0 x)dμ(x) .
E

Since the support of h has no intersection with E(η) and ±w0 ∈

/ W , we have h(0) =
0. Therefore, we obtain ν(E) = 0 for

ν(B) := 2 cos(w0 x)dμ(x) , B ∈ F .
B

Since g is not zero, ν is not the zero measure. Thus, using the total variation

n
|ν|(B) := sup |ν(Bi )| , B ∈ F ,
∪Bi =B i=1
160 5 The MMD and HSIC

fn f1

0
F U

Fig. 5.5 Proof of Proposition

57.
As n grows, the slope of f n rapidly increases at the border of
F, U . Therefore, if E f n d P = E f n d Q for all { f n }, we require P = Q (Dudley “Real Analysis
and Probability” [6])

where sup is the supremum when dividing F into Bi ∈ F, we define the constant
c := |ν|(E) and the finite measures μ1 := 1c |ν| and μ2 := 1c {|ν| − ν}. From ν(E) =
0, we observe that μ1 and μ2 are both probabilities and that μ1 = μ2 . Additionally,
we have
c(dμ1 − dμ2 ) = dν = 2 cos(w0 x)dμ .

From Fubini’s theorem, we can write the difference between the expectations w.r.t.
probabilities μ1 , μ2 as

1
φ(x − y)dμ1 (y) − φ(x − y)dμ2 (y) = φ(x − y)2 cos(w0 y)dμ(y)
c
E
E
E
1 i(x−y) w 1 i xw
= 2 cos(w0 y) e dηdμ(y) = e h(w)dη(w) .
c c E

However, since the supports of h and η do not intersect, the value is zero, which
contradicts the assumption that φ(x − y) is a characteristic kernel.

Proof of Proposition 57

For any bounded continuous f , if E f d P = E f d Q holds, this implies that P = Q
(Fig. 5.5). In fact, let U be an open subset of E, and let V be its complement.
Furthermore, let d(x, V ) := inf y∈V d(x, y) and f n (x) := min(1, nd(x, V )). Then,
f n is a bounded continuous function on E, and f n (x) ≤ I (x ∈ U ) and f n (x) →
I (x ∈ U ) as n → ∞ foreach x ∈ R; Thus, by the monotonic convergence theorem,
E n f d P → P(U ) and f
E n d Q → Q(U ) hold. By our assumption, E fn d P =
E nf d Q and P(U ) = Q(U ), i.e., P(V ) = Q(V ) holds 2
In other words, every event
is guaranteed to be a closure event. Let E be a compact set. For each element g ∈ H in
the RKHS H of the universal kernel, the same argument follows since supx∈E | f (x) −
g(x)| can be arbitrarily small for any f ∈ C(E). That is, if gd P = gd Q holds
for any g ∈ H , then P = Q, so the universal kernel is characteristic.

2 If E is compact, then for any A ∈ F , P(A) = {P(V )|V is a closed set, V ⊆ A, V ∈ V } (Theorem
7.1.3, Dudley [6]).
Exercises 65∼83 161

Exercises 65∼83

65. Proposition 49 can be derived according to the following steps. Which part of
the proof in the Appendix does each step correspond to?
√ √
(a) Show that |g(x, x, y, y)| ≤ k X (x, x) kY (y, y)
g
for g ∈ H X ⊗ HY and
x ∈ E X , y ∈ E Y (from Proposition 33, this implies that H is some RKHS).
(b) Show that k(x, ·, y, ) := k X (x, ·)kY (y, ) ∈ H when x ∈ E X , y ∈ E Y are
fixed.
(c) Show that f (x, y) = f (·, ), k(x, ·, y, ) H .
66. How can we define the average of the elements of H X ⊕Y , m X Y =E X Y [k X (·)kY (·)]?
Define the average in the same way that we defined m X using Riesz’s lemma
(Proposition 22).
67. Show that Y X ∈ B(H X , HY ) exists such that

f g, m X Y − m X m Y HX ⊗HY = Y X f, g HY

for each f ∈ H X , g ∈ HY .
68. The MMD is generally defined as sup f ∈F {E P [ f (X )] − E Q [ f (X )]} for some set
F of functions. Assuming that F := { f ∈ H |
f
H ≤ 1}, show that the MMD is

m P − m Q
H . Furthermore, show that we can transform the MMD as follows.

MMD2 = E X X [k(X, X )] + EY Y [k(Y, Y )] − 2E X Y [k(X, Y )] ,

where X and X (Y and Y ) are independent random variables that follow the
same distribution.
69. Show that the squared MMD estimator (5.4) is unbiased.
70. In the two-sample problem solved by a permutation test in Example 71, for the
case when the numbers of samples are m, n (can be different) instead of the same
n and m, n are both even numbers, modify the entire program in Example 71 to
examine whether it works correctly (m = n in Example 71).
71. For the function h in (5.6), show that h 1 is a function that always takes a value
of zero and that h̃ 2 and h coincide as functions.
72. Show that the fact that random variables X, Y that follow Gaussian distributions
are independent is equivalent to the condition that their correlation coefficient
is zero. Additionally, give an example of two variables whose correlation coef-
ficient is zero but that are not independent.
73. Prove the following equation.

m X Y − m X m Y
2 = E X X Y Y [k X (X, X )kY (Y, Y )]
−2E X Y {E X [k X (X, X )]EY [kY (Y, Y )]} + E X X [k X (X, X )]EY Y [kY (Y, Y )].

74. Show that the HSIC estimator

162 5 The MMD and HSIC

1 2
H S I C := 2 k X (xi , x j )kY (yi , y j ) − 3 k X (xi , x j ) kY (yi , yh )
N N
i j i j h
1
+ 4 k X (xi , x j ) kY (yh , yr )
N r
i j h

can be written as H S I C = trace(K X H K Y H ) using K X = (k X (xi , x j ))i, j , K Y =

(kY (yi , y j ))i, j , and H = I − N1 E, where I ∈ R N ×N is the unit matrix and E ∈
R N ×N is the matrix such that all the elements are ones. Additionally, construct
Python programs for each computation. Moreover, examine that both output the
same results for the Gaussian kernels k_x and k_y with σ 2 = 1; generate random
numbers for the standard Gaussian variables X and Y whose correlations are
a = 0, 0.1, 0.2, 0.4, 0.6, 0.8.
75. When we test the independence X ⊥⊥ {Y, Z } of X and {Y, Z }, the HSIC
is extended as ||m X Y Z − m X m Y Z ||2 . That is, we can transform kY (y, ·) into
kY (y, ·)k Z (z, ·). Construct the function HSIC_2 by adding arguments to the func-
tion HSIC_1; generate a random number according to X ⊥⊥ {Y, Z }, and verify
that the obtained value is sufficiently small.
76. Utilizing the class of LiNGAM and the function

def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) ) / l e n ( x )

def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v

we wish to estimate whether each variable X, Y, Z is either upstream, midstream,

or downstream. Fill in the blanks by generating random numbers X, Y, Z that
do not follow the Gaussian distribution, and estimate which variables among
X, Y, Z are upstream, midstream, and downstream from the random numbers
alone.
# Data g e n e r a t i o n
n = 30
x = np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
y = 2 ∗ x + np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
z = x + y +np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2
x = x − np . mean ( x )
y = y − np . mean ( y )
z = z − np . mean ( z )

# # E s t i m a t e UpStream ##
def cc ( x , y ) :
r e t u r n np . sum ( np . d o t ( x . T , y ) / l e n ( x ) )

def f ( u , v ) :
return u − cc ( u , v ) / cc ( v , v ) ∗ v

x_y = f ( x , y ) ; y_z = f ( y , z ) ; z_x = f ( z , x )

x_z = f ( x , z ) ; z_y = f ( z , y ) ; y_x = f ( y , x )

v1 = HSIC_2 ( x , y_x , z_x , k_x , k_y , k_z )

v2 = HSIC_2 ( y , z_y , x_y , k_y , k_z , k_x )
v3 = HSIC_2 ( z , x_z , y_z , k_z , k_x , k_y )
Exercises 65∼83 163

i f v1 < v2 :
i f v1 < v3 :
top = 1
else :
top = 3
else :
i f v2 < v3 :
top = 2
else :
top = 3

# # E s t i m a t e MidStream ##
x_yz = f ( x_y , z_y )
y_zx = f ( y_z , x_z )
z_xy = f ( z_x , y_x )

i f top == 1 :
v1 = ## Blank ( 1 ) ##
v2 = ## Blank ( 2 ) ##
if v1 < v2 :
middle = 2
bottom = 3
else :
middle = 3
bottom = 2
i f t o p == 2 :
v1 = # # B l a n k ( 3 ) ##
v2 = # # B l a n k ( 4 ) ##
i f v1 < v2 :
middle = 3
bottom = 1
else :
middle = 1
bottom = 3

i f top == 3 :
v1 = # # B l a n k ( 5 ) ##
v2 = # # B l a n k ( 6 ) ##
if v1 < v2 :
middle = 1
bottom = 2
else :
middle = 2
bottom = 1
# # O u t p u t t h e R e s u l t s ##
print ( " top = " , top )
print ( " middle = " , middle )
print ( " bottom = " , bottom )

77. We wish to make two sequences independent by shifting one of x1 , . . . , x N or

y1 , . . . , y N , and then we want to repeat the process of calculating H S I C. We
wish to create a histogram that expresses a distribution that follows the null
hypothesis. For this purpose, we constructed the following program. Why can
we obtain the null hypothesis (X, Y are independent) by permutation? Where
in the program do we obtain the HSIC statistics, and where do we obtain the
multiple HSIC values that follow the null hypothesis?

# Data G e n e r a t i o n
f x = np . random . r a n d n ( n )
y = np . random . r a n d n ( n )
164 5 The MMD and HSIC

u = HSIC_1 ( x , y , k_x , k_y )

m = 100
w = []
for i i n range (m) :
x = x [ np . random . c h o i c e ( n , n , r e p l a c e = F a l s e ) ]
w . a p p e n d ( HSIC_1 ( x , y , k_x , k_y ) )
v = np . q u a n t i l e (w, 0 . 9 5 )
x = np . l i n s p a c e ( min ( min (w) , u , v ) , max ( max (w) , u , v ) , 2 0 0 )

d e n s i t y = kde . g a u s s i a n _ k d e (w)
plt . plot (x , density (x) )
p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " )
plt . axvline (x = u , c = "b" )

78. In the MMD (Sect. 5.2) and HSIC (Sect. 5.3), we cannot apply Mercer’s theorem
because the kernel of the integral operator is not nonnegative definite. However,
in both cases, the integral operator possesses eigenvalues and eigenfunctions.
Why?
79. Show that k(x, y) = φ(x − y), φ(t) = e−|t| is a characteristic kernel.
80. In the proof of Proposition 54 (necessity, Appendix), we used the fact that
(d+1)/2
g(w) := ( −
w
2 )+ is nonnegative definite [8]. Verify that this fact is
correct for d = 1 by proving the following equality.

1 1 − cos(x)
g(w)e−iwx dw = .
2π − π x2

81. Why is the exponential type a universal kernel? Why is the characteristic kernel
based on a triangular distribution not a universal kernel?
82. Explain why the three equations and four inequalities hold in the following
derivation of the upper bound on the Rademacher complexity.

1 1
N N
R N (F) = Eσ [sup | σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |]
f ∈F N i=1 f ∈F N i=1

1
N
= Eσ [sup | f, σi k(xi , ·) H |]
f ∈F N i=1

1 N
1
N

≤ Eσ [sup
f
H σi k(xi , ·), σi k(xi , ·) H ]
f ∈F N i=1 N i=1

1 N N
≤ Eσ [ 2 σi σ j k(xi , x j )]
N i=1 j=1

1 N N
kmax
≤ Eσ [ 2 k(xi , x j )] ≤ .
N i=1 j=1 N
Exercises 65∼83 165

83. Explain why the one equality and four inequalities hold for the derivation of the
2
upper bound of |M M D 2 − M M D B | below.

1 1 1 1
N N N N
E X,Y sup |E X { f (xi ) − f (xi )} − EY { f (y j )) − f (y j )}|
f ∈F N N N N
i=1 i=1 j=1 j=1

1
N
1
N
1
N
1
N
≤ E X,Y,X ,Y sup | f (xi ) − f (xi ) − f (yi ) + f (yi )|
f ∈F N N N N
i=1 i=1 i=1 i=1

1 1
N N
= E X,Y,X ,Y ,σ,σ sup | σi { f (xi ) − f (xi )} + σi { f (yi ) − f (yi )}|
f ∈F N i=1
N
i=1

1 1
N n
≤ E X,X ,σ sup | σi { f (xi ) − f (xi )}| + EY,Y ,σ sup | σ j { f (y j ) − f (y j )}|
f ∈F N f ∈F N
i=1 j=1
≤ 2[R(F , P) + R(F , Q)]
≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ].
Chapter 6
Gaussian Processes and Functional Data
Analyses

A stochastic process may be defined either as a sequence of random variables {X t }t∈T ,

where T is a set of times, or as a function X t (ω) : T → R of ω ∈ . We define
a Gaussian process as a stochastic process {X t } such that X t (t ∈ T ) follows a
multivariate Gaussian distribution for any finite subset T of T . In this chapter, we
generalize the one-dimensional T to a multidimensional set E for the consideration
of Gaussian processes. We mainly deal with the variations of ω ∈ in f (ω, x),
while thus far, we have dealt with the variations of x ∈ E in f (ω, x). The Gaussian
process has been applied to various aspects of machine learning. We examine the
relation between Gaussian processes and kernels. The chapter’s first half consists
of regression, classification, and computational reduction treatments, and the last
part studies the Karhunen-Lóeuvre expansion and its surrounding theory. Finally,
we study functional data analyses, which are closely related to stochastic processes.

6.1 Regression

Let E and (, F , μ) be a set and a probability space. If the correspondence between
ω → f (ω, x) ∈ R is measurable for each x ∈ E, i.e., if f (ω, x) is a random
variable at each x ∈ E, then we say that f : × E → R is a stochastic process.
Moreover, if the random variables f (ω, x1 ), . . . , f (ω, x N ) follow an N -variable
Gaussian distribution for any N ≥ 1 and any finite number of elements x1 , . . . .x N ∈
E, then we call f a Gaussian process. We define the covariance between xi , x j ∈ E
by
{ f (ω, xi ) − m(xi )}{ f (ω, x j ) − m(x j )}dμ(ω) ,

where m(x) := f (ω, x)dμ(ω) is the expectation of f (ω, x) for x ∈ E. Then, no
matter what N and x1 , . . . , x N we choose, their covariance matrices are nonnegative
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 167
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1_6
168 6 Gaussian Processes and Functional Data Analyses

definite. Thus, we can write the covariance matrix by using a positive definite kernel
k : E × E → R. Therefore, the Gaussian process can be uniquely expressed in terms
of a pair (m, k) containing the mean m(x) of each x ∈ E and the covariance k(x, x )
of each (x, x ) ∈ E × E.
In general, a random variable is a map of → R, and we should make ω explicit,
i.e., f (ω, x), but for simplicity, for the time being, we make ω implicit, i.e., f (x),
even if it is a random variable.
Example 83 Let m X ∈ R N and k X X ∈ R N ×N be the mean and covariance matrix,
respectively, of the Gaussian process (m, k) at x1 , . . . , x N ∈ E := R. In general, for
a mean μ and a covariance matrix ∈ R N ×N , is nonnegative definite, and there
exists a lower triangular matrix R ∈ R N ×N with = R R (Cholesky decomposi-
tion). Therefore, to generate random numbers that follow N (m X , k X X ) from N inde-
pendent random numbers u 1 , . . . , u N that follow the standard Gaussian distribution,
we can calculate f X := R X u + m X ∈ R N for k X X := R X R X with u = [u 1 , . . . , u N ].
In fact, the expectation and the covariance matrix of f X are m X and

E[( f X − m X )( f X − m X ) ] = E[R X uu R
X ] = R X E[uu ]R X = R X R X = k X X ,

respectively This procedure can be described in Python as follows.

# I n s t a l l t h e module s k f d a v i a
p i p i n s t a l l s c i k i t −f d a

# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d .
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l t
from m a t p l o t l i b import s t y l e
from s k l e a r n . d e c o m p o s i t i o n import PCA
import s k f d a

# D e f i n i t i o n o f (m, k )
d e f m( x ) :
return 0
def k ( x , y ) :
r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / 2 )
# D e f i n i t i o n o f gp_sample
d e f g p _ s a m p l e ( x , m, k ) :
n = len ( x )
m_x = m( x )
k_xx = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i ] , x [ j ] )
R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x
u = np . random . r a n d n ( n )
r e t u r n R . d o t ( u ) + m_x
6.1 Regression 169

# G e n e r a t e t h e random numbers and c o n s t r u c t t h e c o v a r i a n c e m a t r i x t o compare

i t with k_xx
x = np . a r a n g e ( − 2 , 3 , 1 )
n = len ( x )
r = 100
z = np . z e r o s ( ( r , n ) )
f o r i i n range ( r ) :
z [ i , : ] = g p _ s a m p l e ( x , m, k )
k_xx = np . z e r o s ( ( n , 2 ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i , : ] , x [ j , : ] )

p r i n t ( "cov(z):\n" , np . cov ( z ) , "\n" )

p r i n t ( "k_xx:\n" , k_xx )
cov ( z ) :

[[ 2.37424382e-01 1.09924256e-03 4.32681142e-01 ... -1.92328002e-01

1.46703390e-03 -7.31927129e-01]
[ 1.09924256e-03 6.17989727e-02 -8.41300879e-02 ... 1.37955079e-01
1.22155905e-01 3.53607212e-02]
[ 4.32681142e-01 -8.41300879e-02 1.51374514e+00 ... -6.82936090e-01
-8.44371416e-02 -1.82595208e+00]
...
[-1.92328002e-01 1.37955079e-01 -6.82936090e-01 ... 1.16493686e+00
4.07945653e-01 7.10299634e-01]
[ 1.46703390e-03 1.22155905e-01 -8.44371416e-02 ... 4.07945653e-01
2.88425891e-01 -4.34847631e-03]
[-7.31927129e-01 3.53607212e-02 -1.82595208e+00 ... 7.10299634e-01
-4.34847631e-03 2.60520510e+00]]

k_xx:
[[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02
3.35462628e-04]
[6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01
1.11089965e-02]
[1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01
1.35335283e-01]
[1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00
6.06530660e-01]
[3.35462628e-04 1.11089965e-02 1.35335283e-01 6.06530660e-01
1.00000000e+00]]

In general, E does not have to be R. Gaussian processes are a class of stochastic

processes, and we might have the impression that the set E is the entire real number
set or a subset of it, but in fact, there is no further restriction as long as we define the
positive definite kernel k on E × E. Once we choose (m, k), we generate N -variate
Gaussian random variables according to (m, k), regardless of the selected E.
170 6 Gaussian Processes and Functional Data Analyses

Example 84 For E = R2 , we can similarly obtain random numbers that follow the
N -variate multivariate Gaussian distribution.

# D e f i n i t i o n o f (m, k )
d e f m( x ) :
return 0
def k ( x , y ) :
r e t u r n np . exp ( − np . sum ( ( x−y ) ∗ ∗ 2 ) / 2 )
# D e f i n i t i o n o f Function gp_sample
d e f g p _ s a m p l e ( x , m, k ) :
n = x . shape [ 0 ]
m_x = m( x )
k_xx = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i ] , x [ j ] )
R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x
u = np . random . r a n d n ( n )
r e t u r n R . d o t ( u ) + m_x

# G e n e r a t e t h e random numbers and c o n s t r u c t t h e c o v a r i a n c e m a t r i x t o compare

i t with k_xx
n = 5
r = 100
z = np . z e r o s ( ( r , n ) )
f o r i i n range ( r ) :
z [ i , : ] = g p _ s a m p l e ( x , m, k )
k_xx = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
k_xx [ i , j ] = k ( x [ i ] , x [ j ] )

p r i n t ( "cov(z):\n" , np . cov ( z ) , "\n" )

p r i n t ( "k_xx:\n" , k_xx )
cov ( z ) :

[[ 1.52140938 0.2585143 0.67840535 ... -0.57075902 -0.53404263

0.20940014]
[ 0.2585143 0.47450778 0.36954164 ... -0.30224707 -0.47093046
-0.03341687]
[ 0.67840535 0.36954164 0.77124364 ... -0.34846439 -0.40226404
0.37837525]
...
[-0.57075902 -0.30224707 -0.34846439 ... 0.42199909 0.46715201
0.04255153]
[-0.53404263 -0.47093046 -0.40226404 ... 0.46715201 0.59461419
0.09420804]
[ 0.20940014 -0.03341687 0.37837525 ... 0.04255153 0.09420804
0.40676413]]
6.1 Regression 171

Then, as in the usual regression procedure, we assume that x1 , . . . , x N ∈ E and

y1 , . . . , y N ∈ R are generated according to

yi = f (xi ) + i (6.1)

through the use of an unknown function f : E → R, where i follows a Gaussian

distribution with a mean of 0 and a variance of σ 2 and is independent for each
i = 1, . . . , N . The likelihood is

N
1 (yi − f (xi ))2
[√ exp{− }]
i=1 2π σ 2 2σ 2

when the function f is known (fixed). In the following, we assume that the func-
tion f randomly varies, and we regard the Gaussian process (m, k) as its prior
distribution. That is, we consider the model f X ∼ N (m X , k X X ) with yi | f (xi ) ∼
N ( f (xi ), σ 2 ) as f X = ( f (x1 ), . . . , f (x N )). Then, we calculate the posterior distri-
bution of f (z 1 ), . . . , f (z n ) corresponding to z 1 , . . . , z n ∈ E, which is different from
x1 , . . . , x N . The variations in y1 , . . . , y N is due to the variations in f and i . Thus,
the covariance matrix is

k X X + σ 2 I = (k(xi , x j ) + σ 2 δi, j )i, j=1,...,N ∈ R N ×N .

On the other hand, the variation in f (z 1 ), . . . , f (z n ) is due only to the variation

in f . Therefore, the covariance matrix is k Z Z = (k(z i , z j ))i, j=1,...,n ∈ Rn×n . More-
over, the variances of yi and f (z j ) are those of f (xi ) and f (z j ), respectively,
and the covariance matrix of Y = [y1 , . . . , y N ] and f Z = [ f (z 1 ), · · · , f (z n )] is
k X Z = (k(xi , z j ))i=1,...,N , j=1,...,n . In summary, the simultaneous distribution of Y
and f Z is
Y mX kX X + σ 2 I kX Z
∼N , .
fZ mZ kZ X kZ Z
172 6 Gaussian Processes and Functional Data Analyses

In the following, we show that the posterior probability of the function f (·) given the
value of Y is still a Gaussian process. To this end, we use the following proposition.

Proposition 61 Suppose that the simultaneous distribution of random variables a ∈

R N , b ∈ Rn can be expressed by

a μa A C
∼N , ,
b μb C B

where μa , μb are the expectations, A ∈ R N ×N , B ∈ Rn×n are the covariance matrices

(A: positive definite; B: nonnegative definite), and C ∈ R N ×n is the covariance matrix
between them. Then, the conditional probability of b given a is

b|a ∼ N (μb + C A−1 (a − μa ), B − C A−1 C) . (6.2)

Proof: Consult Lauritzen, “Graphical Models” [20] p256.

Hence, from Proposition 61, the posterior distribution of f Z ∈ Rn under Y ∈ R N
is N (μ , ), where

μ := m Z + k Z X (k X X + σ 2 I )−1 (Y − m X ) ∈ Rn

and
:= k Z Z − k Z X (k X X + σ 2 I )−1 k X Z ∈ Rn×n .

If we set n = 1 and z 1 = x, then the distribution of f (x) becomes

m (x) := m(x) + k x X (k X X + σ 2 I )−1 (Y − m X ) (6.3)

k (x, x) := k(x, x) − k x X (k X X + σ 2 I )−1 k X x . (6.4)

We summarize the discussion as follows.

Proposition 62 Suppose that the prior distribution of f (·) is a Gaussian process

(m, k). If we obtain x1 , . . . , x N , y1 , . . . , y N according to (6.1), the posterior proba-
bility of f (·) is a Gaussian process (m , k ), where m , k are given by (6.3) and (6.4),
respectively.

In the actual calculation, it takes O(N 3 ) time to calculate (K + σ 2 I )−1 . To com-

plete the whole process in O(N 3 /3), we use the following method. By Cholesky
decomposition, we obtain an L ∈ R N ×N such that

L L = kX X + σ 2 I ,

which can be completed in O(N 3 /3) time. Then, let the solutions of Lγ = k X x , Lβ =
y − m(x), and L α = β be γ ∈ R N , β ∈ R N , and α ∈ R N , respectively. Since L is
6.1 Regression 173

a lower triangular matrix, these calculations take at most O(N 2 ) time. Additionally,
we have

(k X X + σ 2 I )−1 (Y − m X ) = (L L )−1 Lβ = (L L )−1 L L α = α

and
k x X (k X X + σ 2 I )−1 k X x = (Lγ ) (L L )−1 Lγ = γ γ .

Finally, from α, β, γ , we have

m (x) = m(x) + k x X α

and
k (x, x) = k(x, x) − γ γ .

We can write the calculations of m(x), k(x, x) in forms that are completed in
O(N 3 ) and O(N 3 /3) time in Python as follows.

d e f gp_1 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 ) C o m p u t a t i o n
mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , ( y−mu ( x ) ) )
s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h )
r e t u r n {"mm" :mm, "ss" : s s }

d e f gp_2 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 / 3 )
Computation
a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) ) # O( n ^ 2 )
Computation
mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) )
gamma = np . l i n a l g . s o l v e ( L . T , h ) # O( n ^ 2 )
Computation
s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 )
r e t u r n {"mm" :mm, "ss" : s s }

Example 85 For comparison purposes, we executed the functions gp_1 and gp_2.
We can see the difference achieved by Cholesky decomposition, which reduced the
computational complexity (Fig. 6.1).

sigma_2 = 0 . 2

def k ( x , y ) : # Covariance Function

r e t u r n np . exp ( − ( x − y ) ∗∗2 / 2 / s i g m a _ 2 )
d e f mu ( x ) : # Mean F u n c t i o n
return x
174 6 Gaussian Processes and Functional Data Analyses

n = 100
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( x / 2 ) + np . random . r a n d n ( n )
K = np . z e r o s ( ( n , n ) )

f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
# # Measure E x e c u t i o n Time
import t i m e
s t a r t 1 = time . time ( )
gp_1 ( 0 )
end1 = t i m e . t i m e ( )
p r i n t ( "time1=" , end1 − s t a r t 1 )
s t a r t 2 = time . time ( )
gp_2 ( 0 )
end2 = t i m e . t i m e ( )
p r i n t ( "time2=" , end2 − s t a r t 2 )
# The 3 s i g m a w i d t h a r o u n d t h e a v e r a g e
u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 )
v _ s e q = [ ] ; w_seq = [ ]
for u in u_seq :
r e s = gp_1 ( u )
v _ s e q . a p p e n d ( r e s [ "mm" ] )
w_seq . a p p e n d ( r e s [ "ss" ] )

plt . figure ()
plt . x l i m ( −3 , 3 )
plt . y l i m ( −3 , 3 )
plt . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )
plt . p l o t ( u_seq , v _ s e q )
plt . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ 3 f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b" )
plt . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ ( − 3) f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b
")
p l t . show ( )
n = 100
plt . figure ()
p l t . x l i m ( −3 , 3 )
p l t . y l i m ( −3 , 3 )
## Five times , changing t h e samples
c o l o r = [ "r" , "g" , "b" , "k" , "m" ]
f o r h i n range ( 5 ) :
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( np . p i ∗ x / 2 ) + np . random . r a n d n ( n )
sigma_2 = 0 . 2
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 )
v_seq = [ ]
for u in u_seq :
r e s = gp_1 ( u )
v _ s e q . a p p e n d ( r e s [ "mm" ] )
p l t . p l o t ( u_seq , v_seq , c = c o l o r [ h ] )
time1 = 0.009966373443603516
time2 = 0.057814598083496094

If we compare the equation

m (x) := k x X (k X X + σ 2 I )−1 Y
6.1 Regression 175

3
2

2
1

1
f (z)
0

0
0
-1

-1
-2

-2
-3

-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
z Index

Fig. 6.1 We show the range of 3 σ above and below the average (left) and executed different
samples five times (right)

obtained by substituting m X = m(x) = 0 into the average formula of the Gaussian

process (6.3) with the equation

k x,X α̂ = k x X (K + λI )−1 Y

obtained by multiplying the kernel ridge regression formula (4.6) by k x,X from the
left, we observe that the former is a specific case of the latter when setting λ = σ 2 .

6.2 Classification

We consider the classification problem next. We assume that the random variable Y
takes the value Y = ±1 and that its conditional probability given x ∈ E is

1
P(Y = 1|x) = , (6.5)
1 + exp(− f (x))

where the Gaussian process f : × E → R is used. We wish to estimate f from

the actual x1 , . . . , x N ∈ R p (row vector) and y1 , . . . , y N ∈ {−1, 1}. To maximize the
likelihood, we minimize the negative log-likelihood

N
log[1 + exp{−yi f (xi )}] .
i=1

If we set
f X = [ f 1 , . . . , f N ] = [ f (x1 ), . . . , f (x N )] ∈ R N , vi := e−yi fi , and
N
l( f X ) := i=1 log(1 + vi ), then we have
176 6 Gaussian Processes and Functional Data Analyses

∂vi ∂l( f X ) yi vi ∂ 2 l( f X ) vi
= −yi vi , =− , = ,
∂ fi ∂ fi 1 + vi ∂ f i2 (1 + vi )2

where we use yi2 = 1. Given an initial value, we wait for the Newton-Raphson update
f X ← f X − (∇ 2 l( f X ))−1 ∇l( f X ) to converge. The update formula is

f X ← f X + W −1 u ,

yi vi vi
where u = and W = diag . In other words, we
1 + vi i=1,...,N (1 + vi )2 i=1,...,N
repeat the following two steps:
1. Obtain v, u, and W from f X .
2. Calculate f X + W −1 u and substitute it into f X
for v := [v1 , . . . , v N ] ∈ R N .

N
1
Next, we consider maximizing the likelihood multiplied
i=1
1 + exp{−y i f (x i )}
by the prior distribution of f X , i.e., finding the solution with the maximum posterior
probability. Here, the mean is often set to 0 as the prior probability of f in the
formulation of (6.5). Suppose first that the prior probability of f X ∈ R N is

1 f X k −1
X X fX
exp{− },
(2π ) N det k X X 2

where k X X is the Gram matrix (k(xi , x j ))i, j=1,...,N . If we set

1 −1 1 N
L( f X ) = l( f X ) + f k f X + log det k X X + log 2π , (6.6)
2 X XX 2 2
then we have
∇ L( f X ) = ∇l( f X ) + k −1 −1
X X f X = −u + k X X f X (6.7)

and
∇ 2 L( f X ) = ∇ 2 l( f X ) + k −1 −1
X X = W + kX X . (6.8)

Thus, we may express the update formula as

f X ← f X + (W + k −1 −1 −1
X X ) (u − k X X f X )
= (W + k −1 −1 −1 −1
X X ) {(W + k X X ) f X − k X X f X + u}
= (W + k −1 −1
X X ) (W f X + u) .

However, since the size of f X is the number of samples N , it takes an enormous

amount of time to calculate the inverse matrix. We try to improve the efficiency of
this process as follows. Utilizing the Woodbury-Sherman-Morrison formula
6.2 Classification 177

(A + U W V )−1 = A−1 − A−1 U (W −1 + V A−1 U )−1 V A−1 (6.9)

with A ∈ Rn×n (nonsingular), W ∈ Rm×m , and U, V ∈ Rn×m , if we set A = k −1

X X and
U = V = I , we obtain

(W + k −1XX)
−1

= k X X − k X X (W −1 + k X X )−1 k X X
= k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X . (6.10)

Thus, we can obtain an L such that I + W 1/2 k X X W 1/2 = L L (Cholesky decom-

position) in O(N 3 /3) time. Letting γ := W f X + u, we find a β such that Lβ =
W 1/2 k X X γ and an α such that L W −1/2 α = β in O(N 2 ) time, and we substitute
k X X (γ − α) into f X . We repeat this procedure until convergence is achieved. In fact,
we have the following equation:

L L W −1/2 α = Lβ = W 1/2 k X X γ

k X X (γ − α)
= k X X {γ − W 1/2 (L L )−1 W 1/2 k X X γ } = {k X X − k X X W 1/2 (L L )−1 W 1/2 k X X }γ
= {k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X }γ = (W + k −1 −1
X X ) (W f + u) ,

where the last equality is due to (6.10).

Example 86 By using the first N = 100 of the 150 Iris data (the first 50 points
and the next 50 points are Setosa and Versicolor data, respectively), we found the
f X = [ f 1 , . . . , f N ] with the maximum posterior probability. The output showed that
f 1 , . . . , f 50 and f 51 , . . . , f 100 were positive and negative, respectively.

from s k l e a r n . d a t a s e t s import l o a d _ i r i s
df = l o a d _ i r i s ( ) # # I r i s Data
x = df . data [0:100 , 0:4]
y = np . a r r a y ( [ 1 ] ∗ 5 0 + [ − 1 ] ∗ 5 0 )
n = len ( y )
# # Compute K e r n e l v a l u e s f o r t h e f o u r c o v a r i a t e s
def k ( x , y ) :
r e t u r n np . exp ( np . sum( − ( x − y ) ∗ ∗ 2 ) / 2 )
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i , : ] , x [ j , : ] )
eps = 0.00001
f = [0] ∗ n
g = [0.1] ∗ n

while np . sum ( ( np . a r r a y ( f )−np . a r r a y ( g ) ) ∗ ∗ 2 ) > e p s :

i = i + 1
g = f ## Save t h e data b e f o r e update f o r comparison
f = np . a r r a y ( f )
y = np . a r r a y ( y )
v = np . exp ( − y∗ f )
u = y ∗ v / (1 + v )
178 6 Gaussian Processes and Functional Data Analyses

w = ( v / (1 + v ) ∗∗2)
W = np . d i a g (w)
W_p = np . d i a g (w∗ ∗ 0 . 5 )
W_m = np . d i a g (w∗ ∗ ( − 0 . 5 ) )
L = np . l i n a l g . c h o l e s k y ( np . i d e n t i t y ( n ) +np . d o t ( np . d o t ( W_p , K) , W_p) )
gamma = W. d o t ( f ) + u
b e t a = np . l i n a l g . s o l v e ( L , np . d o t ( np . d o t ( W_p , K) , gamma ) )
a l p h a = np . l i n a l g . s o l v e ( np . d o t ( L . T , W_m) , b e t a )
f = np . d o t (K, ( gamma− a l p h a ) )
print ( l i s t ( f ) )

[2.9017597728506903, 2.6661877410648125, 2.735999714975541, 2.5962146616446793,

2.8888259653434902, 2.4229904289075734, 2.7128370653298717, 2.8965829899125057,
2.263839959450692, 2.722794155018708, 2.6757868220665557, 2.80427691289934,
2.629916582197861, 2.129058875598969, 1.9947371858903622, 1.7255773341842824,
2.502403800007298, 2.894767948521167, 2.211715451090947, 2.7578887454424845,
2.5807025000167654, 2.7884335002993703, 2.4501472162360978, 2.598252566107158,
2.49363291477457, 2.5892721299617927, 2.7995603132602014, 2.854337885531593,
2.8580336326051525, 2.682198311711416, 2.6631803480277263, 2.6529515170091735,
2.409809417765029, 2.1570288906747956, 2.738196179682446, 2.777507355522734,
2.6054932709605585, 2.8486244905053546, 2.342636360704147, 2.8826825981318938,
2.887406385036485, 1.561916989035174, 2.4541693614670925, 2.64939855108404,
2.4071165717812315, 2.633906076076528, 2.7271240196093944, 2.6732162909902857,
2.749570997237667, 2.884288422112919, -1.870441763888594, -2.5373817133813237,
-1.9327372577182746, -2.5318858849974895, -2.579366785986732, -2.7850146869568233,
-2.381782737110541, -1.4675208896283152, -2.48651854962823, -2.356973904782517,
-1.6007721537062138, -2.8113237365649777, -2.3813705315823492, -2.734406212419866,
-2.3307697118699044, -2.3416642385980246, -2.6148174861101388, -2.748088114594368,
-2.3729716167430452, -2.6288003128782993, -2.2519078422840106, -2.7892659188354783,
-2.3456928166252453, -2.7042381425156226, -2.712488583609864, -2.502955837329986,
-2.1624797840208307, -2.01571028937641, -2.8402609758000015, -2.19474914254942,
-2.474090451450695, -2.3426425865837377, -2.755051287500358, -2.1890836679248933,
-2.415174453933202, -2.3822860688193157, -2.275106022961632, -2.4967546817420585,
-2.6958800533399927, -2.6643571619340682, -2.64863153643803, -2.7718459450746087,
-2.8075095809378467, -1.5468264491591686, -2.81472767183598, -2.756340788852421,
-2.8346439145882236, -2.8498009687645762, -1.1822836980388691, -2.8458627499007214]

To execute classification for a new value x by using the estimated fˆ ∈ R N , we

perform the following steps. Similar to the regression case, if we apply Proposition
61 to

fX 0 kX X kXx
∼N , ,
f (x) 0 kx X kx x

then we obtain
f (x)| f X ∼ N (m (x), k (x, x)) , (6.11)

where
m (x) = k x X k −1
X X fX

and
k (x, x) = k x x − k x X k −1
X X kXx .
6.2 Classification 179

If we observe Y ∈ {−1, 1} N , we obtain the estimate fˆ with the Newton-Raphson

method and calculate Ŵ . We consider the Laplace approximation of f X |Y , i.e., we
approximate the Gaussian distribution as follows (Rasmussen-Williams [25]):

f X |Y ∼ N ( fˆ, (Ŵ + k −1 −1
X X ) ). (6.12)

That is, the covariance matrix is the inverse of Ŵ + k −1 X X , which is the Hessian
ˆ
∇ L( f ) of (6.8). Then, the variations in (6.11), (6.12) are independent, and f (x|Y ) =
2

N (m ∗ , k∗ ). Note that we can calculate the posterior probability for each x ∈ E:

m ∗ = k x X k −1
XX f
ˆ (6.13)

k∗ = k x x − k x X k −1 −1 −1 −1 −1
X X k X x + k x X k X X ( Ŵ + k X X ) k X X k X x

= k x x − k x X (Ŵ −1 + k X X )−1 k X x ,

where the last transformation is due to A = k X X and W = Ŵ −1 in (6.9), and U, V

are the unit matrices. Thus, we can compute the expectation (prediction value) w.r.t.
f (x)|Y ∼ N (m ∗ , k∗ ) in the sigmoid function

1
P(Y = 1|x) = :
1 + exp(− f (x))

1 1 1
√ exp[− {z − m ∗ }2 ]dz. (6.14)
E 1 + exp(−z) 2π k∗ 2k∗

To implement this step, we only need to compute û from fˆ. Since (6.7) is zero
when the updates converge, from (6.13), we have that

m ∗ = k x X û

and

(k X X + W −1 )−1 = W 1/2 W −1/2 (k X X + W −1 )−1 W −1/2 W 1/2

= W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 .

Hence, we compute α ∈ R N such that I + Ŵ 1/2 k X X Ŵ 1/2 = L L (Cholesky decom-

position) and Lα = Ŵ 1/2 k X x in O(N 3 /3) time. Then, we have

k∗ = k x x − α α

since we have

k x X W 1/2 (L L )−1 W 1/2 k X x = k x X W 1/2 (L −1 ) L −1 W 1/2 k X x = α α .

180 6 Gaussian Processes and Functional Data Analyses

We can describe the procedure for finding the value of (6.14) in Python as follows.
We assume that the procedure starts immediately after the procedure of Example 86
completes. 6-I

def pred ( z ) :
kk = np . z e r o s ( n )
f o r i i n range ( n ) :
kk [ i ] = k ( z , x [ i , : ] )
mu = np . sum ( kk ∗ u ) # Mean
a l p h a = np . l i n a l g . s o l v e ( L , np . d o t ( W_p , kk ) )
s i g m a 2 = k ( z , z ) − np . sum ( a l p h a ∗ ∗ 2 ) # Variance
m = 1000
b = np . random . n o r m a l ( mu , sigma2 , s i z e = m)
p i = np . sum ( ( 1 + np . exp ( − b ) ) ∗∗( − 1) ) / m # Prediction
return pi

Example 87 Immediately after processing Example 86, we entered numerical val-

ues for the four covariates of Iris into the function pred and calculated the probability
of them being Setosa values (1 minus the probability of them being Versicolor val-
ues). When we input the average values of the covariates for Setosa and Versicolor,
we observed that the probabilities were close to 1 and 0, respectively.

z = np . z e r o s ( 4 )
f o r j i n range ( 4 ) :
z [ j ] = np . mean ( x [ : 5 0 , j ] )
pred ( z )

0.9466371383148896

f o r j i n range ( 4 ) :
z [ j ] = np . mean ( x [ 5 0 : 1 0 0 , j ] )
pred ( z )

0.05301765489687672

6.3 Gaussian Processes with Inducing Variables

A Gaussian process generally involves O(N 3 ) calculations. To avoid such an incon-

venience, we choose Z := [z 1 , · · · , z M ] ∈ E M and approximate the generation pro-
cess
f X ∼ N (m X , k X X )

f (x)| f X ∼ N (m(x) + k x X k −1 −1
X X ( f X − m X ), k(x, x) − k x X k X X k X x )
6.3 Gaussian Processes with Inducing Variables 181

y| f (x) ∼ N ( f (x), σ 2 )

by
f Z ∼ N (m Z , k Z Z ) (6.15)

f (x)| f Z ∼ N (m(x) + k x Z k −1 −1
Z Z ( f Z − m Z ), k(x, x) − k x Z k Z Z k Z x ) (6.16)

y| f (x) ∼ N ( f (x), σ 2 ) , (6.17)

where m Z = (m(z 1 ), · · · , m(z M )), k Z Z = (k(z i , z j ))i, j=1,··· ,M , and kx Z =

[k(x, z 1 ), · · · , k(x, z M )] (row vector).
Under the following assumption, we obtain Proposition 63.
Assumption 1 Each occurrence of (6.16) is independent for x = x1 , · · · , x N .

Proposition 63 Let ∈ R N ×N be a diagonal matrix whose elements are λ(xi ),

i = 1, · · · , N . Under the generation process outlined in (6.15), (6.16), (6.17) and
Assumption 1, we have
f Z |Y ∼ N (μ f Z |Y , f Z |Y ) ,

μ f Z |Y =m Z +k Z Z Q −1 k Z X (+σ 2 I N )−1 (Y − m X ) , (6.18)

and
f Z |Y = k Z Z Q −1 k Z Z , (6.19)

where
Q := k Z Z + k Z X ( + σ 2 I N )−1 k X Z ∈ R M×M (6.20)

with λ(x) := k(x, x) − k x Z k −1

Z Z kZx .

Proof: From (6.16) and Assumption 1,

f X | f Z ∼ N (m X + k X Z k −1
Z Z ( f Z − m Z ), )

for f X := [ f (x1 ), · · · , f (x N )]. Moreover, the expectations of Y and f X are equal,

and only the variance σ 2 is different, so we have

Y | f Z ∼ N (m X + k X Z k −1
Z Z ( f Z − m Z ), + σ I N ) .
2

Thus, the simultaneous distribution of f Z , Y is

(the exponents of p( f Z ) and p(Y | f Z ))

1
= − ( f Z − m Z ) k −1
Z Z ( fZ − mZ )
2
1
− {Y − (m X + k Z X k −1
Z Z ( f Z − m Z ))} ( + σ I N )
2 −1
2
182 6 Gaussian Processes and Functional Data Analyses

·{Y − (m X + k Z X k −1
Z Z ( f Z − m Z ))} . (6.21)

If we differentiate (6.21) by f Z , setting a = f Z − m Z and b = Y − m X yields

−k −1 −1 −1 −1
Z Z a + k Z Z k Z X ( + σ I N ) (b − k Z X k Z Z a)
2

= k −1 2 −1 −1 −1
Z Z k Z X ( + σ I N ) b − k Z Z {k Z Z + k Z X ( + σ I N ))k X Z }k Z Z a
2

= k −1 −1 −1 −1
Z Z k Z X ( + σ I N ) b − k Z Z Qk Z Z a
2

= k −1 −1 −1 −1
Z Z Qk Z Z {k Z Z Q k Z X ( + σ I N ) b − a}
2

= − −1 f Z |Y ( f Z − μ f Z |Y ) . (6.22)

Therefore, the terms w.r.t. f Z in (6.21) are only

1
− ( f Z − μ f Z |Y ) −1
f Z |Y ( f Z − μ f Z |Y ), (6.23)
2
and we obtain the proposition.

Proposition 64 Under the generation process outlined in (6.15), (6.16), (6.17) and
Assumption 1, we have
Y ∼ N (μY , Y )

with
μY := m X (6.24)

and
Y := + σ 2 I N + k X Z k −1
Z Z kX Z . (6.25)

Proof: Since the expectation μY of Y is m X , we obtain the covariance matrix Y . Let

a := f Z − m Z and b := Y − m X . Then, the exponents (6.21) and (6.23) of p(Y, f Z )
and p( f Z |Y ) are, respectively,

1 1
− a k −1 −1 2 −1 −1
Z Z a − (b − k Z X k Z Z a) ( + σ I ) (b − k Z X k Z Z a) (6.26)
2 2
and

1
− (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) (k Z Z Q −1 k Z Z )−1 (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) .
2
(6.27)
From (6.20), we have

1 1 1 −1 −1
− a k −1 −1 2 −1 −1
Z Z a − (k Z X k Z Z a) ( + σ I ) k Z X k Z Z a = − a k Z Z Qk Z Z a .
2 2 2
From p(Y, f 2 ) = p( f 2 |Y ) p(Y ), the difference between (6.26) and (6.27) is the expo-
nent of p(Y ), which is
6.3 Gaussian Processes with Inducing Variables 183

1 1
− b ( + σ 2 I )−1 b + b ( + σ 2 I N )−1 k X Z Q −1 k Z X ( + σ 2 I N )−1 b ,
2 2
where we may set a = 0 because no terms will remain w.r.t. a. Furthermore, if we
set A = + σ 2 I N , U = k X Z , V = k Z X , and W = k −1
Z Z in the Woodbury-Sherman-
Morrison formula (6.9), then we have

1
− b ( + σ 2 I N + k X Z k −1 −1
Z Z kX Z ) b
2
and obtain (6.25).

Proposition 65 Under the generation process outlined in (6.15), (6.16), (6.17) and
Assumption 1, for each x ∈ E, we have

f (x)|Y ∼ N (μ(x), σ 2 (x))

μ(x):=m(x)+k x Z k −1 −1 −1
Z Z (μ f Z |Y − m Z )=m(x)+k x Z Q k Z X (+σ I N ) (Y − m X )
2

(6.28)
σ 2 (x) := k(x, x) − k x Z (K Z−1Z − Q −1 )k Z x .

Proof: First, we note that Y → f Z → f (x) forms a Markov chain in this order. In the
following, we consider the distribution of f (x)|Y instead of f (x)| f Z , i.e., the distri-
bution of f (x)| f Z and f Z |Y . In (6.16), the term with a mean of k x Z k −1
Z Z ( fZ − mZ )
becomes k x Z k −1
Z Z (μ f Z |Y − m Z ) when averaged over f Z |Y . Thus, we obtain (6.28).
Moreover, if we take the variance of that term with respect to f Z |Y , we obtain the
same value as the variance of k x Z k −1Z Z ( f Z − μ f Z |Y ), so we have

E[k x Z k −1 −1 −1 −1 −1
Z Z ( f Z − μ f Z |Y )( f Z − μ f Z |Y ) k Z Z k Z x ] = k x Z k Z Z f Z |Y k Z Z k Z x = k x Z Q k Z x ,
(6.29)
where f Z varies with the given Y . Furthermore, from (6.16), since the variance
λ(x) = k(x, x) − k x Z k −1Z Z k Z x of f (x)| f Z is independent of f Z , we can write the
variance of f (x)|Y as the sum of the variance λ(x) of f (x)| f Z and (6.29). In other
words, we have σ 2 (x) = λ(x) + k x Z Q −1 k Z x .
In a case when the inducing variable method is employed, the calculations of
k Z Z , k x Z take O(M 2 ) and O(M), respectively, the calculation of takes O(N ),
and the calculations of Q and Q −1 take O(N M 2 ) and O(M 3 ), respectively. The
multiplication process is also completed in O(N M 2 ). On the other hand, without
the inducing variable method, it takes O(N 3 ) of computational time. In the inducing
variable method, we do not use the matrix K X X ∈ R N ×N .
We can randomly select the inducing points z 1 , · · · , z M from x1 , · · · , x N or via
K-means clustering.

Example 88 Based on the above discussion, we constructed the function gp_ind by

using the inducing variable method and compared its performance with that of gp_1,
which does not use the inducing variable method.
184 6 Gaussian Processes and Functional Data Analyses

sigma_2 = 0 . 0 5 # s h o u l d be e s t i m a t e d
def k ( x , y ) : # Covariance f u n c t i o n
r e t u r n np . exp ( − ( x − y ) ∗∗2 / 2 / s i g m a _ 2 )
d e f mu ( x ) : # Mean f u n c t i o n
return x
# Data G e n e r a t i o n
n = 200
x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3
y = np . s i n ( x / 2 ) + np . random . r a n d n ( n )
e p s = 10∗∗( − 6)

m = 100
K = np . z e r o s ( ( n , n ) )
f o r i i n range ( n ) :
f o r j i n range ( n ) :
K[ i , j ] = k ( x [ i ] , x [ j ] )
i n d e x = np . random . c h o i c e ( n , s i z e = m, r e p l a c e = F a l s e )
z = x [ index ]
m_x = 0
m_z = 0
K_zz = np . z e r o s ( ( m, m) )
f o r i i n range (m) :
f o r j i n range (m) :
K_zz [ i , j ] = k ( z [ i ] , z [ j ] )

K_xz = np . z e r o s ( ( n , m) )
f o r i i n range ( n ) :
f o r j i n range (m) :
K_xz [ i , j ] = k ( x [ i ] , z [ j ] )
K _ z z _ i n v = np . l i n a l g . i n v ( K_zz + np . d i a g ( [ 1 0 ∗ ∗ e p s ] ∗m) )
lam = np . z e r o s ( n )
f o r i i n range ( n ) :
lam [ i ] = k ( x [ i ] , x [ i ] ) − np . d o t ( np . d o t ( K_xz [ i , 0 :m] , K _ z z _ i n v ) , K_xz [ i , 0 :
m] )
l a m _ 0 _ i n v = np . d i a g ( 1 / ( lam+ s i g m a _ 2 ) )
Q = K_zz + np . d o t ( np . d o t ( K_xz . T , l a m _ 0 _ i n v ) , K_xz ) ## Computation o f Q
d o e s n o t r e q u i r e O( n ^ 3 )
Q_inv = np . l i n a l g . i n v (Q + np . d i a g ( [ e p s ] ∗ m) )
muu = np . d o t ( np . d o t ( np . d o t ( Q_inv , K_xz . T ) , l a m _ 0 _ i n v ) , y−m_x )
d i f = K _ z z _ i n v − Q_inv
R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) ## O( n ^ 3 ) o m p u t a t i o n is
required

def gp_ind ( x_pred ) : ## I n d u c i n g V a r i a b l e Method

h = np . z e r o s (m)
f o r i i n range (m) :
h [ i ] = k ( x_pred , z [ i ] )
mm = mu ( x _ p r e d ) + h . d o t ( muu )
s s = k ( x_pred , x_pred ) − h . dot ( d i f ) . dot ( h )
r e t u r n {"mm" : mm, "ss" : s s }

d e f gp_1 ( x _ p r e d ) : # # W/ O I n d u c i n g V a r i a b l e Method
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , y−mu ( x ) )
s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h )
r e t u r n {"mm" : mm, "ss" : s s }

x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 )
mmv = [ ] ; s s v = [ ]
6.3 Gaussian Processes with Inducing Variables 185

for u in x_seq :
mmv . a p p e n d ( g p _ i n d ( u ) [ "mm" ] )
s s v . a p p e n d ( g p _ i n d ( u ) [ "ss" ] )

plt . figure ()
p l t . p l o t ( x_seq , mmv, c = "r" )
p l t . p l o t ( x_seq , np . a r r a y (mmv) + 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) ,
c = "r" , l i n e s t y l e = "−−" )
p l t . p l o t ( x_seq , np . a r r a y (mmv) − 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) ,
c = "r" , l i n e s t y l e = "−−" )
p l t . x l i m ( −2 , 2 )
p l t . p l o t ( np . min (mmv) , np . max (mmv) )

x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 )
mmv = [ ] ; s s v = [ ]
for u in x_seq :
mmv . a p p e n d ( gp_1 ( u ) [ "mm" ] )
s s v . a p p e n d ( gp_1 ( u ) [ "ss" ] )

mmv = np . a r r a y (mmv)
s s v = np . a r r a y ( s s v )

plt . p l o t ( x_seq , mmv, c = "b" )

plt . p l o t ( x_seq , mmv + 3 ∗ np . s q r t ( s s v ) , c = "b" , l i n e s t y l e = "−−" )
plt . p l o t ( x_seq , mmv − 3 ∗ np . s q r t ( s s v ) , c = "b" , l i n e s t y l e = "−−" )
plt . scatter (x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )

6.4 Karhunen-Lóeve Expansion

In this section, we continue to study the probability space (, F, P) and the map f :
× E (ω, x) → f (ω, x) ∈ H . We assume that H is a general separable Hilbert
space. In the following, we continue to denote f (ω, x) by f (x) as a random variable
for each x ∈ E. In particular, we assume that f is a mean-square continuous process
(Fig. 6.2), which is defined by

lim E| f (ω, xn ) − f (ω, x)|2 = 0 (6.30)

n→∞

for an arbitrary {xn } in E that converges to x ∈ E. We do not assume a Gaussian

process, and we give the expectation at x ∈ E and the covariance at x, y ∈ E by

m(x) = E f (ω, x)

and
k(x, y) = Cov( f (ω, x), f (ω, y)) .

In Chap. 5, we obtained the expectation and covariance of k(X, ·); in this section,
however, x, y ∈ E are not random, and the randomness of m, k is due to that of
f (ω, ·).
In the following, we assume that E is compact.
186 6 Gaussian Processes and Functional Data Analyses

Fig. 6.2 The red and blue

3
curves show the results
obtained by the inducing

2
variable and standard
Gaussian processes,

1
respectively

0
0

-1
-2
-3

-2 -1 0 1 2
x

Proposition 66 f is a mean-square continuous process if and only if m and k are

continuous.

Proof: See the Appendix at the end of this chapter.

In the following, we assume that m ≡ 0 to simplify the discussion. Since E is
compact, we assume that the diameter of each E i is less than or equal to 1/n.
However, each E i is a metric space, and we define the diameter by the maximum
M(n)
distance among the elements in E i . Thus, there exists a partition of E (∪i=1 E i = E,
E i ∩ E j = φ, i = j) and a number of partitions M(n). Then, we define

M(n)
I f (g; {(E i , xi )}1≤i≤M(n) ) := f (ω, xi ) g(y)dμ(y)
i=1 Ei

for a pair of interior points {(E i , xi )}1≤i≤M(n) and g ∈ L 2 (E, B(E), μ). Hence, we
have

M(n)
{I f (g; {(E i , xi )}1≤i≤M(n) )}2 d P(ω) ≤ M(n) { f (ω, xi )}2 { g(u)dμ(u)}2
i=1 Ei

M(n)
d P = M(n) k(xi , xi ) {g(u)}2 dμ(u) < ∞
i=1 Ei

and I f (g; {(E i , xi )}1≤i≤M(n) ) ∈ L 2 (, F, P). Although this value is different
depending on the choices of the region decomposition and the points inside the
regions, the difference in I f converges to zero as n goes to infinity. In fact, we have
6.4 Karhunen-Lóeve Expansion 187

E |I f (g; {(E i , xi )}1≤i≤M(n) ) − I f (g; {(E j , x j )}1≤ j≤M(n ) )|2
M(n)
M(n)
= k(xi , xi ) g(u)dμ(u) g(v)dμ(v)
i=1 i =1 Ei Ei

) M(n
M(n )

+ k(x j , x j ) g(u)dμ(u) g(v)dμ(v)
j=1 j =1 Ej E j

M(n
M(n)
)
−2 k(xi , x j ) g(u)dμ(u) g(v)dμ(v) .
i=1 j=1 Ei E j

Since k is uniformly continuous, each double sum on the right-hand side converges
to
k(u, v)g(u)g(v)dμ(u)dμ(v) .
E E

Since the Cauchy sequence converges to zero, its convergence destination I f (ω, g)
is contained in L 2 (ω, F, P) regardless of the choice of {(E i , xi )}1≤i≤M(n) .
If the eigenvalues and eigenfunctions obtained from the integral operator Tk ∈
B(L 2 (E, B(E), μ)),

Tk g(·) = k(y, ·)g(y)dμ(y) , g ∈ L 2 (E, B(E), μ),
E

are {λ j }∞ ∞
j=1 and {e j (·)} j=1 , by Mercer’s theorem, we can express the covariance
function k as
∞
k(x, y) = λ j e j (x)e j (y) , (6.31)
j=1

where the sum absolutely and uniformly converges on that support.

Then, we have the following claim.
Proposition 67 If { f (ω, x)}x∈E is a mean-square continuous process with a mean
of zero, we have
1. E[I f (ω, g)] = 0.
2. E[I f (ω, g) f (ω, x)] = E k(x,
y)g(y)dμ(y), x ∈ E.
3. E[I f (ω, g)I f (ω, h)] = E E k(x, y)g(x)h(y)dμ(x)dμ(y).
These properties hold for each g, h ∈ L 2 (E, F, μ), and in particular, we have that

E[I f (ω, ei )I f (ω, e j )] = δi, j λi . (6.32)

Proof: For the proofs of the above three items, see the Appendix at the end of this
chapter. We obtain (6.32) by substituting Mercer’s theorem (6.31), g = ei , and h = e j
into the third item:
188 6 Gaussian Processes and Functional Data Analyses

∞
E[I f (ω, ei )I f (ω, e j )] = λr er (x)er (y)ei (x)e j (y)dμ(x)dμ(y).
E E r =1

Furthermore, we have the following theorem.
Proposition 68 (Karhunen-Lóeve [17, 18])Suppose that { f (ω, x)}x∈E is a mean-
square continuous process with a mean of zero. Then, we have

lim sup E| f (ω, x) − f n (ω, x)|2 = 0

n→∞ x∈E

n
for f n (ω, x) := j=1 I f (ω, e j )e j (x).
Proof: From (6.32), we have

n
E[ f n (ω, x)2 ] = E[{ I f (ω, e j )e j (x)}2 ]
j=1

n
n
n
= E[I f (ω, ei )I f (ω, e j )]ei (x)e j (x) = λ j e2j (x) .
i=1 j=1 j=1

Moreover, from (6.31) and the second item of Proposition 67, we have

n
n
E[ f n (ω, x) f (ω, x)] = E[ I f (ω, e j )e j (x) f (ω, x)] = e j (x) k(x, y)e j (y)dμ(y)
j=1 j=1 E

n
n
= λ j e2j (x) e2j (y)dμ(y) = λ j e2j (x) ,
j=1 E j=1

which means that

E| f n (ω, x) − f (ω, x)|2 = E[ f n (ω, x)2 ] − 2E[ f n (ω, x) f (ω, x)] + E[ f (ω, x)2 ]
n
n n
= λ j e2j (x) − 2 λ j e2j (x) + k(x, x) = k(x, x) − λ j e2j (x) .
j=1 j=1 j=1

In a general mean-square continuous process (without assuming a Gaussian pro-
cess),
the series expansion provided by Karhunen-Lóeve’s theorem makes I f (ω, e j )/
λ j a random variable with a mean of 0 and a variance of 1. Instead, if we assume
a Gaussian process such that f (x) (x ∈ E) follows a Gaussian distribution, then we
can write

n

f n (x) = z j λ j e j (x) , (6.33)
j=1
6.4 Karhunen-Lóeve Expansion 189

where z j follows an independent standard Gaussian distribution.

Let E := [0, 1] and (, F, P) be a probability space. Then, we call the map
× E (ω, x) → f (ω, x) ∈ R that satisfies the following conditions a Brownian
motion.
1. f (ω, 0) = 0, f (ω, x) − f (ω, y) ∼ N (0, y − x), and 0 ≤ x < y.
2. f (ω, x2 ) − f (ω, x1 ), . . . , f (ω, xn−1 ) − f (ω, xn ) are independent for any n =
1, 2, . . . and 0 ≤ x1 < x2 < . . . < xn .
3. There exists an ∈ F with a probability of 1, and E x → f (ω, x) is contin-
uous for each ω ∈ .
In this case, we have the following proposition.
Proposition 69 The map × E (ω, x) → f (ω, x) ∈ R is a Brownian motion if
and only if the following three conditions are satisfied simultaneously.
1. It is a Gaussian process.
2. The covariance function of x, y ∈ E is given by k(x, y) = min(x, y).
3. f (ω, ·) is continuous with a probability of 1.
Proof: The first two conditions in the definition imply the first condition in Proposition
69. Moreover, if x < y, then we have

E[ f (ω, x) f (ω, y)] = E[ f (ω, x)2 ] + E[ f (ω, x){ f (ω, y) − f (ω, x)}] = x ,

which implies the second condition of Proposition 69. On the contrary, supposing
that m ≡ 0 for simplicity, if we assume that the first two items of Proposition 69
hold, then because k(x, x) = x, when x ≤ y ≤ z, we have

E[ f (ω, x){ f (ω, y) − f (ω, z)}] = k(x, y) − k(x, z) = x − x = 0

and
E[ f (ω, y){ f (ω, y) − f (ω, z)}] = k(y, y) − k(y, z) = y − y = 0 ,

which implies that

E[{ f (ω, x) − f (ω, y)}{ f (ω, y) − f (ω, z)}] = 0 .

Moreover, from k(0, 0) = 0, the variance of f (ω, 0) is zero, so we have f (ω, 0) = 0.

Furthermore, we have

E[{ f (ω, x) − f (ω, y)}2 ] = k(x, x) − 2k(x, y) + k(y, y) = x − 2x + y = y − x .

Example 89 (Brownian Motion as a Gaussian Process) For the integral operator
(Example 58) on the covariance function of a Brownian motion k(x, y) = min(x, y)
(x, y ∈ E), its eigenvalues and eigenfunctions are (3.13) and (3.14), respectively.
190 6 Gaussian Processes and Functional Data Analyses

Utilizing these eigenvalues and eigenfunctions, we can expand f (ω, ·) as follows.

In particular, from (6.33), we have

n

f n (x) = z j (ω) λ j e j (x)
j=1

for z j ∼ N (0, 1). We generate the series {z i (ω)}i=1

n
m times (Fig. 6.3; n = 10;
m = 7). We execute it with the following code.

d e f lam ( j ) : ## EigenValue
return 4 / ( ( 2 ∗ j − 1 ) ∗ np . p i ) ∗∗2
def ee ( j , x ) : ## D e f i n i t i o n o f E i g e n f u n c t i o n
r e t u r n np . s q r t ( 2 ) ∗ np . s i n ( ( 2 ∗ j − 1 ) ∗ np . p i / 2 ∗ x )

n = 10; m = 7

## D e f i n i t i o n o f Gaussian Process
def f ( z , x ) :
n = len ( z )
S = 0
f o r i i n range ( n ) :
S = S + z [ i ] ∗ e e ( i , x ) ∗ np . s q r t ( lam ( i ) )
return S
plt . figure ()
p l t . xlim (0 , 1)
p l t . x l a b e l ( "x" )
p l t . y l a b e l ( "f(omega,x)" )
c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , m) ]

f o r j i n range (m) :
z = np . random . r a n d n ( n )
x _ s e q = np . a r a n g e ( 0 , 3 . 0 0 1 , 0 . 0 0 1 )
y_seq = [ ]
for x in x_seq :
y_seq . append ( f ( z , x ) )
p l t . p l o t ( x_seq , y_seq , c = c o l o r s [ j ] )

p l t . t i t l e ( "BrownMotion" )

We introduce the Matérn class, which is a class of kernels used in stochastic

processes rather than the RKHS of machine learning. Such a kernel is k(x, y) = ϕ(z)
(z := x − y in x.y ∈ E), where ϕ(z) is
√
21−ν 2νz
ϕ(z) := √ Kν ( ), (6.34)
(ν) 2νz l
l

ν, l > 0 are the parameters of the kernel, and K ν is a variant Bessel function of the
second kind.
π I−α (x) − Iα (x)
K ν (x) :=
2 sin(αx)
6.4 Karhunen-Lóeve Expansion 191

Brownian Motion

2
1
f (ω, x)
0
-1
-2

0.0 0.2 0.4 0.6 0.8 1.0

Fig. 6.3 We generated the sample paths of Brownian motions seven times. Each run involved a
sum of up to 10 terms

∞
1 x 2m+α
Iα (x) := .
m=0
m!(m + α + 1) 2

In practice, we use (6.34), in which p is a positive integer and ν = p + 1/2. In

the 1-dimensional case, we have
√ √ p−i
2νz ( p + 1) ( p + i)!
p
8νz
ϕν (z) = exp − . (6.35)
l (2 p + 1) i=0 i!( p − i)! l

For example, we express ν = 5/2, 3/2, 1/2 as follows. In particular, we call the
stochastic process with ν = 1/2 the Ornstein-Uhlenbeck process.
√ √
5z 5z 2 5z
ϕ5/2 (z) = 1 + + 2 exp(− )
l 3l l
√ √
3z 3z
ϕ3/2 (z) = 1 + exp(− )
l l

ϕ1/2 (z) = exp(−z/l).

For example, if we write this process in Python, we have the following code.
192 6 Gaussian Processes and Functional Data Analyses

Matern Kernel (l=0.1) Matern Kernel (l = 0.02)

10
ν = 1.5 ν = 1.5
ν = 2.5 ν = 2.5
ν ν
Kernel Values ϕ(z)

Kernel Values ϕ(z)

= 3.5 = 3.5
8

8
ν = 4.5 ν = 4.5
ν = 5.5 ν = 5.5
ν = 6.5 ν = 6.5
6

6
ν = 7.5 ν = 7.5
ν = 8.5 ν = 8.5
ν ν
4

4
= 9.5 = 9.5
ν = 10.5 ν = 10.5
2

2
0

0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
z z

Fig. 6.4 The values of the Matérn kernel for ν = 1/2, 3/2, . . . , m + 1/2 (Example 90). l = 0.1
(left) and l = 0.02 (right)

from s c i p y . s p e c i a l import gamma

d e f m a t e r n ( nu , l , r ) :
p = nu − 1 / 2
S = 0
f o r i i n range ( i n t ( p + 1 ) ) :
S = S + gamma ( p + i + 1 ) / gamma ( i + 1 ) / gamma ( p − i + 1 ) \
∗ ( np . s q r t ( 8 ∗ nu ) ∗ r / l ) ∗ ∗ ( p − i )
S = S ∗ gamma ( p + 2 ) / gamma ( 2 ∗ p + 1 ) ∗ np . exp ( − np . s q r t ( 2 ∗ nu ) ∗ r /
l)
return S

Example 90 We present the Matérn kernel values for l = 0.1, 0.02 with ν =
1/2, 3/2, . . . , m + 1/2 (Fig. 6.4).

m = 10
l = 0.1
c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 1 , l e n ( range (m) ) ) ]
x = np . l i n s p a c e ( 0 , 0 . 5 , 2 0 0 )
p l t . p l o t ( x , m a t e r n ( 1 − 1 / 2 , l , x ) , c = c o l o r [ 0 ] , l a b e l = r"$\nu=%d$"%1)
p l t . ylim (0 , 10)
f o r i i n range ( 2 , m + 1 ) :
p l t . p l o t ( x , m a t e r n ( i − 1 / 2 , l , x ) , c = c o l o r [ i − 1 ] , l a b e l = r"$\nu=%d$"
%i )

p l t . legend ( loc = "upperright" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

In the case of the Matérn kernel and in general, we cannot analytically obtain
the eigenvalues and eigenfunctions, as in the cases involving Gaussian kernels and
Brownian motion. Even in those cases, if we assume a Gaussian process, then we can
find x1 , . . . , xn ∈ E to obtain its Gram matrix, which will be a covariance matrix.
6.4 Karhunen-Lóeve Expansion 193

OU Process (ν = 1/2, l = 0.1)

3
2
1
0
y

-1
-2
-3

0.0 0.2 0.4 0.6 0.8 1.0

Matern Process (ν = 3/2, l = 0.1)

3
2
1
0
y

-1
-2
-3

0.0 0.2 0.4 0.6 0.8 1.0

Fig. 6.5 The Orstein-Uhlenbeck process (ν = 1.2, top) and the Matérn process (ν = 3/2, top) for
l = 0.1

Thus, it is sufficient to generate n-variate random numbers that follow a Gaussian

distribution. The above method is approximate, but it is very versatile.

Example 91 We display the Orstein-Uhlenbeck process (ν = 1.2, top) and the

Matérn process (ν = 3/2, top) with n = 100 and l = 0.1 (Fig. 6.5).

c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d
c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , 5 ) ]
d e f r a n d _ 1 0 0 ( Sigma ) :
L = np . l i n a l g . c h o l e s k y ( Sigma ) ## Cholesky d e c o m p o s i t i o n o f c o v a r i a n c e
matrix
194 6 Gaussian Processes and Functional Data Analyses

u = np . random . r a n d n ( 1 0 0 )
y = L . dot ( u ) # # G e n e r a t e random numbers w i t h z e r o −mean and t h e
covariance matrix
return y
x = np . l i n s p a c e ( 0 , 1 , 1 0 0 )
z = np . abs ( np . s u b t r a c t . o u t e r ( x , x ) ) # c o m p u t e d i s t a n c e m a t r i x , d_ { i j } = | x _ i
− x_j |
l = 0.1
Sigma_OU = np . exp ( − z / l ) # # OU: matern ( 1 / 2 , l , z ) i s slow
y = r a n d _ 1 0 0 ( Sigma_OU )

plt . figure ()
plt . plot (x , y)
plt . ylim ( −3 ,3)
for i i n range ( 5 ) :
y = r a n d _ 1 0 0 ( Sigma_OU )
plt . plot (x , y , c = colors [ i ])
p l t . t i t l e ( "OUprocess(nu=1/2,l=0.1)" )

Sigma_M = m a t e r n ( 3 / 2 , l , z ) # # Matern
y = r a n d _ 1 0 0 ( Sigma_M )
plt . figure ()
plt . plot (x , y)
p l t . y l i m ( −3 , 3 )
f o r i i n range ( 5 ) :
y = r a n d _ 1 0 0 ( Sigma_M )
plt . plot (x , y , c = colors [ i ])
p l t . t i t l e ( "Maternprocess(nu=3/2,l=0.1)" )

6.5 Functional Data Analysis

Let (, F, P) and H be a probability space and a separable Hilbert space, respec-
tively. Let F : → H be a measurable map, i.e., {h ∈ H | g − h < r }) is an ele-
ment of F for each open set (g ∈ H, r ∈ (0, ∞) in H . We call such an F : → H
a random element of H . Intuitively, a random element is a random variable that
takes a value in H . Thus far, we have assumed that f : × E → R is measurable
at each x ∈ E (stochastic process). This section addresses situations in which we do
not assume such measurability. For simplicity, we write F(ω) as F, similar to the
elements of H .
Although we do not go into details in this book, it is known that the following
relationship holds between stochastic processes and random elements. It is only
necessary to understand the close relationship between the two.
Proposition 70 (Hsing-Eubank [14])
1. If f : × E → R is measurable w.r.t. × E and f (ω, ·) ∈ H , for ω ∈ , then
f (ω, ·) is a random element of H .
2. If f (·, x) → R is measurable for each x ∈ E and f (ω, ·) is continuous for each
ω ∈ , then f (ω, ·) is a random element.
3. If f : × E → R is a (zero-mean) mean-square continuous process and its
covariance function is k, then
a random element of H exists such that the covari-
ance operator is H g → E k(·, y)g(y)dμ(y) ∈ H .
6.5 Functional Data Analysis 195

4. A random element in an RKHS H (k) with a measurable reproducing kernel k is

a stochastic process, and a stochastic process that takes values in RKHS H (k) is
a random element of H (k).
For the proof, see Chap. 7 in [14].
In this section, we learn the properties of random elements and apply them to
functional data analysis [24].
First, we consider the average E[F, g] of F, g for each g ∈ H under E[F] <
∞. Since g → E[F, g] is a linear functional, there exists a unique m ∈ H such
that
E[F, g] = m, g (6.36)

from Proposition 22. We write this formally as m = E[F], which is the definition of
the mean of a random element F.
Proposition 71 If EF2 < ∞, then

EF − m2 = EF2 − m2

holds.
Proof: If we substitute g = m into (6.36), we obtain

EF − m2 = EF2 − 2EF, m + m2 = EF2 − 2m, m + m2 .

Since EF2 < ∞ implies that EF < ∞, we proceed with our discussion by
assuming the former case.
Regarding covariance, if H = R p , then the covariance matrix is

E[(F − E[F])(F − E[F]) ] = E[(F − E[F]) ⊗ (F − E[F])] ∈ R p× p

for F ∈ R p . For the general Hilbert space H , the correspondence

H 2 (g, h) → E[F − m, gF − m, h] ∈ R

is linear for each of g and h. Moreover, if EF2 < ∞, then it is bounded from

E[F − m, gF − m, h] ≤ EF − m2 · g h ≤ EF2 · g h .

If we define u ⊗ v ∈ B(H ) by

H w → (u ⊗ v)w = u, wv ∈ H

for u, v, w ∈ H , then a K ∈ B(H ) exists such that

196 6 Gaussian Processes and Functional Data Analyses

E[F − m, gF − m, h] = E[{(F − m) ⊗ (F − m)}g, h] = K g, h = g, K ∗ h .

If we exchange g, h, we obtain the same value, so K and K ∗ coincide, and each is

self-conjugated. Such a K is called a covariance operator, and we formally write this
as K = E[(F − m) ⊗ (F − m)].
Proposition 72

E[(F − m) ⊗ (F − m)] = E[F ⊗ F] − m ⊗ m

Proof: From (6.36), for arbitrary g, h ∈ H , we have

E[m, gF, h] = m, gE[F], h = m, gm, h = (m ⊗ m)g, h .

From

F − m, gF − m, h = F, gF, h − F, gm, h − F, gm, h + m, gm, h ,

we have

E[{(F − m) ⊗ (F − m)}g, h] = E[{F ⊗ F − m ⊗ m}g, h] .

In the following, for simplicity, we proceed with our discussion by assuming that
m = 0.
Proposition 73 If m = 0 and EF2 < ∞, then
1. The covariance operator K is nonnegative definite and is a trace class operator
whose trace is
K T R = EF2 .

2. With probability 1, F ∈ Im(K ) holds.

Proof: For g ∈ H ,
K g, g = E[F, gF, g] ≥ 0

holds, which means that K is nonnegative definite. Moreover, if {e j } is an orthonormal

basis, we have
∞
∞

K T R = K e j , e j = E[F ⊗ F]e j , e j = EF2 < ∞ .
j=1 j=1

For the second item, note that in general, we have

(Im(K ))⊥ = Ker(K ) . (6.37)

6.5 Functional Data Analysis 197

In fact, if we set g ∈ Ker(K ), then K is self-adjoint (K = K ∗ ) and

g, K h = K g, h = 0 , h ∈ H .

Therefore, g is orthogonal to any element of I m(K ), and we require g ∈ Im(K )⊥ .

Conversely, if g ∈ Im(K )⊥ , then K K g ∈ Im(K ) and K g = g, K K g = 0, i.e.,
we have g ∈ Ker(K ). This means that for g ∈ Im(K )⊥ , we have E[F, g2 ] =
K g, g = 0, and F is orthogonal to any g ∈ Im(K )⊥ with a probability of 1. There-
fore, from Proposition 20, with a probability of 1, we have

F ∈ (Im(K ))⊥⊥ = Im(K ) .

Additionally, from Propositions 27 and 31 and the first item of Proposition 73,
the following holds.
Proposition 74 The eigenvalue function {e j } of the covariance operator K is an
orthonormal basis of Im(K ); the corresponding eigenvalues {λ j }∞
j=1 are nonnegative,
monotonically decrease, and converge to 0. Furthermore, the multitude of each of
the nonzero eigenvalues is finite.
Additionally, from Propositions 73 and 74, the following holds.
Proposition 75 If { f j } is an orthonormal basis of H , then we have

n
n
EF − F, f j f j 2 = EF2 − K f j , f j , (6.38)
j=1 j=1

which is minimized when f j = e j (1 ≤ j ≤ n).

Proof: The following two equations imply (6.38).
⎡ ⎤

n
n
n
EF − F, f j f j 2 =EF2 +E F, f j f j 2 − 2E ⎣F, F, f j f j ⎦
j=1 j=1 j=1

⎡ ⎤

n
n
n
n
E F, f j f j 2 = E ⎣F, F, f j f j ⎦ = E F, f j 2 = K f j , f j .
j=1 j=1 j=1 j=1

Then, from EF2 = K T R = ∞ j=1 λ j (Proposition 73) and Proposition 28, we

obtain Proposition 75.
For example, from the independent realizations F1 , . . . , FN of the random element
F, via
1
N
mN = Fi (6.39)
N i=1
198 6 Gaussian Processes and Functional Data Analyses

1
N
KN = (Fi − m N ) ⊗ (Fi − m N ) , (6.40)
N i=1

we can estimate the mean m and covariance operator1 K .

In the following, we examine how to perform principal component analysis (PCA)
based on functional data analysis [24].
Then, to obtain the eigenfunctions and eigenvalues, for x1 , . . . , xn ∈ E, 1 ≤
n ≤ N , Fi : E → R, we apply the ordinary (nonfunctional) PCA approach to
X = (Fi (xk )) (i = 1, . . . , N and k = 1, . . . , n).
1. Prepare the basis functionη = [η1 , . . . , ηm ] : E → Rm .
2. Calculate W = (wi, j ) = E η(x)η(x) d x such that W = (wi, j ) = E ηi (x)η j
(x)d x.

3. Find C = (ci, j )i=1,...,N , j=1,...,m ∈ R N ×m such that Fi (x) = mj=1 ci η j (x).
4. Find the coefficients d1 ,
. . . , dm of the estimated mean function m N (x) :=
1
N m
N i=1 Fi (x) (m N (x) = j=1 d j η j (x)).
5. Since the variance function is

1
N
1
k(x, y) = {Fi (x) − m N (x)}{Fi (y) − m N (y)} = η(x)T (C − d) (C − d)η(y) ,
N N
i=1

if we set the eigenvectors as φ(x) = b η(x) (b ∈ Rm ), then the eigenvalue prob-

lem for the covariance operator

k(x, y)φ(y)dy = λφ(x)
E

under b W b = 1 reduces to the problem of finding a b such that

1
η(x) (C − d) (C − d)η(x)η(x) b = λη(x) b ,
N
which is equivalent to

1
(C − d) (C − d)W b = λb .
N

In particular, if we set u := W 1/2 b, it becomes the problem of finding a u ∈ Rm

such that
1 1/2
W (C − d) (C − d)W 1/2 u = λu
N
under u = 1.

1 The denominator of K N may be N − 1.

6.5 Functional Data Analysis 199

Example 92 For E = [−π, π ], if we set

⎧ 1
⎨ √2π ,
⎪ j =1
η j (x) = √1
π
cos kx, j = 2k ,
⎪
⎩ √1 sin kx, j = 2k + 1
π

we have π
ηi (x)η j (x)d x = δi, j ,
−π

and W is the unit matrix of size p. Therefore, the eigenequation becomes n1 (C −

d) (C − d)u = λu, and we can apply C ∈ Rn× p instead of the design matrix to the
PCA procedure (even if we set d = 0 in the above procedure, the centering step
will be completed automatically). In this example, we apply Canadian weather data
from the fda package containing a daily list of, the temperature and precipitation
for each day of the year in each Canadian city. We construct the following programs
in various ways. We do not give the n functions from the beginning but from N =
365 days, as this represents the change in temperature by a linear sum of p bases
(Fourie transformation). Therefore, we can say that the function is discretized using
a sufficiently large p.

X, y = s k f d a . d a t a s e t s . f e t c h _ w e a t h e r ( r e t u r n _ X _ y = True , a s _ f r a m e = T r u e )
df = X. iloc [: , 0]. values
def g ( j , x) : ## B a s i s c o n s i s t i n g o f p e l e m e n t s
if j == 0 :
r e t u r n 1 / np . s q r t ( 2 ∗ np . p i )
i f j % 1 == 0 :
r e t u r n np . c o s ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i )
else :
r e t u r n np . s i n ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i )

def beta ( x , y ) : ## C o e f f i c i e n t s i n f r o n t o f t h e p e l e m e n t s
X = np . z e r o s ( ( N, p ) )
f o r i i n range (N) :
f o r j i n range ( p ) :
X[ i , j ] = g ( j , x [ i ] )
b e t a = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X)
+ 0 . 0 0 0 1 ∗ np . i d e n t i t y ( p ) ) , X . T ) , y )
r e t u r n np . s q u e e z e ( b e t a )

N = 365; n = 35; m = 5; p = 100; df = df . c o o r d i n a t e s [ 0 ] . data_matrix

C = np . z e r o s ( ( n , p ) )
f o r i i n range ( n ) :
x = np . a r a n g e ( 1 , N+ 1 ) ∗ ( 2 ∗ np . p i / N) − np . p i
y = df [ i ]
C[ i , : ] = b e t a ( x , y )
p c a = PCA ( )
pca . f i t (C)
B = pca . components_ . T
xx = C . d o t (B )

Each line of C ∈ Rn× p is the coefficient ( p) of a function. Then, B ∈ R p×m

(m ≤ p) is the principal component vector, and x x is the score of each function. The
mth column vector of B is the vector of m principal components (the coefficients
200 6 Gaussian Processes and Functional Data Analyses

Reconstructions for m = 2, 3, 4, 5, 6

20
Temperature (Celsius)
10
0

Original m=4
-10

m=2 m=5
m=3 m=6

-3 -2 -1 0 1 2 3
Dates (trandformed from Jan. 1 through Dec. 31 to −π through π)

Fig. 6.6 We present the output of approximating Toronto’s annual temperature by using m =
2, 3, 4, 5, 6 principal components. As m increases, the data are faithfully recovered from the original
data

in front of η j (x) fpr j = 1, . . . , p). We change m and the function z and run the
following program to see if we can recover the original function.

d e f z ( i , m, x ) : # # The a p p r o x i m a t e d f u n c t i o n u s i n g m c o m p o n e n t s r a t h e r
than m
S = 0
f o r j i n range ( p ) :
f o r k i n range (m) :
f o r r i n range ( p ) :
S = S + C[ i , j ] ∗ B[ j , k ] ∗ B[ r , k ] ∗ g ( r , x )
return S

x _ s e q = np . a r a n g e ( − np . p i , np . p i , 2 ∗ np . p i / 1 0 0 )
plt . figure ()
p l t . x l i m ( − np . p i , np . p i )
# p l t . y l i m ( − 15 , 2 5 )
p l t . x l a b e l ( "Days" )
p l t . y l a b e l ( "Temp(C)" )
p l t . t i t l e ( "Reconstructionforeachm" )
p l t . p l o t ( x , d f [ 1 3 ] , l a b e l = "Original" )
f o r m i n range ( 2 , 7 ) :
p l t . p l o t ( x_seq , z ( 1 3 , m, x _ s e q ) , c = c o l o r [m] , l a b e l = "m=%d"%m)
p l t . l e g e n d ( l o c = "lowercenter" , n c o l = 2 )

Figure 6.6 shows the output of approximating the annual temperature in Toronto
by using m = 2, 3, 4, 5, 6 principal components.
Next, we list the principal components in order of increasing eigenvalue and draw
a graph of their contribution ratio (Fig. 6.7).
6.5 Functional Data Analysis 201

lam = p c a . e x p l a i n e d _ v a r i a n c e _
r a t i o = lam / sum ( lam ) # Or u s e pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _
p l t . p l o t ( range ( 1 , 6 ) , r a t i o [ : 5 ] )
p l t . x l a b e l ( "PC1throughPC5" )
p l t . y l a b e l ( "Ratio" )
p l t . t i t l e ( "Ratio" )

The principal component function is a function with the principal component

vector as the coefficients of the basis. It differs from the output of the scikit-fda in
two ways.
1. Because the dates of the year are normalized from January 1-December 31
√ [−π, π ], the value of the principal component√function is multiplied by
to
365/(2π ), and the score function is multiplied by 2π/365.
2. Some principal component vectors are multiplied by −1, resulting in an upside-
down function approximation (which is unavoidable if the packages are different).
The first, second, and third principal component functions appear as shown in Fig. 6.8.
We use the following program. The first principal component is the effect for the
whole year, with winter temperatures influencing the variations between cities.

def h ( coef , x ) : ## D e f i n e a f u n c t i o n u s i n g c o e f f i c i e n t s
S = 0
f o r j i n range ( p ) :
S = S + coef [ j ] ∗ g ( j , x )
return S
p r i n t (B)
plt . figure ()
p l t . x l i m ( − np . p i , np . p i )
p l t . y l i m ( −1 , 1 )
f o r j i n range ( 3 ) :
p l t . p l o t ( x_seq , h ( B [ : , j ] , x _ s e q ) , c = c o l o r s [ j ] , l a b e l = "PC%d"%( j + 1 ) )
p l t . l e g e n d ( l o c = "best" )

[[-5.17047156e-01 -2.43880782e-01 7.48988279e-02 ... 5.48412387e-04

-1.22748578e-03 8.07866592e-01]
[-7.31215100e-01 -3.44899509e-01 1.05922938e-01 ... 7.75572240e-04
-1.73592707e-03 -5.71437836e-01]

Fig. 6.7 Contribution of Contribution Ratio

temperature to Canadian
Contribution Ratio
0.0 0.2 0.4 0.6 0.8

weather. We can calculate

the contribution rate as in the
ordinary case where
functional data analysis is
not used

1 2 3 4 5
The first component through fifth
202 6 Gaussian Processes and Functional Data Analyses

Principal Component Functions

The values of principal component functions

First

1.0
Second
Third
0.5
0.0
-0.5
-1.0

-3 -2 -1 0 1 2 3
Date (transformed Jan. 1 throuugh Dec. 31 to −π through π)

Fig. 6.8 The first, second, and third principal component functions for temperature in the Canadian
weather data. Some of the principal component functions are multiplied by −1, which means that
they are upside down compared to those of other packages. Additionally, because √ the horizontal
axis is normalized by [−π, π ], the value of each eigenfunction is multiplied by 365/(2π )

[ 3.13430279e-01 -6.12932605e-01 1.50738649e-01 ... -6.99018707e-03

1.19586600e-03 -5.41866413e-02]
...
[ 3.08129173e-05 -2.83373696e-03 8.28893867e-03 ... 1.35931675e-01
-2.34867483e-01 1.23616273e-03]
[ 1.47021532e-03 3.25749669e-03 -4.83933350e-03 ... -1.32596334e-01
1.60270831e-01 8.94103792e-03]
[ 1.47021531e-03 3.25749669e-03 -4.83933350e-03 ... -1.32596334e-01
1.60270831e-01 -1.17997796e-02]]

place = X. i l o c [ : , 1]
index = [ 9 , 11 , 12 , 13 , 16 , 23 , 25 , 26]
o t h e r s = [ x f o r x i n range ( 3 4 ) i f x n o t i n i n d e x ]
f i r s t = [ place [ i ] [ 0 ] for i in index ]
print ( f i r s t )
plt . figure ()
p l t . x l i m ( − 15 , 2 5 )
p l t . y l i m ( − 25 , − 5)
p l t . x l a b e l ( "PC1" )
p l t . y l a b e l ( "PC2" )
p l t . t i t l e ( "CanadianWeather" )
p l t . s c a t t e r ( xx [ o t h e r s , 0 ] , xx [ o t h e r s , 1 ] , m a r k e r = "x" , c = "k" )
f o r i i n range ( 8 ) :
l = p l t . t e x t ( xx [ i n d e x [ i ] , 0 ] , xx [ i n d e x [ i ] , 1 ] ,
s = f i r s t [ i ] , c = color [ i ])
[ ’Q’ , ’M’ , ’O’ , ’T’ , ’W’ , ’C’ , ’V’ , ’V’ ]
Appendix 203

Fig. 6.9 Canadian weather The Scores in Canadian Weather

temperature scores, with

10
warmer regions such as
Vancouver and Victoria W

5
appearing furthest to the left O
M
in the first principal Q
T

0
component

Second
C

-5
V
V
Q Quebec W Winnipeg

-10
M Montreal C Calgary
O Ottawa V Vancouver
T Toronto V Victoria
-15
-20 -10 0 10 20 30
First

Appendix

Proof of Proposition 66
Proof: Since the expectation and variance of f (ω, x) − m(x) are 0 and k(x, x),
respectively, and the covariance between f (ω, x) − m(x) and f (ω, y) − m(y) is
k(x, y), from

E[| f (ω, x) − f (ω, y)|2 ]

= E[({ f (ω, x) − m(x)} − { f (ω, y) − m(y)} − {m(x) − m(y)})2 ]
= k(x, x) + k(y, y) − 2k(x, y) + {m(x) − m(y)}2 . (6.41)

The continuity of m, k implies (6.30). Conversely, if we assume (6.30), then the

continuity of m is obtained from

|m(x) − m(y)| = |E[ f (ω, x) − f (ω, y)]| ≤ {E[| f (ω, x) − f (ω, y)|2 ]}1/2 .

Without loss of generality, if we assume that m ≡ 0, then we have

k(x, y) − k(x , y ) = {k(x, y) − k(x , y)} + {k(x , y) − k(x , y )},

and each of the right-hand side terms are bounded by

|k(x, y) − k(x , y)| = |E[ f (ω, x) f (ω, y)] − E[ f (ω, x ) f (ω, y)]|
≤ E[ f (ω, y)2 ]1/2 E[{ f (ω, x) − f (ω, x )}2 ]1/2 = {k(y, y)}1/2 {E[| f (ω, x) − f (ω, x )|2 ]}1/2

and
|k(x , y) − k(x , y )| ≤ {k(x , x )}1/2 {E[| f (ω, y) − f (ω, y )|2 ]}1/2 .

Thus, we have established the continuity of k.

204 6 Gaussian Processes and Functional Data Analyses

Proof of Proposition 67
We define I (n) (n)
f (g) := I f (g; {(E i , x i )}1≤i≤M(n) ). Then, we have E[I f (g)] = 0. From
E[ f (ω, x)] = 0, x ∈ E, and the convergence proven thus far, we obtain the first
claim:

|E[I f (ω, g)]| = |E[I f (ω, g) − I (n) (n)

f (g)]| ≤ {E[{I f (ω, g) − I f (g)} ]}
2 1/2
→0

as n → ∞. From the uniform continuity of k, we obtain the second claim:

|E[I f (ω, g) f (ω, x)] − k(x, y)g(y)dμ(y)|

E

(n) (n)
≤ E[{I f (ω, g) − I f (g)} f (ω, x)] +|E[I f (g) f (ω, x)− k(x, y)g(y)dμ(y)]|
E
≤ |{E[{I f (ω, g) − I (n)
f (g)} ]}
2 1/2
{E[ f (ω, x)2 ]}1/2

M(n)
+| |k(x, xi ) − k(x, y)|g(y)dμ(y)| → 0
i=1 Ei

as n → ∞. From

M(n)
M(n)
E[I (n) (n)
f (g)I f (h)] = k(xi , x j ) g(x)dμ(x) h(y)dμ(y)
i=1 j=1 Ei Ej

→ k(x, y)g(x)h(y)dμ(x)dμ(y) ,
E E

we obtain the third claim:

|E[I f (ω, g)I f (ω, h)] − k(x, y)g(x)h(y)dμ(x)dμ(y)|
E E
≤ |E[{I f (ω, g) − I (n) (n) (n) (n)
f (g)}{I f (ω, h) − I f (h)} + {I f (ω, g) − I f (g)}I f (h)

+{I f (ω, h) − I (n) (n)

f (h)}I f (g)]

+|E[I (n) (n)
f (g)I f (h)] − k(x, y)g(x)h(y)dμ(x)dμ(y)|
E E
≤ |(E[{I f (ω, g) − I (n)
f (g)} ])
2 1/2
(E[{I f (ω, h) − I (n)
f (h)} ])
2 1/2

+(E[{I f (ω, g) − I (n)

f (g)} ])
2 1/2
(E[I (n)
f (h) ])
2 1/2

+(E[{I f (ω, h) − I (n)

f (h)} ])
2 1/2
(E[I (n)
f (g) ])
2 1/2

+ |k(x, y) − k(xi , x j )|g(x)h(y)dμ(x)dμ(y) → 0 .
i j Ei Ej

Exercises 83∼100 205

Exercises 83∼100
84. Construct a function gp_sample that generates random numbers f (ω, x1 ), . . . ,
f (ω, x N ) from the mean function m, the covariance function k, and x1 , . . . ,
x N ∈ E for a set E. Then, set m, k to generate 100 random numbers and examine
if the covariance matrix matches the m, k.
85. Using Proposition 61, prove (6.3) and (6.4).
86. In the following program, other than the Cholesky decomposition, is there any
step that requires a calculation with O(N 3 ) complexity?

d e f gp_2 ( x _ p r e d ) :
h = np . z e r o s ( n )
f o r i i n range ( n ) :
h [ i ] = k ( x_pred , x [ i ] )
L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) )
a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) )
mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) )
gamma = np . l i n a l g . s o l v e ( L . T , h )
s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 )
r e t u r n {"mm" :mm, "ss" : s s }

87. Show from (6.5) that the negated log-likelihood of x1 , . . . , x N ∈ R p , y1 , . . . ,

y N ∈ {−1, 1} is
N
log[1 + exp{−yi f (xi )}] .
i=1

88. Explain that Lines 19 through 24 of the program in Example 86 are used to
update f X ← (W + k −1 −1
X X ) (W f X + u).
89. Replace the first 100 Iris data (50 Setosa, 50 Versicolor) with the 51st to 150th
data (50 Versicolor, 50 Versinica) in Example 86 to execute the program.
90. In the proof of Proposition 65, why is it acceptable to replace f Z in (6.16) of
the generation process by μ f Z |Y to μ(x)? In σ 2 (x), the variations due to f Z |Y
and f (x)| f Z are independent. Why can we assume that they are independent?
91. In Example 88, there is a step in which the function gp_ind that realizes the
inducing variable method avoids processing O(N 3 ) calculations. Where is this
step?
92. Show that a stochastic process is a mean-square continuous process if and only
if its mean and covariance functions are continuous.
93. From Mercer’s theorem (6.31) and Proposition 67, derive Karhunen-Lóeve’s
theorem. Additionally, for n = 10, generate five sample paths of Brownian
motion.
94. From the formula for the Matérn kernel (6.35), derive ϕ5/2 and ϕ3/2 . Addition-
ally, illustrate the value of the Matérn kernel (ν = 1, . . . , 10) for l = 0.05, as
in Fig. 6.4.
95. Illustrate the sample path of the Matérn kernel with ν = 5/2, l = 0.1.
96. Give an example of a random element that does not involve a stochastic process
and an example of a stochastic process that does not involve a random element.
206 6 Gaussian Processes and Functional Data Analyses

97. Prepare a basis function η = [η1 , . . . , η p ] : E → R p and construct a procedure

to find m N (x) in (6.39). Then, input the Canadian weather data for N = 35 and
output the result. Additionally, construct a procedure to find K N (x) in (6.40)
and output it as a matrix of size p × p.
98. Suppose that we prepare p basis functions as

1 cos x sin x cos 2x sin 2x

{√ , √ , √ , √ , √ , · · · }
2π π π π π

for E = [−π, π ]. Why is W = (wi, j ) = E η(x)η(x) d x a unit matrix?
99. Using the Canadian weather data (precipitation for each day of the year) instead
of temperature for each day of the year, find the principal component functions
and eigenvalues and output graphs similar to those in Figs. 6.8 and 6.9.
100. Using the scikit-fda, find the principal component functions and eigenvalues
for both temperature and precipitation for each day of the year and output
graphs similar to those in Figs. 6.8 and 6.9.
Bibliography

1. N. Aronszajn, Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404 (1950)
2. H. Avron, M. Kapralov, C. Musco, A. Velingker and A. Zandieh. Random fourier fea-
tures for kernel ridge regression: Approximation bounds and statistical guarantees. ArXiv,
abs/1804.09893, 2017
3. C. Baker. The Numerical Treatment of Integral Equations (Claredon Press, 1978)
4. P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and struc-
tural results. In J. Mach. Learn. Res., 2001
5. K.P. Chwialkowski and A. Gretton. A kernel independence test for random processes. In ICML,
2014)
6. R. Dudley. Real Analysis and Probability (Cambridge Studies in Advanced Mathematics, 1989)
7. K. Fukumizu. Introduction to Kernel Methods (kaneru hou nyuumon) (Asakura, 2010). (In
Japanese)
8. T. Gneiting, Compactly supported correlation functions. J. Multivar. Anal. 83, 493–508 (2002)
9. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, Baltimore, 1996)
10. I.S. Gradshteyn, I.M. Ryzhik, R.H. Romer, Tables of integrals, series, and products. Am. J.
Phys. 56, 958–958 (1988)
11. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test. J.
Mach. Learn. Res. 13, 723–773 (2012)
12. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, B. Schölkopf, Kernel methods for measuring
independence. J. Mach. Learn. Res. 6, 2075–2129 (2005)
13. D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
UCSC, 1999
14. T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Intro-
duction to Linear Operators (Wiley, 2015)
15. K. Itõ. An Introduction to Probability Theory (Cambridge University Press, 1984)
16. Y. Kano and S. Shimizu. Causal inference using nonnormality. In Proceedings of the Annual
Meeting of the Behaviormetric Society of Japan 47, 2004
17. K. Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci.
Fennicae. Ser. A. I. Math.-Phys 37, 1–79 (1947)
18. K. Karhunen. Probability theory. Vol. II (Springer-Verlag, 1978)
19. H. Kashima, K. Tsuda and A. Inokuchi. Marginalized kernels between labeled graphs. In ICML,
2003

© The Editor(s) (if applicable) and The Author(s), under exclusive license 207
to Springer Nature Singapore Pte Ltd. 2022
J. Suzuki, Kernel Methods for Machine Learning with Math and Python,
https://doi.org/10.1007/978-981-19-0401-1
208 Bibliography

20. S. Lauritzen. Graphical Models (Oxford Science Publications, 1996)

21. J. Mercer. Functions of positive and negative type and their connection with the theory of
integral equations. Philosophical Transactions of the Royal Society A, pp. 441–458, 1909
22. J. Neveu, Processus aIeatoires gaussiens, Seminaire Math. Sup. Les presses de I’Universite de
Montreal, 1968
23. A. Rahimi and B. Recht, Random features for large-scale kernel machines. In Advances in
neural information processing systems, 2007
24. J. Ramsay and B.W. Silverman. Functional Data Analysis (Springer Series in Statistics, 2005)
25. C. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning (MIT Press,
2006)
26. B. Schölkopf, A. Smola and K. Müller. Kernel principal component analysis, In ICANN, 1997
27. R. Serfling. Approximation Theorems of Mathematical Statistics (Wiley, 1980)
28. S. Shimizu, P. Hoyer, A. Hyvärinen, A.J. Kerminen, A linear non-gaussian acyclic model for
causal discovery. J. Mach. Learn. Res. 7, 2003–2030 (2006)
29. I. Steinwart, On the influence of the kernel on the consistency of support vector machines. J.
Mach. Learn. Res. 2, 67–93 (2001)
30. M.H. Stone, Applications of the theory of boolean rings to general topology. Trans. Am. Math.
Soc. 41(3), 375–481 (1937)
31. M.H. Stone, The generalized weierstrass approximation theorem. Math. Mag. 21(4), 167–184
(1948)
32. K. Tsuda, T. Kin, K. Asai, Marginalized kernels for biological sequences. Bioinformatics
18(Suppl 1), S268-75 (2002)
33. J.-P. Vert. Aronszajn’s theorem, 2017. https://members.cbio.mines-paristech.fr/~jvert/svn/
kernelcourse/notes/aronszajn.pdf
34. K. Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen
einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wis-
senschaften zu Berlin, pp. 633–639, 1885. Erste Mitteilung
35. K. Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen
einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wis-
senschaften zu Berlin, pp. 789–805, 1885. Zweite Mitteilung
36. H. Zhu, C.K.I. Williams, R. Rohwer and Michal Morciniec. Gaussian regression and optimal
finite dimensional linear models. In Neural Networks and Machine Learning, 1997