A Tutorial On Canonical Correlation Methods
A Tutorial On Canonical Correlation Methods
Canonical correlation analysis is a family of multivariate statistical methods for the analysis of paired sets
of variables. Since its proposition, canonical correlation analysis has for instance been extended to extract
relations between two sets of variables when the sample size is insufficient in relation to the data dimen-
sionality, when the relations have been considered to be non-linear, and when the dimensionality is too large
for human interpretation. This tutorial explains the theory of canonical correlation analysis including its
regularised, kernel, and sparse variants. Additionally, the deep and Bayesian CCA extensions are briefly
reviewed. Together with the numerical examples, this overview provides a coherent compendium on the ap-
plicability of the variants of canonical correlation analysis. By bringing together techniques for solving the
optimisation problems, evaluating the statistical significance and generalisability of the canonical correla-
tion model, and interpreting the relations, we hope that this article can serve as a hands-on tool for applying
canonical correlation methods in data analysis.
CCS Concepts: •Computing methodologies → Dimensionality reduction and manifold learning;
General Terms: Multivariate Statistical Analysis, Machine Learning, Statistical Learning Theory
Additional Key Words and Phrases: Canonical correlation, regularisation, kernel methods, sparsity
ACM Reference Format:
Viivi Uurtio, João M. Monteiro, Jaz Kandola, John Shawe-Taylor, Delmiro Fernandez-Reyes, and Juho
Rousu, 2017. A Tutorial on Canonical Correlation Methods. ACM Comput. Surv. 50, 6, Article 95 (Octo-
ber 2017), 33 pages.
DOI: 10.1145/3136624
1. INTRODUCTION
When a process can be described by two sets of variables corresponding to two differ-
ent aspects, or views, analysing the relations between these two views may improve
the understanding of the underlying system. In this context, a relation is a mapping
of the observations corresponding to a variable of one view to the observations corre-
sponding to a variable of the other view. For example in the field of medicine, one view
could comprise variables corresponding to the symptoms of the disease and the other
to the risk factors that can have an effect on the disease incidence. Identifying the
relations between the symptoms and the risk factors can improve the understanding
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:2
of the disease exposure and give indications for prevention and treatment. Examples
of these kind of two-view settings, where the analysis of the relations could provide
new information about the functioning of the system, occur in several other fields of
science. These relations can be determined by means of canonical correlation methods
that have been developed specifically for this purpose.
Since the proposition of canonical correlation analysis (CCA) by H. Hotelling
[Hotelling 1935; Hotelling 1936], relations between variables have been explored
in various fields of science. CCA was first applied to examine the relation
of wheat characteristics to flour characteristics in an economics study by F.
Waugh in 1942 [Waugh 1942]. Since then, studies in the fields of psychology
[Hopkins 1969; Dunham and Kravetz 1975], geography [Monmonier and Finn 1973],
medicine [Lindsey et al. 1985], physics [Wong et al. 1980], chemistry [Tu et al. 1989],
biology [Sullivan 1982], time-series modeling [Heij and Roorda 1991], and signal pro-
cessing [Schell and Gardner 1995] constitute examples of the early application fields
of CCA.
In the beginning of the 21st century, the applicability of CCA has been demon-
strated in modern fields of science such as neuroscience, machine learning and
bioinformatics. Relations have been explored for developing brain-computer in-
terfaces [Cao et al. 2015; Nakanishi et al. 2015] and in the field imaging genetics
[Fang et al. 2016]. CCA has also been applied for feature selection [Ogura et al. 2013],
feature extraction and fusion [Shen et al. 2013], and dimension reduction
[Wang et al. 2013]. Examples of application studies conducted in the fields of bioin-
formatics and computational biology include [Rousu et al. 2013; Seoane et al. 2014;
Baur and Bozdag 2015; Sarkar and Chakraborty 2015; Cichonska et al. 2016]. The
vast range of application domains emphasises the utility of CCA in extracting
relations between variables.
Originally, CCA was developed to extract linear relations in overdetermined set-
tings, that is when the number of observations exceeds the number of variables in
either view. To extend CCA to underdetermined settings that often occur in modern
data analysis, methods of regularisation have been proposed. When the sample size
is small, Bayesian CCA also provides an alternative to perform CCA. The applicabil-
ity of CCA to underdetermined settings has been further improved through sparsity-
inducing norms that facilitate the interpretation of the final result. Kernel methods
and neural networks have been introduced for uncovering non-linear relations. At
present, canonical correlation methods can be used to extract linear and non-linear
relations in both over- and underdetermined settings.
In addition to the already described variants of CCA, alternative extensions have
been proposed, such as the semi-paired and multi-view CCA. In general, CCA algo-
rithms assume one-to-one correspondence between the observations in the views, in
other words, the data is assumed to be paired. However, in real datasets some of the ob-
servations may be missing in either view, which means that the observations are semi-
paired. Examples of semi-paired CCA algorithms comprise [Blaschko et al. 2008],
[Kimura et al. 2013], [Chen et al. 2012], and [Zhang et al. 2014]. CCA has also been
extended to more than two views by [Horst 1961], [Carroll 1968], [Kettenring 1971],
and [Van de Geer 1984]. In multi-view CCA the relations are sought among more
than two views. Some of the modern extensions of multi-view CCA comprise its reg-
ularised [Tenenhaus and Tenenhaus 2011], kernelised [Tenenhaus et al. 2015], and
sparse [Tenenhaus et al. 2014] variants. Application studies of multi-view CCA and its
modern variants can be found in neuroscience [Kang et al. 2013], [Chen et al. 2014],
feature fusion [Yuan et al. 2011] and dimensionality reduction [Yuan et al. 2014].
However, both the semi-paired and multi-view CCA are beyond the scope of this tu-
torial.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:3
This tutorial begins with an introduction to the original formulation of CCA. The
basic framework and statistical assumptions are presented. The techniques for solv-
ing the CCA optimisation problem are discussed. After solving the CCA problem, the
approaches to interpret and evaluate the result are explained. The variants of CCA
are illustrated using worked examples. Of the extended versions of CCA, the tuto-
rial concentrates on the topics of regularised, kernel, and sparse CCA. Additionally,
the deep and Bayesian CCA variants are briefly reviewed. This tutorial acquaints the
reader with canonical correlation methods, discusses where they are applicable and
what kind of information can be extracted.
Xa wa = za and Xb wb = zb
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:4
Let the maximum be obtained by z1a and z1b . The pair of images z2a and z2b , that has
the second smallest enclosing angle θ2 , is found in the orthogonal complements of z1a
and z1b . The procedure is continued until no more pairs are found. Hence the r angles
θr ∈ [0, π2 ] for r = 1, 2, · · · , q when p > q that can be found are recursively defined by
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:5
The first and greatest canonical correlation that corresponds to the smallest angle is
between the first pair of images za = Xa wa and zb = Xb wb . Since the correlation
between za and zb does not change with the scaling of za and zb , we can constrain wa
and wb to be such that za and zb have unit variance. This is given by
zTa za =waT XaT Xa wa = waT Caa wa = 1, (3)
zTb zb =wbT XbT Xb wb = wbT Cbb wb = 1. (4)
Due to the normality assumption and comparability, the variables of Xa and Xb should
be centered such that they have zero means. In this case, the covariance between za
and zb is given by
zTa zb = waT XaT Xb wb = waT Cab wb . (5)
Substituting (5), (3) and (4) into the algebraic problem in Equation (1), we obtain:
cos θ = max n hza , zb i = max waT Cab wb ,
za ,zb ∈R wa ∈Rp ,wb ∈Rq
q q
||za ||2 = waT Caa wa = 1 ||zb ||2 = wbT Cbb wb = 1.
In general, the constraints (3) and (4) are expressed in squared form, waT Caa wa = 1
and wbT Cbb wb = 1. The problem can be solved using the Lagrange multiplier technique.
Let
ρ1 ρ2
L = waT Cab wb − (waT Caa wa − 1) − (wbT Cbb wb − 1) (6)
2 2
where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to wa
and wb gives
δL
= Cab wb − ρ1 Caa wa = 0 (7)
δwa
δL
= Cba wa − ρ2 Cbb wb = 0 (8)
δwb
Multiplying (7) from the left by waT and (8) from the left by wbT gives
waT Cab wb − ρ1 waT Caa wa = 0
wbT Cba wa − ρ2 wbT Cbb wb = 0.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:6
If Cbb is invertible, the problem reduces to a standard eigenvalue problem of the form
−1 −1
Cbb Cba Caa Cab wb = ρ2 wb .
−1 −1
The eigenvalues of the matrix Cbb Cba Caa Cab are found by solving the characteristic
equation
−1 −1
|Cbb Cba Caa Cab − ρ2 I| = 0.
The square roots of the eigenvalues correspond to the canonical correlations. The tech-
nique of solving the standard eigenvalue problem is shown in Example 2.1.
Example 2.1. We generate two data matrices Xa and Xb of sizes n × p and n ×
q, where n = 60, p = 4 and q = 3, respectively as follows. The variables of Xa are
generated from a random univariate normal distribution, a1 , a2 , a3 , a4 ∼ N (0, 1). We
generate the following linear relations
b1 = a3 + ξ 1
b2 = a1 + ξ 2
b3 = −a4 + ξ 3
where ξ 1 ∼ N (0, 0.2), ξ2 ∼ N (0, 0.4), and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
The data is standardised such that every variable has zero mean and unit variance.
The joint covariance matrix C in (2) of the generated data is given by
1.00 0.34 −0.11 0.21 −0.10 0.92 −0.21
0.34 1.00 −0.08 0.03 −0.10 0.34 0.06
−0.11 −0.08 1.00 −0.30 0.98 −0.03 0.30
Caa Cab
C = 0.21 0.03 −0.30 1.00 −0.25 0.12 −0.94 = .
−0.10 −0.10 0.98 −0.25 1.00 −0.03 0.25 Cba Cbb
0.92 0.34 −0.03 0.12 −0.03 1.00 −0.13
−0.21 0.06 0.30 −0.94 0.25 −0.13 1.00
Now we compute the eigenvalues of the characteristic equation
−1 −1
|Cbb Cba Caa Cab − ρ2 I| = 0.
−1 −1
The square roots of the eigenvalues of Cbb Cba Caa Cab are ρ1 = 0.99, ρ2 = 0.94, and
ρ3 = 0.92. The eigenvectors wb satisfy the equation
−1 −1
(Cbb Cba Caa Cab − ρ2 I)wb = 0.
Hence we obtain
! ! !
−0.97 −0.39 0.19
wb1 = −0.04 wb2 = −0.37 wb3 = −0.86
−0.22 0.85 −0.46
and wa vectors satisfy
−0.04 −0.41 −0.84
−1 1 −1
C Cab wb −0.00 2 Caa Cab wb2 0.09 3 −1 3
C Cab wb −0.10
wa1 = aa = w = = w = aa = .
ρ1 −0.99 a ρ2 −0.41 a ρ3 0.14
0.18 −0.83 0.52
The vectors wb1 , wb2 , and wb3 and wa1 , wa2 , and wa3 correspond to the pairs of positions
(wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) that have the images (z1a , z1b ), (z2a , z2b ) and (z3a , z3b ). In
linear CCA, the canonical correlations equal to the square roots of the eigenvalues,
that is hz1a , z1b i = 0.99, hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:7
Solving CCA Through the Generalised Eigenvalue Problem. The positions wa and
wb and their images za and zb can also be solved through a generalised eigenvalue
problem [Bach and Jordan 2002; Hardoon et al. 2004]. The equations in (7) and (8) can
be represented as simultaneous equations
Cab wb =ρCaa wa
Cba wa =ρCbb wb
that are equivalent to
0 Cab wa Caa 0 wa
=ρ . (11)
Cba 0 wb 0 Cbb wb
The equation (11) represents a generalised eigenvalue problem of the form βAx =
αBx where the pair (β, α) = (1, α) is an eigenvalue of the pair (A, B) [Saad 2011;
Golub and Van Loan 2012]. The pair of matrices A ∈ R(p+q)×(p+q) and B ∈ R(p+q)×(p+q)
is also referred to as matrix pencil. In particular, A is symmetric and B is symmet-
ric positive-definite. The pair (A, B) is then called the symmetric pair. As shown in
[Watkins 2004], a symmetric pair has real eigenvalues and (p+q) linearly independent
eigenvectors. To express the generalised eigenvalue problem in the form Ax = ρBx, the
generalised eigenvalue is given by ρ = α β . Since the generalised eigenvalues come in
pairs {ρ1 , −ρ1 , ρ2 , −ρ2 , . . . , ρp , −ρp , 0} where p < q, the positive generalised eigenvalues
correspond to the canonical correlations.
Example 2.2. Using the data in Example 2.1, we apply the formulation of the gen-
eralised eigenvalue problem to obtain the positions wa and wb . The resulting gener-
alised eigenvalues are
{0.99, 0.94, 0.92, 0.00, −0.92, −0.94, −0.99}.
The generalised eigenvectors that correspond to the positive generalised eigenvalues
in descending order are
−0.04 0.48 −0.97
−0.00 2 −0.11 3 −0.11
wa1 = w = w =
−1.00 a 0.48 a 0.16
0.18 0.98 0.60
! ! !
−0.98 0.46 0.22
wb1 = −0.04 wb2 = 0.43 wb3 = −1.00
−0.23 −1.00 −0.54
The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of positions
(wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99, hz2a , z2b i =
0.94, and hz3a , z3b i = 0.92.
The entries of the position pairs differ to some extent from the solutions to the stan-
dard eigenvalue problem in the Example 2.1. This is due to the numerical algorithms
that are applied to solve the eigenvalues and eigenvectors. Additionally, the signs may
also be opposite. This can be seen when comparing the second pairs of positions with
the Example 2.1. This results from the symmetric nature of CCA.
Solving CCA Using the SVD. The technique of applying the SVD to solve
the CCA problem was first introduced by [Healy 1957] and described by
[Ewerbring and Luk 1989] as follows. First, the variance matrices Caa and Cbb are
transformed into identity forms. Due to the symmetric positive definite property, the
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:8
square root factors of the matrices can be found using a Cholesky or eigenvalue decom-
position:
1/2 1/2 1/2 1/2
Caa = Caa Caa and Cbb = Cbb Cbb .
Applying the inverses of the square root factors symmetrically on the joint covariance
matrix in (2) we obtain
! ! !
−1/2 −1/2 −1/2 −1/2
Caa 0 Caa Cab Caa 0 Iq Caa Cab Cbb
−1/2 −1/2 = −1/2 −1/2 .
0 Cbb Cba Cbb 0 Cbb Cbb Cba Caa Ip
The position vectors wa and wb can hence be obtained by solving the following SVD
−1/2 −1/2
Caa Cab Cbb = U T SV (12)
where the columns of the matrices U and V correspond to the sets of orthonormal left
and right singular vectors respectively. The singular values of matrix S correspond to
the canonical correlations. The positions wa and wb are obtained from
−1/2 −1/2
wa = Caa U wb = Cbb V
The method is shown in Example 2.3.
Example 2.3. The method of solving CCA using the SVD is demonstrated using the
data of Example 2.1. We compute the matrix
−0.02 0.90 −0.06
−1/2 −1/2 −0.07 0.20 0.11
Caa Cab Cbb = .
0.98 0.04 0.04
0.01 −0.02 −0.93
The SVD gives
−1/2 −1/2
Caa Cab Cbb =
! 0.99 0.00 0.00 !
−0.03 −0.03 0.95 −0.30 0.95 −0.29 0.15
0.00 0.94 0.00
−0.47 0.03 −0.28 0.84 0.01 −0.44 −0.90 .
0.00 0.00 0.92
−0.86 −0.26 0.11 0.44 0.33 0.85 −0.41
| {z } 0.00 0.00 0.00 | {z }
UT
| {z } V
S
The singular values of the matrix S correspond to the canonical correlations. The po-
sitions wa and wb are given by
0.04 −0.43 −0.91
0.00 2 0.10 3 −0.10
wa1 = Caa
−1/2 1
u = w = Caa−1/2 2
u = w = Caa−1/2 3
u =
0.94 a −0.43 a 0.14
−0.17 −0.87 0.56
! ! !
0.93 −0.40 0.21
−1/2 −1/2 2 −1/2 3
wb1 = Cbb v1 = 0.04 wb2 = Cbb v = −0.38 wb3 = Cbb v = −0.93
0.21 0.89 −0.50
where ui and vi for i = 1, 2, 3 correspond to the left and right singular vectors.
The vectors wa1 , wa2 , and wa3 and wb1 , wb2 , and wb3 correspond to the pairs of posi-
tions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ). The canonical correlations are hz1a , z1b i = 0.99,
hz2a , z2b i = 0.94, and hz3a , z3b i = 0.92.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:9
The main motivation for improving the eigenvalue-based technique was the compu-
tational complexity. The standard and generalised eigenvalue methods scale with the
cube of the input matrix dimension, in other words, the time complexity is O(n3 ), for
−1/2 −1/2
a matrix of size n × n. The input matrix Caa Cab Cbb in the SVD-based technique is
rectangular. This gives a time complexity of O(mn2 ), for a matrix of size m × n. Hence
the SVD-based technique is computationally more tractable for very large datasets.
To recapitulate, the images za and zb of the positions wa and wb that suc-
cessively maximise the canonical correlation can be obtained by solving a stan-
dard [Hotelling 1936] or a generalised eigenvalue problem [Bach and Jordan 2002;
Hardoon et al. 2004] or by applying the SVD [Healy 1957; Ewerbring and Luk 1989].
The CCA problem can also be solved using alternative techniques. The only require-
ments are that the successive images on the unit ball are orthogonal and that the angle
is minimised.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:10
0 0 0
-1 -1 -1
a3 a4 a1
0 0 0
-1 -1 -1
b1 b3 b2
Fig. 1. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max-
imum absolute value is coloured blue.
b3
a4 a4
ab31 ba13
z2a
z3a
z3a
ab31 a2
b3 a2 a2 b3
ba21
a4 ba21 ab12
Fig. 2. The biplots are generated using the results of Example 2.1. The biplot on the left shows the relations
between the variables when viewed with respect to the images z1a and z2a . The biplot in the middle shows
the relations between the variables when viewed with respect to the images z1a and z3a . The biplot on the
right shows the relations between the variables when viewed with respect to the images z2a and z3a .
angles between the variable vectors. The extraction of the relations can be enhanced
by changing the pairs of images with which the correlations are computed.
The statistical significance tests of the canonical correlations evaluate whether the
obtained pattern can be considered to occur non-randomly. The sequential test pro-
cedure of Bartlett [Bartlett 1938] determines the number of statistically significant
canonical correlations in the data. The procedure to evaluate the statistical signifi-
cance of the canonical correlations is described in [Fujikoshi and Veitch 1979]. We test
the hypothesis
H0 : min(p, q) = k against H1 : min(p, q) > k (13)
where k = 0, 1, . . . , p when p < q. If the hypothesis H0 : min(p, q) = j is rejected for
j = 0, 1, . . . , k − 1 but accepted for H1 : min(p, q) > k − 1 the number of statistically
significant canonical correlations can be estimated as k. For the test, the Bartlett-
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:11
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:12
The canonical correlation model can be evaluated by assessing the statistical signif-
icance and testing the generalisability of the relations. The statistical significance of
the model can be determined by testing whether the extracted canonical correlations
are not non-zero by chance. The generalisability of the relations can be assessed us-
ing new observations from the sampling distribution. These evaluation methods can
generally be applied to test the validity of the extracted relations obtained using any
variant of CCA.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:13
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:14
0.0776
0.09 1
Fig. 3. The maximum test canonical correlation, computed over 50 times repeated 5-fold cross-validation,
is obtained at c1 = 0.09.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:15
0 0 0
-1 -1 -1
a3 a1 a4
0 0 0
-1 -1 -1
b1 b2 b3
Fig. 4. The entries of the pairs of positions (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) are shown. The entry of max-
imum absolute value is coloured blue. The positive linear relation between a3 and b1 , the positive linear
relation between a1 and b2 and the negative linear relation between a4 and b3 are extracted by the pairs
(wa1 , wb1 ), (wa2 , wb2 ), and (wa3 , wb3 ) respectively.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:16
where N (µ, Σ) denotes the normal multivariate distribution with mean µ and co-
variance Σ. The Sa and Sb correspond to the transformations of the latent variables
yk ∈ Ro . The Ψa and Ψb denote the noise covariance matrices. The maximum likeli-
hood estimates of the parameters Sa , Sb , Ψa , Ψb , µa and µb are given by
where Ma , Mb ∈ Rd×d are arbitrary matrices such that Ma MbT = Pd and the spectral
norms of Ma and Mb are smaller than one. Pd is the diagonal matrix of the first d
canonical correlations. The d columns of Wad and Wbd correspond to the positions wai
and wbi for i = 1, 2, . . . , d obtained using any of the standard techniques described in
section 2.1.
The posterior expectations of y given xa and xb are E(y|xa ) = MaT Wad T
(xa − µˆa )
T T
and E(y|xb ) = Mb Wbd (xb − µˆb ). As stated in [Bach and Jordan 2005], regardless of
what Ma and Mb are, E(y|xa ) and E(y|xb ) lie in the d-dimensional subspaces of Rp
and Rq which are identical to those obtained by linear CCA. The generative model of
[Bach and Jordan 2005] was further developed in [Archambeau et al. 2006] by replac-
ing the normal noise with the multivariate Student’s t distribution. This improves the
robustness against outlying observations that are then better modeled by the noise
term [Klami et al. 2013].
A Bayesian extension of CCA was proposed by [Klami and Kaski 2007] and
[Wang 2007]. To perform Bayesian analysis, the probabilistic model has to be supple-
mented with prior distributions of the model parameters. In [Klami and Kaski 2007]
and [Wang 2007], the prior distribution of the covariance matrices Ψa and Ψb was cho-
sen to be the inverse-Wishart distribution. The automatic relevance determination
[Neal 2012] prior was selected for the linear transformations Sa and Sb . The inference
on the posterior distribution was made by applying a variational mean-field algorithm
[Wang 2007] and Gibbs sampling [Klami and Kaski 2007].
As in the case of the linear CCA, the variance matrices obtained from high-
dimensional data make the inference of the probabilistic and Bayesian CCA models
difficult [Klami et al. 2013]. This is because the variance matrices need to be inverted
in the inference algorithms. To perform Bayesian CCA on high-dimensional data, di-
mensionality reduction techniques should be applied as a preprocessing step, as has
been done for example in [Huopaniemi et al. 2010].
An advantage of Bayesian CCA, in relation to linear CCA, is the application of the
prior distributions that enable to take the possible underlying structure in the data
into account. Examples of studies where sparse models were obtained by means of the
prior distribution include [Archambeau and Bach 2009] and [Rai and Daume 2009]. In
addition to modeling the structure of the data, in [Klami et al. 2012] the Bayesian CCA
was extended such that any exponential family distribution could model the noise, not
only the normal.
In summary, probabilistic and Bayesian CCA provide alternative ways to interpret
the CCA by means of latent variables. Bayesian CCA may be more feasible in set-
tings where knowledge regarding the data can be incorporated through the prior dis-
tributions. Additionally, noise can be modelled by other exponential family distribution
functions than the normal distribution.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:17
Ka (xia , xja ) = hφa (xia ), φa (xja )iHa and Kb (xib , xjb ) = hφb (xib ), φb (xjb )iHb
where i, j = 1, 2, . . . , n. As derived in [Bach and Jordan 2002], the original data matri-
ces Xa ∈ Rn×p and Xb ∈ Rn×q can be substituted by the Gram matrices Ka ∈ Rn×n
and Kb ∈ Rn×n . Let α and β denote the positions in the kernel space Rn that have the
images za = Ka α and zb = Kb β on the unit ball in Rn with a minimum enclosing angle
in between. The kernel CCA problem is hence
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:18
As in CCA, the optimisation problem can be solved using the Lagrange multiplier
technique.
ρ1 ρ2
L = αT KaT Kb β − (αT Ka2 α − 1) − (β T Kb2 β − 1) (23)
2 2
where ρ1 and ρ2 denote the Lagrange multipliers. Differentiating L with respect to α
and β gives
δL
= Ka Kb β − ρ1 Ka2 α = 0 (24)
δα
δL
= Kb Ka α − ρ2 Kb2 β = 0 (25)
δβ
Multiplying (7) from the left by αT and (8) from the left by βT gives
αT Ka Kb β − ρ1 αT Ka2 α = 0 (26)
T T
β K b K a α − ρ2 β Kb2 β = 0. (27)
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:19
where the constants c1 and c2 denote the regularisation parameters. In Example 3.2,
kernel CCA, solved through the generalised eigenvalue problem, is performed on sim-
ulated data.
Example 3.2. We generate a simulated dataset as follows. The data matrices Xa
and Xb of sizes n × p and n × q, where n = 150, p = 7 and q = 8, respectively as
follows. The seven variables of Xa are generated from a random univariate normal
distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations
b1 = exp(a3 ) + ξ 1
b2 = a31 + ξ2
b3 = −a4 + ξ 3
where ξ1 ∼ N (0, 0.4), ξ2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
The five other variables of Xb are generated from a random univariate normal distri-
bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has
zero mean and unit variance.
In kernel CCA, the choice of the kernel function affects what kind of relations can
2
be extracted. In general, a Gaussian kernel K(x, y) = exp(− ||x−y|| 2σ2 ) is used when the
data is assumed to contain non-linear relations. The width parameter σ determines the
non-linearity in the distances between the data points computed in the form of inner
products. Increasing the value of σ makes the space closer to Euclidean while decreas-
ing makes the distances more non-linear. The optimal value for σ is best determined
using a re-sampling method such as a cross-validation scheme, for example procedure
similar to the one presented in Algorithm 1. In this example, we applied the ”median
trick”, presented in [Song et al. 2010], according to which the σ corresponds to the me-
dian of Euclidean distances computed between all pairs of observations. The median
distances for the data in this example were σa = 3.53 and σb = 3.62 for the views Xa
and Xb respectively. The kernels were centred by K̃ = K − n1 jjT K − n1 KjjT + n12 (jT Kj)jjT
where j contains only entries of value one [Shawe-Taylor and Cristianini 2004].
In addition to the kernel parameters, also the regularisation parameters c1 and c2
need to be optimised to extract the correct relations. As in the case of regularised
CCA, a repeated cross-validation procedure can be applied to identify the optimal pair
of parameters. For the data in this example, the optimal regularisation parameters
were c1 = 1.50 and c2 = 0.60 when a 20 times repeated 5-fold cross-validation was
applied. The first three canonical correlations at the optimal parameter values were
hz1a , z1b i = 0.95, hz2a , z2b i = 0.89, and hz3a , z3b i = 0.87.
The interpretation of the relations cannot be performed from the positions α and β
since they are obtained in the kernel spaces. In the case of simulated data, we know
what kind of relations are contained in the data. We can compute the linear correla-
tion coefficient between the simulated relations and the transformed pairs of positions
za and zb [Chang et al. 2013]. The correlation coefficients are shown in Table I. The
exponential relation was extracted in the second pair (z2a , z2b ), the 3rd order polynomial
relation was extracted in the third pair (z3a , z3b ) and the linear relation in the first pair
(z1a , z1b ).
In [Hardoon et al. 2004], an alternative formulation of the standard eigenvalue
problem was presented when the data contains a large number of observations.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:20
If the sample size is large, the dimensionality of the Gram matrices Ka and Kb
can cause computational problems. Partial Gram-Schmidt orthogonalization (PGSO)
[Cristianini et al. 2002] was proposed as a matrix decomposition method. PGSO re-
sults in
Ka ≃Ra RaT
Kb ≃Rb RbT .
Substituting these into the Equations (24) and (25) and multiplying by RaT and RbT
respectively we obtain
RaT Ra RaT Rb RbT β − ρRaT RaT Ra RaT Ra α = 0 (36)
RbT Rb RbT Ra RaT α − µRbT RbT Rb RbT Rb β = 0. (37)
Let Daa = RaT Ra , Dab = RaT Rb , Dba = RbT Ra , and Dbb = RbT Rb denote the blocks of the
new sample covariance matrix. Let α̃ = RaT α and β̃ = RbT β denote the positions α and
β in the reduced space. Using these substitutions in (36) and (37) we obtain
2
Daa Dab β̃ − ρDaa α̃ = 0 (38)
2
Dbb Dba α̃ − ρDbb β̃ = 0. (39)
−1 −1
If Daa and Dbb are invertible we can multiply (38) by Daa and (39) by Dbb which gives
Dab β̃ − ρDaa α̃ = 0 (40)
Dba α̃ − ρDbb β̃ = 0. (41)
and hence
−1
Dbb Dba α̃
β̃ = (42)
ρ
which, after a substitution into (38), results in a generalised eigenvalue problem
−1
Dab Dbb Dba α̃ = ρ2 Daa α̃. (43)
To formulate the problem as a standard eigenvalue problem, let Daa = SS T denote
the complete Cholesky decomposition where S is a lower triangular matrix and let
α̂ = S T α̃. Substituting these into (43) we obtain
−1
S −1 Dab Dbb Dba S ′−1 α̂ = ρ2 α̂.
If regularisation using the parameter κ is combined with dimensionality reduction the
problem becomes
−1
S −1 Dab Dbb + κI Dba S ′−1 α̂ = ρ2 α̂. (44)
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:21
A numerical example of the method presented by [Hardoon et al. 2004] is given in Ex-
ample 3.3.
Example 3.3. We generate a simulated dataset as follows. The data matrices Xa
and Xb of sizes n × p and n × q, where n = 10000, p = 7 and q = 8, respectively as
follows. The seven variables of Xa are generated from a random univariate normal
distribution, a1 , a2 , . . . , a7 ∼ N (0, 1). We generate the following relations
b1 = exp(a3 ) + ξ 1
b2 = a31 + ξ2
b3 = −a4 + ξ 3
where ξ1 ∼ N (0, 0.4), ξ2 ∼ N (0, 0.2) and ξ 3 ∼ N (0, 0.3) denote vectors of normal noise.
The five other variables of Xb are generated from a random univariate normal distri-
bution, b4 , b5 , . . . , b8 ∼ N (0, 1). The data is standardised such that every variable has
zero mean and unit variance.
A Gaussian kernel is used for both views. The width parameter is set using the
median trick to σa = 3.56 and σb = 3.60. The kernels were centred. The positions α
and β are found solving the standard eigenvalue problem in (44) and applying the
Equation (42). We set the regularisation parameter κ = 0.5.
The first three canonical correlations at the optimal parameter values were hz1a , z1b i =
0.97, hz2a , z2b i = 0.97, and hz3a , z3b i = 0.96. The correlation coefficients between the simu-
lated relations and the transformed variables are shown in Table II. The exponential
relation was extracted in the first pair (z1a , z1b ), the 3rd order polynomial relation was
extracted in the second pair (z2a , z2b ) and the linear relation in the third pair (z3a , z3b ).
Non-linear relations are also taken into account through neural networks which are
employed in deep CCA [Andrew et al. 2013]. In deep CCA, every observation xka ∈ Rp
and xkb ∈ Rq for k = 1, 2, . . . , n is non-linearly transformed multiple times in an iter-
ative manner through a layered network. The number of units in a layer determines
the dimension of the output vector which is put in the next layer. As is explained in
[Andrew et al. 2013], let the first layer have c1 units and the final layer o units. The
output vector of the first layer for the observation x1a ∈ Rp , is h1 = s(S11 x1a + b11 ) ∈ Rc1 ,
where S11 ∈ Rc1 ×p is a matrix of weights, b11 ∈ Rc1 is a vector of bias, and s : R 7→ R
is a non-linear function applied to each element. The logistic and tanh functions are
examples of popular non-linear functions. The output vector h1 is then used to com-
pute the output of the following layer in similar manner. The final transformed vector
f1 (x1a ) = s(Sd1 hd−1 + b1d ) is in the space of Ro , for a network with d layers. The same
procedure is applied to the observations xkb ∈ Rq for k = 1, 2, . . . , n.
In deep CCA, the aim is to learn the optimal parameters Sd and bd for both views
such that the correlation between the transformed observations is maximised. Let
Ha ∈ Ro×n and Hb ∈ Ro×n denote the matrices that have the final transformed output
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:22
vectors in their columns. Let H̃a = Ha − n1 Ha 1 denote the centered data matrix and
1 1
let Ĉab = m−1 H̃a H̃bT and Ĉaa = m−1 H̃a H̃aT + ra I, where ra is a regularisation constant,
denote the covariance and variance matrices. Same formulae are used to compute the
covariance and variance matrices for view b. As in section 2.1, the total correlation of
the top k components of Ha and Hb is the sum of the top k singular values of the matrix
−1/2 −1/2
T = Ĉaa Ĉab Ĉbb . If k = o, the correlation is given by the trace norm of T , that is
corr(Ha , Hb ) = tr(T T T )1/2 .
The optimal parameters Sd and bd maximise the trace norm using gradient-based op-
timisation. The details of the algorithm can be found in [Andrew et al. 2013].
In summary, kernel and deep CCA provide alternatives to the linear CCA when the
relations in the data can be considered to be non-linear and the sample size is small in
relation to the data dimensionality. When applying kernel CCA on a real dataset, prior
knowledge of the relations of interest can help in the analysis of the results. If the data
is assumed to contain both linear and non-linear relations a Gaussian kernel could be
a first option. The choice of the kernel function depends on what kind of relations the
data can be considered to contain. The possible relations can be extracted by testing
how the image pairs correlate with the functions of variables. Deep CCA provides an
alternative to compute maximal correlation between the views although the neural
network makes the identification of the type of relations difficult.
3.4. Improving the Interpretability by Enforcing Sparsity
The extraction of the linear relations between the variables in CCA and regularised
CCA relies on the values of the entries of the position vectors that have images on
the unit ball with a minimum enclosing angle. The relations can be inferred when the
number of variables is not too large for a human to interpret. However, in modern data
analysis, it is common that the number of variables is of the order of tens of thousands.
In this case, the values of the entries of the position vectors should be constrained such
that only a subset of the variables would have a non-zero value. This would facilitate
the interpretation since only a fraction of the total number of variables need to be
considered when inferring the relations.
To constrain some of the values of the entries of the position vectors to zero, which is
also referred to as to enforce sparsity, tools of convex analysis can be applied. In liter-
ature, sparsity has been enforced on the position vectors using soft-thresholding oper-
ators [Parkhomenko et al. 2007], elastic net regularisation [Waaijenborg et al. 2008],
penalised matrix decomposition combined with soft-thresholding [Witten et al. 2009],
and convex least squares optimisation [Hardoon and Shawe-Taylor 2011]. The sparse
CCA formulations presented in [Parkhomenko et al. 2007; Waaijenborg et al. 2008;
Witten et al. 2009] find sparse position vectors that can be applied to infer lin-
ear relations between the variables with non-zero entries. The formulation in
[Hardoon and Shawe-Taylor 2011] differs from the preceding propositions in terms of
the optimisation criterion. The canonical correlation is found between the image ob-
tained from the linear transformation defined by the data space of one view and the
image obtained from the linear transformation defined by the kernel of the other view.
The selection of which sparse CCA should be applied for a specific task depends on the
research question and prior knowledge regarding the variables.
The sparse CCA algorithm of [Parkhomenko et al. 2007] can be applied when the
aim is to find sparse position vectors and no prior knowledge regarding the variables is
available. The positions and images are solved using the SVD, as presented in Section
2.2. Sparsity is enforced on the entries of the positions by iteratively applying the
soft-thresholding operator [Donoho and Johnstone 1995] on the pair of left and right
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:23
where uk denotes the column of the matrix U , vk denotes the column of the matrix
U , σk denotes the k th singular value on the diagonal of S, M (r) is the set of rank r
n × p matrices and r << K. In the case of CCA, the matrix to be approximated is the
covariance matrix X = Cab . The optimisation problem in the PMD context is given by
1
min ||Cab − σwa wbT ||2F ,
p
wa ∈R ,wb ∈Rq2
||wa ||2 = 1 ||wb ||2 = 1,
||wa ||1 ≤ c1 ||wb ||1 ≤ c2 , σ ≥ 0
which is equivalent to
cos θ = max waT Cab wb ,
wa ∈Rp ,wb ∈Rq
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:24
where c > 0 is a constant. The following formula is applied in the derivation of the
algorithm
maxhu, ai,
u
s.t. ||u||22 ≤ 1, ||u||1 < c.
S(a,δ)
The solution is given by u = with δ = 0 if ||u1 || ≤ c. Otherwise, δ is selected
||S(a,δ)||2
such that ||u1 || = c. Sparse position vectors are the obtained by Algorithm 2. At ev-
ery iteration, the δ1 and δ2 are selected by binary search. To obtain several 1-rank
approximations, a deflation step is included such that when the converged vectors wa
and wb are found, the extracted relations are subtracted from the covariance matrix
k+1
Cab ← C k − σk wak wbkT . In this way, the successive solutions remain orthogonal which
is a contstraint of CCA.
Example 3.4. To demonstrate the PMD formulation of sparse CCA, we generate
the following data. The data matrices Xa and Xb of sizes n × p and n × q, where n = 50,
p = 100 and q = 150, respectively as follows. The variables of Xa are generated from
a random univariate normal distribution, a1 , a2 , · · · , a100 ∼ N (0, 1). We generate the
following linear relations
b1 = a3 + ξ 1 (45)
b2 = a1 + ξ 2 (46)
b3 = −a4 + ξ 3 (47)
where ξ 1 ∼ N (0, 0.08), ξ2 ∼ N (0, 0.07), and ξ 3 ∼ N (0, 0.05) denote vectors of normal
noise. The other variables of Xb are generated from a random univariate normal dis-
tribution, b4 , b5 , · · · , b150 ∼ N (0, 1). The data is standardised such that every variable
has zero mean and unit variance.
We apply the R implementation of [Witten et al. 2009] which is available in the PMA
package. We extract three rank-1 approximations. The values of the entries of the pairs
of position vectors (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) corresponding to canonical correla-
tions hz1a , z1b i = 0.95, hz2a , z2b i = 0.92, hz3a , z3b i = 0.91 are shown in Figure 5. The first
1-rank approximation extracted (47), the second (46), and the third (47).
The sparse CCA of [Hardoon and Shawe-Taylor 2011] is a sparse convex least
squares formulation that differs from the preceding versions. The canonical correlation
is found between the linear transformations between a data space view and a kernel
space view. The aim is to find a sparse set of variables in the data space view that relate
to a sparse set of observations, represented in terms of relative similarities, in the ker-
nel space view. An example of a setting, where relations of this type can provide useful
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:25
0 0 0
-1 -1 -1
a4 a1 a3
0 0 0
-1 -1 -1
b3 b2 b1
Fig. 5. The values of the entries of the position vector pairs (wa1 , wb1 ), (wa2 , wb2 ) and (wa3 , wb3 ) obtained using
the PMD method for sparse CCA are shown. The entry of maximum absolute value is coloured blue. The
negative linear relation between a4 and b3 is extracted in the first 1-rank approximation. The positive linear
relations between a1 and b2 and a3 and b1 are extracted in the second and third 1-rank approximations.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:26
Example 3.5. In the sparse CCA of [Hardoon and Shawe-Taylor 2011], the idea is
to determine the relations of the variables in the data space view Xa to the observa-
tions in the kernel space view Kb where the observations comprise the variables of the
view b. This setting differs from all of the previous examples where the idea was to find
relations between the variables. Since one of the views is kernelised, the relations can-
not be explicitly simulated. We therefore demonstrate the procedure on data generated
from random univariate normal distribution as follows. The data matrices Xa and Xb
of sizes n × p and n × q, where n = 50, p = 100 and q = 150, respectively are generated
as follows. The variables of Xa and Xb are generated from random univariate normal
distribution, a1 , a2 , · · · , a100 ∼ N (0, 1) and b1 , b2 , · · · , b150 ∼ N (0, 1) respectively. The
data is standardised such that every variable has zero mean and unit variance.
The Gaussian kernel function K(x, y) = exp(−||x − y||2 /2σ 2 ) is used to compute the
similarities for the view b. The choice of the kernel is justified since the underlying
distribution is normal. The width parameter is set to σ = 17.25 using the median trick.
The kernel matrix is centred by K̃ = K − n1 jjT K − n1 KjjT + n12 (jT Kj)jjT where j contains
only entries of value one [Shawe-Taylor and Cristianini 2004].
To find the positions wa and β, we solve
f = min ||Xa wa − Kb β||2 + µ||wa ||1 + γ||β̃||1
wa ,β
s.t ||β||∞ = 1
using the implementation proposed in [Uurtio et al. 2015]. As stated in
[Hardoon and Shawe-Taylor 2011], to determine which variable in the data space
view Xa is most related to the observation in Kb , the algorithm needs to be run for
all possible values of k. This means that every observation is in turn set as a basis
for comparison and a sparse set of the remaining observations β̃ is computed. The
optimal value of k gives the minimum objective value f .
We run the algorithm by initially setting the value of the entry βk = 1 for k =
1, 2, . . . , n. The minimum objective value f = 0.03 was obtained at k = 29. This corre-
sponds to a canonical correlation of hza , zb i = 0.88. The values of the entries of wa and
β are shown in Figure 6. The observation corresponding to k = 29 in the kernelised
view Kb is most related to the variables a15 , a16 , a18 , a20 , and a24 .
The sparse versions of CCA can be applied to settings when the large num-
ber of variables hinders the inference of the relations. When the interest is to
extract sparse linear relations between the variables, the proposed algorithms of
[Parkhomenko et al. 2007; Waaijenborg et al. 2008; Witten et al. 2009] provide a solu-
tion. The algorithm of [Hardoon and Shawe-Taylor 2011] can be applied if the focus is
to find how the variables of one view relate to the observations that correspond to the
combined sets of the variables in the other view. In other words, the approach is useful
if the focus is not to uncover the explicit relations between the variables but to gain
insight how a variable relates to a complete set of variables of an observation.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:27
wa β
0.014 1
0.9
0.012
0.8
0.01 0.7
0.6
0.008
0.5
0.006
0.4
0.004 0.3
0.2
0.002
0.1
0 0
β2 ββ1112 β26β29 β35 β40 β46
a9
aa1156
8
a220
a24
9
a62
a77
712
4
a1
a6
a9
Fig. 6. The values of the entries of the positions wa and β at the optimal value of k are shown.
4. DISCUSSION
This tutorial presented an overview on the methodological evolution of canonical cor-
relation methods focusing on the original linear, regularised, kernel, and sparse CCA.
Succinct reviews were also conducted on the Bayesian and neural network-based deep
CCA variants. The aim was to explain the theoretical foundations of the variants us-
ing the linear algebraic interpretation of CCA. The methods to solve the CCA problems
were described using numerical examples. Additionally, techniques to assess the sta-
tistical significance of the extracted relations and the generalisability of the patterns
were explained. The aim was to delineate the applicabilities of the different CCA vari-
ants in relation to the properties of the data.
In CCA, the aim is to determine linear relations between variables belonging to two
sets. From a linear algebraic point of view, the relations can be found by analysing
the linear transformations defined by the two views of the data. The most distinct
relations are obtained by analysing the entries of the first pair of position vectors in
the two data spaces that are mapped onto a unit ball such that their images have a
minimum enclosing angle. The less distinct relations can be identified from the suc-
cessive pairs of position vectors that correspond to the images with a minimum en-
closing angle obtained from the orthogonal complements of the preceding pairs of im-
ages. This tutorial presented three standard ways of solving the CCA problem, that is
by solving either a standard [Hotelling 1935; Hotelling 1936] or a generalised eigen-
value problem [Bach and Jordan 2002; Hardoon et al. 2004], or by applying the SVD
[Healy 1957; Ewerbring and Luk 1989].
The position vectors of the two data spaces, that convey the related pairs of variables,
can be obtained using alternative techniques than the ones selected for this tutorial.
The three methods were chosen because they have been much applied in CCA litera-
ture and they are relatively straightforward to explain and implement. Additionally,
to understand the further extensions of CCA, it is important to know how it originally
has been solved. The extensions are often further developed versions of the standard
techniques.
For didactic purposes, the synthetic datasets used for the worked examples were de-
signed to represent optimal data settings for the particular CCA variants to uncover
the relations. The relations were generated to be one-to-one, in other words one vari-
able in one view was related with only one variable in the other view. In real datasets,
which are often much larger than the synthetic ones in this paper, the relations may
not be one-to-one but rather many-to-many (one-to-two, two-to-three, etc.). As in the
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:28
worked examples, these relations can also be inferred by examining the entries of the
position vectors of the two data spaces. However, the understanding of how the one-to-
one relations are extracted provides means to uncover the more complex relations.
To apply the linear CCA, the sample size needs to exceed the number of variables
of both views which means that the system is required to be overdetermined. This is
to guarantee the non-singularity of the variance matrices. If the sample size is not
sufficient, regularisation [Vinod 1976] or Bayesian CCA [Klami et al. 2013] can be ap-
plied. The feasibility of regularisation has not been studied in relation to the number
of variables exceeding the number of observations. Improving the invertibility by in-
troducing additional bias has been shown to work in various settings but the limit
when the system is too underdetermined that regularisation cannot assist in recov-
ering the underlying relations has not been resolved. Bayesian CCA is more robust
against outlying observations, when compared with linear CCA, due to its generative
model structure.
In addition to linear relations, non-linear relations are taken into account in ker-
nelised and neural network-based CCA. Kernel methods enable the extraction of
non-linear relations through the mapping to a Hilbert space [Bach and Jordan 2002;
Hardoon et al. 2004]. When applying kernel methods in CCA, the disparity between
the number of observations and variables can be huge due to very dimensional kernel
induced feature spaces, a challenge that is tackled by regularisation. The types of rela-
tions that can be extracted, is determined by the kernel function that was selected for
the mapping. Linear relations are extracted by a linear kernel and non-linear relations
by non-linear kernel functions such as the Gaussian kernel. Although kernelisation
extends the range of extractable relations, it also complicates the identification of the
type of relation. A method to determine the type of relation involves testing how the
image vectors correlate with a certain type of function. However, this may be difficult
if no prior knowledge of the relations is available. Further research on how to select
the optimal kernel functions to determine the most distinct relations underlying in the
data could facilitate the final inference making. Neural network-based deep CCA is an
alternative to kernelised CCA, when the aim is to find a high correlation between the
final output vectors obtained through multiple non-linear transformations. However,
due to the network structure, it is not straightforward to identify the relations between
the variables.
As a final branch of the CCA evolution, this tutorial covered sparse versions of
CCA. Sparse CCA variants have been developed to facilitate the extraction of the
relations when the data dimensionality is too high for human interpretation. This
has been addressed by enforcing sparsity on the entries of the position vectors
[Parkhomenko et al. 2007; Waaijenborg et al. 2008; Witten et al. 2009]. As an alterna-
tive to operating in the data spaces, [Hardoon and Shawe-Taylor 2011] proposed a
primal-dual sparse CCA in which the relations are obtained between the variables
of one view and observations of the other. The sparse variants of CCA in this tutorial
were selected based on how much they have been applied in literature. As a limitation
of the selected variants, sparsity is enforced on the entries of the position vectors with-
out regarding the possible underlying dependencies between the variables which has
been addressed in the literature of structured sparsity [Chen et al. 2012].
In addition to studying the techniques of solving the optimisation problems of CCA
variants, this tutorial gave a brief introduction to evaluating the canonical correlation
model. Bartlett’s sequential test procedure [Bartlett 1938; Bartlett 1941] was given as
an example of a standard method to assess the statistical significance of the canonical
correlations. The techniques of identifying the related variables through visual inspec-
tion of biplots [Meredith 1964; Ter Braak 1990] were presented. To assess whether the
extracted relations can be considered to occur in any data with the same underly-
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:29
ing sampling distribution, the method of applying both training and test data was
explained. As an alternative method, the statistical significance of the canonical cor-
relation model could be assessed using permutation tests [Rousu et al. 2013]. The vi-
sualisation of the results using the biplots is mainly applicable in the case of linear
relations. Alternative approaches could be considered to visualise the non-linear rela-
tions extracted by kernel CCA.
To conclude, this tutorial compiled the original, regularised, kernel, and sparse CCA
into a unified framework to emphasise the applicabilities of the four variants in dif-
ferent data settings. The work highlights which CCA variant is most applicable de-
pending on the sample size, data dimensionality and the type of relations of interest.
Techniques for extracting the relations are also presented. Additionally, the impor-
tance of assessing the statistical significance and generalisability of the relations is
emphasised. The tutorial hopefully advances both the practice of CCA variants in data
analysis and further development of novel extensions.
The software used to produce the examples in this paper are available for download
at [Link]
ACKNOWLEDGMENTS
The work by Viivi Uurtio and Juho Rousu has been supported in part by Academy of Finland (grant
295496/D4Health). João M. Monteiro was supported by a PhD studentship awarded by Fundação para a
Ciência e a Tecnologia (SFRH/BD/88345/2012). John Shawe-Taylor acknowledges the support of the EPSRC
through the C-PLACID project Reference: EP/M006093/1.
REFERENCES
S Akaho. 2001. A Kernel Method For Canonical Correlation Analysis. In In Proceedings of the International
Meeting of the Psychometric Society (IMPS2001.
Md A Alam, M Nasser, and K Fukumizu. 2008. Sensitivity analysis in robust and kernel canonical correla-
tion analysis. In Computer and Information Technology, 2008. ICCIT 2008. 11th International Confer-
ence on. IEEE, 399–404.
TW Anderson. 2003. An introduction to statistical multivariate analysis. (2003).
G Andrew, R Arora, J Bilmes, and K Livescu. 2013. Deep canonical correlation analysis. In International
Conference on Machine Learning. 1247–1255.
C Archambeau and FR Bach. 2009. Sparse probabilistic projections. In Advances in neural information
processing systems. 73–80.
C Archambeau, N Delannay, and M Verleysen. 2006. Robust probabilistic projections. In Proceedings of the
23rd International conference on machine learning. ACM, 33–40.
S Arlot, A Celisse, and others. 2010. A survey of cross-validation procedures for model selection. Statistics
surveys 4 (2010), 40–79.
F Bach, R Jenatton, J Mairal, G Obozinski, and others. 2011. Convex optimization with sparsity-inducing
norms. Optimization for Machine Learning 5 (2011).
FR Bach and MI Jordan. 2002. Kernel independent component analysis. Journal of machine learning re-
search 3, Jul (2002), 1–48.
FR Bach and MI Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. (2005).
MS Bartlett. 1938. Further aspects of the theory of multiple regression. In Mathematical Proceedings of the
Cambridge Philosophical Society, Vol. 34. Cambridge Univ Press, 33–40.
MS Bartlett. 1941. The statistical significance of canonical correlations. Biometrika 32, 1 (1941), 29–37.
B Baur and S Bozdag. 2015. A canonical correlation analysis-based dynamic bayesian network prior to infer
gene regulatory networks from multiple types of biological data. Journal of Computational Biology 22,
4 (2015), 289–299.
Å Björck and GH Golub. 1973. Numerical methods for computing angles between linear subspaces. Mathe-
matics of computation 27, 123 (1973), 579–594.
MB Blaschko, CH Lampert, and A Gretton. 2008. Semi-supervised laplacian regularization of kernel canon-
ical correlation analysis. In Joint European Conference on Machine Learning and Knowledge Discovery
in Databases. Springer, 133–145.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:30
MW Browne. 2000. Cross-validation methods. Journal of mathematical psychology 44, 1 (2000), 108–132.
E Burg and J Leeuw. 1983. Non-linear canonical correlation. British journal of mathematical and statistical
psychology 36, 1 (1983), 54–80.
J Cai. 2013. The distance between feature subspaces of kernel canonical correlation analysis. Mathematical
and Computer Modelling 57, 3 (2013), 970–975.
L Cao, Z Ju, J Li, R Jian, and C Jiang. 2015. Sequence detection analysis based on canonical correlation
for steady-state visual evoked potential brain computer interfaces. Journal of neuroscience methods 253
(2015), 10–17.
JD Carroll. 1968. Generalization of canonical correlation analysis to three or more sets of variables. In
Proceedings of the 76th annual convention of the American Psychological Association, Vol. 3. 227–228.
B Chang, U Krüger, R Kustra, and J Zhang. 2013. Canonical Correlation Analysis based on Hilbert-Schmidt
Independence Criterion and Centered Kernel Target Alignment.. In ICML (2). 316–324.
X Chen, S Chen, H Xue, and X Zhou. 2012. A unified dimensionality reduction framework for semi-paired
and semi-supervised multi-view data. Pattern Recognition 45, 5 (2012), 2005–2018.
X Chen, C He, and H Peng. 2014. Removal of muscle artifacts from single-channel EEG based on ensemble
empirical mode decomposition and multiset canonical correlation analysis. Journal of Applied Mathe-
matics 2014 (2014).
X Chen, H Liu, and JG Carbonell. 2012. Structured sparse canonical correlation analysis. In International
Conference on Artificial Intelligence and Statistics. 199–207.
A Cichonska, J Rousu, P Marttinen, AJ Kangas, P Soininen, T Lehtimäki, OT Raitakari, M-R Järvelin,
V Salomaa, M Ala-Korpela, and others. 2016. metaCCA: Summary statistics-based multivariate meta-
analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics (2016),
btw052.
N Cristianini, J Shawe-Taylor, and H Lodhi. 2002. Latent semantic kernels. Journal of Intelligent Informa-
tion Systems 18, 2-3 (2002), 127–152.
R Cruz-Cano and MLT Lee. 2014. Fast regularized canonical correlation analysis. Computational Statistics
& Data Analysis 70 (2014), 88–100.
J Dauxois and GM Nkiet. 1997. Canonical analysis of two Euclidean subspaces and its applications. Linear
Algebra Appl. 264 (1997), 355–388.
DL Donoho and IM Johnstone. 1995. Adapting to unknown smoothness via wavelet shrinkage. Journal of
the american statistical association 90, 432 (1995), 1200–1224.
RB Dunham and DJ Kravetz. 1975. Canonical correlation analysis in a predictive system. The Journal of
Experimental Education 43, 4 (1975), 35–42.
C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1,
3 (1936), 211–218.
B Efron. 1979. Computers and the theory of statistics: thinking the unthinkable. SIAM review 21, 4 (1979),
460–480.
LM Ewerbring and FT Luk. 1989. Canonical correlations and generalized SVD: applications and new algo-
rithms. In 32nd Annual Technical Symposium. International Society for Optics and Photonics, 206–222.
J Fang, D Lin, SC Schulz, Z Xu, VD Calhoun, and Y-P Wang. 2016. Joint sparse canonical correlation analysis
for detecting differential imaging genetics modules. Bioinformatics 32, 22 (2016), 3480–3488.
Y Fujikoshi and LG Veitch. 1979. Estimation of dimensionality in canonical correlation analysis. Biometrika
66, 2 (1979), 345–351.
K Fukumizu, FR Bach, and A Gretton. 2007. Statistical consistency of kernel canonical correlation analysis.
Journal of Machine Learning Research 8, Feb (2007), 361–383.
C Fyfe and PL Lai. 2000. Canonical correlation analysis neural networks. In Pattern Recognition, 2000.
Proceedings. 15th International Conference on, Vol. 2. IEEE, 977–980.
GH Golub and CF Van Loan. 2012. Matrix computations. Vol. 3. JHU Press.
GH Golub and H Zha. 1995. The canonical correlations of matrix pairs and their numerical computation. In
Linear algebra for signal processing. Springer, 27–49.
I González, S Déjean, PGP Martin, O Gonçalves, P Besse, and A Baccini. 2009. Highlighting relationships
between heterogeneous biological data through graphical displays based on regularized canonical cor-
relation analysis. Journal of Biological Systems 17, 02 (2009), 173–199.
BK Gunderson and RJ Muirhead. 1997. On estimating the dimensionality in canonical correlation analysis.
Journal of Multivariate Analysis 62, 1 (1997), 121–136.
DR Hardoon, J Mourao-Miranda, M Brammer, and J Shawe-Taylor. 2007. Unsupervised analysis of fMRI
data using kernel canonical correlation. NeuroImage 37, 4 (2007), 1250–1259.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:31
DR Hardoon and J Shawe-Taylor. 2009. Convergence analysis of kernel canonical correlation analysis: the-
ory and practice. Machine learning 74, 1 (2009), 23–38.
DR Hardoon and J Shawe-Taylor. 2011. Sparse canonical correlation analysis. Machine Learning 83, 3
(2011), 331–353.
DR Hardoon, S Szedmak, and J Shawe-Taylor. 2004. Canonical correlation analysis: An overview with ap-
plication to learning methods. Neural computation 16, 12 (2004), 2639–2664.
MJR Healy. 1957. A rotation method for computing canonical correlations. Math. Comp. 11, 58 (1957), 83–
86.
C Heij and B Roorda. 1991. A modified canonical correlation approach to approximate state space modelling.
In Decision and Control, 1991., Proceedings of the 30th IEEE Conference on. IEEE, 1343–1348.
AE Hoerl and RW Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Techno-
metrics 12, 1 (1970), 55–67.
JW Hooper. 1959. Simultaneous equations and canonical correlation theory. Econometrica: Journal of the
Econometric Society (1959), 245–256.
CE Hopkins. 1969. Statistical analysis by canonical correlation: a computer application. Health services
research 4, 4 (1969), 304.
P Horst. 1961. Relations among sets of measures. Psychometrika 26, 2 (1961), 129–149.
H Hotelling. 1935. The most predictable criterion. Journal of educational Psychology 26, 2 (1935), 139.
H Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321–377.
WW Hsieh. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10
(2000), 1095–1105.
I Huopaniemi, T Suvitaival, J Nikkilä, M Orešič, and S Kaski. 2010. Multivariate multi-way analysis of
multi-source data. Bioinformatics 26, 12 (2010), i391–i398.
A Kabir, RD Merrill, AA Shamim, RDW Klemn, AB Labrique, P Christian, KP West Jr, and M Nasser. 2014.
Canonical correlation analysis of infant’s size at birth and maternal factors: a study in rural Northwest
Bangladesh. PloS one 9, 4 (2014), e94243.
M Kang, B Zhang, X Wu, C Liu, and J Gao. 2013. Sparse generalized canonical correlation analysis for
biological model integration: a genetic study of psychiatric disorders. In Engineering in Medicine and
Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. IEEE, 1490–1493.
JR Kettenring. 1971. Canonical analysis of several sets of variables. Biometrika (1971), 433–451.
A Kimura, M Sugiyama, T Nakano, H Kameoka, H Sakano, E Maeda, and K Ishiguro. 2013. SemiCCA:
Efficient semi-supervised learning of canonical correlations. Information and Media Technologies 8, 2
(2013), 311–318.
A Klami and S Kaski. 2007. Local dependent components. In Proceedings of the 24th international conference
on Machine learning. ACM, 425–432.
A Klami, S Virtanen, and S Kaski. 2012. Bayesian exponential family projections for coupled data sources.
arXiv preprint arXiv:1203.3489 (2012).
A Klami, S Virtanen, and S Kaski. 2013. Bayesian canonical correlation analysis. Journal of Machine Learn-
ing Research 14, Apr (2013), 965–1003.
D Krstajic, LJ Buturovic, DE Leahy, and S Thomas. 2014. Cross-validation pitfalls when selecting and
assessing regression and classification models. Journal of cheminformatics 6, 1 (2014), 1.
PL Lai and C Fyfe. 1999. A neural implementation of canonical correlation analysis. Neural Networks 12,
10 (1999), 1391–1397.
PL Lai and C Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of
Neural Systems 10, 05 (2000), 365–377.
NB Larson, GD Jenkins, MC Larson, RA Vierkant, TA Sellers, CM Phelan, JM Schildkraut, R Sutphen,
PPD Pharoah, S A Gayther, and others. 2014. Kernel canonical correlation analysis for assessing gene–
gene interactions and application to ovarian cancer. European Journal of Human Genetics 22, 1 (2014),
126–131.
SC Larson. 1931. The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology
22, 1 (1931), 45.
H-S Lee. 2007. Canonical correlation analysis using small number of samples. Communications in Statistic-
sSimulation and Computation R 36, 5 (2007), 973–985.
SE Leurgans, RA Moyeed, and BW Silverman. 1993. Canonical correlation analysis when the data are
curves. Journal of the Royal Statistical Society. Series B (Methodological) (1993), 725–740.
H Lindsey, JT Webster, and S Halpern. 1985. Canonical Correlation as a Discriminant Tool in a Periodontal
Problem. Biometrical journal 27, 3 (1985), 257–264.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
95:32
P Marttinen, J Gillberg, A Havulinna, J Corander, and S Kaski. 2013. Genome-wide association studies with
high-dimensional phenotypes. Statistical applications in genetics and molecular biology 12, 4 (2013),
413–431.
T Melzer, M Reiter, and H Bischof. 2001. Nonlinear feature extraction using generalized canonical correla-
tion analysis. In International Conference on Artificial Neural Networks. Springer, 353–360.
T Melzer, M Reiter, and H Bischof. 2003. Appearance models based on kernel canonical correlation analysis.
Pattern recognition 36, 9 (2003), 1961–1971.
W Meredith. 1964. Canonical correlations with fallible data. Psychometrika 29, 1 (1964), 55–65.
MS Monmonier and FE Finn. 1973. Improving the interpretation of geographical canonical correlation mod-
els. The Professional Geographer 25, 2 (1973), 140–142.
M Nakanishi, Y Wang, Y-T Wang, and T-P Jung. 2015. A Comparison Study of Canonical Correlation
Analysis Based Methods for Detecting Steady-State Visual Evoked Potentials. PloS one 10, 10 (2015),
e0140703.
RM Neal. 2012. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media.
T Ogura, Y Fujikoshi, and T Sugiyama. 2013. A variable selection criterion for two sets of principal compo-
nent scores in principal canonical correlation analysis. Communications in Statistics-Theory and Meth-
ods 42, 12 (2013), 2118–2135.
E Parkhomenko, D Tritchler, and J Beyene. 2007. Genome-wide sparse canonical correlation of gene expres-
sion with genotypes. In BMC proceedings, Vol. 1. BioMed Central Ltd, S119.
P Rai and H Daume. 2009. Multi-label prediction via sparse infinite CCA. In Advances in Neural Information
Processing Systems. 1518–1526.
J Rousu, DD Agranoff, O Sodeinde, J Shawe-Taylor, and D Fernandez-Reyes. 2013. Biomarker discovery by
sparse canonical correlation analysis of complex clinical phenotypes of tuberculosis and malaria. PLoS
Comput Biol 9, 4 (2013), e1003018.
Y Saad. 2011. Numerical methods for large eigenvalue problems. Vol. 158. SIAM.
T Sakurai. 2009. Asymptotic expansions of test statistics for dimensionality and additional information in
canonical correlation analysis when the dimension is large. Journal of Multivariate Analysis 100, 5
(2009), 888–901.
BK Sarkar and C Chakraborty. 2015. DNA pattern recognition using canonical correlation algorithm. Jour-
nal of biosciences 40, 4 (2015), 709–719.
SV Schell and WA Gardner. 1995. Programmable canonical correlation analysis: A flexible framework for
blind adaptive spatial filtering. IEEE transactions on signal processing 43, 12 (1995), 2898–2908.
B Schölkopf, A Smola, and K-R Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem.
Neural computation 10, 5 (1998), 1299–1319.
JA Seoane, C Campbell, INM Day, JP Casas, and TR Gaunt. 2014. Canonical correlation analysis for gene-
based pleiotropy discovery. PLoS Comput Biol 10, 10 (2014), e1003876.
J Shawe-Taylor and N Cristianini. 2004. Kernel methods for pattern analysis. Cambridge university press.
X-B Shen, Q-S Sun, and Y-H Yuan. 2013. Orthogonal canonical correlation analysis and its application in
feature fusion. In Information Fusion (FUSION), 2013 16th International Conference on. IEEE, 151–
157.
C Soneson, H Lilljebjörn, T Fioretos, and M Fontes. 2010. Integrative analysis of gene expression and copy
number alterations using canonical correlation analysis. BMC bioinformatics 11, 1 (2010), 1.
L Song, B Boots, SM Siddiqi, GJ Gordon, and A Smola. 2010. Hilbert space embeddings of hidden Markov
models. (2010).
Y Song, PJ Schreier, D Ramı́rez, and T Hasija. 2016. Canonical correlation analysis of high-dimensional
data with very small sample support. Signal Processing 128 (2016), 449–458.
M Stone. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statis-
tical society. Series B (Methodological) (1974), 111–147.
MJ Sullivan. 1982. Distribution of Edaphic Diatoms in a Missisippi Salt Marsh: A Canonical Correlation
Analysis. Journal of Phycology 18, 1 (1982), 130–133.
A Tenenhaus, C Philippe, and V Frouin. 2015. Kernel generalized canonical correlation analysis. Computa-
tional Statistics & Data Analysis 90 (2015), 114–131.
A Tenenhaus, C Philippe, V Guillemot, K-A Le Cao, J Grill, and V Frouin. 2014. Variable selection for
generalized canonical correlation analysis. Biostatistics (2014), kxu001.
A Tenenhaus and M Tenenhaus. 2011. Regularized generalized canonical correlation analysis. Psychome-
trika 76, 2 (2011), 257–284.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.
A Tutorial on Canonical Correlation Methods 95:33
CJF Ter Braak. 1990. Interpreting canonical correlation analysis through biplots of structure correlations
and weights. Psychometrika 55, 3 (1990), 519–531.
R Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.
Series B (Methodological) (1996), 267–288.
XM Tu. 1991. A bootstrap resampling scheme for using the canonical correlation technique in rank estima-
tion. Journal of chemometrics 5, 4 (1991), 333–343.
XM Tu, DS Burdick, DW Millican, and LB McGown. 1989. Canonical correlation technique for rank estima-
tion of excitation-emission matrixes. Analytical Chemistry 61, 19 (1989), 2219–2224.
V Uurtio, M Bomberg, K Nybo, M Itävaara, and J Rousu. 2015. Canonical correlation methods for exploring
microbe-environment interactions in deep subsurface. In International Conference on Discovery Science.
Springer, 299–307.
JP Van de Geer. 1984. Linear relations amongk sets of variables. Psychometrika 49, 1 (1984), 79–94.
T Van Gestel, JAK Suykens, J De Brabanter, B De Moor, and J Vandewalle. 2001. Kernel canonical cor-
relation analysis and least squares support vector machines. In International Conference on Artificial
Neural Networks. Springer, 384–389.
HD Vinod. 1976. Canonical ridge and econometrics of joint production. Journal of Econometrics 4, 2 (1976),
147–166.
S Waaijenborg, PC Verselewel de Witt Hamer, and AH Zwinderman. 2008. Quantifying the association
between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical
Applications in Genetics and Molecular Biology 7, 1 (2008).
C Wang. 2007. Variational Bayesian approach to canonical correlation analysis. IEEE Transactions on Neu-
ral Networks 18, 3 (2007), 905–910.
D Wang, L Shi, DS Yeung, and ECC Tsang. 2005. Nonlinear canonical correlation analysis of fMRI signals
using HDR models. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. IEEE,
5896–5899.
GC Wang, N Lin, and B Zhang. 2013. Dimension reduction in functional regression using mixed data canon-
ical correlation analysis. Stat Interface 6 (2013), 187–196.
DS Watkins. 2004. Fundamentals of matrix computations. Vol. 64. John Wiley & Sons.
FV Waugh. 1942. Regressions between sets of variables. Econometrica, Journal of the Econometric Society
(1942), 290–310.
DM Witten, R Tibshirani, and T Hastie. 2009. A penalized matrix decomposition, with applications to sparse
principal components and canonical correlation analysis. Biostatistics (2009), kxp008.
KW Wong, PCW Fung, and CC Lau. 1980. Study of the mathematical approximations made in the basis-
correlation method and those made in the canonical-transformation method for an interacting Bose gas.
Physical Review A 22, 3 (1980), 1272.
T Yamada and T Sugiyama. 2006. On the permutation test in canonical correlation analysis. Computational
statistics & data analysis 50, 8 (2006), 2111–2123.
H Yamamoto, H Yamaji, E Fukusaki, H Ohno, and H Fukuda. 2008. Canonical correlation analysis for mul-
tivariate regression and its application to metabolic fingerprinting. Biochemical Engineering Journal
40, 2 (2008), 199–204.
Y-H Yuan, Q-S Sun, and H-W Ge. 2014. Fractional-order embedding canonical correlation analysis and its
applications to multi-view dimensionality reduction and recognition. Pattern Recognition 47, 3 (2014),
1411–1424.
Y-H Yuan, Q-S Sun, Q Zhou, and D-S Xia. 2011. A novel multiset integrated canonical correlation analysis
framework and its application in feature fusion. Pattern Recognition 44, 5 (2011), 1031–1040.
B Zhang, J Hao, G Ma, J Yue, and Z Shi. 2014. Semi-paired probabilistic canonical correlation analysis. In
International Conference on Intelligent Information Processing. Springer, 1–10.
H Zou and T Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320.
ACM Computing Surveys, Vol. 50, No. 6, Article 95, Publication date: October 2017.