0% found this document useful (0 votes)
8 views12 pages

le21a - EPT

This document discusses the development of an Entropy Partial Transport (EPT) problem for nonnegative measures on a tree with different masses, addressing limitations of traditional optimal transport methods. The authors derive a dual formulation for EPT, propose a novel regularization for efficient computation, and introduce tree-sliced variants for practical applications. Empirical results demonstrate the effectiveness of their approach in tasks such as document classification and topological data analysis.

Uploaded by

Tam T. Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views12 pages

le21a - EPT

This document discusses the development of an Entropy Partial Transport (EPT) problem for nonnegative measures on a tree with different masses, addressing limitations of traditional optimal transport methods. The authors derive a dual formulation for EPT, propose a novel regularization for efficient computation, and introduce tree-sliced variants for practical applications. Empirical results demonstrate the effectiveness of their approach in tasks such as document classification and topological data analysis.

Uploaded by

Tam T. Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Entropy Partial Transport with Tree Metrics: Theory and Practice

Tam Le∗ Truyen Nguyen∗


RIKEN AIP University of Akron

Abstract 1 Introduction
Optimal transport (OT) theory offers powerful tools
to compare probability measures (Villani, 2008). OT
Optimal transport (OT) theory provides pow-
has been applied for various tasks in machine learning
erful tools to compare probability measures.
(Courty et al., 2017; Bunne et al., 2019; Nadjahi et al.,
However, OT is limited to nonnegative mea-
2019; Peyré and Cuturi, 2019), statistics (Mena and
sures having the same mass, and suffers se-
Niles-Weed, 2019; Weed and Berthet, 2019) and com-
rious drawbacks about its computation and
puter graphics (Solomon et al., 2015; Lavenant et al.,
statistics. This leads to several proposals of
2018). However, OT requires input measures having
regularized variants of OT in the recent litera-
the same mass which may limit its applications in prac-
ture. In this work, we consider an entropy par-
tice since one often needs to deal with measures of
tial transport (EPT) problem for nonnegative
unequal masses. For instance, in natural language pro-
measures on a tree having different masses.
cessing, we can view a document as a measure where
The EPT is shown to be equivalent to a stan-
each word is regarded as a point in the support with a
dard complete OT problem on a one-node
unit mass. Thus, documents with different lengths lead
extended tree. We derive its dual formula-
to their associated measures having different masses.
tion, then leverage this to propose a novel
regularization for EPT which admits fast com- To tackle the transport problem for measures having
putation and negative definiteness. To our different masses, Caffarelli and McCann (2010) pro-
knowledge, the proposed regularized EPT is posed the partial optimal transport (POT) where one
the first approach that yields a closed-form only transports a fixed amount of mass from a measure
solution among available variants of unbal- into another. Later, Figalli (2010) extended the theory
anced OT for general nonnegative measures. of POT, notably, about the uniqueness of solutions. A
For practical applications without prior knowl- different approach is to optimize the sum of a trans-
edge about the tree structure for measures, we port functional and two convex entropy functionals
propose tree-sliced variants of the regularized which quantify the deviation of the marginals of the
EPT, computed by averaging the regularized transport plan from the input measures (Liero et al.,
EPT between these measures using random 2018), i.e., the optimal entropy transport (OET) prob-
tree metrics, built adaptively from support lem. This formulation recovers many different previous
data points. Exploiting the negative definite- works. For examples, when the entropy is equal to the
ness of our regularized EPT, we introduce a total variation distance or the `2 distance, the OET is
positive definite kernel, and evaluate it against respectively equivalent to the generalized Wasserstein
other baselines on benchmark tasks such as distance (Piccoli and Rossi, 2014, 2016) or the unbal-
document classification with word embedding anced mass transport (Benamou, 2003). It is worth
and topological data analysis. In addition, we noting that the generalized Wasserstein distance shares
empirically demonstrate that our regulariza- the same spirit as the Kantorovich-Rubinstein discrep-
tion also provides effective approximations. ancy (Hanin, 1992; Guittet, 2002; Lellmann et al., 2014).
Another variant is the unnormalized optimal transport
(Gangbo et al., 2019) which mixes Wasserstein distance
and the `p distance. There are several applications of
the transport problem for measures having different
Proceedings of the 24th International Conference on Artifi-
cial Intelligence and Statistics (AISTATS) 2021, San Diego, masses such as in machine learning (Frogner et al.,
California, USA. PMLR: Volume 130. Copyright 2021 by 2015; Janati et al., 2019), deep learning (Yang and
the author(s). ∗
: Two authors contributed equally.
Entropy Partial Transport with Tree Metrics: Theory and Practice

Uhler, 2019), topological data analysis (Lacombe et al., • We derive a dual formulation for our EPT problem.
2018), computational imaging (Lee et al., 2019), and We then leverage it to propose a novel regular-
computational biology (Schiebinger et al., 2019). ization which admits a closed-form formula and
negative definiteness. Consequently, we introduce
One important case for the OET problem is when the
positive definite kernels for our regularized EPT.
entropy is equal to the Kullback-Leibler (KL) diver-
We also derive tree-sliced variants of the regular-
gence and a particular cost function is used, then OET
ized EPT for applications without prior knowledge
is equivalent to the Kantorovich-Hellinger distance
about tree structure for measures.
(i.e., Wasserstein-Fisher-Rao distance) (Chizat et al.,
2018; Liero et al., 2018). In addition, one can apply • We empirically show that (i) our regularization
the Sinkhorn-based algorithm to efficiently solve OET provides both efficient approximations and fast
problem when the entropy is equal to KL divergence, computations, and (ii) the performances of the
i.e., Sinkhorn-based approach for unbalanced optimal proposed kernels for our regularized EPT compare
transport (Sinkhorn-UOT) (Frogner et al., 2015; Chizat favorably with other baselines in applications.
et al., 2018). Pham et al. (2020) showed that the com-
plexity of Sinkhorn-based algorithm for Sinkhorn-UOT The paper is organized as follow: we review tree metric
is quadratic which is similar to the case of entropic and introduce important notations in §2. In §3, we
regularized OT (Cuturi, 2013) for probability measures. develop the theory for EPT with tree metrics and de-
However, for large-scale applications where the sup- rive an efficient regularization for EPT computation
ports of measures contain a large number of points, in practice. In §4, we distinguish our approach with
the computation of Sinkhorn-UOT becomes prohibited. other related work in the literature. Then, we evalu-
Following the sliced-Wasserstein (SW) distance (Ra- ate our proposal on document classification with word
bin et al., 2011; Bonneel et al., 2015) which projects embeddings and topological data analysis in §5, before
supports into a one-dimensional space and employs the giving a conclusion in §6. We have released code for
closed-form solution of the univariate optimal transport our proposal1 .
(1d-OT), Bonneel and Coeurjolly (2019) propose the
sliced partial optimal transport (SPOT) for nonnegative
measures having different masses. Unlike the standard 2 Preliminaries
1d-OT, one does not have a closed-form solution for
measures of unequal masses that are supported in a one- Let T = (V, E) be a tree rooting at node r and with
dimensional space. With an assumption of a unit mass nonnegative edge lengths {we }e∈E , where V is the
on each support, Bonneel and Coeurjolly (2019) de- collection of nodes and E is the collection of edges. For
rived an efficient algorithm to solve the SPOT problem convenience, we use T to denote the set of all nodes
in quadratic complexity for the worst case. Especially, together with all points on its edges2 . We then recall
in practice, their proposed algorithm is nearly linear the definition of tree metric (Semple and Steel, 2003,
for computation. However, as in SW, the SPOT uses §7, p.145–182) as follow:
one-dimensional projection for supports which limits Definition 2.1 (Tree metric). A metric dT : Ω × Ω →
its capacity to capture a structure of a distribution, [0, ∞) is called a tree metric on Ω if there exists tree
especially in high-dimensional settings (Le et al., 2019b; T such that Ω ⊆ T and for x, y ∈ Ω, dT (x, y) equals
Liutkus et al., 2019). to the length of the (unique) path between x and y.
In this work, we aim to develop an efficient and scal- Assume that V is a subset of a vector space, and let
able approach for the transport problem when input dT (·, ·) be the tree metric on T . Hereafter, the unique
measures have different masses. Inspired by the tree- shortest path in T connecting x and y is denoted by
sliced Wasserstein (TSW) distance (Le et al., 2019b) [x, y]. Let ω be the unique Borel measure (i.e., the
which has fast closed-form computation and remedies length measure) on T satisfying ω([x, y]) = dT (x, y)
the curse of dimensionality for SW, we propose to con- for all x, y ∈ T . Given x ∈ T , the set Λ(x) stands for
sider the entropy partial transport (EPT) problem with the subtree below x. Precisely,
tree metrics. As a high level, our main contribution is 
three-fold as follows: Λ(x) := y ∈ T : x ∈ [r, y] . (1)

• We establish a relationship between the EPT prob- We shall use notation M(T ) to represent the set of all
lem with mass constraint and a formulation with nonnegative Borel measures on T with a finite mass.
Lagrangian multiplier. Then, we employ it to 1
[Link]
transform the EPT problem to the standard com- 2
Tree T has a finite number of nodes, but all points
plete OT problem on a suitable one-node extended on edges can be considered for the tree T and so the tree
tree. includes an infinite number of points.
Tam Le, Truyen Nguyen

Also let C(T ) be the set of all continuous functions where Cλ (γ) is defined as follow:
on T , while L∞ (T ) be the collection of all Borel mea- Z Z
surable functions on T that are bounded ω-a.e. Then, Cλ (γ) := w1 [1 − f1 (x)]µ(dx) + w2 [1 − f2 (x)]ν(dx)
L∞ (T ) is a Banach space under the norm T
Z T

+b [c(x, y) − λ]γ(dx, dy)


kf kL∞ (T ) := inf{a ∈ R : |f (x)| ≤ a for ω-a.e. x ∈ T }. T ×T
Z Z Z
= w1 µ(dx) + w2 ν(dx) − w1 γ1 (dx)
3 Entropy Partial Transport (EPT) T
Z T
Z T
with Tree Metrics − w2 γ2 (dx) + b [c(x, y) − λ]γ(dx, dy). (4)
T T ×T
Let b ≥ 0 be a constant, c : T × T → R be a continuous
Notice that problem (3) is a generalization of the gen-
cost with c(x, x) = 0, F1 , F2 : [0, ∞) → (0, ∞) be en-
tropy functions which are convex and lower semicontin- eralized Wasserstein distance W1a,b (µ, ν) introduced in
uous, and let w1 , w2 : T → [0, ∞) be two nonnegative (Piccoli and Rossi, 2014, 2016). We next display some
weights. For µ, ν ∈ M(T ), consider the set relationships between problem (2) with mass constraint
m and problem (3) with Lagrange multiplier λ. For
this, let Γ0 (λ) denote the set of all optimal plans (i.e.,
n o
Π≤ (µ, ν) := γ ∈ M(T × T ) : γ1 ≤ µ, γ2 ≤ ν
minimizers γ) for ETc,λ (µ, ν). Then, since Cλ (γ) is
an affine function of γ ∈ Π≤ (µ, ν), the set Γ0 (λ) is
with γi (i = 1, 2) denoting the ith marginal of the a nonempty convex set. Indeed, for any γ̃, γ̂ ∈ Γ0 (λ)
measure γ. For γ ∈ Π≤ (µ, ν), the Radon-Nikodym and for any t ∈ [0, 1] we have (1 − t)γ̃ + tγ̂ ∈ Γ0 (λ)
derivatives of γ1 w.r.t. µ and of γ2 w.r.t. ν exist due due to Cλ ((1 − t)γ̃ + tγ̂) = (1 − t)Cλ (γ̃) + tCλ (γ̂) ≤
to γ1 ≤ µ and γ2 ≤ ν. From now on, we let f1 and f2 (1 − t)Cλ (γ) + tCλ (γ) = Cλ (γ) for every γ ∈ Π≤ (µ, ν).
respectively denote these Radon-Nikodym derivatives, The following result extends (Caffarelli and McCann,
i.e., γ1 = f1 µ and γ2 = f2 ν. Then 0 ≤ f1 ≤ 1 µ-a.e. 2010, Corollary 2.1) and reveals the connection between
and 0 ≤ f2 ≤ 1 ν-a.e. Throughout the paper, m̄ stands problem (2) and problem (3).
for the minimum of the total masses of µ and ν. That
is, m̄ := min{µ(T ), ν(T )}. Inspired by Caffarelli and Theorem 3.1. Let u(λ) := −ETc,λ (µ, ν) for λ ∈ R,
McCann (2010); Liero et al. (2018), we fix a number and denote
m ∈ [0, m̄] and consider the following EPT problem:
n o
∂u(λ) := p ∈ R : u(t) ≥ u(λ) + p(t − λ), ∀t ∈ R
h
Wc,m (µ, ν) := inf F1 (γ1 |µ) for the set of all subgradients of u at λ. Also, set
γ∈Π≤ (µ,ν), γ(T ×T )=m
Z i ∂u(R) := ∪λ∈R ∂u(λ). Then, we have
+F2 (γ2 |ν) + b c(x, y)γ(dx, dy) , (2)
T ×T
i) u is a convex function on R, and
R
where F1 (γR1 |µ) := T
w1 (x)F1 (f1 (x))µ(dx) and ∂u(λ) = b γ(T × T ) : γ ∈ Γ0 (λ)

∀λ ∈ R.
F2 (γ2 |ν) := T w2 (x)F2 (f2 (x))ν(dx) are the weighted
relative entropies. The role of the two entropies in Also if λ1 < λ2 , then m1 ≤ m2 for every m1 ∈
the minimization problem is to force the marginals of ∂u(λ1 ) and m2 ∈ ∂u(λ2 ).
γ close to µ and ν respectively. Let us introduce a
Lagrange multiplier λ ∈ R conjugate to the constraint ii) u is differentiable at λ if and only if every optimal
γ(T × T ) = m. As a result, we instead study the plan in Γ0 (λ) has the same mass. When this hap-
following formulation pens, we in addition have u0 (λ) = b γ(T × T ) for
h any γ ∈ Γ0 (λ).
ETc,λ (µ, ν) := inf F1 (γ1 |µ) + F2 (γ2 |ν)
γ∈Π≤ (µ,ν) iii) If there exists a constant M > 0 such that
Z i w1 (x) + w2 (y) ≤ b c(x, y) + M for all x, y ∈
+b [c(x, y) − λ]γ(dx, dy) . T ,R then ∂u(R) R= [0, b m̄]. Moreover, u(λ) =
T ×T
− T w1 µ(dx) − T w2 ν(dx) when λ < −M , and
u0 (λ) = b m̄ for λ > kckL∞ (T ×T ) .
In this paper, we focus on the specific entropy functions
F1 (s) = F2 (s) = |s − 1|. Thus, the quantity of interest Proof is placed in the Supplementary (§A.1). For any
becomes m ∈ [0, m̄], part iii) of Theorem 3.1 implies that there
exists λ ∈ R such that b m ∈ ∂u(λ). It then follows
ETc,λ (µ, ν) = inf Cλ (γ), (3)
γ∈Π≤ (µ,ν) from part i) of this theorem that m = γ ∗ (T × T ) for
Entropy Partial Transport with Tree Metrics: Theory and Practice

some γ ∗ ∈ Γ0 (λ). It is also clear that this γ ∗ is an Proposition 3.3 (EPT versus complete OT). For ev-
optimal plan for Wc,m (µ, ν), and ery µ, ν ∈ M(T ), we have ETc,λ (µ, ν) = KT(µ̂, ν̂).
Moreover, relation (6) gives a one-to-one correspon-
Wc,m (µ, ν) = ETc,λ (µ, ν) + λb m. dence between optimal solution γ for EPT problem (3)
and optimal solution γ̂ for standard complete OT prob-
Thus solving the auxiliary problem (3) gives us a so- lem (5).
lution to the original problem (2). When u is dif-
ferentiable, the relation between m and λ is given Proof is placed in the Supplementary (§A.3).
explicitly as u0 (λ) = b m. Note that the above selec-
tion of λ is unique only if the function u is strictly 3.1 Dual Formulations
convex. Nevertheless, it enjoys the following mono-
tonicity regardless of the uniqueness: if m1 < m2 , The relationship given in Proposition 3.3 allows us to
then λ1 ≤ λ2 . Indeed, we have m1 = γ 1 (T × T ) and obtain the dual formulation of EPT in problem (3) from
m2 = γ 2 (T ×T ) for some γ 1 ∈ Γ0 (λ1 ) and γ 2 ∈ Γ0 (λ2 ). that of problem (5) proved in (Caffarelli and McCann,
Since γ 1 (T × T ) < γ 2 (T × T ), one has λ1 ≤ λ2 by i) 2010, Corollary 2.6).
of Theorem 3.1.
Theorem 3.4 (Dual formula for general cost). For
To investigate problem (3), we recast it as the stan- any λ ≥ 0 and nonnegative weights w1 (x), w2 (x), we
dard complete OT problem by using an observation in have
(Caffarelli and McCann, 2010). More precisely, let ŝ be hZ Z i
a point outside T and consider the set T̂ := T ∪ {ŝ}. ETc,λ (µ, ν) = sup u(x)µ(dx) + v(x)ν(dx) ,
We next extend the cost function to T̂ × T̂ as follow (u,v)∈K T T
 n

 b[c(x, y) − λ] if x, y ∈ T , where K := (u, v) : u ≤ w1 , −bλ + inf x∈T [b c(x, y) −
w1 (x) if x ∈ T and y = ŝ,
 o
ĉ(x, y) := w1 (x)] ≤ v(y) ≤ w2 (y), u(x) + v(y) ≤ b[c(x, y) − λ] .

 w 2 (y) if x = ŝ and y ∈ T ,
0 if x = y = ŝ.

Proof is placed in the Supplementary (§A.4).
The measures µ, ν are extended accordingly by adding
a Dirac mass at the isolated point ŝ: µ̂ = µ + ν(T )δŝ This dual formula is our main theoretical result which
and ν̂ = ν + µ(T )δŝ . As µ̂, ν̂ have the same total leads to our novel efficient regularization for the EPT
mass on T̂ , we can consider the standard complete OT (see §3.2), and can be rewritten more explicitly when
problem between µ̂, ν̂ as follow the cost function c is the tree metric. Hereafter, we use
c(x, y) = dT (x, y). To ease the notations, we simply
write ETλ (µ, ν) for ETdT ,λ (µ, ν).
Z
KT(µ̂, ν̂) := inf ĉ(x, y)γ̂(dx, dy), (5)
γ̂∈Γ(µ̂,ν̂) T̂ ×T̂ Corollary 3.5 (Dual formula for tree metric). Assume
n that λ ≥ 0 and the nonnegative weights w1 , w2 are b-
where Γ(µ̂, ν̂) := γ̂ ∈ M(T̂ × T̂ ) : µ̂(U ) = γ̂(U × Lipschitz w.r.t. dT . Then, we have
o
T̂ ), ν̂(U ) = γ̂(T̂ × U ) for all Borel sets U ⊂ T̂ . R
ETλ (µ, ν) = sup T f (µ − ν) : f ∈ L
− bλ
 
A one-to-one correspondence between γ ∈ Π≤ (µ, ν) 2 µ(T ) + ν(T ) , (7)
and γ̂ ∈ Γ(µ̂, ν̂) is given by n

where L := f ∈ C(T ) : −w2 − 2 ≤ f ≤ w1 +
γ̂ = γ + [(1 − f1 )µ] ⊗ δŝ + δŝ ⊗ [(1 − f2 )ν] o

+γ(T × T )δ(ŝ,ŝ) . (6) 2 , |f (x) − f (y)| ≤ b dT (x, y) .

Indeed, if γ ∈ Π≤ (µ, ν), then it is clear that γ̂ defined Proof is placed in the Supplementary (§A.5).
by (6) satisfies γ̂ ∈ Γ(µ̂, ν̂). The converse is guaranteed Corollary 3.5 extends the dual formulation for the
by the next technical result. generalized Wasserstein distance W1a,b (µ, ν) proved in
Lemma 3.2. For γ̂ ∈ Γ(µ̂, ν̂), let γ be the restriction (Piccoli and Rossi, 2016, Theorem 2) and (Chung and
of γ̂ to T . Then, relation (6) holds and γ ∈ Π≤ (µ, ν). Trinh, 2019). In the next section, we will leverage (7)
to propose an effective regularization for computation
Proof is placed in the Supplementary (§A.2). in practice.
These observations in particular display the following Remark 3.6. An example of b-Lipschitz weight is
connection between the EPT problem and the standard w(x) = a1 dT (x, x0 ) + a0 for some x0 ∈ T and for
complete OT problem. some constants a1 ∈ [0, b] and a0 ∈ [0, ∞).
Tam Le, Truyen Nguyen

As a consequence of the dual formulation, we obtain leads us to consider the following regularization for
the following geometric properties: ETλ (µ, ν):
Proposition 3.7 (Geometric structures of metric d). α R
Assume that λ ≥ 0 and the weights w1 , w2 are pos- ET
f (µ, ν) := sup
λ T
f (µ − ν) : f ∈ Lα
− bλ
 
itive and b-Lipschitz w.r.t.  dT . Define d(µ, ν) := 2 µ(T ) + ν(T ) . (8)
ETλ (µ, ν) + bλ

2 µ(T ) + ν(T ) . Then, we have
Especially, when α = 0 and notice that L ⊂ L0 ,
0
i) d(µ + σ, ν + σ) = d(µ, ν), ∀σ ∈ M(T ). ET
f λ (µ, ν) is an upper bound of ETλ (µ, ν) through
the dual formulation. The next result gives a closed-
ii) d is a divergence and satisfies the triangle inequal- α
form formula for ETf λ (µ, ν) and is our main formula
ity d(µ, ν) ≤ d(µ, σ) + d(σ, ν).
used for computation in practice.
iii) If in addition w1 = w2 , then (M(T ), d) is a com- Proposition 3.8 (closed-form for regularized EPT).
plete metric space. Moreover, it is a geodesic space Assume that λ, w1 (r), w2 (r) are nonnegative numbers.
in the sense that for every two points µ and ν in Then, for 0 ≤ α ≤ 12 [bλ + w1 (r) + w2 (r)], we have
M(T ) there exists a path ϕ : [0, a] → M(T ) with
α
a := d(µ, ν) such that ϕ(0) = µ, ϕ(a) = ν, and ET
f λ (µ, ν) =
R
|µ(Λ(x)) − ν(Λ(x))| ω(dx)
T
− 2 µ(T ) + ν(T ) + wi (r) + bλ

   
d(ϕ(t), ϕ(s)) = |t − s| for all t, s ∈ [0, a]. 2 − α |µ(T ) − ν(T )|

Proof is placed in the Supplementary (§A.6). with i := 1 if µ(T ) ≥ ν(T ) and i := 2 if µ(T ) < ν(T ).
α
In particular, the map α 7−→ ETf λ (µ, ν) is nonincreas-
Let m ∈ [0, m̄], and choose λ ≥ 0 such that there exists ing and
an optimal plan γ 0 for ETλ (µ, ν) with γ 0 (T × T ) = m.
α α
As pointed out right after Theorem 3.1, this choice of f λ 1 (µ, ν) − ET
|ET f λ 2 (µ, ν)| = |α1 − α2 ||µ(T ) − ν(T )|.
λ is possible. Then, the proof of Lemma A.1 in the
Supplementary (§A.6) shows that
Proof is placed in the Supplementary (§A.7).
h
α
inf F1 (γ1 |µ) + F2 (γ2 |ν) It is also possible to use ET
f λ (µ, ν) to upper or lower
γ∈Π≤ (µ,ν), γ(T ×T )=m
Z i bound the distance ETλ (µ, ν) as follows:
+b c(x, y)γ(dx, dy) ≤ d(µ, ν). α
T ×T
Proposition 3.9 (Bounds for ETλ with ET f ). As-
λ
sume that λ ≥ 0 and the weights w1 , w2 are b-Lipschitz
Moreover, the equality happens if and only if there w.r.t. dT . Then,
exists an optimal plan γ 0 for ETλ (µ, ν) such that m =
0
γ 0 (T × T ) = 12 [µ(T ) + ν(T )]. The necessary conditions ETλ (µ, ν) ≤ ET
f λ (µ, ν).
for the latter one to hold are µ(T ) = ν(T ) and m = m̄.
In addition, if [4LT − λ]b ≤ w1 (r) + w2 (r) where LT :=
3.2 An Efficient Regularization for Entropy maxx∈T ω([r, x]), then
Partial Transport with Tree Metrics α
f λ (µ, ν) ≤ ETλ (µ, ν),
ET
First observe that any f ∈ L can be represented by
Z for every 2bLT ≤ α ≤ 12 [bλ + w1 (r) + w2 (r)].
f (x) = f (r) + g(y)ω(dy)
[r,x] Proof is placed in the Supplementary (§A.8).
for some function g ∈ L∞ (T ) with kgkL∞ (T ) ≤ b. Note Analogous to Proposition 3.7, we obtain:
that condition |f (x) − f (y)| ≤ b dT (x, y) is equivalent Proposition 3.10 (Geometric structures of regular-
to kgkL∞ (T ) ≤ b. It follows that L ⊂ L0 , where we ized metric dα ). Assume that λ, w1 (r), w2 (r) are non-
define for 0 ≤ α ≤ 21 [bλ + w1 (r) + w2 (r)] that Lα is the negative numbers. For 0 ≤ α < bλ
2 +min{w1 (r), w2 (r)},
collection of all functions f of the form define
Z
f (x) = s + g(y)ω(dy),
α bλ  
dα (µ, ν) := ET
f λ (µ, ν) + µ(T ) + ν(T ) . (9)
[r,x] 2
Then, we have
h
with s being a constant in the interval − w2 (r) −
i
bλ bλ
2 + α, w1 (r) + 2 − α and with kgkL (T ) ≤ b. This i) dα (µ + σ, ν + σ) = dα (µ, ν), ∀σ ∈ M(T ).

Entropy Partial Transport with Tree Metrics: Theory and Practice

ii) dα is a divergence and satisfies the triangle in- Despite the fact that one-dimensional projections do
equality dα (µ, ν) ≤ dα (µ, σ) + dα (σ, ν). not have interesting properties in terms of distortion
viewpoints, they remain useful for SPOT (or SW, sliced-
iii) If in addition w1 (r) = w2 (r), then (M(T ), dα ) is Gromov-Wasserstein (Vayer et al., 2019)). In the same
a complete metric space. Moreover, it is a geodesic vein, we believe that trees with high distortion are still
space in the sense defined in part iii) of Proposi- useful for EPT, similar as in TSW. Moreover, one may
tion 3.7 but with dα replacing d. not need to spend excessive effort to optimize ETλ
(in Equation (7)) for a randomly sampled tree metric
Proof is placed in the Supplementary (§A.9). since it can lead to overfitting within the computation
of the EPT itself. Therefore, the proposed efficient
Our next result about negative definiteness is a cor- α
nerstone to build positive definite kernels upon either regularization of EPT (e.g, ET f in Equation (8)) is
λ
α not only fast for computation (i.e., closed-form), but
ET
f λ or dα for kernel-dependent frameworks.
also gives a benefit to overcome the overfitting problem
Proposition 3.11 (Negative definiteness). With the within the computation of the EPT.
α
same assumptions as in Proposition 3.8 for ET
f λ and in
α
Proposition 3.10 for dα , both ET
f λ and dα are negative 4 Discussion and Related Work
definite.
One can leverage tree metrics to approximate arbitrary
Proof is placed in the Supplementary (§A.10). metrics for speeding up a computation (Bartal, 1996,
From Proposition 3.11 and following (Berg et al., 1998; Charikar et al., 1998; Indyk, 2001; Fakcharoen-
1984, Theorem 3.2.2, given t > 0, the kernels phol et al., 2004). For instances, (i) Indyk and Thaper
 p.74),
α (2003) applied tree metrics (e.g., quadtree) to approxi-
f (µ, ν) := exp −tETλ (µ, ν)
kETα and kdα (µ, ν) :=
f
λ mate OT with Euclidean cost metric for a fast image
exp (−tdα (µ, ν)) are positive definite. retrieval. (ii) Sato et al. (2020) considered a generalized
Kantorovich-Rubinstein discrepancy (Hanin, 1992; Gui-
3.3 Tree-sliced Variants by Sampling Tree ttet, 2002; Lellmann et al., 2014) with general weights
Metrics for unbalanced OT, and used a quadtree as in (Indyk
and Thaper, 2003) to approximate the proposed dis-
In most of practical applications, we usually do not tance via a dynamic programming with infinitely many
have prior knowledge about tree structure for mea- states. They then derived an efficient algorithm with a
sures. Therefore, we need to choose or sample tree quasi-linear time complexity to speed up the dynamic
metrics from support data points for a given task. We programming computation by leveraging high-level pro-
use the tree metric sampling methods in (Le et al., gramming techniques. However, such approximations
2019b, §4): (i) partition-based tree metric sampling for following the approach of Indyk and Thaper (2003)
a low-dimensional space, or (ii) clustering-based tree result in large distortions in high dimensional spaces
metric sampling for a high-dimensional space. More- (Naor and Schechtman, 2007).
over, those tree metric sampling methods are not only
fast for computation3 , but also adaptive to the distri- Tree metrics are also leveraged for several advanced OT
bution of supports. We further propose the tree-sliced problems, e.g., tree-Wasserstein barycenters (Le et al.,
variants of the regularized EPT, computed by averag- 2019a); or a variant of Gromov-Wasserstein (a.k.a.,
ing the regularized EPT using those randomly sampled flow-based alignment approaches) (Le et al., 2021).
tree metrics. One advantage is to reduce the quantiza- Additionally, ultrametric, a special case of tree metrics,
tion effects or cluster sensitivity problems (i.e, support is also utilized on Gromov-Wasserstein (Mémoli et al.,
data points are quantized, or clustered into an adja- 2021) and Gromov-Hausdoff (Mémoli et al., 2019) for
cent hypercube, or cluster respectively) within the tree metric measure spaces.
metric sampling procedure.
Although one can leverage tree metrics to approximate 5 Experiments
arbitrary metrics (Bartal, 1996, 1998; Charikar et al., α
1998; Indyk, 2001; Fakcharoenphol et al., 2004), our In this section, we first illustrate that ETf (Equa-
λ
goal is rather to sample tree metrics and use them as tion (8)) is an efficient approximation for ETλ (Equa-
α
ground metrics in the regularized EPT, similar to TSW. tion (7)). Then, we evaluate our proposed ETf and dα
λ
3 (Equation (9)) for comparing measures in document
E.g., the complexity of the clustering-based tree metric
is O(HT m log κ) when we set κ clusters for the farthest- classification with word embedding and topological
point clustering (Gonzalez, 1985), and HT for the predefined data analysis (TDA). Experiments are evaluated with
tree deepest level for m input support data points. Intel Xeon CPU E7-8891v3 2.80GHz and 256GB RAM.
Tam Le, Truyen Nguyen

Documents with word embedding. We consider fication with SVM, as well as change point detection
each document as a measure where each word is re- for material data analysis with kernel Fisher discrim-
garded as a point in the support with a unit mass. inant ratio (KFDR) (Harchaoui et al., 2009). While
α
Following Kusner et al. (2015); Le et al. (2019b), we f and dα are positive definite, kernels for
kernels for ET λ
applied the word2vec word embedding (Mikolov et al., Sinkhorn-UOT and SPOT are empirically indefinite5 .
2013), pretrained on Google News4 containing about When kernels are indefinite, we regularized for the cor-
3 millions words/phrases. Each word/phrase in a doc- responding Gram matrices by adding a sufficiently large
ument is mapped into a vector in R300 . We removed diagonal term as in (Cuturi, 2013; Le et al., 2019b). For
all SMART stop word (Salton and Buckley, 1988), and SVM, we randomly split each dataset into 70%/30%
dropped words in documents if they are not available for training and test with 10 repeats. Typically, we
in the pretrained word2vec. choose hyper-parameters via cross validation, choose
1/t from {q10 , q20 , q50 } where qs is the s% quantile of
Geometric structured data via persistence dia- a subset of corresponding discrepancies observed on
grams (PD) in TDA. TDA has recently emerged a training set, use 1-vs-1 strategy with Libsvm6 for
in machine learning community as a powerful tool to an- multi-classclassification, and choose SVM regulariza-
alyze geometric structured data such as material data, tion from 10−[Link] . For Sinkhorn-UOT, we select
or linked twist maps (Adams et al., 2017; Lacombe the entropic regularization from {0.01, 0.05, 0.1, 0.5, 1}.
α
et al., 2018; Le and Yamada, 2018). TDA applies al- Following Proposition 3.9, we take α = 0 for ET f and
λ
gebraic topology methods (e.g., persistence homology) dα in all our experiments.
to extract robust topological features (e.g., connected
components, rings, cavities) and output a multiset of 0
2-dimensional points (i.e., PD). The coordinates of a 5.1 Efficient Approximation of ET
f for ETλ
λ
2-dimensional point in PD are corresponding to the
We randomly sample 500K pairs of documents in
birth and death time of a particular topological feature.
TWITTER dataset. Following Proposition 3.3, we com-
Therefore, each point in PD summarizes a life span of
pute ETλ via the corresponding KT (Equation (5)).
a topological feature. We can regard PD as measures 0
where each 2-dimensional point is considered as a point Our goal is to compare ET
f to ETλ .
λ
in the support with a unit mass.
2
Relative difference

Tree metric sampling. In our experiments, we do 1.5


not have prior knowledge about tree metrics for nei-
ther word embeddings in documents nor 2-dimensional 1

α
points in PD. To compute the EPT (e.g., ET f and
λ 0.5

dα ), we considered ns randomized tree metrics. We 0


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

employed the clustering-based tree metric sampling for


0

1
Lipschitz
word embeddings in documents (i.e., high-dimensional 0
Figure 1: Relative difference between ET
f and ETλ
space R300 ), while we used the partition-based tree λ
w.r.t. Lipschitz const. of w1 , w2 .
metric sampling for 2-dimensional points in PD (i.e.,
low-dimensional space R2 ). Those tree metric sampling Change Lipschitz constants. We choose w1 (x) =
methods are built with a predefined deepest level HT of w2 (x) = a1 dT (r, x) + a0 , and set λ = b = 1, a0 = 1. In
tree T as a stopping condition as in (Le et al., 2019b). particular, a1 ∈ [0, b] since w1 , w2 are b-Lipschitz func-
tions (see Corollary 3.5 and Remark 3.6). We illustrate
0
Baselines and setup. We considered 2 typical base- the relative difference (ET f − ETλ )/ETλ when a1 is
λ
lines based on OT theory for measures with different changed in [0, b] in Figure 1. We observe that when a1
masses: (i) Sinkhorn-UOT (Frogner et al., 2015; Chizat is close to b (i.e., the Lipschitz constants of w1 , w2 are
et al., 2018) (i.e., entropic regularization approach), 0
close to b), ET
f becomes closer to ETλ . When a1 = b,
λ
and (ii) SPOT (Bonneel and Coeurjolly, 2019) (i.e., 0
sliced-formula approach based on 1-dimensional pro- the values of ET
f is almost identical to ETλ .
λ
jection). Following Le et al. (2019b), we apply the Change λ. From the results in Figure 1, we set a1 = b
kernel approach in the form exp(−td) ¯ with SVM for 0
to investigate the relative different between ET
f and
λ
document classification with word embedding. Here,
ETλ when λ is changed. As illustrated in Figure 2,
d¯ is a discrepancy between measures and t > 0. We
5
also employed this kernel approach for various tasks in We empirically observed negative eigenvalues in Gram
TDA, e.g., orbit recognition and object shape classi- matrices corresponding to kernels for Sinkhorn-UOT and
SPOT.
4 6
[Link] [Link]
Entropy Partial Transport with Tree Metrics: Theory and Practice
0.5

10−3 for the regularization parameter in KFDR and

Relative difference
used the ball model filtration to extract 2-dimensional
0 topological features for PD in GPS dataset, and 1-
dimensional topological features for PD in SiO2 dataset.
Note that we omit the baseline kernel for Sinkhorn-
-0.5 LT/100 UOT in this application since the computation with
LT/50

LT/10

LT/5
Sinkhorn-UOT is out of memory.

2LT
LT
0
Figure 2: Relative difference between ET
f and ETλ We illustrate the KFDR graphs in Figure 5. For GPS
λ
w.r.t. λ when a1 = b. (LT := LT ) dataset, all kernel approaches get the change point at
0
the index 23 which supports the observation (corre-
f is almost identical to ETλ regardless the value of
ET λ sponding id = 23) in (Anonymous, 1972). For SiO2
λ when a1 = b. dataset, all kernel approaches get the change point in
a supported range (35 ≤ id ≤ 50), obtained by a tradi-
5.2 Document Classification with Word tional physical approach (Elliott, 1983). The KFDR
0
Embedding results of kernels corresponding to d0 and ET
f compare
λ
favorably with those of kernel for SPOT.
We consider 4 datasets: TWITTER, RECIPE, CLASSIC
and AMAZON for document classification with word em-
5.4 Results of SVM, Time Consumption and
bedding. Statistical characteristics of these datasets
Discussions
are summarized in Figure 3.
We illustrate the results of SVM and time consump-
5.3 Topological Data Analysis (TDA) tion of kernel matrices in document classification with
word embedding and TDA in Figure 3 and Figure 4
5.3.1 Orbit Recognition 0
respectively. The performances of kernels for ET f and
λ
We considered a synthesized dataset, proposed by d0 outperform those of kernels for SPOT. They also
Adams et al. (2017), for link twist map which is a outperform those of kernels for Sinkhorn-UOT on TDA,
discrete dynamical system to model flows in DNA mi- and are comparative on document classification. The
croarrays (Hertzsch et al., 2007). There are 5 classes of fact that SPOT uses the 1-dimensional projection for
orbits. Following Le and Yamada (2018), we generated support data points may limit its ability to capture
1000 orbits for each class of orbits, and each orbit has high-dimensional structure in data distributions (Le
1000 points. We used the 1-dimensional topological et al., 2019b; Liutkus et al., 2019). The regularized EPT
features for PD extracted with Vietoris-Rips complex remedies this problem by leveraging the tree metrics
filtration (Edelsbrunner and Harer, 2008). which have more flexibility and degrees of freedom (e.g.,
choose a tree rather than a line). In addition, while
0
5.3.2 Object Shape Classification kernels for ETf and d0 are positive definite, kernels
λ

We evaluated our approach for object shape classifica- for SPOT and Sinkhorn-UOT are empirically indef-
tion on a subset of MPEG7 dataset (Latecki et al., 2000) inite. The indefiniteness of kernels may affect their
containing 10 classes where each class has 20 samples performances in some applications, e.g., kernels for
as in (Le and Yamada, 2018). For simplicity, we fol- Sinkhorn-UOT work well for document classification
lowed Le and Yamada (2018) to extract 1-dimensional with word embedding, but perform poorly in TDA ap-
topological features for PD with Vietoris-Rips complex plications. There are also similar observations in (Le
filtration7 (Edelsbrunner and Harer, 2008). et al., 2019b). Additionally, we illustrate a trade-off
between performances and computational time for dif-
5.3.3 Change Point Detection for Material ferent number of (tree) slices in TWITTER dataset in
Analysis Figure 6. The performances are usually improved with
more slices, but with a trade-off about the computa-
We applied our approach on change point detection for tional time. In applications, we observed that a good
material analysis with KFDR as a statistical score on trade off is about ns = 10 slices.
granular packing system (GPS) (Francois et al., 2013) Tree metric sampling. Time consumption for
and SiO2 (Nakamura et al., 2015) datasets. Statistical the tree metric sampling is negligible in applications.
characteristics of these datasets are summarized in With the predefined tree deepest level HT = 6 and
Figure 5. Following Le and Yamada (2018), we set tree branches κ = 4 as in (Le et al., 2019b), it
7
A more complicated and advanced filtration for this took 1.5, 11.0, 17.5, 20.5 seconds for TWITTER, RECIPE,
task is considered in (Turner et al., 2014). CLASSIC, AMAZON datasets respectively, and 21.0, 0.1
Tam Le, Truyen Nguyen

TWITTER (3/3108/26) RECIPE (15/4370/340) CLASSIC (4/7093/197) AMAZON (4/8000/884)


0.8 0.6 1 1

Average Accuracy
0.5 0.8 SPOT
0.8
0.7 Sinkhorn-UOT
0.4
0.6 d
0
0.3 0.6 0
0.6 0.4 tildeET

0.2
0.4 0.2
0.5 0.1

10 5 10 6 10 6 10 6
Time Consumption (s)

10 4

10 4 10 4 10 4

10 3

10 2 10 2 10 2 10 2

Figure 3: SVM results and time consumption of kernel matrices on document classification. For each dataset, the
numbers in the parenthesis are respectively the number of classes, the number of documents, and the maximum
number of unique words for each document.
SPOT d0 tildeET0 Runtime
1 1 1
Orbit (5000/300) MPEG7 (200/80) 10 4
0.8 0.8
0.8 0.8 0.8

GPS (35/20.4K)
Average Accuracy

0.6 0.6 0.6


0.6 0.6 10 3
0.4 0.4 0.4

0.2 0.2 0.2


0.4 0.4
10 2
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
0.2 0.2 (id=23) (id=23) (id=23)
1 1 1

0.8 0.8 0.8


10 2
SiO2 (80/30K)

SPOT
Time Consumption (s)

10 6 SPOT 0.6 0.6 0.6 10 4 d0

Sinkhorn-UOT tildeET0
0.4 0.4 0.4
d0
1
10 0.2 0.2 0.2
10 4 tildeET0
0 0 0 10 2
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
(id=40) (id=40) (id=40)
10 2 10 0
Figure 5: KFDR graphs and time consumption of kernel
Figure 4: SVM results and time consumption of kernel matrices for change point detection. For each dataset,
matrices for TDA. For each dataset, the numbers in the numbers in the parenthesis are respectively the
the parenthesis are respectively the number of PD, and number of PDs, and the maximum number of points
the maximum number of points in PD. in PD.
seconds for Orbit, MPEG7 datasets respectively.
Time Consumption (s)

0.75
Average Accuracy

0 0
ET
f
λ versus ETλ . We also compare and ETλ (or ET
f
λ 10 3
0.7
KT) in TWITTER dataset for document classification, SPOT
d0
and in MPEG7 dataset for object shape recognition in 0.65
0 tildeET0
TDA. The performances of ET f and ETλ are identical
λ 2
10
(i.e., their kernel matrices are almost the same for those 0.6
0 5 10 15 20 0 5 10 15 20
0 Number of slices
datasets), but ET f is faster than ETλ about 11 times
λ
in TWITTER dataset, and 81 times in MPEG7 dataset Figure 6: SVM results and time consumption for cor-
when ns = 10 slices. responding kernel matrices in TWITTER dataset w.r.t.
the number of (tree) slices.
Further results are placed in the supplementary (§B).
tion for the EPT which yields closed-form solution for a
fast computation and negative definiteness—an impor-
6 Conclusion tant property to build positive definite kernels required
in many kernel-dependent frameworks. Moreover, our
We have developed a rigorous theory for the entropy regularization also provides effective approximations in
partial transport (EPT) problem for nonnegative mea- applications. We further derive tree-sliced variants of
sures on a tree having different masses. We show that the regularized EPT for practical applications without
the EPT problem is equivalent to a standard complete prior knowledge about a tree structure for measures.
OT problem on a suitable one-node extended tree which The question about sampling optimal tree metrics for
allows us to develop its dual formulation. By leveraging the tree-sliced variants from data points is left for future
the dual problem, we proposed efficient novel regulariza- work.
Entropy Partial Transport with Tree Metrics: Theory and Practice

Acknowledgements Chizat, L., Peyré, G., Schmitzer, B., and Vialard, F.-X.
(2018). Scaling algorithms for unbalanced optimal
We thank Nhan-Phu Chung, Nhat Ho for fruitful discus- transport problems. Mathematics of Computation,
sions, and anonymous reviewers for their comments. TL 87(314):2563–2609.
acknowledges the support of JSPS KAKENHI Grant Chung, N.-P. and Trinh, T.-S. (2019). Duality and
number 20K19873. The research of TN is supported in quotient spaces of generalized wasserstein spaces.
part by a grant from the Simons Foundation (#318995). arXiv preprint arXiv:1904.12461.
Courty, N., Flamary, R., Habrard, A., and Rakotoma-
References monjy, A. (2017). Joint distribution optimal trans-
Adams, H., Emerson, T., Kirby, M., Neville, R., Peter- portation for domain adaptation. In Advances in
son, C., Shipman, P., Chepushtanova, S., Hanson, E., Neural Information Processing Systems, pages 3730–
Motta, F., and Ziegelmeier, L. (2017). Persistence 3739.
images: A stable vector representation of persistent Cuturi, M. (2013). Sinkhorn distances: Lightspeed
homology. Journal of Machine Learning Research, computation of optimal transport. In Advances in
18(1):218–252. Neural Information Processing Systems, pages 2292–
Anonymous (1972). What is random packing? Nature, 2300.
239:488–489. Edelsbrunner, H. and Harer, J. (2008). Persistent
Bartal, Y. (1996). Probabilistic approximation of met- homology-a survey. Contemporary mathematics,
ric spaces and its algorithmic applications. In Pro- 453:257–282.
ceedings of 37th Conference on Foundations of Com- Elliott, S. R. (1983). Physics of amorphous materials.
puter Science, pages 184–193. Longman Group.
Bartal, Y. (1998). On approximating arbitrary metrices Fakcharoenphol, J., Rao, S., and Talwar, K. (2004).
by tree metrics. In ACM Symposium on Theory of A tight bound on approximating arbitrary metrics
Computing (STOC), volume 98, pages 161–168. by tree metrics. Journal of Computer and System
Benamou, J.-D. (2003). Numerical resolution of an Sciences, 69(3):485–497.
“unbalanced” mass transport problem. ESAIM: Figalli, A. (2010). The optimal partial transport prob-
Mathematical Modelling and Numerical Analysis- lem. Archive for rational mechanics and analysis,
Modélisation Mathématique et Analyse Numérique, 195(2):533–560.
37(5):851–868. Francois, N., Saadatfar, M., Cruikshank, R., and Shep-
Berg, C., Christensen, J. P. R., and Ressel, P., editors pard, A. (2013). Geometrical frustration in amor-
(1984). Harmonic analysis on semigroups. Springer- phous and partially crystallized packings of spheres.
Verglag, New York. Physical review letters, 111(14):148001.
Bonneel, N. and Coeurjolly, D. (2019). Spot: sliced Frogner, C., Zhang, C., Mobahi, H., Araya, M., and
partial optimal transport. ACM Transactions on Poggio, T. A. (2015). Learning with a wasserstein
Graphics (TOG), 38(4):1–13. loss. In Advances in neural information processing
Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). systems, pages 2053–2061.
Sliced and radon wasserstein barycenters of mea- Gangbo, W., Li, W., Osher, S., and Puthawala, M.
sures. Journal of Mathematical Imaging and Vision, (2019). Unnormalized optimal transport. Journal of
51(1):22–45. Computational Physics, 399:108940.
Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, Gonzalez, T. F. (1985). Clustering to minimize the
S. (2019). Learning Generative Models across In- maximum intercluster distance. Theoretical Com-
comparable Spaces. In International Conference on puter Science, 38:293–306.
Machine Learning (ICML), volume 97. Guittet, K. (2002). Extended kantorovich norms: a
Caffarelli, L. A. and McCann, R. J. (2010). Free bound- tool for optimization. INRIA report.
aries in optimal transport and monge-ampere obsta- Hanin, L. G. (1992). Kantorovich-rubinstein norm
cle problems. Annals of mathematics, pages 673–730. and its application in the theory of lipschitz spaces.
Charikar, M., Chekuri, C., Goel, A., Guha, S., and Proceedings of the American Mathematical Society,
Plotkin, S. (1998). Approximating a finite metric by 115(2):345–352.
a small number of tree metrics. In Proceedings 39th Harchaoui, Z., Moulines, E., and Bach, F. R. (2009).
Annual Symposium on Foundations of Computer Sci- Kernel change-point analysis. In Advances in neural
ence (FOCS), pages 379–388. information processing systems, pages 609–616.
Tam Le, Truyen Nguyen

Hertzsch, J.-M., Sturman, R., and Wiggins, S. (2007). Lellmann, J., Lorenz, D. A., Schonlieb, C., and Valko-
Dna microarrays: design principles for maximizing nen, T. (2014). Imaging with kantorovich–rubinstein
ergodic, chaotic mixing. Small, 3(2):202–218. discrepancy. SIAM Journal on Imaging Sciences,
Indyk, P. (2001). Algorithmic applications of low- 7(4):2833–2859.
distortion geometric embeddings. In Proceedings Liero, M., Mielke, A., and Savaré, G. (2018). Opti-
42nd IEEE Symposium on Foundations of Computer mal entropy-transport problems and a new hellinger–
Science (FOCS), pages 10–33. kantorovich distance between positive measures. In-
Indyk, P. and Thaper, N. (2003). Fast image retrieval ventiones mathematicae, 211(3):969–1117.
via embeddings. In International workshop on statis- Liutkus, A., Simsekli, U., Majewski, S., Durmus,
tical and computational theories of vision, volume 2, A., and Stöter, F.-R. (2019). Sliced-wasserstein
page 5. flows: Nonparametric generative modeling via opti-
Janati, H., Cuturi, M., and Gramfort, A. (2019). mal transport and diffusions. In International Con-
Wasserstein regularization for sparse multi-task re- ference on Machine Learning, pages 4104–4113.
gression. In The 22nd International Conference Mémoli, F., Munk, A., Wan, Z., and Weitkamp, C.
on Artificial Intelligence and Statistics, pages 1407– (2021). The ultrametric gromov-wasserstein distance.
1416. arXiv preprint arXiv:2101.05756.
Kusner, M., Sun, Y., Kolkin, N., and Weinberger, Mémoli, F., Smith, Z., and Wan, Z. (2019). Gromov-
K. (2015). From word embeddings to document hausdorff distances on p-metric spaces and ultramet-
distances. In International conference on machine ric spaces. arXiv preprint arXiv:1912.00564.
learning, pages 957–966.
Mena, G. and Niles-Weed, J. (2019). Statistical bounds
Lacombe, T., Cuturi, M., and Oudot, S. (2018). Large for entropic optimal transport: sample complexity
scale computation of means and clusters for persis- and the central limit theorem. In Advances in Neural
tence diagrams using optimal transport. In Advances Information Processing Systems, pages 4541–4551.
in Neural Information Processing Systems, pages
9770–9780. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
and Dean, J. (2013). Distributed representations of
Latecki, L. J., Lakamper, R., and Eckhardt, T. (2000).
words and phrases and their compositionality. In
Shape descriptors for non-rigid shapes with a single
Advances in neural information processing systems,
closed contour. In Proceedings of the IEEE Confer-
pages 3111–3119.
ence on Computer Vision and Pattern Recognition
(CVPR), volume 1, pages 424–429. Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R.
(2019). Asymptotic guarantees for learning genera-
Lavenant, H., Claici, S., Chien, E., and Solomon, J.
tive models with the sliced-wasserstein distance. In
(2018). Dynamical optimal transport on discrete
Advances in Neural Information Processing Systems,
surfaces. In SIGGRAPH Asia 2018 Technical Papers,
pages 250–260.
page 250. ACM.
Le, T., Ho, N., and Yamada, M. (2021). Flow-based Nakamura, T., Hiraoka, Y., Hirata, A., Escolar, E. G.,
alignment approaches for probability measures in and Nishiura, Y. (2015). Persistent homology and
different spaces. In International Conference on many-body atomic structure for medium-range order
Artificial Intelligence and Statistics (AISTATS). in the glass. Nanotechnology, 26(30):304001.
Le, T., Huynh, V., Ho, N., Phung, D., and Yamada, M. Naor, A. and Schechtman, G. (2007). Planar Earth-
(2019a). On scalable variant of wasserstein barycen- mover is not in L_1. SIAM Journal on Computing,
ter. arXiv preprint arXiv:1910.04483. 37(3):804–826.
Le, T. and Yamada, M. (2018). Persistence Fisher Peyré, G. and Cuturi, M. (2019). Computational op-
kernel: A Riemannian manifold kernel for persis- timal transport. Foundations and Trends R in Ma-
tence diagrams. In Advances in Neural Information chine Learning, 11(5-6):355–607.
Processing Systems, pages 10007–10018. Pham, K., Le, K., Ho, N., Pham, T., and Bui, H.
Le, T., Yamada, M., Fukumizu, K., and Cuturi, M. (2020). On unbalanced optimal transport: An anal-
(2019b). Tree-sliced variants of Wasserstein distances. ysis of Sinkhorn algorithm. In Proceedings of the
In Advances in neural information processing sys- International Conference on Machine Learning.
tems, pages 12283–12294. Piccoli, B. and Rossi, F. (2014). Generalized wasser-
Lee, J., Bertrand, N. P., and Rozell, C. J. (2019). stein distance and its application to transport equa-
Parallel unbalanced optimal transport regularization tions with source. Archive for Rational Mechanics
for large scale imaging problems. arXiv preprint and Analysis, 211(1):335–358.
arXiv:1909.00149.
Entropy Partial Transport with Tree Metrics: Theory and Practice

Piccoli, B. and Rossi, F. (2016). On properties of Solomon, J., De Goes, F., Peyré, G., Cuturi, M.,
the generalized wasserstein distance. Archive for Butscher, A., Nguyen, A., Du, T., and Guibas, L.
Rational Mechanics and Analysis, 222(3):1339–1365. (2015). Convolutional Wasserstein distances: Effi-
Rabin, J., Peyré, G., Delon, J., and Bernot, M. (2011). cient optimal transportation on geometric domains.
Wasserstein barycenter and its application to texture ACM Transactions on Graphics (TOG), 34(4):66.
mixing. In International Conference on Scale Space Turner, K., Mukherjee, S., and Boyer, D. M. (2014).
and Variational Methods in Computer Vision, pages Persistent homology transform for modeling shapes
435–446. and surfaces. Information and Inference: A Journal
Salton, G. and Buckley, C. (1988). Term-weighting of the IMA, 3(4):310–344.
approaches in automatic text retrieval. Information Vayer, T., Flamary, R., Tavenard, R., Chapel, L., and
processing & management, 24(5):513–523. Courty, N. (2019). Sliced Gromov-Wasserstein. Ad-
Sato, R., Yamada, M., and Kashima, H. (2020). Fast vances in Neural Information Processing Systems.
unbalanced optimal transport on tree. In Advances Villani, C. (2008). Optimal transport: old and new,
in neural information processing systems. volume 338. Springer Science & Business Media.
Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subra- Weed, J. and Berthet, Q. (2019). Estimation of smooth
manian, V., Solomon, A., Gould, J., Liu, S., Lin, S., densities in wasserstein distance. In Proceedings of
Berube, P., et al. (2019). Optimal-transport analysis the Thirty-Second Conference on Learning Theory,
of single-cell gene expression identifies developmental volume 99, pages 3118–3119.
trajectories in reprogramming. Cell, 176(4):928–943. Yang, K. D. and Uhler, C. (2019). Scalable unbal-
Semple, C. and Steel, M. (2003). Phylogenetics. Oxford anced optimal transport using generative adversarial
Lecture Series in Mathematics and its Applications. networks. In International Conference on Learning
Representations.

You might also like