0% found this document useful (0 votes)

21 views9 pages

03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper

Uploaded by

uniquee0314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views9 pages

03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper

Uploaded by

uniquee0314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

IMRAM: Iterative Matching with Recurrent Attention Memory

for Cross-Modal Image-Text Retrieval∗

Hui Chen1 , Guiguang Ding1* , Xudong Liu2 , Zijia Lin3 , Ji Liu4 , Jungong Han5
1
School of Software, BNRist, Tsinghua University
2
Kwai Ads Platform; 3 Microsoft Research
4
Kwai Seattle AI Lab, Kwai FeDA Lab, Kwai AI Platform
5
WMG Data Science, University of Warwick
{jichenhui2012, ji.liu.uwisc, jungonghan77}@gmail.com
[email protected], [email protected], [email protected]

Abstract cross-modal image-text retrieval is in great demand and has

become prevalent in both academia and industry. Mean-
Enabling bi-directional retrieval of images and texts is while, this task is challenging because it requires to under-
important for understanding the correspondence between stand not only the content of images and texts but also their
vision and language. Existing methods leverage the atten- inter-modal correspondence.
tion mechanism to explore such correspondence in a fine- In recent years, a large number of researches have been
grained manner. However, most of them consider all se- proposed and achieved great progress. Early works at-
mantics equally and thus align them uniformly, regardless tempted to directly map the information of images and texts
of their diverse complexities. In fact, semantics are diverse into a common latent embedding space. For example, Wang
(i.e. involving different kinds of semantic concepts), and et al. [26] adopted a deep network with two branches to, re-
humans usually follow a latent structure to combine them spectively, map images and texts into an embedding space.
into understandable languages. It may be difficult to op- However, these works coarsely capture the correspondence
timally capture such sophisticated correspondences in ex- between modalities and thus are unable to depict the fine-
isting methods. In this paper, to address such a deficiency, grained interactions between vision and language.
we propose an Iterative Matching with Recurrent Attention To gain a deeper understanding of such fine-grained cor-
Memory (IMRAM) method, in which correspondences be- respondences, recent researches further explored the at-
tween images and texts are captured with multiple steps of tention mechanism for cross-modal image-text retrieval.
alignments. Specifically, we introduce an iterative match- Karpathy et al. [11] extracted features of fragments for each
ing scheme to explore such fine-grained correspondence image and text (i.e. image regions and text words), and
progressively. A memory distillation unit is used to refine proposed a dense alignment between each fragment pair.
alignment knowledge from early steps to later ones. Exper- Lee et al. [13] proposed a stacked cross attention model,
iment results on three benchmark datasets, i.e. Flickr8K, in which attention was used to align each fragment with all
Flickr30K, and MS COCO, show that our IMRAM achieves fragments from another modality. It can neatly discover the
state-of-the-art performance, well demonstrating its effec- fine-grained correspondence and thus achieves state-of-the-
tiveness. Experiments on a practical business advertise- art performance on several benchmark datasets.
ment dataset, named KWAI-AD, further validates the ap-
plicability of our method in practical scenarios. However, due to the large heterogeneity gap between im-
ages and texts, existing attention-based models, e.g. [13],
may not well seize the optimal pairwise relationships
1. Introduction among a number of region-word fragments pairs. Actu-
ally, semantics are complicated, because they are diverse
Due to the explosive increase of multimedia data from (i.e. composed by different kinds of semantic concepts with
social media and web applications, enabling bi-directional different meanings, such as objects (e.g. nouns), attributes
∗ This work was supported by the National Natural Science Foundation (e.g. adjectives) and relations (e.g. verbs)). And there gen-
of China (Nos. U1936202, 61925107). Corresponding author: Guiguang erally exist strong correlations among different concepts,
Ding e.g. relational terms (e.g. verbs) usually indicate relation-

4321
12655
ships between objects (e.g. nouns). Moreover, humans usu- forms considerably better than compared models, further
ally follow a latent structure (e.g. a tree-like structure [25]) demonstrating the effectiveness of our method in the prac-
to combine different semantic concepts into understandable tical business advertisement scenario. The source code
languages, which indicates that semantics shared between is available at: https://github.com/HuiChen24/
images and texts exhibit a complicated distribution. How- IMRAM.
ever, existing state-of-the-art models treat different kinds of The contributions of our work are three folds: 1) First,
semantics equally and align them together uniformly, taking we propose an iterative matching method for cross-modal
little consideration of the complexity of semantics. image-text retrieval to handle the complexity of semantics.
In reality, when humans perform comparisons between 2) Second, we formulate the proposed iterative matching
images and texts, we usually associate low-level semantic method with a recurrent attention memory which incorpo-
concepts, e.g. objects, at the first glimpse. Then, higher- rates a cross-modal attention unit and a memory distillation
level semantics, e.g. attributes and relationships, are mined unit to refine the correspondence between images and texts.
by revisiting images and texts to obtain a better understand- 3) Third, we verify our method on benchmark datasets
ing [21]. This intuition is favorably consistent with the (i.e. Flickr8K, Flickr30K, and MS COCO) and a real-world
aforementioned complicated semantics, and meanwhile, it business advertisement dataset (i.e. our proposed KWAI-
indicates that the complicated correspondence between im- AD dataset). Experimental results show that our method
ages and texts should be exploited progressively. outperforms compared methods in all datasets. Thorough
Motivated by this, in this paper, we propose an itera- analyses on our model also well demonstrate the superior-
tive matching framework with recurrent attention memory ity and reasonableness of our method.
for cross-modal image-text retrieval, termed IMRAM. Our
way of exploring the correspondence between images and 2. Related work
texts is characterized by two main features: (1) an itera- Our work is concerned about the task of cross-modal
tive matching scheme with a cross-modal attention unit to image-text retrieval which essentially aims to explore the
align fragments across different modalities; (2) a memory latent correspondence between vision and language. Ex-
distillation unit to dynamically aggregate information from isting matching methods can be roughly categorized into
early matching steps to later ones. The iterative matching two lines: (1) coarse-grained matching methods aiming to
scheme can progressively update the cross-modal attention mine the correspondence globally by mapping the whole
core to accumulate cues for locating the matched semantics, images and the full texts into a common embedding space,
while the memory distillation unit can refine the latent cor- (2) fine-grained matching ones aiming to explore the corre-
respondence by enhancing the interaction of cross-modality spondence between image fragments and text fragments at
information. Leveraging these two features, different kinds a fine-grained level.
of semantics are treated distributively and well captured at Coarse-grained matching methods. Wang et al. [26]
different matching steps. used a deep network with two branches of multilayer per-
We conduct extensive experiments on several benchmark ceptrons to deal with images and texts, and optimized it
datasets for cross-modal image-text retrieval, i.e. Flickr8K, with intra- and inter-structure preserving objectives. Kiros
Flickr30K, and MS COCO. Experiment results show that et al. [12] adopted a CNN and a Gate Recurrent Unit
our proposed IMRAM can outperform the state-of-the-art (GRU) with a hinge-based triplet ranking loss to optimize
models. Subtle analyses are also carried out to provide the model by averaging the individual violations across the
more insights about IMRAM. We observe that: (1) the fine- negatives. Alternatively, Faghri et al. [5] reformed the rank-
grained latent correspondence between images and texts can ing objective with a hard triplet loss function parameterized
be well refined during the iterative matching process; (2) by only hard negatives.
different kinds of semantics, respectively, play dominant Fine-grained matching methods. Recently, several
roles at different matching steps in terms of contributions works have been devoted to exploring the latent fine-grained
to the performance improvement. vision-language correspondence for cross-modal image-
These observations can account for the effectiveness and text [1, 11, 20, 8, 18, 13]. Karpathy et al. [11] extracted
reasonableness of our proposed method, which encourages features for fragments of each image and text, i.e. image
us to validate its potential in practical scenarios. Hence, regions and text words, and aligned them in the embedding
we collect a new dataset, named KWAI-AD, by crawling space. Niu et al. [20] organized texts as a semantic tree with
about 81K image-text pairs on an advertisement platform, in each node corresponding to a phrase, and then used a hier-
which each image is associated with at least one advertise- archical long short term memory (LSTM, a variant of RNN)
ment textual title. We then evaluate our proposed method to extract phrase-level features for text. Huang et al. [8] pre-
on the KWAI-AD dataset and make comparisons with the sented a context-modulated attention scheme to selectively
state-of-the-art models. Results show that our method per- attend to salient pairwise image-sentence instances. Then a

4322
12656
𝑥#∗
MDU
CAUs MDUs
𝑣!∗ … 𝑣#∗ … ∗
𝑣$ 1-
𝑉3
𝑥# 𝑔# 𝑜#
I … … 𝑐$' 𝐶57 𝑉5 gate() tanh()
𝑉 𝑐!' 𝑐#'
CNN [𝑥# , 𝑐#) ]
… RAM7 (𝑉3 , 𝑇) 𝑇
𝑇
𝑡! … 𝑡% … 𝑡& 𝑥# 𝑐#)

Bi- 𝑇 𝑣! … 𝑣# … 𝑣$
𝑉 RAM6 (𝑇3 , 𝑉) 𝑉
GRU
𝑆 𝑐!( … 𝑐%( … 𝑐&( 𝑇5
𝐶56
𝑇3
A horse 𝑡!∗ … 𝑡%∗ … 𝑡&∗
walks on the
road. objective
𝑘=1 𝑘=2 𝑘=3
Figure 1. Framework of the proposed model.

multi-modal LSTM was used to sequentially aggregate lo- 3. Methodology

cal similarities into a global one. Nam et al. [18] proposed
a dual attention mechanism in which salient semantics in In this section, we will elaborate on the details of our
images and texts were obtained by two attentions, and the proposed IMRAM for cross-modal image-text retrieval.
similarity was computed by aggregating a sequence of lo- Figure 1 shows the framework of our model. We will first
cal similarities. Lee et al. [13] proposed a stacked cross describe the way of learning the cross-modal feature repre-
attention model which aligns each fragment with all other sentations in our work in section 3.1. Then, we will intro-
fragments from the other modality. They achieved state- duce the proposed recurrent attention memory as a module
of-the-art performance on several benchmark datasets for in our matching framework in section 3.2. We will also
cross-modal retrieval. present how to incorporate the proposed recurrent attention
memory into the iterative matching scheme for cross-modal
While our method targets the same as [11, 13], differ-
image-text retrieval in section 3.3. Finally, the objective
ently, we apply an iterative matching scheme to refine the
function is discussed in section 3.4.
fragment alignment. Besides, we adopt a memory unit
to distill the knowledge of matched semantics in images 3.1. Cross-modal Feature Representation
and texts after each matching step. Our method can also
be regarded as a sequential matching method, as [18, 8]. Image representation. Benefiting from the develop-
However, within the sequential computations, we transfer ment of deep learning in computer vision [4, 7, 24], differ-
the knowledge about the fragment alignment to the suc- ent convolution neural networks have been widely used in
cessive steps with the proposed recurrent attention mem- many tasks to extract visual information for images. To ob-
ory, instead of using modality-specific context information. tain more descriptive information about the visual content
Experiments also show that our method outperforms those for image fragments, we employ a pretrained deep CNN,
mentioned works. e.g. Faster R-CNN. Specifically, given an image I, a CNN
We also noticed that some latest works make use of detects image regions and extracts a feature vector fi for
large-scale external resources to improve performance. For each image region ri . We further transform fi to a d-
example, Mithun et al. [17] collected amounts of image-text dimensional vector vi via a linear projection as follows:
pairs from the Internet and optimized the retrieval model
with them. Moreover, inspired by the recent great success of v i = W v fi + b v (1)
contextual representation learning for languages in the field
of natural language processing, researchers also explored to where Wv and bv are to-be-learned parameters.
apply BERT into cross-modal understanding field [2, 14]. For simplicity, we denote the image representation as
However, such pre-trained cross-modal BERT models1 re- V = {vi |i = 1, ..., m, vi ∈ Rd }, where m is the number
quire large amounts of annotated image-text pairs, which of detected regions in I. We further normalize each region
are not easy to obtain in the practical scenarios. On the con- feature vector in V as [13].
trary, our method is general and unlimited to the amount of Text representation. Basically, texts can be represented
data. We leave the exploration of large-scale external data at either sentence-level or word-level. To enable the fine-
to future works. grained connection of vision and language, we extract the
word-level features for texts, which can be done through a
1 Corresponding codes and models are not made publicly available. bi-directional GRU as the encoder.

4323
12657
Specifically, for a text S with n words, we first rep- where λ is the inverse temperature parameter of the soft-
resent each word wj with a contiguous embedding vector max function [3] to adjust the smoothness of the attention
ej = We wj , ∀j ∈ [1, n], where We is a to-be-learned em- distribution.
bedding matrix. Then, to enhance the word-level represen- We define C x = {cxi |i ∈ [1, m′ ], cxi ∈ Rd } as X-
tation with context information, we employ a bi-directional grounded alignment features, in which each element cap-
GRU to summarize information from both forward and tures related semantics shared by each xi and the whole Y .
backward directions in the text S: Memory Distillation Unit (MDU). To refine the align-
→
− −−−−→ →
− ment knowledge for the next alignment, we adopt a memory
h j = GRU (ej , h j−1 ); distillation unit which updates the query features X by ag-
←− ←−−−− ←− (2)
h j = GRU (ej , h j+1 ) gregating them with the corresponding X-grounded align-
ment feature C x dynamically:
→
− ←
−
where h j and h j denote hidden states from the forward
GRU and the backward GRU, respectively. Then, the repre- x∗i = f (xi , cxi ) (6)
→
− ←
−
h j+ h j
sentation of the word wj is defined as tj = 2 . where f () is a aggregating function. We can define f () with
Eventually, we obtain a word-level feature set for the text different formulations, such as addition, multilayer percep-
S, denoted as T = {tj |j = 1, ..., n, tj ∈ Rd }, where each tron (MLP), attention and so on. Here, we adopt a modified
tj encodes the information of the word wj . Note that each gating mechanism for f ():
tj shares the same dimensionality as vi in Eq. 1. We also
normalize each word feature vector in T as [13]. gi = gate(Wg [xi , cxi ] + bg )
oi = tanh(Wo [xi , cxi ] + bo ) (7)
3.2. RAM: Recurrent Attention Memory
x∗i = gi ∗ xi + (1 − gi ) ∗ oi
The recurrent attention memory aims to align fragments
in the embedding space by refining the knowledge about where Wg , Wo , bg , bo are to-be-learned parameters. oi is
previous fragment alignments in a recurrent manner. It can a fused feature which enhances the interaction between xi
be regarded as a block that takes in two sets of feature and cxi . gi performs as a gate to select the most salient
points, i.e. V and T , and estimates the similarity between information.
these two sets via a cross-modal attention unit. A memory With the gating mechanism, information of the input
distillation unit is used to refine the attention result in or- query can be refined by itself (i.e. xi ) and the semantic in-
der to provide more knowledge for the next alignments. For formation shared with the response (i.e. oi ). The gate gi
generalization, we denote the two input sets of features as a can help to filter trivial information in the query, and en-
query set X = {xi |i ∈ [1, m′ ], xi ∈ Rd } and a response able the representation learning of each query fragment (i.e.
set Y = {yj |j ∈ [1, n′ ], yj ∈ Rd }, where m′ and n′ are the xi in X) to focus more on its individual shared semantics
numbers of feature points in X and Y , respectively. Note with Y . Besides, the X-grounded alignment features C x
that X can be either of V and T , while Y will be the other. summarize the context information of Y with regard to each
Cross-modal Attention Unit (CAU). The cross-modal fragment in X. And in the next matching step, such context
attention unit aims to summarize context information in Y information will assist to determine the shared semantics
for each feature xi in X. To achieve this goal, we first with respect to Y , forming a recurrent computation process
compute the similarity between each pair (xi , yj ) using the as described in the subsequent section 3.3. Therefore, with
cosine function: the help of Cx , the intra-modality relationships in Y are
implicitly involved and re-calibrated during the recurrent
xTi yj process, which would enhance the interaction among cross-
zij = , ∀i ∈ [1, m′ ], ∀j ∈ [1, n′ ] (3)
||xi || · ||yj || modal features and thus benefit the representation learning.
As [13], we further normalize the similarity score z as: RAM block. We integrate the cross-modal attention unit
and the memory distillation unit into a RAM block, formu-
relu(zij ) lated as:
z̄ij = qP (4)
m′ 2 C x , X ∗ = RAM(X, Y ) (8)
i=1 relu(zij )
where C x and X ∗ are derived by Eq. 5 and 6.
where relu(x) = max(0, x).
Attention is performed over the response set Y given a 3.3. Iterative Matching with Recurrent Attention
feature xi in X: Memory
n′ In this section, we describe how to employ the recurrent
X exp(λz̄ij ) attention memory introduced above to enable the iterative
cxi = αij yj , s.t. αij = Pn′ (5)
j=1 j=1 exp(λz̄ij ) matching for cross-modal image-text retrieval.

4324
12658
Table 1. Comparison with the state-of-the-art models on Flickr8K. As results of SCAN [13] are not reported on Flickr8K, here we show
our experiment results by running codes provided by authors.
Text Retrieval Image Retrieval
Method R@sum
R@1 R@5 R@10 R@1 R@5 R@10
DeViSE [6] 4.8 16.5 27.3 5.9 20.1 29.6 104.2
DVSA [11] 16.5 40.6 54.2 11.8 32.1 44.7 199.9
m-CNN [16] 24.8 53.7 67.1 20.3 47.6 61.7 275.2
SCAN* 52.2 81.0 89.2 38.3 67.8 78.9 407.4
Image-IMRAM 48.5 78.1 85.3 32.0 61.4 73.9 379.2
Text-IMRAM 52.1 81.5 90.1 40.2 69.0 79.2 412.1
Full-IMRAM 54.7 84.2 91.0 41.0 69.2 79.9 420.0

Specifically, given an image I and a text S, we derive manner. Following [5], instead of comparing with all neg-
two strategies for iterative matching grounded on I and S, atives, we only consider the hard negatives within a mini-
respectively, using two independent RAM blocks: batch, i.e. the negative that is closest to a training query:

Ckv , Vk = RAMv (Vk−1 , T ) B

X
(9) L= [∆ − F (Ib , Sb ) + F (Ib , Sb∗ )]+
Ckt , Tk = RAMt (Tk−1 , V )
b=1
(13)
B
where Vk , Tk indicate the step-wise features of the image X
I and the text S, respectively. And k is the matching step, + [∆ − F (Ib , Sb ) + F (Ib∗ , Sb )]+
b=1
and V0 = V , T0 = T .
We iteratively perform RAM() for a total of K steps. where [x]+ = max(x, 0), and F (I, S) is the semantic sim-
And at each step k, we can derive a matching score between ilarity between I and S defined by Eq. 12. Images and
I and S: texts with the same subscript b are matched examples. Hard
1 X
m
1X
n negatives are indicated by the subscript b∗ . ∆ is a margin
Fk (I, S) = Fk (ri , S) + Fk (I, wj ) (10) value.
m i=1 n j=1
Note that in the loss function, F (I, S) consists of
Fk (I, S) at each matching step (i.e. Eq. 12), and thus
where F (ri , S) and F (I, wj ) are defined as the region- optimizing the loss function would directly supervise the
based matching score and the word-based matching score, learning of image-text correspondences at each matching
respectively. They are derived as follows: step, which is expected to help the model to yield higher-
quality alignment at each step. With the employed triplet-
Fk (ri , S) = sim(vi , cvki ); wise ranking objective, the whole model parameters can be
(11)
Fk (I, wj ) = sim(ctkj , tj ) optimized in an end-to-end manner, using widely-used op-
timizers like SGD, etc.
where sim() is the cosine function that measures the simi-
larity between two input features as Eq. 3. And vi ∈ V
corresponds to the region ri . tj ∈ T corresponds to the 4. Experiment
word wj . cvki ∈ Ckv and ctkj ∈ Ckt are, respectively, the 4.1. Datasets and Evaluation Metric
context feature corresponding to the region ri and the word
wj . m and n are the numbers of image regions and text Three benchmark datasets are used in our experiments,
words, respectively. including: (1) Flickr8K: contains 8,000 images and pro-
After K matching steps, we derive the similarity be- vides 5 texts for each image. We adopt its standard splits
tween I and S by summing all matching scores: as [20, 16], using 6,000 images for training, 1,000 images
for validation and another 1,000 images for testing. (2)
K
X Flickr30K: consists of 31,000 images and 158,915 English
F (I, S) = Fk (I, S) (12) texts. Each image is annotated with 5 texts. We follow the
k=1 dataset splits as [13, 5] and use 29,000 images for train-
ing, 1,000 images for validation, and the remaining 1,000
3.4. Loss Function
images for testing. (3) MS COCO: is a large-scale image
In order to enforce matched image-text pairs to be clus- description dataset containing about 123,287 images with
tered and unmatched ones to be separated in the embedding at least 5 texts for each. As previous works [13, 5], we use
spaces, triplet-wise ranking objectives are widely used in 113,287 images to train all models, 5,000 images for vali-
previous works [12, 5] to train the model in an end-to-end dation and another 5,000 images for testing. Results on MS

4325
12659
Table 2. Comparison with state-of-the-art models on Flickr30K.
Text Retrieval Image Retrieval
Method R@sum
R@1 R@5 R@10 R@1 R@5 R@10
DPC [27] 55.6 81.9 89.5 39.1 69.2 80.9 416.2
SCO [9] 55.5 82.0 89.3 41.1 70.5 80.1 418.5
SCAN* [13] 67.4 90.3 95.8 48.6 77.7 85.2 465.0
VSRN* [15] 71.3 90.6 96.0 54.7 81.8 88.2 482.6
Image-IMRAM 67.0 90.5 95.6 51.2 78.2 85.5 468.0
Text-IMRAM 68.8 91.6 96.0 53.0 79.0 87.1 475.5
Full-IMRAM 74.1 93.0 96.6 53.9 79.4 87.2 484.2

Table 3. Comparison with state-of-the-art models on MS COCO.

Text Retrieval Image Retrieval
Method R@sum
R@1 R@5 R@10 R@1 R@5 R@10
1K
DPC [27] 65.6 89.8 95.5 47.1 79.9 90.0 467.9
SCO [9] 69.9 92.9 97.5 56.7 87.5 94.8 499.3
SCAN* [13] 72.7 94.8 98.4 58.8 88.4 94.8 507.9
PVSE [23] 69.2 91.6 96.6 55.2 86.5 93.7 492.8
VSRN* [15] 76.2 94.8 98.2 62.8 89.7 95.1 516.8
Image-IMRAM 76.1 95.3 98.2 61.0 88.6 94.5 513.7
Text-IMRAM 74.0 95.6 98.4 60.6 88.9 94.6 512.1
Full-IMRAM 76.7 95.6 98.5 61.7 89.1 95.0 516.6
5K
DPC [27] 41.2 70.5 81.1 25.3 53.4 66.4 337.9
SCO [9] 42.8 72.3 83.0 33.1 62.9 75.5 369.6
SCAN* [13] 50.4 82.2 90.0 38.6 69.3 80.4 410.9
PVSE [23] 45.2 74.3 84.5 32.4 63.0 75.0 374.4
VSRN* [15] 53.0 81.1 89.4 40.5 70.6 81.1 415.7
Image-IMRAM 53.2 82.5 90.4 38.9 68.5 79.2 412.7
Text-IMRAM 52.0 81.8 90.1 38.6 68.1 79.1 409.7
Full-IMRAM 53.7 83.2 91.0 39.7 69.1 79.8 416.5

Affective: Do not tisement images with related well-designed titles could not
make us alone! only enrich the understanding of vision and language but
also benefit the development of recommender systems and
V.S.
social networks.
Factual: A yellow Evaluation Metric. To compare our proposed method
dog lies on the grass. with the state-of-the-art methods, we adopt the same eval-
Figure 2. Difference between our KWAI-AD dataset and standard uation metrics in all datasets as [17, 13, 5]. Namely, we
datasets, e.g. MS COCO. adopt Recall at K (R@K) to measure the performance of
bi-directional retrieval tasks, i.e. retrieving texts given an
COCO are reported by averaging over 5 folds of 1K test image query (Text Retrieval) and retrieving images given a
images and testing on the full 5K test images as [13]. text query (Image Retrieval). We report R@1, R@5, and
R@10 for all datasets as in [13]. And to well reveal the ef-
To further validate the effectiveness of our method in
fectiveness of the proposed method, we also report an extra
practical scenarios, we build a new dataset, named KWAI-
metric “R@sum”, which is the summation of all evaluation
AD. We collect 81,653 image-text pairs from a real-world
metrics as [8].
business advertisement platform, and we randomly sample
79,653 image-text pairs for training, 1,000 for validation
4.2. Implementation Details
and the remaining 1,000 for testing. The uniqueness of our
dataset is that the provided texts are not detailed textual de- To systematically validate the effectiveness of the pro-
scriptions of the content in the corresponding images, but posed IMRAM, we experiment with three of its variants:
maintain weakly associations with them, conveying strong (1) Image-IMRAM only adopts the RAM block grounded
affective semantics instead of factual semantics (seeing Fig- on images (i.e. only using the first term in Eq. 10); (2) Text-
ure 2). And thus our dataset is more challenging than con- IMRAM only adopts the RAM block grounded on texts (i.e.
ventional datasets. However, it is of great importance in the only using the first term in Eq. 10); (3) Full-IMRAM. All
practical business scenario. Learning subtle links of adver- models are implemented by Pytorch v1.0. In all datasets,

4326
12660
for each word in texts, the word embedding is initialized Table 4. The effect of the total steps of matching, K, on variants
of IMRAM in MS COCO (5K).
by random weights with a dimensionality of 300. We use
Text Retrieval Image Retrieval
a bi-directional GRU with one layer and set its hidden state Model K
→
− ←− R@1 R@10 R@1 R@10
(i.e. h j and h j in Eq. 2) dimensionality as 1,024. The
1 40.8 85.7 34.6 76.2
dimensionality of each region feature (i.e. vi in V ) and and
Image 2 51.5 89.5 37.7 78.3
each word feature (i.e. tj in T ) is set as 1,024. On three
-IMRAM 3 53.2 90.4 38.9 79.2
benchmark datasets, we use Faster R-CNN pre-trained on
1 46.2 87.0 34.4 75.9
Visual Genome to extract 36 region features for each im-
Text 2 50.4 89.2 37.4 78.3
age. For our KWAI-AD dataset, we simply use Inception
-IMRAM 3 51.4 89.9 39.2 79.2
v3 to extract 64 features for each image.
1 49.7 88.9 35.4 76.7
4.3. Results on Three Benchmark Datasets Full 2 53.1 90.2 39.1 79.5
-IMRAM 3 53.7 91.0 39.7 79.8
We compare our proposed IMRAM with published state-
of-the-art models in the three benchmark datasets2 . We di- Table 5. The effect of the aggregating function in the proposed
rectly cite the best-reported results from respective papers memory distillation unit of Text-IMRAM (K = 3) in Flickr30K.
when available. And for our proposed models, we perform Text Retrieval Image Retrieval
Memory
3 steps of iterative matching by default. R@1 R@10 R@1 R@10
Results. Comparison results are shown in Table 1, Ta- add 64.5 95.1 49.2 84.9
ble 2 and Table 3 for Flickr8K, Flickr30K and MS COCO, mlp 66.6 96.4 52.8 86.2
respectively. ‘*’ indicates the performance of an ensemble att 66.1 95.5 52.1 86.2
model. ‘-’ means unreported results. We can see that our gate 66.2 96.4 52.5 86.1
proposed IMRAM can consistently achieve performance ours 68.8 96.0 53.0 87.1
improvements in terms of all metrics, compared to the state-
of-the-art models. Table 6. Statistical results of salient semantics at each matching
step, k, in Text-IMRAM (K = 3) in MS COCO.
Specifically, our Full-IMRAM can outperform one of the
k nouns (%) verbs (%) adjectives (%)
previous best model, i.e. SCAN* [13], by a large margin
of 12.6%, 19.2%, 8.7% and 5.6% in terms of the overall 1 99.0 32.0 35.3
performance R@sum in Flickr8K, Flickr30K, MS COCO 2 99.0 38.8 37.9
(1K) and MS COCO (5K), respectively. And among re- 3 99.0 40.2 39.1
call metrics for the text retrieval task, our Full-IMRAM
can obtain a maximal performance improvement of 3.2% test) in Table 4. We can observe that for all variants, K = 2
(R@5 in Flickr8K), 6.7% (R@1 in Flickr30K), 4.0% (R@1 and K = 3 can consistently achieve better performance
in MS COCO (1K)) and 3.3% (R@1 in MS COCO (5K)), than K = 1. And K = 3 performs better or comparatively,
respectively. As for the image retrieval task, the maximal compared with K = 2. This observation well demon-
improvements are 2.7% (R@1 in Flickr8K), 5.3% (R@1 strates that the iterative matching scheme effectively im-
in Flickr30K), 2.9% (R@1 in MS COCO (1K)) and 1.1% proves model performance. Besides, our Full-IMRAM con-
(R@1 in MS COCO (5K)), respectively. And compared sistently outperforms Image-IMRAM and Text-IMRAM for
with VSRN* [15], our single model can achieve compet- different values of K.
itive results in both Flick30K and MS COCO. These re- Effect of the memory distillation unit. The aggrega-
sults well demonstrate that the proposed method exhibits tion function f (x, y) in Eq. 6 is essential for the proposed
great effectiveness for cross-modal image-text retrieval. Be- iterative matching process. We enumerate some basic ag-
sides, our models can consistently achieve state-of-the- gregation functions and compare them with ours: (1) add:
art performance not only in small datasets, i.e. Flickr8K x + y; (2) mlp: x + tanh(W y + b); (3) att: αx + (1 − α)y
and Flickr30K, but also in the large-scale dataset, i.e. MS where α is a real-valued number parameterized by x and y;
COCO, which well demonstrates its robustness. (4) gate: βx+(1−β)y where β is a real-valued vector pa-
rameterized by x and y. We conduct the analysis with Text-
4.4. Model Analysis IMRAM (K = 3) in Flickr30K in Table 5. We can observe
Effect of the total steps of matching, K. For all three that the aggregation function we use (i.e. Eq. 7) achieves
variants of IMRAM, we gradually increase K from 1 to 3 to substantially better performance than baseline functions.
train and evaluate them on the benchmark datasets. Due to
the limited space, we only report results on MS COCO (5K 4.5. Qualitative Analysis
2 We omit models that require additional data augmentation [19, 22, 17, We intend to explore more insights for the effectiveness
14, 2, 10]. of our models here. For the convenience of the explanation,

4327
12661
A woman in an orange coat and jeans is squatting on a rock wall. Two people standing outside of a beautiful oriental building.
jeans(0.374) jeans(0.546) jeans(0.507) v
building(0.376) v
building(0.424) building(0.424)
v

A woman and girl dressed up in beautiful dresses. A person in a green and white jacket and green pants is practicing on
his snowboard. green(0.536) green(0.671)v green(0.728)
beautiful(0.336) beautiful(0.404) beautiful(0.423) v

An open book laid on top of a bed. A child holding a flowered umbrella and petting a yak.
laid(0.241) laid(0.412) laid(0.421) petting(0.351)
petting(0.223) petting(0.360)

𝑘=1 𝑘=2 𝑘=3 𝑘=1 𝑘=2 𝑘=3

Figure 3. Visualization of attention at each matching step in Text-IMRAM. Corresponding matched words are in blue, followed by the
matching similarity.

we mainly analyze semantic concepts from the view of lan- Table 7. Results on the Ads dataset.
Text Retrieval Image Retrieval
guage, instead of from the view of vision, i.e. we treat each Method
R@1 R@10 R@1 R@10
word in the text as one semantic concept. Therefore, we
i-t AVG [13] 7.4 21.1 2.1 9.3
conduct the qualitative analysis on Text-IMRAM.
Image-IMRAM 10.7 25.1 3.4 16.8
We first visualize the attention map at each matching step
t-i AVG [13] 6.8 20.8 2.0 9.9
in Text-IMRAM (K = 3) corresponding to different se-
Text-IMRAM 8.4 21.5 2.3 15.9
mantic concepts in Figure 3. We can see that the attention
i-t + t-i [13] 7.3 22.5 2.7 11.5
is refined and gradually focuses on the matched regions.
Full-IMRAM 10.2 27.7 3.4 21.7
To quantitatively analyze the alignment of semantic con-
cepts, we first define a semantic concept in Text-IMRAM
dataset is greatly lower than those on benchmark datasets,
as a salient one at the matching step k as follows: 1) Given
which indicates the challenges of cross-modal retrieval in
an image-text pair, at the matching step k, we derive the
real-world business advertisement scenarios. Results also
word-based matching score by Eq. 11 for each word with
show that our models can obtain substantial improvements
respect to the image, and derive the image-text matching
over compared models, which demonstrates the effective-
score by averaging all the word-based scores (see Eq. 10).
ness of the proposed method in this dataset.
2) A semantic concept is salient if its corresponding word-
based score is greater than the image-text score. For a set of
image-text pairs randomly sampled from the testing set, we 5. Conclusion
can compute the percentage of such salient semantic con-
In this paper, we propose an Iterative Matching method
cepts for each model at different matching steps.
with a Recurrent Attention Memory network (IMRAM) for
Then we analyze the change of the salient semantic con-
cross-modal image-text retrieval to handle the complexity
cepts captured at different matching steps in Text-IMRAM
of semantics. Our IMRAM can explore the correspon-
(K = 3). Statistical results are shown in Table 6. We can
dence between images and texts in a progressive manner
see that at the 1st matching step, nouns are easy to be rec-
with two features: (1) an iterative matching scheme with a
ognized and dominant to help to match. While during the
cross-modal attention unit to align fragments from different
subsequent matching steps, contributions of verbs and ad-
modalities; (2) a memory distillation unit to refine align-
jectives increase.
ments knowledge from early steps to later ones. We validate
our models on three benchmarks (i.e. Flickr8K, Flickr30K
4.6. Results on the Newly-Collected Ads Dataset
and MS COCO) as well as a new dataset (i.e. KWAI-AD)
We evaluate our proposed IMRAM on our KWAI-AD for practical business advertisement scenarios. Experiment
dataset. We compare our models with the state-of-the-art results on all datasets show that our IMRAM outperforms
SCAN models in [13]. Comparison results are shown in compared methods consistently and achieves state-of-the-
Table 7. We can see that the overall performance on this art performance.

4328
12662
References [15] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu.
Visual semantic reasoning for image-text matching. In Pro-
[1] Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jun- ceedings of the IEEE International Conference on Computer
gong Han. Cross-modal image-text retrieval with semantic Vision, pages 4654–4662, 2019.
consistency. In Proceedings of the 27th ACM International
[16] Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. Mul-
Conference on Multimedia, pages 1749–1757, 2019.
timodal convolutional neural networks for matching image
[2] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, and sentence. In Proceedings of the IEEE international con-
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: ference on computer vision, pages 2623–2631, 2015.
Learning universal image-text representations. ArXiv, [17] Niluthpol Chowdhury Mithun, Rameswar Panda, Evange-
abs/1909.11740, 2019. los E. Papalexakis, and Amit K. Roy-Chowdhury. Webly
[3] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, supervised joint embedding for cross-modal image-text re-
Kyunghyun Cho, and Yoshua Bengio. Attention-based mod- trieval. In Proceedings of the 26th ACM International Con-
els for speech recognition. In Advances in neural information ference on Multimedia, pages 1856–1864, 2018.
processing systems, pages 577–585, 2015. [18] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual
[4] Guiguang Ding, Wenshuo Chen, Sicheng Zhao, Jungong attention networks for multimodal reasoning and matching.
Han, and Qiaoyan Liu. Real-time scalable visual tracking In Proceedings of the IEEE Conference on Computer Vision
via quadrangle kernelized correlation filters. IEEE Transac- and Pattern Recognition, pages 299–307, 2017.
tions on Intelligent Transportation Systems, 19(1):140–150, [19] Duy-Kien Nguyen and Takayuki Okatani. Multi-task learn-
2017. ing of hierarchical vision-language representation. In The
[5] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja IEEE Conference on Computer Vision and Pattern Recogni-
Fidler. Vse++: Improving visual-semantic embeddings with tion (CVPR), June 2019.
hard negatives. arXiv preprint arXiv:1707.05612, 2017. [20] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Hierarchi-
[6] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, cal multimodal lstm for dense visual-semantic embedding.
Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual- In 2017 IEEE International Conference on Computer Vision
semantic embedding model. In Advances in neural informa- (ICCV), pages 1899–1907, 2017.
tion processing systems, pages 2121–2129, 2013. [21] Leonid Perlovsky. Language and cognition interaction neural
[7] Yuchen Guo, Guiguang Ding, Xiaoming Jin, and Jianmin mechanisms. Computational Intelligence and Neuroscience,
Wang. Learning predictable and discriminative attributes for 2011, 2011.
visual recognition. In Twenty-Ninth AAAI Conference on Ar- [22] Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan.
tificial Intelligence, 2015. Knowledge aware semantic concept expansion for image-
[8] Yan Huang, Wei Wang, and Liang Wang. Instance-aware im- text matching. In Proceedings of the Twenty-Eighth Interna-
age and sentence matching with selective multimodal lstm. tional Joint Conference on Artificial Intelligence, IJCAI-19,
In Proceedings of the IEEE Conference on Computer Vision pages 5182–5189, 7 2019.
and Pattern Recognition, pages 2310–2318, 2017. [23] Yale Song and Mohammad Soleymani. Polysemous visual-
[9] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. semantic embedding for cross-modal retrieval. In The IEEE
Learning semantic concepts and order for image and sen- Conference on Computer Vision and Pattern Recognition
tence matching. In Proceedings of the IEEE Conference (CVPR), June 2019.
on Computer Vision and Pattern Recognition, pages 6163– [24] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
6171, 2018. Shlens, and Zbigniew Wojna. Rethinking the inception archi-
[10] Zhong Ji, Haoran Wang, Jungong Han, and Yanwei tecture for computer vision. In Proceedings of the IEEE con-
Pang. Saliency-guided attention network for image-sentence ference on computer vision and pattern recognition, pages
matching. In Proceedings of the IEEE International Confer- 2818–2826, 2016.
ence on Computer Vision, pages 5754–5763, 2019. [25] Kai Sheng Tai, Richard Socher, and Christopher D. Man-
[11] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align- ning. Improved semantic representations from tree-
ments for generating image descriptions. In Proceedings of structured long short-term memory networks. In Proceed-
the IEEE conference on computer vision and pattern recog- ings of the 53rd Annual Meeting of the Association for Com-
nition, pages 3128–3137, 2015. putational Linguistics, pages 1556–1566, Beijing, China,
[12] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. july 2015.
Unifying visual-semantic embeddings with multimodal neu- [26] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep
ral language models. arXiv preprint arXiv:1411.2539, 2014. structure-preserving image-text embeddings. In Proceed-
[13] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi- ings of the IEEE conference on computer vision and pattern
aodong He. Stacked cross attention for image-text matching. recognition, pages 5005–5013, 2016.
In Proceedings of the European Conference on Computer Vi- [27] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang,
sion (ECCV), pages 201–216, 2018. and Yi-Dong Shen. Dual-path convolutional image-
[14] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, text embedding with instance loss. arXiv preprint
and Ming Zhou. Unicoder-vl: A universal encoder for vision arXiv:1711.05535, 2017.
and language by cross-modal pre-training, 2019.

4329
12663

Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
No ratings yet
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
18 pages
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
No ratings yet
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
23 pages
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
No ratings yet
Biten Is An Image Worth Five Sentences A New Look Into WACV 2022 Paper
10 pages
Cross-modal Hard Aligning for Image-Text Matching
No ratings yet
Cross-modal Hard Aligning for Image-Text Matching
10 pages
Cross-Modal Graph Matching Network For Image-Text Retrieval
No ratings yet
Cross-Modal Graph Matching Network For Image-Text Retrieval
23 pages
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
No ratings yet
Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning
51 pages
Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper
No ratings yet
Jiang Cross-Modal Implicit Relation Reasoning and Aligning For Text-To-Image Person Retrieval CVPR 2023 Paper
11 pages
Image and Sentence Matching Via Semantic Concepts and Order Learning
No ratings yet
Image and Sentence Matching Via Semantic Concepts and Order Learning
14 pages
Cross-Modal Image-Text Matching for Apps
No ratings yet
Cross-Modal Image-Text Matching for Apps
3 pages
Cross-Modal Implicit Relation Reasoning and Aligning For Text-to-Image Person Retrieval
No ratings yet
Cross-Modal Implicit Relation Reasoning and Aligning For Text-to-Image Person Retrieval
11 pages
Efficient Image-Text Retrieval Framework
No ratings yet
Efficient Image-Text Retrieval Framework
14 pages
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
No ratings yet
Rethinking Benchmarks For Cross-Modal Image-Text Retrieval: Weijing Chen Linli Yao Qin Jin
11 pages
Image-Text Retrieval: A Survey On Recent Research and Development
No ratings yet
Image-Text Retrieval: A Survey On Recent Research and Development
8 pages
Advanced Multimodal Deep Learning Architecture For Image-Text Matching
No ratings yet
Advanced Multimodal Deep Learning Architecture For Image-Text Matching
8 pages
Graph-Based Cross-Modal Retrieval Model
No ratings yet
Graph-Based Cross-Modal Retrieval Model
7 pages
Deep Visual-Semantic Image Descriptions
No ratings yet
Deep Visual-Semantic Image Descriptions
17 pages
Interacting-Enhancing Feature Transformer For Cross-Modal Remote-Sensing Image and Text Retrieval
No ratings yet
Interacting-Enhancing Feature Transformer For Cross-Modal Remote-Sensing Image and Text Retrieval
15 pages
A Mixed Generative-Discriminative
No ratings yet
A Mixed Generative-Discriminative
13 pages
Deep Cross-Modal Learning for Matching
No ratings yet
Deep Cross-Modal Learning for Matching
16 pages
Ensemble Model for Image Retrieval
No ratings yet
Ensemble Model for Image Retrieval
8 pages
10.1007@s11042 019 7732 Z
No ratings yet
10.1007@s11042 019 7732 Z
17 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Iterative Alignment for Temporal Grounding
No ratings yet
Iterative Alignment for Temporal Grounding
10 pages
DeepSeek 基于内容的图像搜索与检索
No ratings yet
DeepSeek 基于内容的图像搜索与检索
9 pages
Deep Captioning With MRNN
No ratings yet
Deep Captioning With MRNN
17 pages
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
No ratings yet
ContextRefine CLIP For EPIC KITCHENS 100 Multi Instance Retrieval Challenge 2025
4 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
2025 Coling-Main 138
No ratings yet
2025 Coling-Main 138
12 pages
Park LANIT Language-Driven Image-to-Image Translation For Unlabeled Data CVPR 2023 Paper
No ratings yet
Park LANIT Language-Driven Image-to-Image Translation For Unlabeled Data CVPR 2023 Paper
11 pages
A Mixed Generative-Discriminative Based Hashing Method: Qi Zhang, Yang Wang, Jin Qian, Binbin Deng, Xuanjing Huang
No ratings yet
A Mixed Generative-Discriminative Based Hashing Method: Qi Zhang, Yang Wang, Jin Qian, Binbin Deng, Xuanjing Huang
2 pages
Saw Gan
No ratings yet
Saw Gan
11 pages
VR Project
No ratings yet
VR Project
15 pages
MMM2023 Duy
No ratings yet
MMM2023 Duy
12 pages
Category-Level Contrastive Learning For Unsupervised Hashing in Cross-Modal Retrieval
No ratings yet
Category-Level Contrastive Learning For Unsupervised Hashing in Cross-Modal Retrieval
36 pages
Sachin Kumar Report
No ratings yet
Sachin Kumar Report
15 pages
Mirrorgan: Learning Text-To-Image Generation by Redescription
No ratings yet
Mirrorgan: Learning Text-To-Image Generation by Redescription
10 pages
Wang Learning Deep Structure-Preserving CVPR 2016 Paper
No ratings yet
Wang Learning Deep Structure-Preserving CVPR 2016 Paper
9 pages
Computational Methods For Integrating Vision and Language: Kobus Barnard
No ratings yet
Computational Methods For Integrating Vision and Language: Kobus Barnard
229 pages
Texim Fast: Text-To-Image Encoding For Semantic Similarity Evaluation of Disproportionate Sequences
No ratings yet
Texim Fast: Text-To-Image Encoding For Semantic Similarity Evaluation of Disproportionate Sequences
23 pages
【2022】IEEE Cyb Visual Relationship Detection a Survey
No ratings yet
【2022】IEEE Cyb Visual Relationship Detection a Survey
14 pages
Image Caption Generator Using CNN and LSTM
No ratings yet
Image Caption Generator Using CNN and LSTM
8 pages
Applsci 13 07916
No ratings yet
Applsci 13 07916
17 pages
2020 Deep CNN TR Le
No ratings yet
2020 Deep CNN TR Le
6 pages
Liu Graph Structured Network For Image-Text Matching CVPR 2020 Paper
No ratings yet
Liu Graph Structured Network For Image-Text Matching CVPR 2020 Paper
10 pages
Remote Sensing Symposium, Waikoloa, HI, USA, 2020, Pp. 577-580
No ratings yet
Remote Sensing Symposium, Waikoloa, HI, USA, 2020, Pp. 577-580
162 pages
HUSE Multi Input Deep Learning Model
No ratings yet
HUSE Multi Input Deep Learning Model
10 pages
Text-Image Embeddings With OpenAIs CLIP
No ratings yet
Text-Image Embeddings With OpenAIs CLIP
5 pages
Stacked Attention Networks For Image Question Answering
No ratings yet
Stacked Attention Networks For Image Question Answering
9 pages
Find
No ratings yet
Find
12 pages
L-Verse: Bidirectional Image-Text Generation
No ratings yet
L-Verse: Bidirectional Image-Text Generation
18 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Text Similarity with Siamese Networks
No ratings yet
Text Similarity with Siamese Networks
10 pages
A Cross-Modal Attention Model For Fine-Grained Incident Retrieval From Dashcam Videos
No ratings yet
A Cross-Modal Attention Model For Fine-Grained Incident Retrieval From Dashcam Videos
12 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Text and Visual Cues in Classification
No ratings yet
Text and Visual Cues in Classification
10 pages
Flickr30k Entities: Image-Phrase Dataset
No ratings yet
Flickr30k Entities: Image-Phrase Dataset
22 pages
Fine-Grained Late-Interaction Multi-Modal Retrieval For Retrieval Augmented Visual Question Answering
No ratings yet
Fine-Grained Late-Interaction Multi-Modal Retrieval For Retrieval Augmented Visual Question Answering
21 pages
George Ansell: Problem Solving Lesson
No ratings yet
George Ansell: Problem Solving Lesson
6 pages
Understanding Learning Standards
No ratings yet
Understanding Learning Standards
81 pages
Rating Scale
100% (6)
Rating Scale
10 pages
PPG q1 Week8
No ratings yet
PPG q1 Week8
3 pages
Visual Perception and Weber's Law Experiment
No ratings yet
Visual Perception and Weber's Law Experiment
2 pages
Deep Learning Techniques Notes
No ratings yet
Deep Learning Techniques Notes
42 pages
Unit 7: Data Handling (16 Periods) : Learning Outcomes: at The End of This Unit, Learners Will Be Able To
No ratings yet
Unit 7: Data Handling (16 Periods) : Learning Outcomes: at The End of This Unit, Learners Will Be Able To
2 pages
DLL Philo 11th Week
100% (1)
DLL Philo 11th Week
3 pages
MELCS 2 Quarter 2
No ratings yet
MELCS 2 Quarter 2
39 pages
Developments in Business Simulation & Experiential Exercises, Volume 10, 1983
No ratings yet
Developments in Business Simulation & Experiential Exercises, Volume 10, 1983
4 pages
Lesson 1-Study Skills 2022-2023
No ratings yet
Lesson 1-Study Skills 2022-2023
4 pages
Teamwork: Boosting Creativity & Success
No ratings yet
Teamwork: Boosting Creativity & Success
7 pages
Suzuki Piano and Flute Class Schedule
100% (2)
Suzuki Piano and Flute Class Schedule
1 page
Quick Guide to Personality Tests
No ratings yet
Quick Guide to Personality Tests
21 pages
Conceptual Framework of The Study
83% (6)
Conceptual Framework of The Study
2 pages
How To Keep Focus On IELTS Reading
No ratings yet
How To Keep Focus On IELTS Reading
3 pages
Eckhart Tolle's Awakening Journey
No ratings yet
Eckhart Tolle's Awakening Journey
9 pages
SAMPLE ESSAYS My Hobby
100% (2)
SAMPLE ESSAYS My Hobby
3 pages
Module 1 Edited
No ratings yet
Module 1 Edited
17 pages
Music and Philosophy (Introduction) - Wayne Bowman
No ratings yet
Music and Philosophy (Introduction) - Wayne Bowman
10 pages
Technical English I Lesson Plan 2014-2015
No ratings yet
Technical English I Lesson Plan 2014-2015
11 pages
Student Survey Results
No ratings yet
Student Survey Results
3 pages
Malaysia's Smart School Initiative
100% (5)
Malaysia's Smart School Initiative
12 pages
Initial Assessment With Your Success Coach
No ratings yet
Initial Assessment With Your Success Coach
6 pages
Interview Questions
No ratings yet
Interview Questions
5 pages
I. Objectives: Skills Through World Literature. 2011. Phoenix Publishing
No ratings yet
I. Objectives: Skills Through World Literature. 2011. Phoenix Publishing
4 pages
Online Group Embedded Figures Test and Student's Success in Online Course
No ratings yet
Online Group Embedded Figures Test and Student's Success in Online Course
5 pages
Video Activities for Teaching Compliments
No ratings yet
Video Activities for Teaching Compliments
40 pages
Sample
No ratings yet
Sample
5 pages
2019 21ST Century DLL Week 2 (June 10-14)
No ratings yet
2019 21ST Century DLL Week 2 (June 10-14)
2 pages

03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper

Uploaded by

03IMRAM Iterative Matching With Recurrent Attention Memory For Cross-Modal Image-Text CVPR 2020 Paper

Uploaded by

IMRAM: Iterative Matching with Recurrent Attention Memory

for Cross-Modal Image-Text Retrieval∗

Abstract cross-modal image-text retrieval is in great demand and has

multi-modal LSTM was used to sequentially aggregate lo- 3. Methodology

Ckv , Vk = RAMv (Vk−1 , T ) B

Table 3. Comparison with state-of-the-art models on MS COCO.

𝑘=1 𝑘=2 𝑘=3 𝑘=1 𝑘=2 𝑘=3

You might also like