Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
Fine-Grained Visual Textual Alignment For Cross-Modal Retrieval Using Transformer Encoders
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@[Link].
© 2020 Association for Computing Machinery.
1551-6857/2020/0-ART0 $15.00
[Link]
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:2 Messina et al.
1 INTRODUCTION
Since 2012, deep learning has obtained impressive results in several vision and language tasks.
Recently, various attempts have been made to merge the two worlds, and state-of-the-art results
have been obtained in many of these tasks, including visual question answering [2, 12, 52], image
captioning [6, 14, 48, 65], and image-text matching [5, 9, 28, 39]. In this work, we deal with the
cross-modal retrieval task, with a focus on the visual and textual modalities. The task consists
in finding the top-relevant images representing a natural language sentence given as a query
(image-retrieval), or, vice versa, in finding a set of sentences that best describe an image given as a
query (sentence-retrieval).
This cross-modal retrieval task is closely related to image-sentence matching, which consists in
assigning a score to a pair composed of an image and a sentence. The score is high if the sentence
adequately describes the image, and low if the input sentence is unrelated to the corresponding
image. The score function learned by solving the matching problem can be then used for deciding
which are the top-relevant images and sentences in the two image- and sentence- retrieval scenarios.
The matching problem is often very difficult since a deep high-level understanding of images and
sentences is needed for succeeding in this task.
Visuals and texts are used by humans to fully understand the real world. Although they are
of equal importance, the information hidden in these two modalities has a very different nature.
The text is already a well-structured description developed by humans in hundreds of years, while
images are nothing but raw matrices of pixels hiding very high-level concepts and structures.
Images and texts do not describe only static entities. In fact, they can easily portray relationships
between the objects of interest, e.g.: "The kid kicks the ball". Therefore, it would be helpful to also
understand spatial and even abstract relationships linking them together.
Vision and language matching has been extensively studied [3, 9, 24, 28, 39]. Many works employ
standard architectures for processing images and texts, such as CNNs-based models for image
processing and recurrent networks for language. Usually, in this scenario, the image embeddings
are extracted from standard image classification networks, such as ResNet or VGG, by employing
the network activations before the classification head. Usually, descriptions extracted from CNN
networks trained on classification tasks can only capture global summarized features of the image,
ignoring important localized details. For this reason, recent works make extensive use of attention
mechanisms, which are able to relate each visual object, extracted from the spatial locations of a
feature map or an object detector to the most interesting parts of the sentence, and/or vice-versa.
Many of these works, such as ViLBERT[39], ImageBERT[45], VL-BERT[51], IMRAM[4], try to
learn a complex scoring function 𝑠 = 𝜙 (𝐼, 𝐶) that measures the affinity between an image and a
caption, where 𝐼 is an image, 𝐶 is the caption and 𝑠 is a normalized score in the range [0, 1]. These
are very effective models for tackling the matching task, and they reach state-of-the-art results.
However, they remain very inefficient for large-scale image or sentence retrieval: the problem with
these approaches is that it is not possible to extract visual and textual descriptions separately, as
the pipelines are strongly entangled through cross-attention or memory layers. Thus, if we want to
retrieve images related to a given query text, we have to compute all the similarities using the 𝜙
function and then sort the resulting scores in descending order. This is unfeasible if we want to
retrieve images or sentences from a large database in a few milliseconds.
In our previous work, we introduced the Transformer Encoder Reasoning Network (TERN)
architecture [43], which is a transformer-based model able to independently process images and
sentences to match them into the same common space. TERN is a useful architecture for producing
compact yet informative features that could be used in cross-modal retrieval setups for efficient
indexing using metric-space or text-based approaches. TERN processes visual and textual elements
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:3
using transformer encoder layers, exploring and reasoning on the relationships among image
regions and sentence words. However, its main objective is to match images and sentences as
atomic, global entities, by learning a global representation of them inside special tokens (I-CLS and
T-CLS) processed by the transformer encoder. This usually leads to performance loss and possibly
poor generalization since fine-grained information useful for effective matching is lost during the
projection to a fixed-sized common space.
For this reason, in this work, we propose TERAN (Transformer Encoder Reasoning and Alignment
Network) in which we force a fine-grained word-region alignment. Fine-grained matching deals
with the accurate understanding of the local correspondences between image regions and words,
as opposed to coarse-grained matching, where only a summarized global descriptions of the two
modalities is considered. In fact, differently from TERN, the objective function is directly defined on
the set of regions and words in output from the architecture, and not on a potentially lossy global
representation. Using this objective, TERAN tries to individually align the regions and the words
contained in images and sentences respectively, instead of directly matching images and sentences
as a whole. The information available to TERAN during training is still coarse-grained, as we do
not inject any information about word-region correspondences. The fine-grained alignment is thus
obtained in a semi-supervised setup, where no explicit word-region correspondences are given to
the network.
Our TERAN proposal shares most of the previous TERN building blocks and interconnections:
the visual and textual pipelines are forwarded separately and they are fused only during the loss
computation, in the very last stage of the architecture, making scalable cross-modal information
retrieval possible. At the same time, this novel architecture employs state-of-the-art self-attentive
modules, based on the transformer encoder architecture [53], able to spot out hidden relationships
in both modalities for a very effective fine-grained alignment.
Therefore, TERAN is able to produce independent visual and textual features usable in efficient
retrieval scenarios implementing two simple visual and textual pipelines built of modern self-
attentive mechanisms. In spite of its overall simplicity, TERAN is able to reach state-of-the-art
results in the image and sentence retrieval task, even when compared with complex entangled
visual-textual matching models. Experiments show that TERAN can generalize better with respect
to the previous TERN approach.
In the evaluation of the proposed matching procedure, we used a typical information retrieval
setup using the Recall@K metrics (with 𝐾 = {1, 5, 10}.) However, in common search engines where
the user is searching for related images and not necessarily exact matches, the Recall@K evaluation
could be too rigid, especially when 𝐾 = 1. For this reason, as in our previous work [43], in addition
to the strict Recall@K metric, we propose to measure the retrieval abilities of the system with a
normalized discounted cumulative gain metric (NDCG) with relevance computed exploiting caption
similarities.
Summarizing, the contributions of this paper are the following:
• we introduce the Transformer Encoder Reasoning and Alignment Network (TERAN), able to
produce fine-grained region-word alignments for efficient cross-modal information retrieval.
• we show that TERAN can reach state-of-the-art results on the cross-modal visual-textual
retrieval task, both in terms of Recall@K and NDCG, while producing visually-pleasant region-
words alignments without using supervision at the region-word level. Retrieval results are
measured both on MS-COCO and Flickr30k datasets.
• we quantitatively compare TERAN with our previous work [43], and we perform an extensive
study on several variants of our novel model, including weight sharing in the last transformer
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:4 Messina et al.
layers, stop-words removal during training, different pooling protocols for the matching loss
function, and the usage of different language models.
2 RELATED WORK
In this section, we review some of the previous works related to image-text joint processing for
cross-modal retrieval and alignment, and high-level relational reasoning, on which this work lays
its foundations. Also, we briefly summarize the evaluation metrics available in the literature for the
cross-modal retrieval task.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:5
Networks (GCNs) and a GRU to sequentially reason on the different image regions. Furthermore,
they impose a sentence reconstruction loss to regularize the training process. The authors in [18]
use a similar objective, but employing a pre-trained multi-label CNN to find semantically relevant
image patches and their vectorial descriptions. Differently, in [50] an adversarial learning method
is proposed, where a discriminator is used to learn modality-invariant representations. The authors
in [11] use a contextual attention-based LSTM-RNN which can selectively attend to salient regions
of an image at each time step, and they employ a recurrent canonical correlation analysis to find
hidden semantic relationships between regions and words.
The works closer to our setup are SAEM [59] and CAMERA [46]. In [59] the authors use triplet
and angular loss to project the image and sentence features into the same common space. The
visual and textual features are obtained through transformer encoder modules. Differently from our
work, they do not enforce fine-grained alignments and they pool the final representations to obtain
a single-vector representation. Instead, in [46] the authors use BERT as language model and an
adaptive gating self-attention module to obtain context-enhanced visual features, projecting them
into the same common space using cosine similarity. Unlike our work, they specifically focus on
multi-view summarization, as multiple sentences can describe the same images in many different
but complementary ways.
The loss used in our work is inspired by the matching loss introduced by the MRNN architecture
[24], which seems able to produce very good region-word alignments by supervising only the
global image-sentence level.
High-Level Reasoning
Another branch of research from which this work draws inspiration is focused on the study of
relational reasoning models for high-level understanding. The work in [49] proposes an architecture
that separates perception from reasoning. They tackle the problem of Visual Question Answering
by introducing a particular layer called Relation Network (RN), which is specialized in comparing
pairs of objects. Object representations are learned using a four-layer CNN, and the question
embedding is generated through an LSTM. The authors in [41, 42] extend the RN for producing
compact features for relation-aware image retrieval. However, they do not explore the multi-modal
retrieval setup.
Other solutions try to stick more to a symbolic-like way of reasoning. In particular, [12, 23]
introduce compositional approaches able to explicitly model the reasoning process by dynamically
building a reasoning graph that states which operations must be carried out and in which order to
obtain the right answer.
Recent works employ Graph Convolution Networks (GCNs) to reason about the interconnections
between concepts. The authors in [31, 62, 63] use GCNs to reason on the image regions for image
captioning, while [32, 61] use GCN with attention mechanisms to produce the scene graph from
plain images.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:6 Messina et al.
s output vectors
Att(Q, K, V)
Add & Norm
s ⨉ s attention matrix
Feed-Forward
Network
(per row)
softmax
Add & Norm
Q K V
Multi-Head
Attention query key values
FFN FFN FFN
Q K V
s input vectors
Fig. 1. A high-level view of the transformer encoder layer. Every arrow carries 𝑠 fixed-sized vectors.
We extend the metric introduced in [3], giving rise to a powerful evaluation protocol that handles
non-exact yet relevant matches. Relaxing the constraints of exact-match similarity search is an
important step towards an effective evaluation of real search engines.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:7
not any built-in sequential prior which considers every vector in a precise position in the sequence.
This makes the TE suitable for processing visual features coming from an object detector.
We argue that the transformer encoder self-attention mechanism can drive a simple but powerful
reasoning mechanism able to spot hidden relationships between the vector entities, whatever
nature they have (visual or textual). Also, the encoder is designed in a way that multiple instances
of it could be stacked in sequence. Using multiple levels of attention helps in producing a deeper
and more powerful reasoning pipeline.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:8 Messina et al.
Reasoning
I-CLS
Pairwise R-W Similarities
FC
TEs TEs
Region FC
Features
FC
... ... Pooling
... ...
FC
Faster-RCNN { v i} I-T Similarity
T-CLS
Fig. 2. The proposed TERAN architecture. TEs stands for Transformer Encoders, and it indicates a stack of
TE layers whose internals are recalled in Section 3 and explained in detail in [53]. Region and word features
are extracted through a bottom-up attention model based on Faster-RCNN and BERT, respectively. The final
image-text (I-T) similarity score is obtained by pooling a region-word (R-W) similarity matrix. Note that the
special I-CLS and T-CLS are not used in the basic formulation of TERAN.
𝒗𝑇𝑖 𝒔 𝑗
𝐴𝑖 𝑗 = 𝑖 ∈ 𝑔𝑘 , 𝑗 ∈ 𝑔𝑙 (2)
∥𝒗𝑖 ∥ ∥𝒔 𝑗 ∥
At this point, the global similarity 𝑆𝑘𝑙 between the 𝑘-th image and the 𝑙-th sentence is computed
by pooling this similarity matrix through an appropriate pooling function. Inspired by [24] and
[28], we employ the max-sum pooling, which consists in computing the max over the rows of 𝐴
and then summing or, equivalently, max-over-regions sum-over-words (𝑀𝑟 𝑆 𝑤 ) pooling. We explore
also the dual version, as in [28], by computing the max over the columns and then summing, or
max-over-words sum-over-regions (𝑀𝑤 𝑆𝑟 ) pooling:
∑︁ ∑︁
𝑀𝑟 𝑆 𝑤 𝑀𝑤 𝑆𝑟
𝑆𝑘𝑙 = max 𝐴𝑖 𝑗 𝑜𝑟 𝑆𝑘𝑙 = max 𝐴𝑖 𝑗 (3)
𝑖 ∈𝑔𝑘 𝑗 ∈𝑔𝑙
𝑗 ∈𝑔𝑙 𝑖 ∈𝑔𝑘
Since both these similarity functions are not symmetric due to the diverse outcomes we obtain by
inverting the order of the sum and max operations, we introduce also the symmetric form, obtained
by summing the two:
Symm 𝑀𝑟 𝑆 𝑤 𝑀𝑤 𝑆𝑟
𝑆𝑘𝑙 = 𝑆𝑘𝑙 + 𝑆𝑘𝑙 (4)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:9
where [𝑥] + ≡ 𝑚𝑎𝑥 (0, 𝑥) and 𝛼 is a margin that defines the minimum separation that should hold
between the truly matching word-region embeddings and the negative pairs. The hard negatives 𝑘 ′
and 𝑙 ′ are computed as follows:
𝑘 ′ = arg max 𝑆 ( 𝑗, 𝑙)
𝑗≠𝑘
′ (6)
𝑙 = arg max 𝑆 (𝑘, 𝑑)
𝑑≠𝑙
where (𝑘, 𝑙) is a positive pair. As in [9], the hard negatives are sampled from the mini-batch and
not globally, for performance reasons.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:10 Messina et al.
to deploy to real-world scalable search engines. Some of these works partially solve this issue
by keeping the two representations separated up until a certain point in the network, so that
these intermediate representations can be cached, as proposed in [40]. In all these cases, a new
incoming query to the system needs 𝑂 (𝐾) or 𝑂 (𝐿) re-evaluations of the whole network (depending
on whether we are considering image or sentence retrieval); in the best case, we need to re-evaluate
the last attention layers, which could be similarly expensive.
Regarding the similarity computation, TERN uses simple dot products that enable quick and
efficient document rankings in modern search engines. TERAN implements also a very simple
similarity function, built of simple dot products and summations without including complex layers of
memories or attentions. This possibly enables an implementation that uses metric space approaches
to prune the search space and obtain very efficient image or sentence rankings for a given query.
However, the implementation of the TERAN similarity function in real-world search engines is left
for future research.
rel𝑖 is a positive number encoding the affinity that the 𝑖-th element of the retrieved list has
with the query element, and IDCG𝑝 is the DCG𝑝 of the best possible ranking. Thanks to this
normalization, NDCG𝑝 acquires values in the range [0, 1].
The rel𝑖 values can be computed using well-established sentence similarity scores between a
sentence and the sentences associated with a certain image. More formally, we could think of
computing rel𝑖 = 𝜏 (𝐶¯𝑖 , 𝐶 𝑗 ), where 𝐶¯𝑖 is the set of all captions associated to the image 𝐼𝑖 , and
𝜏 : S × S → [0, 1] is a similarity function defined over a pair of sentences returning their normalized
similarity score. With this simple expedient, we could efficiently compute quite large relevance
matrices using similarities defined over captions, which are in general computationally much
cheaper than similarities computed between images and sentences directly.
We thus compute the 𝑟𝑒𝑙𝑖 value in the following ways:
• rel𝑖 = 𝜏 (𝐶¯𝑖 , 𝐶 𝑗 ) in case of image retrieval, where 𝐶 𝑗 is the query caption
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:11
• rel𝑖 = 𝜏 (𝐶¯ 𝑗 , 𝐶𝑖 ) in case of caption retrieval, where 𝐶¯ 𝑗 is the set of captions associated to the
query image 𝐼 𝑗 .
In our work, we use ROUGE-L[33] and SPICE[1] as sentence similarity functions 𝜏 for computing
caption similarities. These two scoring functions capture different aspects of the sentences. In
particular, ROUGE-L operates on the longest common sub-sequences, while SPICE exploits graphs
associated with the syntactic parse trees, and has a certain degree of robustness against synonyms.
In this way, SPICE is more sensitive to high-level features of the text and semantic dependencies
between words and concepts rather than to pure syntactic constructions.
7 EXPERIMENTS
We trained the TERAN architecture and we measured its performance on the MS-COCO [34] and
Flickr30k datasets [64], computing the effectiveness of our approach on the image retrieval and
sentence retrieval tasks. We compared our results against state-of-the-art approaches on the same
datasets, using the introduced NDCG and the already-in-use Recall@K metrics.
The MS-COCO dataset comes with a total of 123,287 images. Every image has associated a set of 5
human-written captions describing the image. We follow the splits introduced by [24] and followed
by the subsequent works in this field [9, 10, 30]. In particular, 113,287 images are reserved for
training, 5,000 for validating, and 5,000 for testing. Differently, Flickr30k consists of 31,000 images
and 158,915 English texts. Like MS-COCO, each image is annotated with 5 captions. Following the
splits by [24], we use 29,000 images for training, 1,000 images for validation, and the remaining
1,000 images for testing. For MS-COCO, at test time the results for both 5k and 1k test-sets are
reported. In the case of 1k images, the results are computed by performing 5-fold cross-validation
on the 5k test split and averaging the outcomes.
We computed caption-caption relevance scores for the NDCG metric using ROUGE-L[33] and
SPICE[1], as explained in Section 6, and we set the NDCG parameter 𝑝 = 25 as in [3] in our
experiments. We employed the NDCG metrics measured during the validation phase for choosing
the best performing model to be used during the test phase.
For a better comparison with our previous TERN approach, we included three more targeted
experiments. In the first two, called TERN 𝑀𝑤 𝑆𝑟 Test and TERN 𝑀𝑟 𝑆 𝑤 Test we used the best-
performing TERN model, trained as explained in [43], testing it using the 𝑀𝑤 𝑆𝑟 and 𝑀𝑟 𝑆 𝑤 alignments
criteria respectively. TERN is effectively able to output features for every image region or word;
however, it is never constrained to produce meaningful descriptions out of these sets of features;
hence, this trial is aimed at checking the quality of the alignment of the concepts in output from
the previous TERN architecture. In the third experiment, called TERN w. Align, we tried to integrate
the objectives of both TERN and TERAN during training, by combining their losses using the
uncertainty weighting method proposed in [25], and testing the model using the TERN inference
protocol. Thus, in this experiment, we effectively reuse the I-CLS and T-CLS tokens as global
descriptions for images and sentences, as described in [43]. This experiment aimed to evaluate if the
TERAN alignment objective can help TERN learn better fixed-sized global vectorial descriptions.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:12 Messina et al.
already-extracted bottom-up features, while we extracted from scratch the features for Flickr30k
using the available pre-trained model.
In the experiments, we used the bottom-up features containing the top 36 most confident
detections, although our pipeline already handles variable-length sets of regions for each image by
appropriately masking the attention weights in the TE layers.
Concerning the reasoning steps, we used a stack of 4 TE layers for visual reasoning. We found the
best results when fine-tuning the BERT pre-trained model, so we did not add further reasoning TE
layers for the textual pipeline. The final common space, as in [9], is 1024-dimensional. We linearly
projected the visual and textual features to a 1024-d space and then we processed the resulting
features using 2 final TEs before computing the alignment matrix.
All the TEs feed-forward layers are 2048-dimensional and the dropout is set to 0.1. We trained
for 30 epochs using Adam optimizer with a batch size of 40 and a learning rate of 1𝑒−5 for the
first 20 epochs and 1𝑒−6 for the remaining 10 epochs. The 𝛼 parameter of the hinge-based triplet
ranking loss is set to 0.2, as in [9, 30].
7.2 Results
We compare our TERAN method against the following baselines: JGCAR [55], SAN [21], VSE++ [9],
SMAN [22], M3A-Net [20], AAMEL [57], MRNN [24], SCAN [28], SAEM [59], CASC [60], MMCA
[58], VSRN [30], PFAN [56], Full-IMRAM [4], and CAMERA [46]. We clustered these methods
based on the visual feature extractor they use: VGG, ResNet, or Region CNN (e.g., Faster-RCNN).
To have a better comparison with our method, we also annotated in the tables whenever they use
BERT as the textual model, or if they use disentangled visual-textual pipelines for efficient feature
computation, as explained in Section 5. Also note that many of the listed methods report the results
using an ensemble of two models having different training initialization parameters, where the
final similarity is obtained by averaging the scores in output from each model. Hence, we reported
also our ensemble results, for a better comparison with these baselines. In the tables, we indicate
ensemble methods postponing "(ens.)" to the method name.
We used the original implementations from their respective GitHub repositories to compute the
NDCG metrics for the baselines, where possible. In the case of missing pre-trained models, we were
not able to produce consistent results with the original papers. In this case, we do not report the
NDCG metrics ("-").
On both the 1K and 5K test sets, our novel TERAN approach reaches state-of-the-art results
on almost all the metrics. Concerning the results reported in Table 1 regarding 1K test set, the
best performing TERAN model is the one implementing the max-over-regions sum-over-words
(𝑀𝑟 𝑆 𝑤 ) pooling method, although the model using the symmetric loss reaches comparable results.
We chose the same TERAN 𝑀𝑟 𝑆 𝑤 model to evaluate the ensemble, reaching an improvement of
5.7% and 3.5% on the Recall@1 metric on image and sentence retrieval respectively, with respect to
the best baseline using ensemble methods, which is CAMERA [46]. Notice, however, that even the
basic TERAN model without ensemble is able to surpass CAMERA in many metrics. This confirms
the power of the TERAN model despite its overall simplicity.
Table 2 reports the results for the 5K test set, which confirm the superiority of TERAN 𝑀𝑟 𝑆 𝑤 over
all the baselines also on the full test set. In this scenario, we increase the Recall@1 performance
by 11.3% and 7.6% on image and sentence retrieval with respect to the CAMERA approach. On
the other hand, the max-over-words sum-over-regions (𝑀𝑤 𝑆𝑟 ) method loses around 10% on the
Recall@1 metrics with respect to the best performing TERAN non-ensemble model. In this case,
the Recall@K metric does not improve over top results obtained by the current state-of-the-art
approaches. Nevertheless, this model loses only about 1.5% during image-retrieval and about
3.5% during sentence-retrieval as far as the SPICE NDCG metric is concerned, reaching perfectly
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:13
comparable results with our state-of-the-art method. In light of these results, we deduce that the
𝑀𝑤 𝑆𝑟 model is not so effective in retrieving the perfect-matching elements; however, it is still very
good at retrieving the relevant ones.
As far as image retrieval is concerned, in the TERN 𝑀𝑤 𝑆𝑟 Test and TERN 𝑀𝑤 𝑆𝑟 Test experiments
we can see that the TERN architecture trained as in [43] performs fairly good when the similarity
is computed as in the novel TERAN architecture, using the region and words outputs and not the
I-CLS and T-CLS global descriptions. In particular, the use of max-over-words sum-over-regions
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:14 Messina et al.
similarity still works quite well compared to the similarity computed through I-CLS and T-CLS
global visual and textual features as it is in TERN.
Notice instead that on the sentence retrieval task, the TERN 𝑀𝑟 𝑆 𝑤 Test experiment obtains a very
low performance. This is the consequence of the fact that TERN is trained to produce global-scale
image-sentence matchings, while it is never forced to produce meaningful fine-grained aligned
concepts. This is further supported by the evidence that if we visualize the region-words alignments
as explained in the following Section 8.5 we obtain random word groundings on the image, meaning
that the concepts in output from TERN are not sufficiently informative.
In order to better compare TERAN with our previous TERN approach, in Figure 3 we report
the validation curves for both NDCG and Recall@1 metrics, for both methods. We can notice how
the NDCG metric overfits in our previous TERN model, especially when using the SPICE metric,
while the Recall@ keeps increasing. On the other hand, TERAN demonstrates better generalization
abilities on both metrics. This is a clear indication that TERAN is better able to retrieve relevant
items in the first positions, as well as exact matching elements. Instead, TERN is more prone to
overfitting to the SPICE metric, meaning that at a certain point in training, the network still searches
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:15
0.74
0.72 60
0.70
NDCG
0.68 50
R@1
0.66
0.64 40
0.62
30
0 100 k 200 k 300 k 400 k 0 100 k 200 k 300 k 400 k
Iteration Iteration
TERAN - ROUGE-L TERN - ROUGE-L TERAN
TERAN - SPICE TERN - SPICE TERN
Fig. 3. Validation metrics on sentence-to-image retrieval, measured during the training phase, for the average-
over-sentences scenario. TERN overfits on the NDCG metrics, while Recall@1 still improves. TERAN instead
generalizes better on both metrics.
for the top matching element, but with a tendency to push away possible relevant results compared
to the novel TERAN approach.
However, looking at the results from the TERN w. Align experiment, we can notice that by
augmenting the TERN objective with the TERAN alignment loss, we can slightly increase the TERN
overall performance. This confirms that a more precise and meaningful region-word alignment has
a visible effect also on the quality of the fixed-sized global embeddings produced by TERN.
In Table 3 we report the results on the Flickr30k dataset. Our single-model TERAN 𝑀𝑟 𝑆 𝑤 method
outperforms the best baseline (CAMERA) on the image retrieval task while approaching the single-
model CAMERA performance on the sentence retrieval task. Nevertheless, even on Flickr30k our
TERAN 𝑀𝑟 𝑆 𝑤 method with model ensemble obtains state-of-the-art results with respect to all
the baselines on all the metrics, gaining 4.6% and 1.5% on the Recall@1 metric on the image and
sentence retrieval tasks respectively.
On the MS-COCO dataset, our system powered by a single GTX 1080Ti can compute a single
image-to-sentence query in ∼ 0.12𝑠 on 5k sentences of the test split; in the sentence-to-image
scenario, it can produce scores and rank the 1K images in ∼ 0.02𝑠. These timings allow the TERAN
scores to be effectively used, for example, in a re-ranking phase, where the first 1k images - 5k
sentences have been previously retrieved using a faster descriptor (e.g., the one from TERN).
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:16 Messina et al.
first position (a dog sitting on a bench on the upper query, and Pennsylvania Avenue, uniquely
identifiable by the street sign, on the lower query). In this case, the Recall@1 metric also succeeds,
given that the query captions are very selective. The third example, instead, evidences a failure
case where the model cannot deal with very subtle details. The (only) correct result is ranked 6th in
this case; in the first ranking positions, the model can find images with a vase used as a centerpiece,
but the table is not often visible, and when it is visible, it is not in the corner of the room.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:17
Fig. 4. Example of image retrieval results for a couple of flexible query captions. These are common examples
where NDCG succeeds over the Recall@K metric. The ground-truth matching image is not among the very
first positions; however, the top-ranked images are also visually very relevant.
Query: Table situated in corner of room with a vase for a center piece.
Fig. 5. Example of image retrieval results for a couple of very specific query captions.
8 ABLATION STUDY
8.1 The Effect of Weight Sharing
We tried to apply weight sharing for the last 2 layers of the TERAN architecture, those after the
linear projection to the 1024-d space. Weight sharing is used to reduce the size of the network
and enforce a structure able to perform common reasoning on the high-level concepts, possibly
reducing the overfitting and increasing the stability of the whole network. We experimented with
the effects of weight sharing on the MS-COCO dataset with 1K test set, for both the max-over-words
sum-over-regions and the max-over-regions sum-over-words scenarios.
Results are shown in the 2-nd and 6-th rows of Table 4. It can be noticed that the values are
perfectly comparable with the TERAN results reported in Table 1, suggesting that at this point in
the network the abstraction is high enough that concepts coming from images and sentences can
be processed in the exact same way. This result shows that vectors at this stage have been freed
from any modality bias and they are fully comparable in the same representation space.
Also, in the max-over-words sum-over-regions scenario (6-th row), there is a small gain both in
terms of Recall@K and NDCG. This confirms the slight regularization effect of the weight sharing
approach.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:18 Messina et al.
60
0.70
50
NDCG
R@1
0.65
ROUGE-L 40
0.60 SPICE
0 100 k 200 k 300 k 400 k 0 100 k 200 k 300 k 400 k
Iteration Iteration
Fig. 6. Validation metrics measured during the training phase, for the average-over-sentences scenario. This
model overfits on the NDCG metrics on the image-retrieval task, while Recall@1 still improves.
If we compute the average instead of the sum in the max-over-regions sum-over-words scenario,
the final similarity score between the image and the sentence is no more dependent on the number
of concepts from the textual pipeline: the similarities are averaged and not accumulated.
In the 3-rd row of Table 4 we can notice that by averaging we lose an important amount of
information with respect to the max-over-regions sum-over-words scenario (1-st row). This insight
suggests that the complexity of the query is beneficial for achieving high-quality matching.
Another side effect of using average instead of the max is the premature clear overfitting on the
NDCG metrics as far as image-retrieval is concerned. The effect is shown in Figure 6. The clear
overfitting of the NDCG metrics resembles the training curve trajectories of TERN (Figure 3). This
result demonstrates that although this model can correctly perform exact matching, it is pulling
away relevant results from the head of the ranked list of images, during the validation phase.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:19
Table 4. Results for the ablation study experiments. We organize the methods in the table clustering them by
the pooling method, for an easier comparison (max-over-regions methods in the upper part and max-over-
words methods on the lower part). In the first row of both sections we report the TERAN results from Table 1.
Experiments are computed on the MS-COCO dataset, 1K test set.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:20 Messina et al.
Fig. 7. Visualization of the word-region alignments. Near each word, we report the cosine similarity computed
between that word and the top-relevant image region associated with it. We slightly offset the overlapping
bounding-boxes for a better visualization.
We can notice some wrong word groundings in the images, such as the phrase "eyes closed"
that is associated with the region depicting the closed mouth. In this case, the error seems to lie
on some localized misunderstanding of the scene (in this case the noun "eyes" has probably been
misunderstood since the mouth and the eyes are both closed). Overall, however, complex scenes
are correctly broken down into their salient elements, and only the key regions are attended.
9 CONCLUSIONS
In this work, we introduced the Transformer Encoder Reasoning and Alignment Network (TERAN).
TERAN is a relationship-aware architecture based on the Transformer Encoder (TE) architecture,
exploiting self-attention mechanisms, able to reason about the spatial and abstract relationships
between elements in the image and in the text separately.
Differently from TERN [43], TERAN forces a fine-grained alignment among the region and word
features without any supervision at this level. We demonstrated that by enforcing this fine-grained
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:21
word-region alignment at training time we can obtain state-of-the-art results on the popular MS-
COCO and Flickr30K datasets. Besides, thanks to the overall simplicity of the proposed model, we
can obtain effective visual and textual features for use in scalable retrieval setups.
We measured the performance of our TERAN architecture in the context of cross-modal retrieval
using both the already-in-use Recall@K metric and the newly introduced NDCG with the ROUGE-L
and SPICE textual relevance measures. In spite of its simplicity, TERAN can outperform current
state-of-the-art models on these two retrieval metrics, competing with the currently very effective
entangled visual-textual matching models, which on the contrary are not able to produce features for
scalable retrieval. Furthermore, we showed that TERAN can successfully output visually-pleasant
word-region alignments. We also observed that a further reduction of the network complexity can
be obtained by sharing the weights of the last TE layers. This has important benefits also on the
stability and in the generalization abilities of the whole architecture.
In the end, we think that this work proposes an interesting path towards efficient and effective
cross-modal information retrieval.
ACKNOWLEDGMENTS
This work was partially supported by “Intelligenza Artificiale per il Monitoraggio Visuale dei Siti
Culturali" (AI4CHSites) CNR4C program, CUP B15J19001040004, by the AI4EU project, funded by
the EC (H2020 - Contract n. 825619), and AI4Media under GA 951911.
REFERENCES
[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image
caption evaluation. In European Conference on Computer Vision. Springer, 382–398.
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018.
Bottom-up and top-down attention for image captioning and visual question answering. In Proc. of the IEEE conference
on computer vision and pattern recognition. 6077–6086.
[3] Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo. 2018. Picture it in your mind:
Generating high level visual representations from textual descriptions. Information Retrieval J. 21, 2-3 (2018), 208–229.
[4] Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching with
Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proc. of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 12655–12663.
[5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019.
Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019).
[6] Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, control and tell: A framework for generating
controllable and grounded captions. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 8307–8316.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In NAACL-HLT 2019. Association for Computational Linguistics, 4171–4186.
[8] Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proc. of the IEEE conference on
computer vision and pattern recognition. 4601–4611.
[9] Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings
with Hard Negatives. In BMVC 2018. BMVA Press, 12.
[10] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual
cross-modal retrieval with generative models. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition.
7181–7189.
[11] Yawen Guo, Hui Yuan, and Kun Zhang. 2020. Associating Images with Sentences Using Recurrent Canonical Correlation
Analysis. Applied Sciences 10, 16 (2020), 5516.
[12] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end
module networks for visual question answering. In Proc. of the IEEE International Conference on Computer Vision.
804–813.
[13] Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, and Zhoujun Li. 2018. Bi-directional spatial-semantic attention
networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2018), 2008–2020.
[14] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proc.
of the IEEE International Conference on Computer Vision. 4634–4643.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
0:22 Messina et al.
[15] Yan Huang and Liang Wang. 2019. Acmm: Aligned cross-modal memory for few-shot image and sentence matching.
In Proc. of the IEEE International Conference on Computer Vision. 5774–5783.
[16] Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal
lstm. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2310–2318.
[17] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and
sentence matching. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.
[18] Yan Huang, Qi Wu, Wei Wang, and Liang Wang. 2018. Image and sentence matching via semantic concepts and order
learning. IEEE transactions on pattern analysis and machine intelligence (2018).
[19] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels
with Text by Deep Multi-Modal Transformers. arXiv preprint arXiv:2004.00849 (2020).
[20] Zhong Ji, Zhigang Lin, Haoran Wang, and Yuqing He. 2020. Multi-Modal Memory Enhancement Attention Network
for Image-Text Matching. IEEE Access 8 (2020), 38438–38447.
[21] Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence
matching. In Proc. of the IEEE International Conference on Computer Vision. 5754–5763.
[22] Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2020. SMAN: Stacked Multimodal Attention Network for
Cross-Modal Image-Text Retrieval. IEEE Transactions on Cybernetics (2020).
[23] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross
Girshick. 2017. Inferring and executing programs for visual reasoning. In Proc. of the IEEE International Conference on
Computer Vision. 2989–2998.
[24] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. of
the IEEE conference on computer vision and pattern recognition. 3128–3137.
[25] Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene
geometry and semantics. In Proc. of the IEEE conference on computer vision and pattern recognition. 7482–7491.
[26] Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image
representations using fisher vectors. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition.
4437–4446.
[27] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis,
Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense
image annotations. International journal of computer vision 123, 1 (2017), 32–73.
[28] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text
matching. In Proc. of the European Conference on Computer Vision (ECCV). 201–216.
[29] Kuang-Huei Lee, Hamid Palangi, Xi Chen, Houdong Hu, and Jianfeng Gao. 2019. Learning visual relation priors for
image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953 (2019).
[30] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching.
In Proc. of the IEEE International Conference on Computer Vision. 4654–4662.
[31] Xiangyang Li and Shuqiang Jiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions
on Multimedia 21, 8 (2019), 2117–2130.
[32] Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an
efficient subgraph-based framework for scene graph generation. In Proc. of the European Conference on Computer Vision
(ECCV). 335–351.
[33] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
74–81.
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer,
740–755.
[35] Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In European
Conference on Computer Vision. Springer, 261–277.
[36] Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention:
A bidirectional focal attention network for image-text matching. In Proc. of the 27th ACM International Conference on
Multimedia. 3–11.
[37] Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured
Network for Image-Text Matching. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
10921–10930.
[38] Yu Liu, Yanming Guo, Erwin M Bakker, and Michael S Lew. 2017. Learning a recurrent residual fusion network for
multimodal matching. In Proc. of the IEEE International Conference on Computer Vision. 4107–4116.
[39] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representa-
tions for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13–23.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders 0:23
[40] Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder.
2020. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. arXiv preprint
arXiv:2004.14255 (2020).
[41] Nicola Messina, Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, and Claudio Gennaro. 2018. Learning relationship-
aware visual features. In Proc. of the European Conference on Computer Vision (ECCV). 0–0.
[42] Nicola Messina, Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, and Claudio Gennaro. 2019. Learning visual features
for relational CBIR. International Journal of Multimedia Information Retrieval (2019), 1–12.
[43] Nicola Messina, Fabrizio Falchi, Andrea Esuli, and Giuseppe Amato. 2020. Transformer Reasoning Network for
Image-Text Matching and Retrieval. In International Conference on Pattern Recognition (ICPR) 2020 (Accepted).
[44] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in
Vector Space. In 1st International Conference on Learning Representations, ICLR 2013.
[45] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with
large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020).
[46] Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network
for Image-Text Matching. In Proc. of the 28th ACM International Conference on Multimedia. 1047–1055.
[47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with
region proposal networks. In Advances in neural information processing systems. 91–99.
[48] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence
training for image captioning. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.
[49] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. 2017. A simple neural network module for relational reasoning. (2017), 4967–4976.
[50] Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image
matching. In Proc. of the IEEE International Conference on Computer Vision. 5814–5824.
[51] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic
Visual-Linguistic Representations. In International Conference on Learning Representations.
[52] Damien Teney, Lingqiao Liu, and Anton van Den Hengel. 2017. Graph-structured representations for visual question
answering. In Proc. of the IEEE conference on computer vision and pattern recognition. 1–9.
[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[54] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-Embeddings of Images and Language. In 4th
International Conference on Learning Representations, ICLR.
[55] Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018. Joint global and co-attentive
representation learning for image-sentence retrieval. In Proc. of the 26th ACM international conference on Multimedia.
1398–1406.
[56] Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019. Position focused attention
network for image-text matching. arXiv preprint arXiv:1907.09748 (2019).
[57] Kaimin Wei and Zhibo Zhou. 2020. Adversarial Attentive Multi-modal Embedding Learning for Image-Text Matching.
IEEE Access (2020).
[58] Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network
for Image and Sentence Matching. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
10941–10950.
[59] Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for
image-text matching. In Proc. of the 27th ACM International Conference on Multimedia. 2088–2096.
[60] Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-modal attention with semantic
consistence for image-text matching. IEEE Transactions on Neural Networks and Learning Systems (2020).
[61] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In
Proc. of the European conference on computer vision (ECCV). 670–685.
[62] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In
Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.
[63] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proc. of the
European conference on computer vision (ECCV). 684–699.
[64] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations:
New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational
Linguistics 2 (2014), 67–78.
[65] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified Vision-Language
Pre-Training for Image Captioning and VQA.. In AAAI. 13041–13049.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 0, No. 0, Article 0. Publication date: 2020.