Visual Image Caption Generator Using Deep Learning

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/333214768
Visual Image Caption Generator Using Deep Learning
Article in SSRN Electronic Journal · January 2019

DOI: 10.2139/ssrn.3368837
CITATIONS READS
3 468
5 authors, including:
Grishma Sharma
Somaiya Vidyavihar
5 PUBLICATIONS 7 CITATIONS
SEE PROFILE
All content following this page was uploaded by Grishma Sharma on 29 December 2019.
The user has requested enhancement of the downloaded file.

2nd International Conference on Advances in Science & Technology (ICAST-2019)
K. J. Somaiya Institute of Engineering & Information Technology, University of Mumbai, Maharashtra, India
Visual Image Caption Generator Using Deep Learning
Grishma Sharma
Asst. Professor of Department Of Computer Engineering
K.J Somaiya College Of Engineering, Mumbai
neelammotwani@somaiya.edu
Priyanka Kalena Nishi Malde

Department Of Computer Engineering Department Of Computer Engineering
K.J Somaiya College Of Engineering, Mumbai K.J Somaiya College Of Engineering, Mumbai
kalenapriyanka@gmail.com nshmalde97@gmail.com
Aromal Nair Saurabh Parkar

Department Of Computer Engineering Department Of Computer Engineering
K.J Somaiya College Of Engineering, Mumbai K.J Somaiya College Of Engineering, Mumbai
aromaln31197@gmail.com saurabh.parkar@somaiya.edu
Abstract - Image Caption Generation has always system. “Image caption generator”: the name itself
been a study of great interest to the researchers in suggests that we aim to build an optimal system
the Artificial Intelligence department. Being able to which can generate semantically and grammatically
program a machine to accurately describe an image accurate captions for an image. Researchers have
or an environment like an average human has been involved in finding an efficient way to make
major applications in the field of robotic vision, better predictions, therefore we have discussed a few
business and many more. This has been a methods to achieve good results. We have used the
challenging task in the field of artificial intelligence deep neural networks and machine learning
throughout the years. In this paper, we present techniques to build a good model. We have used
different image caption generating models based on Flickr 8k dataset which contains around 8000 sample
deep neural networks, focusing on the various RNN images with their five captions for each image. There
techniques and analyzing their influence on the are two phases : feature extraction from the image
sentence generation. We have also generated using Convolutional Neural Networks (CNN) and
captions for sample images and compared the generating sentences in natural language based on the
different feature extraction and encoder models to image using Recurrent Neural Networks (RNN). For
analyse which model gives better accuracy and the first phase, rather than just detecting the objects
generates the desired results. present in the image, we have used a different
approach of extracting features of an image which
Keywords - CNN, RNN, LSTM , VGG, GRU, will give us details of even the slightest difference
Encoder - Decoder. between two similar images. We have used VGG-16
(Visual Geometry Group) , which is a 16
I. INTRODUCTION convolutional layers model used for object
recognition. For the second phase, we need to train
Generating accurate captions for an image has our features with captions provided in the dataset. We
remained as one of the major challenges in Artificial are using two architectures LSTM (Long Short Term
Intelligence with plenty of applications ranging from Memory) and GRU (Gated Recurrent Unit) for
robotic vision to helping the visually impaired. Long framing our sentences from the input images given.
term applications also involve providing accurate To get an estimation of which architecture is better
captions for videos in scenarios such as security we have used the BLEU (Bilingual Evaluation
http://ssrn.com/link/2019-ICAST.html
Electronic copy available at: https://ssrn.com/abstract=3368837

Understudy) score to compare the performances One of the major challenges we faced was choosing
between LSTM and GRU. the right model for the caption generation network. In
their research paper, Tanti (et al)[8] has classified the
II. LITERATURE SURVEY generative models into two kinds – inject and merge
architectures. In the former, we input both, the
There have been several attempts at providing a tokenized captions and image vectors to an RNN
solution to this problem including template based block whereas in the latter, we input only the
solutions which used image classification i.e. captions to the RNN block and merge the output with
assigning labels to objects from a fixed set of classes the image. Although the experiments show that there
and inserting them into a sample template sentence. is not much difference in the accuracy of the two
But more recent work have focused on Recurrent models, we decided to go with the merge architecture
Neural Networks [2,5]. RNNs are already quite for the simplicity of its design, leading to reduction in
popular with several Natural Language Processing the hidden states and faster training. Also, since the
tasks such as machine translation where a sequence images are not passed iteratively through the RNN
of words is generated. Image caption generator network, it makes better use of RNN memory.
extends the same application by generating a
description for an image word by word. III. METHODOLOGY
The computer vision reads an image considering it as The complete system is a combination of three
a two dimensional array. Therefore, Venugopalan (et models which optimizes the whole procedure of
al)[9] has described image captioning as a language caption description from an image. The models are
translation problem. Previously language translation (a) Feature Extraction Model (b) Encoder Model
was complicated and included several different tasks (c)Decoder model.
but the recent work[10] has shown that the task can
be achieved in a much efficient way using Recurrent A. Feature Extraction Model
Neural Networks. But, regular RNNs suffer from the
vanishing gradient problem which was vital in case This model is primarily responsible for acquiring
of our application. The solution for the problem is to features from an image for training. When the
use LSTMs and GRUs which contain internal training begins the features of the images are input to
mechanisms and logic gates that retain information this model.
for a longer time and pass only useful information.
Fig 1. Feature Extraction Model
The model uses a VGG16 architecture as seen in output of a VGG16 network would be vectors of size
Fig.1, to efficiently extract the features from the 1*4096, which are used to represent the features of
images using a combination of multiple 3*3 the images.
convolution layers and max pooling layers. The

A dropout layer is added to the model with a value of are the output of the feature extraction model which
0.5 to reduce overfitting. An optimal value is will then be used in the decoder model.
between 0.5 to 0.8 which indicates the probability at
which the outputs of the layer are dropped out. B. Encoder Model
A dense layer is added after the dropout layer which The encoder model, as seen in Fig. 2, is primarily
basically applies the activation function on the input, responsible for processing the captions of each image
kernel with a bias. The activation function used is fed while training. The output of the encoder model
‘ReLU’(Rectified Linear Units) and the size of output is again vectors of size 1*256 which would again be
space is specified as 256. These vectors of size 256 an input to the decoder sequences.
Fig 2. Encoder Model
Initially the captions present with each images are The most important part of the Encoder model is the
tokenized ie the words in the sentences are converted LSTM layer or Long Short Term Memory Layer.
to integers so that the neural network can process This layer helps the model in learning how to
them efficiently. The tokenized captions are padded generate valid sentences or generating the word with
so that the length is equal to the size of the longest highest probability of occurrence after a specific
sentence and all the sentences can be processed at an word is encountered. The activation function used is
equal length. ReLU, a linear activation function and the output
space defined is 256.
Then an Embedding layer is attached to embed the
tokenized captions into fixed dense vectors with an For comparison between the complete models
output space of 256 by 34. 34 is chosen as the VGG+GRU and VGG+LSTM this particular layer
maximum number of words in all the captions of will be replaced by a GRU or Gated Recurrent Units
Flickr8k dataset is 34. These vectors would further layer, and results will be analyzed for the same .The
ease out the processing by providing a convenient output space for the GRU layer is same i.e 256 .
way of representing the words in the vector space. A Thus the only major difference between the two
dropout layer is attached with again a probability of models will be in the encoder part. The output of the
0.5 to reduce the overfitting in the model. LSTM layer is our output for the encoder layer.

Fig 3. Final Decoder Model
C. Decoder Model predicted word is the output of the decoder layer.The

model is trained by the following input output
The decoder model, as shown in Fig. 3, is basically parameters:
the model which concatenates both the feature <ip>=<image,<in-seq>
extraction model and encoder model and produces <op>=<word>
the required output which is the predicted word given Where input parameters are the image and the input
an image and the sentence generated till that point of sequence and the output of the model is the word
time. predicted provided the model has the image and the
As shown in the above diagram the Decoder model caption generated till that point of time.
takes in the input from the Feature extraction model
and the encoder model both of which outputs vectors When the caption is generated we calculate the bleu
of dimension 256. The output from the concatenated score for each architecture. Four types of bleu scores
models are passed through a dense layer which uses were found out : BLEU -1 (1.0, 0, 0, 0), BLEU -2
the ‘ReLU’ activation function. Another Dense Layer (0.5, 0.5, 0, 0), BLEU -3 (0.33, 0.33, 0.33, 0) and
is added to the decoder model with the vocabulary BLEU -4 (0.25, 0.25, 0.25, 0.25). We have used the
size as the output space. The vocabulary size in cumulative weights since they give better output.
Flickr 8k was found to be 7579 and the activation
function used was softmax activation which basically
outputs a word for the integer predicted. The
IV. ANALYSIS
Fig 4. Comparison of caption for Sample Image 1.

Fig 5. Comparison of caption for Sample Image 2.
Fig 4 and Fig 5 are some example images on which results show that LSTM model generally works
the testing was done. We have tested various images slightly better than GRU although taking more time
with both the methods ie VGG+LSTM and for training and sentence generation due to its
VGG+GRU. The training of the models was done on complexity. The performance is also expected to
Google Colab which provides the 1xTesla K80 GPU increase on using a bigger dataset by training on
with 12GB GDDR5 VRAM and took approximately more number of images. Because of the considerable
13 minutes per epoch for LSTM and 10 minutes per accuracy of the generated image captions, visually
epoch for GRU. This happens due to the lesser impaired people can greatly benefit and get a better
amount of operations occurring in GRU than LSTM. sense of their surroundings using the text-to-speech
While the loss calculated for LSTM was less than technology that we have incorporated as well.
GRU, the user can prefer any model according to his
need, either with maximum accuracy or one which VI. FUTURE WORK
takes lesser time to process. GRUs generally train
faster on less training data than LSTMs and are Our model is not perfect and may generate incorrect
simpler and easy to modify . captions sometimes. In the next phase, we will be
developing models which will use Inceptionv3
instead of VGG as the feature extractor. Then we will
V. CONCLUSION be comparing the 4 models thus obtained i.e.
VGG+GRU, VGG+LSTM, Inceptionv3+GRU, and
We have presented a deep learning model that tends Inceptionv3+LSTM . This will further help us
to automatically generate image captions with the analyze the influence of the CNN component over
goal of not only describing the surrounding the entire network.
environment but also helping visually impaired
people better understand their environments. Our Currently, we are using a greedy approach for
described model is based upon a CNN feature generating the next word in the sequence by selecting
extraction model that encodes an image into a vector one with the maximum probability. Beam search
representation, followed by a RNN decoder model instead selects a group of words with the maximum
that generates corresponding sentences based on the likelihood and parallel searches through all the
image features learned . We have compared various sequences. This approach might help us increase the
encoder decoder models to see how each component accuracy of our predictions.
influences the caption generation and have also
demonstrated various use cases on our system. The Our model is trained on the Flickr 8K dataset which

is relatively small with less variety of images. We

will be training our model on the Flickr30K and
MSCOCO datasets which will help us to make better
predictions. Other optimizations include tweaking the
hyperparameters like batch size, number of epochs,
learning rate etc and understanding the effect of each
one of them on our model.
REFERENCES
[1] CS771 Project Image Captioning by Ankit Gupta , Kartik Hira,

Bajaj Dilip.
[2] ”Every Picture Tells a Story: Generating Sentences from

Images.” Computer Vision ECCV (2016) by Farhadi, Ali, Mohsen
Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus
Rashtchian, Julia Hocken-maier, and David Forsyth
[3] Automatic Caption Generation for News Images by Yansong

Feng, and Mirella Lapata, IEEE (2013).
[4] Image Caption Generator Based on Deep Neural Networks by

Jianhui Chen, Wenqiang Dong and Minchen Li, ACM (2014).
[5] Show and Tell: A Neural Image Caption Generator by Oriol

Vinyal, Alexander Toshev, Samy Bengio, Dumitru Erhan, IEEE
(2015).
[6] Image2Text: A Multimodal Caption Generator by Chang Liu,

Changhu Wang, Fuchun Sun, Yong Rui, ACM (2016).
[7] The Vanishing Gradient Problem During Learning Recurrent

Neural Nets and Problem Solutions by Sepp Hochreiter.
[8] Where to put the Image in an Image Caption Generator by

Marc Tanti, Albert Gatt, Kenneth P. Camilleri.
[9] Sequence to sequence -video to text by Subhashini

Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond
Mooney, Trevor Darrell, and Kate Saenko.
[10] Learning phrase representations using RNN encoder-decoder

for statistical machine translation by K. Cho, B. van Merrienboer,
C. Gulcehre, F. Bougares,H. Schwenk, and Y. Bengi.
[11] TVPRNN for image caption generation .Liang Yang and

Haifeng Hu.
[12] Image Captioning in the Wild: How People Caption Images

on Flickr Philipp Blandfort, Tushar Karayil, Damian Borth,
Andreas Dengel,German Institute for Articial Intelligence,
Kaiserslautern, Germany.
[13] Image Caption Generator Based On Deep Neural Networks

Jianhui Chen ,Wenqiang Dong, Minchen Li ,CS Department. ACM
2014.
[14] BLEU: A method for automatic evaluation of machine

translation. InACL, 2002 by K. Papineni, S. Roukos, T. Ward, and
W. J. Zhu.
View publication stats Electronic copy available at: https://ssrn.com/abstract=3368837

Visual Image Caption Generator Using Deep Learning

Uploaded by

Visual Image Caption Generator Using Deep Learning

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Visual Image Caption Generator Using Deep Learning

Article in SSRN Electronic Journal · January 2019

The user has requested enhancement of the downloaded file.

Visual Image Caption Generator Using Deep Learning

Priyanka Kalena Nishi Malde

Aromal Nair Saurabh Parkar

Electronic copy available at: https://ssrn.com/abstract=3368837

Fig 1. Feature Extraction Model

Electronic copy available at: https://ssrn.com/abstract=3368837

Fig 2. Encoder Model

Electronic copy available at: https://ssrn.com/abstract=3368837

Fig 3. Final Decoder Model

C. Decoder Model predicted word is the output of the decoder layer.The

Fig 4. Comparison of caption for Sample Image 1.

Electronic copy available at: https://ssrn.com/abstract=3368837

Fig 5. Comparison of caption for Sample Image 2.

Electronic copy available at: https://ssrn.com/abstract=3368837

is relatively small with less variety of images. We

[1] CS771 Project Image Captioning by Ankit Gupta , Kartik Hira,

[2] ”Every Picture Tells a Story: Generating Sentences from

[3] Automatic Caption Generation for News Images by Yansong

[4] Image Caption Generator Based on Deep Neural Networks by

[5] Show and Tell: A Neural Image Caption Generator by Oriol

[6] Image2Text: A Multimodal Caption Generator by Chang Liu,

[7] The Vanishing Gradient Problem During Learning Recurrent

[8] Where to put the Image in an Image Caption Generator by

[9] Sequence to sequence -video to text by Subhashini

[10] Learning phrase representations using RNN encoder-decoder

[11] TVPRNN for image caption generation .Liang Yang and

[12] Image Captioning in the Wild: How People Caption Images

[13] Image Caption Generator Based On Deep Neural Networks

[14] BLEU: A method for automatic evaluation of machine

View publication stats Electronic copy available at: https://ssrn.com/abstract=3368837

You might also like