Visual Image Caption Generator Using Deep Learning
Visual Image Caption Generator Using Deep Learning
net/publication/333214768
CITATIONS READS
3 468
5 authors, including:
Grishma Sharma
Somaiya Vidyavihar
5 PUBLICATIONS 7 CITATIONS
SEE PROFILE
All content following this page was uploaded by Grishma Sharma on 29 December 2019.
Grishma Sharma
Asst. Professor of Department Of Computer Engineering
K.J Somaiya College Of Engineering, Mumbai
neelammotwani@somaiya.edu
Abstract - Image Caption Generation has always system. “Image caption generator”: the name itself
been a study of great interest to the researchers in suggests that we aim to build an optimal system
the Artificial Intelligence department. Being able to which can generate semantically and grammatically
program a machine to accurately describe an image accurate captions for an image. Researchers have
or an environment like an average human has been involved in finding an efficient way to make
major applications in the field of robotic vision, better predictions, therefore we have discussed a few
business and many more. This has been a methods to achieve good results. We have used the
challenging task in the field of artificial intelligence deep neural networks and machine learning
throughout the years. In this paper, we present techniques to build a good model. We have used
different image caption generating models based on Flickr 8k dataset which contains around 8000 sample
deep neural networks, focusing on the various RNN images with their five captions for each image. There
techniques and analyzing their influence on the are two phases : feature extraction from the image
sentence generation. We have also generated using Convolutional Neural Networks (CNN) and
captions for sample images and compared the generating sentences in natural language based on the
different feature extraction and encoder models to image using Recurrent Neural Networks (RNN). For
analyse which model gives better accuracy and the first phase, rather than just detecting the objects
generates the desired results. present in the image, we have used a different
approach of extracting features of an image which
Keywords - CNN, RNN, LSTM , VGG, GRU, will give us details of even the slightest difference
Encoder - Decoder. between two similar images. We have used VGG-16
(Visual Geometry Group) , which is a 16
I. INTRODUCTION convolutional layers model used for object
recognition. For the second phase, we need to train
Generating accurate captions for an image has our features with captions provided in the dataset. We
remained as one of the major challenges in Artificial are using two architectures LSTM (Long Short Term
Intelligence with plenty of applications ranging from Memory) and GRU (Gated Recurrent Unit) for
robotic vision to helping the visually impaired. Long framing our sentences from the input images given.
term applications also involve providing accurate To get an estimation of which architecture is better
captions for videos in scenarios such as security we have used the BLEU (Bilingual Evaluation
http://ssrn.com/link/2019-ICAST.html
Understudy) score to compare the performances One of the major challenges we faced was choosing
between LSTM and GRU. the right model for the caption generation network. In
their research paper, Tanti (et al)[8] has classified the
II. LITERATURE SURVEY generative models into two kinds – inject and merge
architectures. In the former, we input both, the
There have been several attempts at providing a tokenized captions and image vectors to an RNN
solution to this problem including template based block whereas in the latter, we input only the
solutions which used image classification i.e. captions to the RNN block and merge the output with
assigning labels to objects from a fixed set of classes the image. Although the experiments show that there
and inserting them into a sample template sentence. is not much difference in the accuracy of the two
But more recent work have focused on Recurrent models, we decided to go with the merge architecture
Neural Networks [2,5]. RNNs are already quite for the simplicity of its design, leading to reduction in
popular with several Natural Language Processing the hidden states and faster training. Also, since the
tasks such as machine translation where a sequence images are not passed iteratively through the RNN
of words is generated. Image caption generator network, it makes better use of RNN memory.
extends the same application by generating a
description for an image word by word. III. METHODOLOGY
The computer vision reads an image considering it as The complete system is a combination of three
a two dimensional array. Therefore, Venugopalan (et models which optimizes the whole procedure of
al)[9] has described image captioning as a language caption description from an image. The models are
translation problem. Previously language translation (a) Feature Extraction Model (b) Encoder Model
was complicated and included several different tasks (c)Decoder model.
but the recent work[10] has shown that the task can
be achieved in a much efficient way using Recurrent A. Feature Extraction Model
Neural Networks. But, regular RNNs suffer from the
vanishing gradient problem which was vital in case This model is primarily responsible for acquiring
of our application. The solution for the problem is to features from an image for training. When the
use LSTMs and GRUs which contain internal training begins the features of the images are input to
mechanisms and logic gates that retain information this model.
for a longer time and pass only useful information.
The model uses a VGG16 architecture as seen in output of a VGG16 network would be vectors of size
Fig.1, to efficiently extract the features from the 1*4096, which are used to represent the features of
images using a combination of multiple 3*3 the images.
convolution layers and max pooling layers. The
http://ssrn.com/link/2019-ICAST.html
A dropout layer is added to the model with a value of are the output of the feature extraction model which
0.5 to reduce overfitting. An optimal value is will then be used in the decoder model.
between 0.5 to 0.8 which indicates the probability at
which the outputs of the layer are dropped out. B. Encoder Model
A dense layer is added after the dropout layer which The encoder model, as seen in Fig. 2, is primarily
basically applies the activation function on the input, responsible for processing the captions of each image
kernel with a bias. The activation function used is fed while training. The output of the encoder model
‘ReLU’(Rectified Linear Units) and the size of output is again vectors of size 1*256 which would again be
space is specified as 256. These vectors of size 256 an input to the decoder sequences.
Initially the captions present with each images are The most important part of the Encoder model is the
tokenized ie the words in the sentences are converted LSTM layer or Long Short Term Memory Layer.
to integers so that the neural network can process This layer helps the model in learning how to
them efficiently. The tokenized captions are padded generate valid sentences or generating the word with
so that the length is equal to the size of the longest highest probability of occurrence after a specific
sentence and all the sentences can be processed at an word is encountered. The activation function used is
equal length. ReLU, a linear activation function and the output
space defined is 256.
Then an Embedding layer is attached to embed the
tokenized captions into fixed dense vectors with an For comparison between the complete models
output space of 256 by 34. 34 is chosen as the VGG+GRU and VGG+LSTM this particular layer
maximum number of words in all the captions of will be replaced by a GRU or Gated Recurrent Units
Flickr8k dataset is 34. These vectors would further layer, and results will be analyzed for the same .The
ease out the processing by providing a convenient output space for the GRU layer is same i.e 256 .
way of representing the words in the vector space. A Thus the only major difference between the two
dropout layer is attached with again a probability of models will be in the encoder part. The output of the
0.5 to reduce the overfitting in the model. LSTM layer is our output for the encoder layer.
http://ssrn.com/link/2019-ICAST.html
IV. ANALYSIS
http://ssrn.com/link/2019-ICAST.html
Fig 4 and Fig 5 are some example images on which results show that LSTM model generally works
the testing was done. We have tested various images slightly better than GRU although taking more time
with both the methods ie VGG+LSTM and for training and sentence generation due to its
VGG+GRU. The training of the models was done on complexity. The performance is also expected to
Google Colab which provides the 1xTesla K80 GPU increase on using a bigger dataset by training on
with 12GB GDDR5 VRAM and took approximately more number of images. Because of the considerable
13 minutes per epoch for LSTM and 10 minutes per accuracy of the generated image captions, visually
epoch for GRU. This happens due to the lesser impaired people can greatly benefit and get a better
amount of operations occurring in GRU than LSTM. sense of their surroundings using the text-to-speech
While the loss calculated for LSTM was less than technology that we have incorporated as well.
GRU, the user can prefer any model according to his
need, either with maximum accuracy or one which VI. FUTURE WORK
takes lesser time to process. GRUs generally train
faster on less training data than LSTMs and are Our model is not perfect and may generate incorrect
simpler and easy to modify . captions sometimes. In the next phase, we will be
developing models which will use Inceptionv3
instead of VGG as the feature extractor. Then we will
V. CONCLUSION be comparing the 4 models thus obtained i.e.
VGG+GRU, VGG+LSTM, Inceptionv3+GRU, and
We have presented a deep learning model that tends Inceptionv3+LSTM . This will further help us
to automatically generate image captions with the analyze the influence of the CNN component over
goal of not only describing the surrounding the entire network.
environment but also helping visually impaired
people better understand their environments. Our Currently, we are using a greedy approach for
described model is based upon a CNN feature generating the next word in the sequence by selecting
extraction model that encodes an image into a vector one with the maximum probability. Beam search
representation, followed by a RNN decoder model instead selects a group of words with the maximum
that generates corresponding sentences based on the likelihood and parallel searches through all the
image features learned . We have compared various sequences. This approach might help us increase the
encoder decoder models to see how each component accuracy of our predictions.
influences the caption generation and have also
demonstrated various use cases on our system. The Our model is trained on the Flickr 8K dataset which
http://ssrn.com/link/2019-ICAST.html
REFERENCES
http://ssrn.com/link/2019-ICAST.html