Image Caption Generator Using Deep Learning
Image Caption Generator Using Deep Learning
RajanSingh V.Charitha
Department of Electronics and Department of Electronics and Department of Computer Science and
Communication Engineering, Communication Engineering, Engineering,
MLR Institute of Technology MLR Institute of Technology MLR Institute of Technology
Hyderabad, India Hyderabad, India Hyderabad, India
rajansingh@mlrinstitutions.ac.in charitha.vanukuru20@gmail.com ksreenu2k@gmail.com
Abstract—Over the last few years deep neural network approaches are generic descriptions of the visual. The
made image captioning conceivable. Image caption generator content and context are completely ignored. This generic in
provides an appropriate title for an applied input image emergency situations, descriptions are insufficient. With
based on the dataset. The present work proposes a model emerging technologies, one can train the computer in such a
based on deep learning and utilizes it to generate caption for way that the computer gives us the output just like the
the input image. The model takes an image as input and
frame the sentence related to the given input image by using
humans. Today’s computer vision can recognize the objects,
some algorithms like CNN and LSTM. This CNN model is can classify and differentiate the things they see.
used to identify the objects that are present in the image and One can create a model for the image caption generator
Long Short-Term Memory (LSTM) model will not only
so that it gives the appropriate caption. It is not only
generate the sentence but summarize the text and generate
the caption that is suitable for the project. So, the proposed important to generate a caption but also recognize the scenes
model mainly focuses on identify the objects and generating or the emotions that are present in the image just like how
the most appropriate title for the input images. humans do. Here, the proposed image caption generator
employing deep learning uses CNN and LSTM models to
Keywords—Convolutional Neural Network (CNN), Long extract the features and the subsequently associated
Short-Term Memory (LSTM), Machine Vision, Natural statements are produced. The present work is expected to be
Language Processing (NLP), Input image, Framing the
mainly helpful for the visually impaired people for
Sentence, Feature extraction.
understanding pictures and contents of it. Firstly, the
I.INTRODUCTION proposed model uses CNN for extracting the features from
the input image.
It is a fundamental and difficult task to automatically
describe the content of photographs using natural language. The CNN model used in this case has previously
As computing power and availability have improved, undergone training. The LSTM model, a subtype of RNN
creating models to create captions from massive datasets are model that primarily generates the sequence of sentences
feasible, nowadays. Alternatively, humans can describe their from the input obtained from the CNN model, uses the
surroundings with ease. It's natural for someone to explain output from the CNN model after that. Then, a caption with
something when they're shown a picture with a large number 4-5 words is generated for the input image that we provide.
of elements in it with a quick turnaround glance. Despite As an illustration, Fig. 1 illustrates how an image may have
significant progress, activities such as object recognition and several captions, such as Caption 1: A group is sitting around
action in computer vision classification, image classification, a snowy crevasse, Caption 2: A group of people sit atop a
all of these phrases are used to describe how something is snowy mountain, and Caption 3: A group of people sit in the
categorized. It is now feasible to recognize faces and scenes snow and look out at a mountain view. Caption 4: A group of
using novel technologies based on neural network and five kids prepares to sled, caption5: In the snow, five people
advanced computing algorithms. The task of allowing a are assembled. Usually, the generated sentences are more
computer to characterize an image utilizes these all. similar to the substance of the input image. To comprehend
the significance of an image, it is not always essential to use
The purpose of picture captioning, the semantics of the long sentences. Here, the proposed image caption generator
images must be understood and expressed in the necessary mainly focuses on generating the sentences and summarizing
natural language form. The real world is significantly the text or the sentence that is generated into small caption.
impacted by it. For instance, by helping persons with visual
impairments better understand the information included in
web images. So, in order to create the image caption
generator model, Convolutional Neural Network and
Recurrent Neural Network architectures are simultaneously
used for extraction of features from CNN which is used to
create the visuals. Exception for model. The CNN
information is then processed. Long short-term memory
(LSTM) uses it to generate an image description. However,
the sentences that are created with these typically,
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:23:25 UTC from IEEE Xplore. Restrictio
architectures were used along with a language model,
vector space is helpful in combining the images as well as
the caption. The basic approach is essentially the same – a
Convolutional Neural Network (encoder) that generates a
series of features is procured into a decoder that adopts a
language model to provide natural language descriptions.
This is the basic methodology used in both methods. The
recurrent neural network will predict the following word
using feed forward basis with just a single linear hidden
Fig.1.A typical image with one possible caption “family trip to a hill station” layer given the image and previous words.
II. LITERATURE SURVEY Vinyalset. al. [5] used Google Net CNN to extract graphic
elements and generate captions using Long Short Term
Earlier reports including A. Karpathyand Li Fei-Fei [1] Memory cells. This model lays the foundation for the
explained utilization of big data and machine learning for motivation of the proposed model in the present work. To
describing the contents of the input image. Authors enhance for real-time settings, the proposed model diverges
generated photo descriptions for the input images using from model implementation. The model makes use of new
LSTM model, and generated the word sequences. And also developments in machine translation and object detection
explained about the different models they used in the to introduce an attention-based model that takes into
project. Qichen Fuet.et. al.[2] used CNN and RNN and account several “spot”. There are importantly two main issues
utilized beam search algorithm to generate the highest which are the machine vision and cognitive computing in
likelihood of the occurrence of the image, and used attention the artificial intelligence. Describing the image's content
mechanism visualization to understand which part of the using artificial intelligence is done. Here the procedure is
image is focused. given as the input, and the image is given as the output.
The projected topic frequently employs generative Output in the form of an English statement that describes
adversarial networks (GANs), which have a generator, a the photograph's content. They investigate three issues. An
person, and a nurse and are trained using the minimax game image is inspected by this model and come up with more
in an adversarial manner. In an effort to deceive the user into unique and relevant words for images.
thinking the samples are real, the generator on the one hand
tries to create realistic samples. On the other hand, the III. SYTEM ARCHITECURE
individual is taught to distinguish between fake samples and The proposed model, shown in Fig. 3, uses a
real ones. The image captioning generator is considered in convolutional neural network to generate a dense feature
the projected theme as the generator within a GAN vector from an input image. This dense vector, also known
framework, which attempts to provide representational as an embedding, creates appropriate captions for the
picture descriptions. images that are supplied as an output and can be utilised as
We have a tendency to build an individual to gauge an input into other algorithms.
whether or not the generated sequence is realistic. The We are appending start and end words to each and
methods based on the only combination of CNN and LSTM every sentence at the beginning and the end. While training
lack of diversity and naturalness in the generated captions. the model we have generated five captions to each and
In the recent past year’s researchers are investigating to every image. These embedding forms a way to represent
improve the image captioning models by adding sentiment the underlying image, which is further utilized to generate
and diversity to have a more human like descriptions. This appropriate captions associated with the image. In this way
paper suggested gives us an idea of giving caption to the this is used as the initial for the LSTM for image caption
image using GAN network. On contrary, the present work generator. The architecture of the proposed model is shown
proposes, in particular, to investigate several auto encoders in Fig. 3.
in order to generate more accurate and meaningful
descriptions for photos.
Previous picture captioning systems used templates
rather than a probabilistic generative model to generate the
plain language caption. Farhadi, et al. (2010) [3] create
captions using triplets and a pre-made template. They train
a multi-label discriminant to predict the triplets' values. The
Markov Random Field. A labelled Conditional Random
Field graph is used by Kulkarni, et al. (2011) [4] to identify
objects in an image, forecast a set of attributes and
prepositions (spatial information against other objects) for
each object, and construct sentences using the labels and a
template.
Fig.2: Proposed Model of Image Caption Generator.
Even if the individual objects were included in the
training data, these techniques do not generalize well
because they fail to represent previously encountered object
compositions. The templates should be examined very
carefully. Their appropriate evaluation requires a plate
approach. To address these issues, deep neural network
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:23:25 UTC from IEEE Xplore. Restrictio
B. System Requirements
System requirements for running the proposed model
includes, OS: Windows 7 and above, recommended:
Windows 10, CPU: Intel processor with 64-bit support,
Disk Storage: 8GB of free disk space, for execution:
Jupyter Lab using anaconda framework in python.
III. ARCHITECTURE AND IMPLEMENTATION
Here, the proposed model is based on the neural
network which utilizes probability concept to determine
chance of occurrence of a favorable event. The underlying
mathematical model is optimized and trained to obtain
most suitable outcome. This is achieved after multiple
iterations and the probability of an appropriate caption is
Fig.3: System Architecture of Image Caption Generator. maximized [7].
A. Proposed image caption Generator A. CNN
DFD’s are used to analyze the flow or process of the Conventional neural networks (CNNs) are the neural
proposed model. There are mainly three levels in the networks which are mostly in the form of a matrix for the
process.The following Fig. 4, 5, and 6 show the data flow input images. Images work nicely with CNNs, which may
levels of the project. be represented here as a square matrix of size 2 x 2. CNN
makes use of dividing the cluster in to multiple frames and
recognizes using the training. The model can differentiate
between objects based on size. For example, a bird or a
plane. The proposed methodology uses the commonly used
scanning process to scan the object under test in horizontal,
left to right, and vertical, from top to bottom. It can handle
images that have been translated, flipped, scaled, and have
had their colors changed [8-11].
Fig.4: Data Flow Diagram Level 0
B. LSTM
These belong to RNN networks, which are proficient at
foreseeing sequences. Based on the paragraph that came
before it, we can guess what the following phrases will be.
By addressing RNN's shortcomings, it has been discovered
to be more effective than traditional RNN. LSTM can filter
out unnecessary input and keep track of pertinent data
Fig.6: Data Flow Diagram Level 2 throughout processing. The majority of the applications
that use these networks deal with prediction difficulties. On
The information flow in the proposed model algorithm the basis of the paragraph before it, we can infer what the
are depicted in Fig.3 and 5. It explains that foremost once next word will be. In terms of overcoming the
the user transfer and image the CNN model can sight the shortcomings of RNNs with short term memory, it has
objects and events gift within the input image so LSTM done better than ordinary RNNs. The LSTM memory cell
model makes the captions considering the objects that are processes the data and is used to exclude the irrelevant data
gift within the image. To urge higher results a coaching [11]. Gates are used to govern the update time of a cell's
knowledge set is gift in order that it generates the higher state.
captions and displays the results for the input image. It is
evident that the proposed model for generation of image’s C. Implementation of the system
caption utilizes dedicated training for a new image. It
A. Object recognition
recognizes the objects in an image and creates a caption
that adequately conveys the image under test for which caption The objects of the image are studied by using a encoder.
is to be generated.
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:23:25 UTC from IEEE Xplore. Restrictio
B. Attribute extraction E. Datasets
It generates embedding, which are vector features. After Flicker 8K is the dataset which is used in the project. We
extracting features from original images, the CNN model are able to use alternative datasets too. However, it'll be
compresses them into a smaller, RNN-compatible feature tough to coach the network and additionally C.P.U. to
vector. Encoder is another name for it. support such huge knowledge sets [8]. Larger data sets like
MSCOCO also be used however coaching of networks
with such knowledge sets can take weeks. This dataset
contains 8000 pictures [9-11].
F. Libraries involved
Tensorflow: This is a python library which is used for
computing the deep learning models faster. Keras: Keras is
a python library which is used for deploying the deep
neural networks and works in the background[12-
15].Pillow: Pillow library is used for manipulating the
image like rotate, resize and transform it.Numpy: Numpy is
Fig.8: Architecture of LSTM model. also a python library used rather than lists because of its
better performance. Matpolib: Python library for creating
C. Implementation of the system
static and animated visualization [16-19].
1. Object recognition
G. Deployment
The objects of the image are studied by using a encoder.
Anaconda framework is used for the deployment of the
2. Attribute extraction project. When an image path is given Tensor Flow will run
in the backend and the caption gets displayed.
It generates embedding, which are vector features. After
extracting features from original images, the CNN model IV. RESULTS
compresses them into a smaller, RNN-compatible feature
vector. Encoder is another name for it Two images are used as input in the proposed model to
generate the associated captions. The input images along
3. Tokenization with respective title are shown in Fig. 9 and 10. The result in
terms of generated captions shows accuracy and reliability of
RNN is the next phase of the application, and it decodes
the proposed model.
the feature vectors provided to it by CNN.
4. Prediction
Prediction is the final stage after tokenization. The vectors
are decoded here, and the get prediction method is used to
construct the final output. Sentences are created using
LSTM [7]-[8]. With the help of LSTM, an appropriate
sentence is formed using these words.
D. Flow of the project
• Importing the libraries
• The memory of GPU is configured which would
be helpful for training purpose.
• Importing the image dataset and its respective
captions
• Cleaning captions for further analysis
• Cleaning the captions for further processing
• Extracting features
• Plotting similar images from the dataset
Fig.9: Input image number 1 and output generated caption “two men play
• Tokenizing the captions for further processing hockey on the frozen field” using the model.
• Processing the captions and images as per the
requires shape by the model
• Building the LSTM model
• Training the LSTM model
• Plotting the loss value
• Generating captions
• Evaluating the performance
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:23:25 UTC from IEEE Xplore. Restrictio
neural image caption generator,” In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 3156-
3164, 2015.
[6] V. D. Shinde, M. P. Dave, A. M. Singh, and Amit C. Dubey, “Image
Caption Generator using Big Data and Machine Learning,”
International Research Journal of Engineering and Technology
(IRJET), Volume: 07 Issue: 04, Apr 2020.
[7] Andrej Karpathy Li Fei-Fei, "Deep Visual-Semantic Alignments for
GeneratingImage Descriptions," IEEE Conference on Computer
Vision and Pattern Recognition, vol. 39, no. 4, pp. 664--676, 2015.
[8] C.Szeged, V.Vanhoucke, S.Ioffe, J.Shlens, and Z. Wojna, "Rethinking
the inception architecture for computervision," IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2818-2826, 2016.
[9] J.Donahue,Y.Jia,O.Vinyals,J.Hoffman,N.Zhang,E. Tzeng, and
T.D.Decaf, "A deep convolutional activation feature for generic
visual identification," International conference on machine learning,
pp. 647-655, 2014.
[10] BaiShuang, and Shan An. "A survey on automatic image caption
generation." Neuro computing 311 (2018): 291-304.
[11] Wang, Haoran, Yue Zhang, and Xiaosheng Yu. "An overview of
image caption generation methods." Computational intelligence and
neuroscience 2020.
[12] “Apply convolution to image processing, signal processing, and deep
learning,” https://de.mathworks.com/discovery/convolution.html.
Fig.10: Input image number 2 and output generated caption “young boy [13] L. Burgueño, J. Cabot, S. Li, and S. Gérard, “A generic LSTM neural
swings the swing” using the model. network architecture to infer heterogeneous model transformations,”
Software and Systems Modeling,vol. 21, no.1,pp.139-156, 2022.
V. CONCLUSION AND FUTURE SCOPE [14] Bittu Kumar, "Comparative Performance Evaluation of Greedy
Algorithms for Speech Enhancement System" Fluctuation and Noise
The model was trained and tested successfully to create Letters, vol.20, no.02, 2020.
accurate captions for the loaded photos. This is mostly a [15] Bittu Kumar, " Real-time Performance Evaluation of Modified
CNN and RNN model, in which the CNN will behave as Cascaded Median based Noise Estimation for Speech Enhancement
an encoder and RNN will act as a decoder. This project is System" Fluctuation and Noise Letters, vol.18, no. 04, 2019.
[16] Bittu Kumar, "Comparative performance evaluation of MMSE-based
the application of deep learning. Using CNN and LSTM speech enhancement techniques through simulation and real-time
model we will first extract the features using CNN and implementation” International Journal of Speech Technology, vol.21,
generate the captions to the input image. The dataset we no. 04, 2018.
used will have 8000 images. So, finally this would be more [17] Sandeep Kumar, Bittu Kumar, Neeraj Kumar, " Speech Enhancement
helpful for the visually impaired people and in order to get techniques: A Review " Rungta International Journal of Electrical and
Electronics Engineering, vol. 1, no. 1, 2016.
more accuracy, we can use bigger datasets. The proposed
[18] Gaber, T., Tharwat, A., Snasel, V., &Hassanien, A. E., “Plant
model utilizes a novel algorithm for neural network which identification: Two dimensional-based vs. one dimensional-based
is used to generate caption for the image under the test. feature extraction methods’, International conference on soft
The model makes use of scanned multiple frames of the computing models in industrial and environmental applications, pp.
image. Based on objects identified, an appropriate title is 375-385, 2015.
provided for the image. Since the model utilizes its dataset [19] K. C. Jena, S. Mishra, S. Sahoo and B. K. Mishra, "Principles,
to identify the objects, large set of data are bound to techniques and evaluation of recommendation systems”, 2017
improve the results, and more suitable caption can be International Conference on Inventive Systems and Control (ICISC),
pp. 1-6,2017.
generated. The increased dataset will boost accuracy while
lowering losses. The results show that the proposed model
successfully generates appropriate caption with high
accuracy.
REFERENCES
[1] Andrej Karpathy, and Li Fei-Fei,“Deep Visual-
SemanticAlignments for Image Description Generation,”
IEEE Transactions on Pattern Analysis and Machine
Intelligence,vol39,issue4(April 2017), pp. 664–676.
[2] Qichen Fu, Yige Liu, and ZijianXie, “Recurrent Neural
Network for Image Caption,” Available online,
https://fuqichen1998.github.io/pdfs/eecs442_report.pdf
[3] Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P.,
Rashtchian, C., Hockenmaier, J., & Forsyth, D. “Every
picture tells a story:Generating sentences from images,” In
European conference on computer vision pp.15-29, 2010.
[4] Kulkarni G, Premraj V, Dhar S, “Understanding and Creating Image
Descriptions with Babies,” IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2011.
[5] O.Vinyals, A.Toshev, S.Bengio,andD.Erhan, “Show and tell: A
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:23:25 UTC from IEEE Xplore. Restrictio