0% found this document useful (0 votes)
24 views44 pages

Major Project Reports

The project report details the development of an Image Captioning Generator using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to automatically generate captions for images. The report outlines the significance of the project for assisting visually impaired individuals and discusses the methodologies, including the use of Python and various libraries like TensorFlow and Keras. It also includes acknowledgments, candidate declarations, and a comprehensive literature review on existing image captioning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views44 pages

Major Project Reports

The project report details the development of an Image Captioning Generator using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to automatically generate captions for images. The report outlines the significance of the project for assisting visually impaired individuals and discusses the methodologies, including the use of Python and various libraries like TensorFlow and Keras. It also includes acknowledgments, candidate declarations, and a comprehensive literature review on existing image captioning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A project report on

IMAGE CAPTIONING GENERATOR


Submitted in partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology(CSE)
Under the supervision of

Mr. Sandeep Tayal


Assistant Professor

Department of Computer Science

Submitted by

Dinesh Gond Ashok Kumar Murmu Chander Shekhar


04614802720 03014802720 04014802720

Department Of Computer Science


Maharaja Agrasen Institute of Technology
Sector-22, Rohini, Delhi –110086
December 2023
ACKNOWLEDGEMENT

We would like to express my profound gratitude to Mr. Sandeep Tayal under whose
guidance,we ventured to learn the fine aspects and notions of Machine Learning. Without his
support on every step of the way, it would have been nearly impossible for us to accomplish
such a feat. Iwould like to thankall the researchers and educators out there who wrote
informative articles and research papers using which we were able to learn about the various
methods and techniques of topics such as Convolutional Neural Networks, and Machine
Learning in general. Last but not the least, I would also like to thank my parents and all our
friends who supported us during all the errors we encountered while training and deploying the
ML model. I am very grateful for everyone’s help and efforts.

Dinesh Gond Chander Shekhar Ashok kumar Murmu


04614802720 04014802720 03014802720

i
CANDIDATE’S DECLARATION

We hereby declare that the work presented in this project report, titled “IMAGE
CAPTIONING GENERATOR ”submitted by me in the partial fulfillment of the requirement
of the award of the degree of Bachelor of Technology (B.Tech.) submitted in the Department of
Computer Science Maharaja Agrasen Institute of Technology is an authentic record of
my project work carried out under the guidance of our professor and mentor Mr. Sandeep Tayal ,
Dept. Computer Science.

The matter presented in this project report has not been submitted either in part or full to any
university or Institute for award of any degree

Date: 30/04/2024

Place: Delhi

Dinesh Gond Ashok Kumar Murmu Chander Shekhar


04014802720
04614802720 03014802720

ii
SUPERVISOR’S CERTIFICATE

It is to certify that the Project entitled “IMAGE CAPTIONING GENERATOR” which is


being submitted by Mr. Dinesh Gond, Mr. Chander Shekhar and Mr. Ashok Kumar Murmu to the
Maharaja Agrasen Instituteof Technology, Delhi in the fulfillment of the requirement for the award
of the degree of Bachelor of Technology ( B.Tech.), is record of bonafide project work carried
out by him under my guidance and supervision.

Mr. Sandeep Tayal Prof. (Dr.) Namita Gupta


Assistant Professor H.O.D
Department of Computer Science Department of Computer Science

iii
LIST OF FIGURES
Figure Title Page No
Figure 1.1 An overall taxonomy of deep learning-based image 3
captioning.
Figure 3.1 Python logo 6

Figure 3.2 NumPy logo 7

Figure 3.3 Pandas logo 8


Figure 3.4 TensorFlow library Logo 9
Figure 3.5 Jupyter Notebook Logo 10
Figure 3.6 Flicker DataSet text format 12
Figure 3.7 Flicker Dataset Python File 13
Figure 3.8 Description of Images 14
Figure 3.10 Final Model 17
Figure 3.11 19
Model ,Image Caption Generator
Figure 3.12 Forget Gate,Input Gate, Output Gate 20
Figure 3.13 Feature Extraction in images using VGG 22
Figure 4.1 Image 1 24
Figure 4.2 Image 2 25

iv
LIST OF TABLES
Table Title Page No
Table 3.9 word Prediction Generation step By Step 16

Table 3.14 Data Cleaning Of Caption’s 23

v
ABSTRACT

In this project, we use CNN and LSTM to identify the caption of the image. As the deep
learning techniques are growing, huge datasets and computer power are helpful to build
models that can generate captions for an image. This is what we are going to implement in
this Python based project where we will use deep learning techniques like CNN. Image caption
generator is a process which involves natural language processing and computer vision
concepts to recognize the context of an image and present it in English. Inthis survey paper,
we carefully follow some of the core concepts of image captioning and its common approaches.
We discuss Keras library, numpy and jupyter notebooks for the making of this project.We
also discuss about flickr_dataset and CNN used for image classification.

vi
TABLE OF CONTENT
Title Page no.
Acknowledgment i

Candidate Declaration ii

Supervisor Certificate iii

List Of Figures iv

List Of tables v

Abstract vi

Chapter 1: Introduction 1
1.1: Need of Study 1
1.2: Scope of the Study 2
1.3: Objective of the Study 2

Chapter 2: Literature Review 4

Chapter 3: Implementation of Model 6


3.1: SRS (Software Requirement Specification) 6
3.2: About Dataset 10
3.3: Methodology 14
3.4: About Model 18
Chapter 4: Experimental Results 24

Chapter 5: Conclusion and Future Scope 27

Annexure 1: Code Snapshots 29

Annexure 2: Research Paper

References 35

vii
CHAPTER 1

INTRODUCTION

Image Captioning is a system that helps visually impaired people to understand the images by
giving the summary or captions for a given image. It is a much dominant task with great
practical and industrial importance. It will reduce the dependency of blind people on other
human being for help. Furthermore, it also provides accessibility for using web pages. As
technology has been developing these days, it can use this in different social media
platforms to automatically predict captions for an image. This system is helpful, especially when
the person wants to know about images and their caption. It will detect/ recognize the objects in
images and predict the caption/summary, which will later inform visually impaired
persons in the form of voice. It is very latest and many growing research problems nowadays.
Day by day, various new methods are being presented to accomplish satisfying results in this
field [1].

Need of study

The attention mechanism, as illustrated in Figure 1, is a crucial aspect of your model, allowing it
to focus on specific regions of the image while generating each word in the caption. This
attention mechanism is an enhancement over the "Show and Tell" model.
our project addresses the challenge of automatically generating descriptive captions for images,
contributing to the broader fields of Computer Vision and Machine Learning. The potential
applications of such technology include assisting visually impaired individuals and automating
caption tasks on the internet.
The introductory section highlights the significance of our project in the context of recent
advancements in Neural Networks. It discusses the challenges faced by previous approaches,
such as grammar problems, cognitive absurdity, and content irrelevance . The transition to using
Convolutional Neural Networks and recurrent neural networks, as seen in the mentioned
papers[2]

1
Scope of study

The scope of the project outlined in the abstract involves the development and evaluation of a
new deep learning-based approach for information retrieval, template generation, text
summarization, and image captioning.
The scope of the study encompasses the development, implementation, and evaluation of a
comprehensive deep learning-based model for information retrieval, template generation, text
summarization, and image captioning. The project focuses on addressing the challenges posed by
the infrequent structure and high complexity of documents in natural language processing. The
evaluation involves benchmarking against existing methods and datasets to demonstrate the
effectiveness and novelty of the proposed approach.[3]

Objective of study

The objective of this project is to develop an innovative algorithm for automatically generating
natural language descriptions of images. This task is of great significance in practical
applications and serves as a bridge between two prominent fields in artificial intelligence:
computer vision and natural language processing.

The objective of incorporating Convolutional Neural Networks (CNN), Long Short-Term


Memory networks (LSTM), and the VGG model in the project is to leverage their unique
capabilities to enhance the performance of the image description generation algorithm. Each of
these models serves a specific purpose within the larger framework, contributing to the overall
effectiveness of the algorithm [4]

For the image captioning step, humans can easily understand the image content and signify it in
the form of natural language sentences according to specific needs; however, for computers, it
needs the integrated use of image processing, Computer Vision, Natural Language Processing,
and other significant areas of research outcomes. The difficulty of image summarizing is to
design a model that can fully use image information to generate more human-like rich image
descriptions.[1]

2
Figure.1.1 An overall taxonomy of deep learning-based image captioning

3
CHAPTER 2

LITERATURE REVIEW

Image captioning has recently gathered a lot of attention specifically in the natural language
domain. There is a pressing need for context based natural language description of images,
however, this may seem a bit farfetched but recent developments in fields like neural networks,
computer vision and natural language processing has paved a way for accurately describing
images i.e. representing their visually grounded meaning. We are leveraging state-of-the-art
techniques like Convolutional Neural Network (CNN), and appropriate datasets of images and
their human perceived description to achieve the same. We demonstrate that our alignment
model produces results in retrieval experiments on datasets such as Flicker.

DEEP CAPTIONING
In the last year, a variety of models[5, 7, 8, 10, 11, 12] have achieved promising results on the
image captioning task. Some [5, 7 ,8] follow a CNN RNN framework: first high-level features
are extracted from a CNN trained on the image classification task, and then a recurrent model
learns to predict subsequent words of a caption conditioned on image features and previously
predicted words. Others [10, 12] adopt a multimodal framework in which recurrent language
features and image features are embedded in a multimodal space. The multimodal embedding is
then used to predict the caption word by word. Retrieval methods [4] based on comparing the
knearest neighbors of training and test images in a deep image feature space, have also achieved
competitive results on the captioning task. However, retrieval methods are limited to words and
descriptions which appear in a training set of paired image-sentence data. As opposed to using
high levelimage features extracted from a CNN, another approach [11, 6] is to train classifiers on
visual concepts such as objects, attributes and scenes. A language model, such as an LSTM [6] or
maximum entropy model [11], then generates a visual description conditioned on the presence of
classified visual elements. Our model most closely resembles the framework suggested in [12]
which uses a multimodal space to combine features from image and language, however our
approach modifies this framework considerably to describe concepts that are never seen in paired
image-sentence data

4
IMAGE CAPTIONING METHODS

There are various Image Captioning Techniques some are rarely used in present but it is
necessary to take a overview of those technologies before proceeding ahead. The main categories
of existing image captioning methods and they include template-based image captioning,
retrieval-based image captioning, and novel caption generation.Novel caption generation-based
image caption methods mostly use visual space and deep machine learning based techniques.
Captions can also be generated from multimodal space. Deep learning-based image captioning
methods can also be categorized on learning techniques: Supervised learning, Reinforcement
learning, and Unsupervised learning. We group the reinforcement learning and unsupervised
learning into Other Deep Learning. Usually captions are generated for a whole scene in the
image. However, captions can also be generated for different regions of an image (Dense
captioning). Image captioning methods can use either simple Encoder-Decoder architecture or
Compositional architecture. There are methods that use attention mechanism, semantic concept,
and different styles in image descriptions. Some methods can also generate description for
unseen objects. We group them into one category as “Others". Most of the image captioning
methods use LSTM as language model. However, there are a number of methods that use other
language models such as CNN . Therefore, we include a language model-based category as
“LSTM vs. Others".

DESCRIBING NEW OBJECTS IN CONTEXT


Many early caption models [13, 14, 15, 16, 17] rely on first discerning visual elements from an
image, such as subjects, objects, scenes, and actions, then filling in a sentence template to create
a coherent visual description. These models are capable of describing objects without being
provided with paired imagesentence examples containing the objects, but are restricted to
generating descriptions using a fixed, predetermined template. More recently, [12] explore
describing new objects with a deep caption model with only a few paired imagesentence
examples during training. However, [12] do not consider how to describe objects when no paired
imagesentence data is available. Our model provides a mechanism to include information from
existing vision datasets as well as unpaired text data, whereas [12] relies on additional image-
sentence annotations to describe novel concepts.
5
CHAPTER 3

IMPLEMENTATION

SRS(Software Requirement Specification)

Programming Language :

The implementation of the proposed models utilizesthe Python programming language. Python is
chosen for its versatility, extensive libraries, and strong support for machine learning
frameworks .

Python:

Fig 3.1: Python Logo


Python, a high-level, general-purpose programming language, has gained immense popularity in
the field of data science, machine learning, and artificial intelligence. Its simplicity, readability,
and versatility make it an ideal choice for implementing complex algorithms, including time
series forecasting models.

In the context of the time series forecasting project, Python serves as the primary programming
language, orchestrating various tasks from data preprocessing to model implementation and
evaluation. Its open-source nature and extensive community support contribute to the robustness
and accessibility of the project.
6
Python is currently the most widely used language, which allows programming in Object-
Oriented and Procedural paradigms. Python programs generally are smaller than other
programming languages like Java. Programmers must type relatively less and the indentation
requirement of the language, makes them readable all the time.

The biggest strength of Python is huge collection of standard libraries which can be used for the
following:
 Machine Learning
 GUI Applications (like Kivy, Tkinter, PyQt etc.)
 Web frameworks like Django (used by YouTube, Instagram, Dropbox)
 Image processing (like OpenCV, Pillow)

Python Libraries:
1. NumPy:

Fig 3.2: NumPy logo

NumPy is a fundamental library for numerical computations in Python. It provides support for
large, multi-dimensional arrays and matrices, along with a collection ofhigh-level mathematical
functions.

Array Operations: NumPy's primary role is in handling numerical arrays efficiently.In the project,
time series data for NSE stocks is represented as arrays, and NumPyfacilitates operations like
mean calculation, standard deviation, and other statisticalcomputations.

7
Data Transition: NumPy arrays act as a versatile data structure for transitioning data between
different stages of the project. Whether it's preprocessing or feeding data into machine learning
models, NumPy arrays are pivotal for their efficiency.

2. Pandas:

Fig 3.3: Pandas logo

Pandas is a powerful data manipulation library that provides easy-to-use data structures, such as
Data Frames, for efficient data analysis and manipulation.

Data Preprocessing: Pandas is extensively used for cleaning and preparing time series data.
Tasks like handling missing values, converting data types, and filtering specific time ranges are
efficiently performed using Pandas Data Frame functionalities.

Structured Data Input: Pandas Data Frames offer a structured format for organizingand inputting
data into machine learning models. In the context of the project, the tabular structure facilitates
feeding data into models like Prophet.

8
3.Tensorflow:

Fig 3.4: TensorFlow Library Logo

TensorFlow is an open-source machine learning framework developed by Google.It specializes


in deep learning applications and provides a flexible platform for building and training neural
networks.

LSTM Model Implementation: TensorFlow plays a central role in developing the LSTM (Long
Short-Term Memory) model. Its high-level API, Keras, simplifies the process of designing and
training neural networks, crucial for capturing complex patterns in time series data.

Sequence Prediction: TensorFlow's LSTM layers are employed to enable the LSTM model to
understand and predict sequential patterns, a key aspect in the time series forecasting of NSE
stocks.

Development Environment :
The implementation is conducted in a Jupyter Notebook environment, fostering an interactive
and iterative approach to model development. The use of Jupyter Notebooks enhances code
readability and allows for the step-by-step execution and visualization of results.

9
Jupyter Notebook:

Fig 3.5: Jupyter Notebook logo

Jupyter Notebook, an open-source web application, is a versatile and powerful tool widelyused in
data science, machine learning, and scientific computing. It allows users to create and share
documents that combine live code, equations, visualizations, and narrative text.

Jupyter Notebook stands as a cornerstone in the toolkit of data scientists, offering an interactive
and collaborative environment for developing, documenting, and sharing data-driven insights. Its
integration with data science libraries, support for multiple languages, and flexibility make it an
indispensable asset in the time series forecasting project, fostering an efficient and transparent
workflow.

ABOUT DATASET

PRE-REQUISITES
This project requires good knowledge of Deep learning, Python, working on Jupyter notebooks,
Keras library, Numpy, and Natural language processing.
Make sure you have installed all the following necessary libraries:

 pip install tensorflow


 keras
 pillow
 numpy
10
 tqdm
 jupyterlab

PROJECT FILE STRUCTURE


Downloaded from dataset:

 Flicker8k_Dataset – Dataset folder which contains 8091 images.


 Flickr_8k_text – Dataset folder which contains text files and captions of images.The below
files will be created by us while making the project.
 Models – It will contain our trained models.
 Descriptions.txt – This text file contains all image names and their captions after
preprocessing.
 Features.p – Pickle object that contains an image and their feature vector extracted from the
Xception pre-trained CNN model.
 Tokenizer.p – Contains tokens mapped with an index value.
 Model.png – Visual representation of dimensions of our project.
 Testing_caption_generator.py – Python file for generating a caption of any image.
 Training_caption_generator.ipynb – Jupyter notebook in which we train and build our
image caption generator.

BUILDING THE PYTHON BASED PROJECT


Let‟s start by initializing the jupyter notebook server by typing jupyter lab in the console of your
project folder. It will open up the interactive Python notebook where you can run your code.
Create a Python3 notebook and name it training_caption_generator.ipynb.

GETTING AND PERFORMING DATA CLEANING

The main text file which contains all image captions is Flickr8k.token in
our Flickr_8k_text folder.

11
Figure.3.6. Flicker DataSet text format

The format of our file is image and caption separated by a new line (“\n”).

Each image has 5 captions and we can see that #(0 to 5)number is assigned for each caption.We

will define 5 functions:

 load_doc( filename ) – For loading the document file and reading the contents inside the
file into a string.
 all_img_captions( filename ) – This function will create a descriptions dictionary that
maps images with a list of 5 captions. The descriptions dictionary will look something like
the Figure.

12
Figure.3.7. Flicker Dataset Python File

 cleaning_text( descriptions) – This function takes all descriptions and performs data
cleaning. This is an important step when we work with textual data, according to our goal,
we decide what type of cleaning we want to perform on the text. In our case, we will be
removing punctuations, converting all text to lowercase and removing words that contain
numbers.So, a caption like “A man riding on a three-wheeled wheelchair” will be
transformed into “man riding on three wheeled wheelchair”
 text_vocabulary( descriptions ) – This is a simple function that will separate all the unique
words and create the vocabulary from all the descriptions.
 save_descriptions( descriptions, filename ) – This function will create a list of all the
descriptions that have been preprocessed and store them into a file. We will create a
descriptions.txt file to store all the captions. It will look something like this:

13
Figure.3.8. Description of Images

METHODOLOGY

EXTRACTING THE FEATURE VECTOR FROM ALL IMAGES


This technique is also called transfer learning, we don‟t have to do everything on our own, we
use the pre-trained model that have been already trained on large datasets and extract the features
from these models and use them for our tasks. We are using the Xception model which has been
trained on imagenet dataset that had 1000 different classes to classify. We can directly import
this model from the keras.applications . Make sure you are connected to the internet as the
weights get automatically downloaded. Since the Xception model was originally built for
imagenet, we will do little changes for integrating with our model. One thing to notice is that the
Xception model takes 299*299*3 image size as input. We will remove the last classification
layer and get the 2048 feature vector.

model = Xception( include_top=False, pooling=‟avg‟ )

The function extract_features() will extract features for all images and we will map image
names with their respective feature array. Then we will dump the features dictionary into a
“features.p” pickle file.

This process can take a lot of time depending on your system. I am using an Nvidia 1050 GPU
for training purpose so it took me around 7 minutes for performing this task. However, if you are
using CPU then this process might take 1-2 hours. You can comment out the code and directly
load the features from our pickle file.
14
LOADING DATASET FOR TRAINING THE MODEL
In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that contains a list of 6000
image names that we will use for training.
For loading the training dataset, we need more functions:

 load_photos( filename ) – This will load the text file in a string and will return the list of
image names.

 load_clean_descriptions( filename, photos ) – This function will create a dictionary that


contains captions for each photo from the list of photos. We also append the <start> and
<end> identifier for each caption. We need this so that our LSTM model can identify the
starting and ending of the caption.

 load_features(photos) – This function will give us the dictionary for image names and their
feature vector which we have previously extracted from the Xception model.

TOKENIZING THE VOCABULARY


Computers don‟t understand English words, for computers, we will have to represent them with
numbers. So, we will map each word of the vocabulary with a unique index value. Keras library
provides us with the tokenizer function that we will use to create tokens from our vocabulary and
save them to a “tokenizer.p” pickle file.
Our vocabulary contains 7577 words. We calculate the maximum length of the descriptions. This
is important for deciding the model structure parameters. Max_length of description is 32.

Create Data generator


Let us first see how the input and output of our model will look like. To make this task into a
supervised learning task, we have to provide input and output to the model for training. We have
to train our model on 6000 images and each image will contain 2048 length feature vector and
caption is also represented as numbers. This amount of data for 6000 images is not possible to
hold into memory so we will be using a generator method that will yield batches.

The generator will yield the input and output sequence.


15
For example:
The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of
that image, x2 is the input text sequence and y is the output text sequence that the model has to
predict.

x1(feature vector) x2(Text sequence) y(word to predict)

feature start, two

feature start, two dogs

feature start, two, dogs drink

feature start, two, dogs, drink water

feature start, two, dogs, drink, water end

Table 3.9. Word Prediction Generation Step By Step

Defining the CNN-RNN model


To define the structure of the model, we will be using the Keras Model from Functional API. It
will consist of three major parts:

 Feature Extractor – The feature extracted from the image has a size of 2048, with a dense
layer, we will reduce the dimensions to 256 nodes.
 Sequence Processor – An embedding layer will handle the textual input, followed by the
LSTM layer.
 Decoder – By merging the output from the above two layers, we will process by the dense
layer to make the final prediction. The final layer will contain the number of nodes equal to
our vocabulary size.
Visual representation of the final model is given in the figure

16
Training the model

To train the model, we will be using the 6000 training images by generating the input and output
sequences in batches and fitting them to the model using model.fit_generator() method. We also
save the model to our models folder. This will take some time depending on your system
capability.

Figure.3.10. Final Model

17
About Models

The main aim of this project is to get a little bit of knowledge of deep learning techniques. We
use two techniques mainly CNN and LSTM for image classification.

So, to make our image caption generator model, we will be merging these architectures. It is also
called a CNN model.

 CNN is used for extracting features from the image. We will use the pre-trained model
Xception.

 LSTM will use the information from CNN to help generate a description of the image.

CONVOLUTIONAL NEURAL NETWORK

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take
in an input image, assign importance (learnable weights and biases) to various aspects/objects in
the image and be able to differentiate one from the other.The pre-processing required in a
ConvNet is much lower as compared to other classification algorithms.

Convolutional Neural networks are specialized deep neural networks which can process the data
that has input shape like a 2D matrix. Images are easily represented as a 2D matrix and CNN is
very useful in working with images.

It scans images from left to right and top to bottom to pull out important features from the image
and combines the feature to classify images. It can handle the images that have been translated,
rotated, scaled and changes in perspective.

LONG SHORT TERM MEMORY


LSTM stands for Long short term memory, they are a type of RNN (recurrent neural network)
which is well suited for sequence prediction problems. Based on the previous text, we can
predict what the next word will be. It has proven itself effective from the traditional RNN by
overcoming the limitations of RNN which had short term memory. LSTM can carry out relevant
information throughout the processing of inputs and with a forget gate, it discards non-relevant
information.

18
LSTMs are designed to overcome the vanishing gradient problem and allow them to retain
information for longer periods compared to traditional RNNs. LSTMs can maintain a constant
error, which allows them to continue learning over numerous time-steps and backpropagate
through time and layers.

Figure. 3.11. Model, Image Caption Generator

LSTMs use gated cells to store information outside the regular flow of the RNN. With these
cells, the network can manipulate the information in many ways, including storing information
in the cells and reading from them. The cells are individually capable of making decisions
regarding the information and can execute these decisions by opening or closing the gates. The
ability to retain information for a long period of time gives LSTM the edge over traditional
RNNs in these tasks.

The chain-like architecture of LSTM allows it to contain information for longer time periods,
solving challenging tasks that traditional RNNs struggle to or simply cannot solve.

The three major parts of the LSTM include:

Forget gate—removes information that is no longer necessary for the completion of the task.
This step is essential to optimizing the performance of the network.

19
Figure.3.12 Forget Gate,Input Gate, Output Gate

Input gate—responsible for adding information to the cells

Output gate—selects and outputs necessary information

The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for
feature extraction on input data combined with LSTMs to support sequence prediction. This
architecture was originally referred to as a Long-term Recurrent Convolutional Network or
LRCN model, although we will use the more generic name “CNN LSTM” to refer to LSTMs
that use a CNN as a front end in this lesson.

This architecture is used for the task of generating textual descriptions of images. Key is the use
of a CNN that is pre-trained on a challenging image classification task that is re-purposed as a
feature extractor for the caption generating problem.

20
SYSTEM DESIGN

This project requires a dataset which have both images and their caption. The dataset should be
able to train the image captioning model.

FLICKR8K DATASET

Flickr8k dataset is a public benchmark dataset for image to sentence description. This dataset
consists of 8000 images with five captions for each image. These images are extracted from
diverse groups in Flickr website. Each caption provides a clear description of entities and events
present in the image. The dataset depicts a variety of events and scenarios and doesn‟t include
images containing well-known people and places which makes the dataset more generic. The
dataset has 6000 images in training dataset, 1000 images in development dataset and 1000
images in test dataset. Features of the dataset making it suitable for this project are: • Multiple
captions mapped for a single image makes the model generic and avoids overfitting of the model.
• Diverse category of training images can make the image captioning model to work for multiple
categories of images and hence can make the model more robust.

IMAGE DATA PREPARATION

The image should be converted to suitable features so that they can be trained into a deep
learning model. Feature extraction is a mandatory step to train any image in deep learning model.
The features are extracted using Convolutional Neural Network (CNN) with Visual Geometry
Group (VGG-16) model. This model also won ImageNet Large Scale Visual Recognition
Challenge in 2015 to classify the images into one among the 1000 classes given in the challenge.
Hence, this model is ideal to use for this project as image captioning requires identification of
images.

In VGG-16, there are 16 weight layers in the network and the deeper number of layers help in
better feature extraction from images. The VGG-16 network uses 3*3 convolutional layers
making its architecture simple and uses max pooling layer in between to reduce volume size of

21
the image. The last layer of the image which predicts the classification is removed and the
internal representation of image just before classification is returned as feature. The dimension of
the input image should be 224*224 and this model extracts features of the image and returns a 1-
dimensional 4096 element vector.

Figure 3.13: Feature Extraction in images using VGG

CAPTION DATA PREPARATION

Flickr8k dataset contains multiple descriptions described for a single image. In the data
preparation phase, each image id is taken as key and its corresponding captions are stored as
values in a dictionary.

DATA CLEANING

In order to make the text dataset work in machine learning or deep learning models, raw text
should be converted to a usable format. The following text cleaning steps are done before using
it for the project: • Removal of punctuations. • Removal of numbers. • Removal of single length
words. • Conversion of uppercase to lowercase characters. Stop words are not removed from the
text data as it will hinder the generation of a grammatically complete caption which is needed for
this project. Table 1 shows samples of captions after data cleaning.

22
Original Captions Captions after Data cleaning

Two people are at the edge of a lake, two people are at the edge of lake facing
facing the water and the city skyline. the water and the city skyline

A little girl rides in a child 's swing. little girl rides in child swing

Two boys posing in blue shirts and khaki two boys posing in blue shirts and khaki
shorts. shorts

Figure 3.14: Data cleaning of captions

23
Chapter 4

EXPERIMENT RESULT

: Snapshots

Fig 4.1 Image 1

24
Fig 4.2 Image 2

: Advantages

Quantitative Results:
Report the quantitative performance of your image captioning model using evaluation metrics such
BLEU, METEOR, ROUGE, and CIDEr. Include results on both the validation and test sets, if
applicable.

Comparison with Baselines:


If you compared your model with baseline models or traditional approaches, present the
comparative results. Discuss any improvements achieved by your model.

Visualizations:
Include visualizations of sample images and their corresponding generated captions. Showcase
instances where the model performed well and highlight any challenges or errors.

25
Analysis of Model Performance:
Provide an in-depth analysis of your model's performance. Discuss cases where the model excelled
and situations where it struggled or failed.

Impact of Hyperparameter Tuning:


If you conducted hyperparameter tuning, discuss the impact on the model's performance. Highlight
how specific hyperparameters influenced the results.

Discussion of Ethical Considerations:


If relevant, discuss the ethical considerations related to bias in caption generation. Address any
steps taken to mitigate biases and ensure fairness.

: Drawbacks
Ambiguity in Image Interpretation:
Image content can be subjective, and different individuals may interpret the same image differently.
Machine learning models may struggle with capturing the diverse interpretations of visual scenes.

Limited Context Understanding:


Image captioning models may have limitations in understanding broader contextual information or
complex relationships within a scene, leading to captions that lack nuance.

Over-reliance on Training Data:


Image captioning models heavily depend on the quality and representativeness of the training data.
Biases present in the data can be perpetuated in the model's outputs.

Rare Word Handling:


Generating captions with rare or uncommon words can be challenging for machine learning models,
potentially resulting in less diverse and more generic captions.

Handling of Unusual or Abstract Images:


Image captioning models may struggle with providing meaningful captions for unusual, abstract, or
surreal images that deviate from typical training data.
26
CHAPTER 5

CONCLUSION AND FUTURE SCOPE

In this chapter we have thrown some light on the conclusion of our project. We have also
underlined the limitation of our methodology. There is a huge possibility in this field, as we have
discussed in the future scope section of this chapter.

CONCLUSION

In this paper, we have reviewed deep learning-based image captioning methods. We have givena
taxonomy of image captioning techniques, shown generic block diagram of the major groups and
highlighted their pros and cons. We discussed different evaluation metrics and datasets with
their strengths and weaknesses. A brief summary of experimental results is also given. We
briefly outlined potential research directions in this area. Although deep learning-based image
captioning methods have achieved a remarkable progress in recent years, a robust image
captioning method that is able to generate high quality captions for nearly all images is yet to be
achieved. With the advent of novel deep learning network architectures, automatic image
captioning will remain an active research area for some time.
We have used Flickr_8k dataset which includes nearly 8000 images, and the corresponding
captions are also stored in the text file. Although deep learning -based image captioning methods
have achieved a remarkable progress in recent years, a robust image captioning method that is
able to generate high quality captions for nearly all images is yet to be achieved. With the advent
of novel deep learning network architectures, automatic image captioning will remain an active
research area for sometime. The scope of image-captioning is very vast in the future as the users
are increasing day by day on social media and most of them would post photos. So this project
will help them to a greater extent.

27
FUTURE SCOPE

Future work Image captioning has become an important problem in recent days due to the
exponential growth of images in social media and the internet. This report discusses the various
research in image retrieval used in the past and it also highlights the various techniques and
methodology used in the research. As feature extraction and similarity calculation in images are
challenging in this domain, there is a tremendous scope of possible research in the future.
Current image retrieval systems use similarity calculation by making use of features such as
color, tags, IMAGE RETRIEVAL USING IMAGE CAPTIONING 54 histogram, etc. There
cannot be completely accurate results as these methodologies do not depend on the context of the
image. Hence, a complete research in image retrieval making use of context of the images such
as image captioning will facilitate to solve this problem in the future. This project can be further
enhanced in future to improve the identification of classes which has a lower precision by
training it with more image captioning datasets. This methodology can also be combined with
previous image retrieval methods such as histogram, shapes, etc. and can be checked if the image
retrieval results get better.

28
ANNEXURE 1

29
30
31
32
33
34
REFERENCES

[1] Patel, V., & Rao, B. S. Image Summarizer: Seeing through Machine using Deep Learning
Algorithm.

[2] Zhang, R., Wang, X., & Liu, Y. EECS442 Final Report. .

[3] P. Mahalakshmi and N. S. Fatima, "Summarization of Text and Image Captioning in


Information Retrieval Using Deep Learning Techniques," in IEEE Access, vol. 10, pp. 18289-
18297, 2022, doi: 10.1109/ACCESS.2022.3150414.

[4] You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic
attention. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4651-4659).

[5] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T.


Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR,
2015.

[6] Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick. Image captioning with an intermediate
attributes layer. arXiv preprint arXiv:1506.01144, 2015.

[7] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.
CVPR, 2015.

[8] R. Kiros, R. Salakhuditnov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal
neural language models. TACL, 2015. [9] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments
for generating image descriptions. CVPR, 2015.

[9] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate
Saenko, Trevor Darrell; Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 1-10

[10] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling
for sequence prediction with recurrent neural networks. In Advances in Neural Information
Processing Systems. 1171– 1179.

[11] H. Fang, S. Gupta, F. N. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M.


Mitchell, J. C. Platt, C. L. Zit- ´ nick, and G. Zweig. From captions to visual concepts and back.
In CVPR, 2015. .

35
[12] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. Learning like a child: Fast
novel visual concept learning from sentence descriptions of images. In ICCV, 2015.

[13] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating


language and vision to generate natural language descriptions of videos in the wild. In COLING,
2014.

[14] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama.


Generating natural-language video descriptions using text-mined knowledge. In AAAI, 2013. .

[15] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T.


Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using
semantic hierarchies and zero-shoot recognition. In ICCV, 2013.

[16] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg.
Babytalk: Understanding and generating simple image descriptions. TPAMI, 2013.

[17] A. Guptal, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots
learning a visually grounded storyline model from annotated videos. In CVPR, 2009.

36

You might also like