In this project, we will create a neural network architecture to automatically generate captions from images. We used MS COCO Datasets for captioning task.
The model consists of 2 parts: the encoder and decoder.
The encoder that we provide to you uses the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a Linear layer to transform the feature vector to have the same size as the word embedding.
There are 3 layers in our decoder: embedding layer, LSTM layer and linear layer. Initially the encoder output features will be fed to the decoder embedding layer then the results from the embedding layer will be fed to the LSTM. We will use the teacher forcer method to train LSTM where at t = 1 we use the features from the encoder, and at t = 2,3,4 and so on we use the word from the groundtruth caption as input to the LSTM.