IMAGE CAPTIONING USING TRANSFORMER ARCHITECTURE
The domain of Deep Learning that is related to generation of textual description of images is called ‘Image Captioning.’ The central idea behind Image Captioning is to identify key features of an image and create meaningful sentences that describe the image. The current popular models include image captioning using Convolution Neural Network - Long Short-Term Memory (CNN-LSTM) based models and Attention based models. This research work first identifies the drawbacks of existing image captioning models namely – sequential style of execution, vanishing gradient problem and lack of context during training.
This work aims at resolving the discovered problems by creating a Contextually Aware Image Captioning (CATIC) Model. The Transformer architecture, which solves the issues of vanishing gradients and sequential execution, forms the basis of the suggested model. In order to inject the contextualized embeddings of the caption sentences, this work uses Bidirectional Encoder Representation of Transformers (BERT). This work uses Remote Sensing Image Captioning Dataset. The results of the CATIC model are evaluated using BLEU, METEOR and ROGUE scores. On comparison the proposed model outperforms the CNN-LSTM model in all metrices. When compared to the Attention based model’s metrices, the CATIC model outperforms for BLEU2 and ROGUE metrices and gives competitive results for others.
- Master of Science
- Computer Science
- Fort Wayne