Purdue University Graduate School
Browse
Wrucha_Nanal_Thesis_Work_PDF.pdf (1.44 MB)

IMAGE CAPTIONING USING TRANSFORMER ARCHITECTURE

Download (1.44 MB)
thesis
posted on 2022-12-06, 02:38 authored by Wrucha A NanalWrucha A Nanal

  

The domain of Deep Learning that is related to generation of textual description of images is called ‘Image Captioning.’ The central idea behind Image Captioning is to identify key features of an image and create meaningful sentences that describe the image. The current popular models include image captioning using Convolution Neural Network - Long Short-Term Memory (CNN-LSTM) based models and Attention based models. This research work first identifies the drawbacks of existing image captioning models namely – sequential style of execution, vanishing gradient problem and lack of context during training.

This work aims at resolving the discovered problems by creating a Contextually Aware Image Captioning (CATIC) Model. The Transformer architecture, which solves the issues of vanishing gradients and sequential execution, forms the basis of the suggested model. In order to inject the contextualized embeddings of the caption sentences, this work uses Bidirectional Encoder Representation of Transformers (BERT). This work uses Remote Sensing Image Captioning Dataset. The results of the CATIC model are evaluated using BLEU, METEOR and ROGUE scores. On comparison the proposed model outperforms the CNN-LSTM model in all metrices. When compared to the Attention based model’s metrices, the CATIC model outperforms for BLEU2 and ROGUE metrices and gives competitive results for others.

History

Degree Type

  • Master of Science

Department

  • Computer Science

Campus location

  • Fort Wayne

Advisor/Supervisor/Committee Chair

Mohommadreza Hajiarbabi

Additional Committee Member 2

Jin Soung Yoo

Additional Committee Member 3

Venkata Inokollu

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC