Giant Pigeon and Small Person: Prompting Visually Grounded Models about the Size of Objects

Zhang, Yi

doi:10.25394/PGS.19633317.v1

Yi_Zhang_MS_Thesis_final.pdf (6.12 MB)

Giant Pigeon and Small Person: Prompting Visually Grounded Models about the Size of Objects

thesis

posted on 2022-04-22, 13:19 authored by Yi ZhangYi Zhang

Empowering machines to understand our physical world should go beyond models with only natural language and models with only vision. Vision and language is a growing field of study that attempts to bridge the gap between natural language processing and computer vision communities by enabling models to learn visually grounded language. However, as an increasing number of pre-trained visual linguistic models focus on the alignment between visual regions and natural language, it is difficult to claim that these models capture certain properties of objects in their latent space, such as size. Inspired by recent trends in prompt learning, this study will design a prompt learning framework for two visual linguistic models, ViLBERT and ViLT, and use different manually crafted prompt templates to evaluate the consistency of performance of these models in comparing the size of objects. The results of this study showed that ViLT is more consistent in prediction accuracy for the given task with six pairs of objects under four prompt designs. However, the overall prediction accuracy is lower than the expectation on this object size comparison task; even the better model in this study, ViLT, has only 16 out of 24 cases better than the proposed random chance baseline. As this study is a preliminary study to explore the potential of pre-trained visual linguistic models on object size comparison, there are many directions for future work, such as investigating more models, choosing more object pairs, and trying different methods for feature engineering and prompt engineering.

History

Degree Type

Master of Science

Department

Computer and Information Technology

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Julia Taylor Rayz

Additional Committee Member 2

Baijian Yang

Additional Committee Member 3

Jin Wei-Kocsis

Usage metrics

Keywords

Natural Language Processing Computer Vision Prompt Learning Visual Linguistic Task Computer Vision Natural Language Processing Knowledge Representation and Machine Learning

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Giant Pigeon and Small Person: Prompting Visually Grounded Models about the Size of Objects

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Usage metrics

Categories

Keywords

Licence

Exports