Giant Pigeon and Small Person: Prompting Visually Grounded Models about the Size of Objects
Empowering machines to understand our physical world should go beyond models with only natural language and models with only vision. Vision and language is a growing field of study that attempts to bridge the gap between natural language processing and computer vision communities by enabling models to learn visually grounded language. However, as an increasing number of pre-trained visual linguistic models focus on the alignment between visual regions and natural language, it is difficult to claim that these models capture certain properties of objects in their latent space, such as size. Inspired by recent trends in prompt learning, this study will design a prompt learning framework for two visual linguistic models, ViLBERT and ViLT, and use different manually crafted prompt templates to evaluate the consistency of performance of these models in comparing the size of objects. The results of this study showed that ViLT is more consistent in prediction accuracy for the given task with six pairs of objects under four prompt designs. However, the overall prediction accuracy is lower than the expectation on this object size comparison task; even the better model in this study, ViLT, has only 16 out of 24 cases better than the proposed random chance baseline. As this study is a preliminary study to explore the potential of pre-trained visual linguistic models on object size comparison, there are many directions for future work, such as investigating more models, choosing more object pairs, and trying different methods for feature engineering and prompt engineering.