Longitudinal Comparison of Word Associations in Shallow Word Embeddings
thesisposted on 08.05.2020, 18:26 by Geetanjali BihaniGeetanjali Bihani
Word embeddings are utilized in various natural language processing tasks. Although effective in helping computers learn linguistic patterns employed in natural language, word embeddings also tend to learn unwanted word associations. This affects the performance of NLP tasks, as unwanted word associations propagate and amplify biases. Current word association evaluation methods for word embeddings do not account for changes in word embedding models and training corpora, when creating the rubric for word association evaluation. Current literature also lacks a consistent training and evaluation protocol for comparison of word associations across varying word embedding models and varying training corpora. In order to address this gap in prior literature, this research aims to evaluate different types of word associations, not limited to gender, racial or religious attributes, incorporating and evaluating the diachronic and variable nature of words over text data collected over a period of 200 years. This thesis introduces a framework to track changes in word associations between neutral words (proper nouns) and attributes (adjectives), across different word embedding models, over a temporal dimension, by evaluating clustering tendencies between neutral words (proper nouns) and attributive words (adjectives) over five different word embedding frameworks: Word2vec (CBOW), Word2vec (Skip-gram), GloVe, fastText (CBOW) and fastText (Skip-gram) and 20 decades of text data from 1810s to 2000s. Finally, various cluster level and corpus level measurements will be compared across aforementioned word embedding frameworks, to find how word associations evolve with changes in the embedding model and the training corpus.