Longitudinal Comparison of Word Associations in Shallow Word Embeddings

Bihani, Geetanjali

doi:10.25394/PGS.12272753.v1

Thesis_final_GB.pdf (1.62 MB)

Longitudinal Comparison of Word Associations in Shallow Word Embeddings

thesis

posted on 2020-05-08, 18:26 authored by Geetanjali BihaniGeetanjali Bihani

Word embeddings are utilized in various natural language processing tasks. Although effective in helping computers learn linguistic patterns employed in natural language, word embeddings also tend to learn unwanted word associations. This affects the performance of NLP tasks, as unwanted word associations propagate and amplify biases. Current word association evaluation methods for word embeddings do not account for changes in word embedding models and training corpora, when creating the rubric for word association evaluation. Current literature also lacks a consistent training and evaluation protocol for comparison of word associations across varying word embedding models and varying training corpora. In order to address this gap in prior literature, this research aims to evaluate different types of word associations, not limited to gender, racial or religious attributes, incorporating and evaluating the diachronic and variable nature of words over text data collected over a period of 200 years. This thesis introduces a framework to track changes in word associations between neutral words (proper nouns) and attributes (adjectives), across different word embedding models, over a temporal dimension, by evaluating clustering tendencies between neutral words (proper nouns) and attributive words (adjectives) over five different word embedding frameworks: Word2vec (CBOW), Word2vec (Skip-gram), GloVe, fastText (CBOW) and fastText (Skip-gram) and 20 decades of text data from 1810s to 2000s. Finally, various cluster level and corpus level measurements will be compared across aforementioned word embedding frameworks, to find how word associations evolve with changes in the embedding model and the training corpus.

History

Degree Type

Master of Science

Department

Computer and Information Technology

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Dr. Julia T. Rayz

Additional Committee Member 2

Dr. Baijian Yang

Additional Committee Member 3

Dr. Ida B. Ngambeki

Additional Committee Member 4

Dr. J. Eric Dietz

Additional Committee Member 5

Dr. Lisa Bosman

Usage metrics

Keywords

word embeddings longitudinal corpora evaluation word clustering Applied Computer Science Natural Language Processing Information Systems

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Longitudinal Comparison of Word Associations in Shallow Word Embeddings

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Additional Committee Member 5

Usage metrics

Categories

Keywords

Licence

Exports