COMPARING PSO-BASED CLUSTERING OVER CONTEXTUAL VECTOR EMBEDDINGS TO MODERN TOPIC MODELING

Miles, Samuel Jacob

doi:10.25394/PGS.19658712.v1

COMPARING PSO-BASED CLUSTERING OVER CONTEXTUAL VECTOR EMBEDDINGS TO MODERN TOPIC MODELING

thesis

posted on 2022-07-12, 17:59 authored by Samuel Jacob MilesSamuel Jacob Miles

Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare

the proposed model to LDA. The third is a standard benchmark dataset for topic modeling which consists of a collection of messages posted to 20 different news groups. It is used to compare state-of-the-art generative document models (i.e., ETM and NVDM) to pPSO. The results show that pPSO is able to produce interpretable clusters. Moreover, pPSO is able to capture both common topics as well as emergent topics. The topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20News-Groups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus specific vocabulary which is used by ETM and NVDM.

History

Degree Type

Master of Science in Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Campus location

Indianapolis

Advisor/Supervisor/Committee Chair

Zina Ben Miled

Additional Committee Member 2

Paul Salama

Additional Committee Member 3

Mohamed El-Sharkawy

Usage metrics

Keywords

Particle Swarm Optimization Algorithm Topic Modelling Vector Embedding Natural Language Processing Computer Engineering

Licence

CC BY 4.0

COMPARING PSO-BASED CLUSTERING OVER CONTEXTUAL VECTOR EMBEDDINGS TO MODERN TOPIC MODELING

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Usage metrics

Categories

Keywords

Licence

Exports