AN ANALYSIS ON SHORT-FORM TEXT AND DERIVED ENGAGEMENT
Short text has historically proven challenging to work with in many Natural Language
Processing (NLP) applications. Traditional tasks such as authorship attribution benefit
from having longer samples of work to derive features from. Even newer tasks, such as
synthetic text detection, struggle to distinguish between authentic and synthetic text in
the short-form. Due to the widespread usage of social media and the proliferation of freely
available Large Language Models (LLMs), such as the GPT series from OpenAI and Bard
from Google, there has been a deluge of short-form text on the internet in recent years.
Short-form text has either become or remained a staple in several ubiquitous areas such as
schoolwork, entertainment, social media, and academia. This thesis seeks to analyze this
short text through the lens of NLP tasks such as synthetic text detection, LLM authorship
attribution, derived engagement, and predicted engagement. The first focus explores the task
of detection in the binary case of determining whether tweets are synthetically generated or
not and proposes a novel feature extraction technique to improve classifier results. The
second focus further explores the challenges presented by short-form text in determining
authorship, a cavalcade of related difficulties, and presents a potential work around to those
issues. The final focus attempts to predict social media engagement based on the NLP
representations of comments, and results in some new understanding of the social media
environment and the multitude of additional factors required for engagement prediction.
History
Degree Type
- Doctor of Philosophy
Department
- Computer Science
Campus location
- West Lafayette