Purdue University Graduate School
Browse

AN ANALYSIS ON SHORT-FORM TEXT AND DERIVED ENGAGEMENT

Download (5.68 MB)
thesis
posted on 2024-07-22, 04:39 authored by Ryan J SchwarzRyan J Schwarz

Short text has historically proven challenging to work with in many Natural Language
Processing (NLP) applications. Traditional tasks such as authorship attribution benefit
from having longer samples of work to derive features from. Even newer tasks, such as
synthetic text detection, struggle to distinguish between authentic and synthetic text in
the short-form. Due to the widespread usage of social media and the proliferation of freely
available Large Language Models (LLMs), such as the GPT series from OpenAI and Bard
from Google, there has been a deluge of short-form text on the internet in recent years.
Short-form text has either become or remained a staple in several ubiquitous areas such as
schoolwork, entertainment, social media, and academia. This thesis seeks to analyze this
short text through the lens of NLP tasks such as synthetic text detection, LLM authorship
attribution, derived engagement, and predicted engagement. The first focus explores the task
of detection in the binary case of determining whether tweets are synthetically generated or
not and proposes a novel feature extraction technique to improve classifier results. The
second focus further explores the challenges presented by short-form text in determining
authorship, a cavalcade of related difficulties, and presents a potential work around to those
issues. The final focus attempts to predict social media engagement based on the NLP
representations of comments, and results in some new understanding of the social media
environment and the multitude of additional factors required for engagement prediction.

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Clifton W. Bingham

Advisor/Supervisor/Committee co-chair

Edward J. Delp

Additional Committee Member 2

Dan Goldwasser

Additional Committee Member 3

Lin Tan

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC