Purdue University Graduate School
Browse

Capturing Style Through Large Language Models - An Authorship Perspective

Download (1.97 MB)
thesis
posted on 2024-12-10, 16:20 authored by Anuj DubeyAnuj Dubey

This research investigates the use of Large Language Model (LLM) embeddings to capture the unique stylistic features of authors in Authorship Attribution (AA) tasks. Specifically, the focus of this research is on evaluating whether LLM-generated embeddings can effectively capture stylistic nuances that distinguish different authors, ultimately assessing their utility in tasks such as authorship attribution and clustering.The dataset comprises news articles from The Guardian authored by multiple writers, and embeddings were generated using OpenAI's text-embedding-ada-002 model. These embeddings were subsequently passed through a Siamese network with the objective of determining whether pairs of texts were authored by the same individual. The resulting model was used to generate style embeddings for unseen articles, which were then evaluated through classification and cluster analysis to assess their effectiveness in identifying individual authors across varying text samples. The classification task tested the model's accuracy in distinguishing authors, while the clustering analysis examined whether style embeddings primarily captured authorial identity or reflected domain-specific topics.

Our findings demonstrate that the proposed architecture achieves high accuracy for authors not previously encountered, outperforming traditional stylometric features and highlighting the effectiveness of LLM-based style embeddings. Additionally, our experiments reveal that authorship attribution accuracy decreases as the number of authors increases, yet improves with longer text lengths.


History

Degree Type

  • Master of Science

Department

  • Computer and Information Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Julia Rayz

Additional Committee Member 2

Romila Pradhan

Additional Committee Member 3

Tatiana Ringenberg

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC