Purdue University Graduate School
Browse

File(s) under embargo

Reason: Pending publications.

5

month(s)

30

day(s)

until file(s) become available

Large Language Models for Unsupervised Keyphrase Extraction and Biomedical Data Analytics

thesis
posted on 2024-09-03, 12:47 authored by Haoran DingHaoran Ding

Natural Language Processing (NLP), a vital branch of artificial intelligence, is designed to equip computers with the ability to comprehend and manipulate human language, facilitating the extraction and utilization of textual data. NLP plays a crucial role in harnessing the vast quantities of textual data generated daily, facilitating meaningful information extraction. Among the various techniques, keyphrase extraction stands out due to its ability to distill concise information from extensive texts, making it invaluable for summarizing and navigating content efficiently. The process of keyphrase extraction usually begins by generating candidates first and then ranking them to identify the most relevant phrases. Keyphrase extraction can be categorized into supervised and unsupervised approaches. Supervised methods typically achieve higher accuracy as they are trained on labeled data, which allows them to effectively capture and utilize patterns recognized during training. However, the dependency on extensive, well-annotated datasets limits their applicability in scenarios where such data is scarce or costly to obtain. On the other hand, unsupervised methods, while free from the constraints of labeled data, face challenges in capturing deep semantic relationships within text, which can impact their effectiveness. Despite these challenges, unsupervised keyphrase extraction holds significant promise due to its scalability and lower barriers to entry, as it does not require labeled datasets. This approach is increasingly favored for its potential to aid in building extensive knowledge bases from unstructured data, which can be particularly useful in domains where acquiring labeled data is impractical. As a result, unsupervised keyphrase extraction is not only a valuable tool for information retrieval but also a pivotal technology for the ongoing expansion of knowledge-driven applications in NLP.

In this dissertation, we introduce three innovative unsupervised keyphrase extraction methods: AttentionRank, AGRank, and LLMRank. Additionally, we present a method for constructing knowledge graphs from unsupervised keyphrase extraction, leveraging the self-attention mechanism. The first study discusses the AttentionRank model, which utilizes a pre-trained language model to derive underlying importance rankings of candidate phrases through self-attention. This model employs a cross-attention mechanism to assess the semantic relevance between each candidate phrase and the document, enhancing the phrase ranking process. AGRank, detailed in the second study, is a sophisticated graph-based framework that merges deep learning techniques with graph theory. It constructs a candidate phrase graph using mutual attentions from a pre-trained language model. Both global document information and local phrase details are incorporated as enhanced nodes within the graph, and a graph algorithm is applied to rank the candidate phrases. The third study, LLMRank, leverages the strengths of large language models (LLMs) and graph algorithms. It employs LLMs to generate keyphrase candidates and then integrates global information through the text's graphical structures. This process reranks the candidates, significantly improving keyphrase extraction performance. The fourth study explores how self-attention mechanisms can be used to extract keyphrases from medical literature and generate query-related phrase graphs, improving text retrieval visualization. The mutual attentions of medical entities, extracted using a pre-trained model, form the basis of the knowledge graph. This, coupled with a specialized retrieval algorithm, allows for the visualization of long-range connections between medical entities while simultaneously displaying the supporting literature. In summary, our exploration of unsupervised keyphrase extraction and biomedical data analysis introduces novel methods and insights in NLP, particularly in information extraction. These contributions are crucial for the efficient processing of large text datasets and suggest avenues for future research and applications.

History

Degree Type

  • Doctor of Philosophy

Department

  • Electrical and Computer Engineering

Campus location

  • Indianapolis

Advisor/Supervisor/Committee Chair

Xiao Luo

Advisor/Supervisor/Committee co-chair

Brian King

Additional Committee Member 2

Qingxue Zhang

Additional Committee Member 3

Lingxi Li

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC