Purdue University Graduate School
Browse
Srushti Vichare SOET final.pdf (1.73 MB)

PROBABILISTIC ENSEMBLE MACHINE LEARNING APPROACHES FOR UNSTRUCTURED TEXTUAL DATA CLASSIFICATION

Download (1.73 MB)
thesis
posted on 2024-04-26, 11:38 authored by Srushti Sandeep VichareSrushti Sandeep Vichare

The volume of big data has surged, notably in unstructured textual data, comprising emails, social media, and more. Currently, unstructured data represents over 80% of global data, the growth is propelled by digitalization. Unstructured text data analysis is crucial for various applications like social media sentiment analysis, customer feedback interpretation, and medical records classification. The complexity is due to the variability in language use, context sensitivity, and the nuanced meanings that are expressed in natural language. Traditional machine learning approaches, while effective in handling structured data, frequently fall short when applied to unstructured text data due to the complexities. Extracting value from this data requires advanced analytics and machine learning. Recognizing the challenges, we developed innovative ensemble approaches that combine the strengths of multiple conventional machine learning classifiers through a probabilistic approach. Response to the challenges , we developed two novel models: the Consensus-Based Integration Model (CBIM) and the Unified Predictive Averaging Model (UPAM).The CBIM and UPAM ensemble models were applied to Twitter (40,000 data samples) and the National Electronic Injury Surveillance System (NEISS) datasets (323,344 data samples) addressing various challenges in unstructured text analysis. The NEISS dataset achieved an unprecedented accuracy of 99.50%, demonstrating the effectiveness of ensemble models in extracting relevant features and making accurate predictions. The Twitter dataset, utilized for sentiment analysis, demonstrated a significant boost in accuracy over conventional approaches, achieving a maximum of 65.83%. The results highlighted the limitations of conventional machine learning approaches when dealing with complex, unstructured text data and the potential of ensemble models. The models exhibited high accuracy across various datasets and tasks, showcasing their versatility and effectiveness in obtaining valuable insights from unstructured text data. The results obtained extend the boundaries of text analysis and improve the field of natural language processing.

History

Degree Type

  • Master of Science

Department

  • Engineering Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Gaurav Nanda

Advisor/Supervisor/Committee co-chair

Raji Sundararajan

Additional Committee Member 2

Huachao Mao

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC