PROBABILISTIC ENSEMBLE MACHINE LEARNING APPROACHES FOR UNSTRUCTURED TEXTUAL DATA CLASSIFICATION

Vichare, Srushti Sandeep

doi:10.25394/PGS.25669425.v1

Srushti Vichare SOET final.pdf (1.73 MB)

PROBABILISTIC ENSEMBLE MACHINE LEARNING APPROACHES FOR UNSTRUCTURED TEXTUAL DATA CLASSIFICATION

thesis

posted on 2024-04-26, 11:38 authored by Srushti Sandeep VichareSrushti Sandeep Vichare

The volume of big data has surged, notably in unstructured textual data, comprising emails, social media, and more. Currently, unstructured data represents over 80% of global data, the growth is propelled by digitalization. Unstructured text data analysis is crucial for various applications like social media sentiment analysis, customer feedback interpretation, and medical records classification. The complexity is due to the variability in language use, context sensitivity, and the nuanced meanings that are expressed in natural language. Traditional machine learning approaches, while effective in handling structured data, frequently fall short when applied to unstructured text data due to the complexities. Extracting value from this data requires advanced analytics and machine learning. Recognizing the challenges, we developed innovative ensemble approaches that combine the strengths of multiple conventional machine learning classifiers through a probabilistic approach. Response to the challenges , we developed two novel models: the Consensus-Based Integration Model (CBIM) and the Unified Predictive Averaging Model (UPAM).The CBIM and UPAM ensemble models were applied to Twitter (40,000 data samples) and the National Electronic Injury Surveillance System (NEISS) datasets (323,344 data samples) addressing various challenges in unstructured text analysis. The NEISS dataset achieved an unprecedented accuracy of 99.50%, demonstrating the effectiveness of ensemble models in extracting relevant features and making accurate predictions. The Twitter dataset, utilized for sentiment analysis, demonstrated a significant boost in accuracy over conventional approaches, achieving a maximum of 65.83%. The results highlighted the limitations of conventional machine learning approaches when dealing with complex, unstructured text data and the potential of ensemble models. The models exhibited high accuracy across various datasets and tasks, showcasing their versatility and effectiveness in obtaining valuable insights from unstructured text data. The results obtained extend the boundaries of text analysis and improve the field of natural language processing.

History

Degree Type

Master of Science

Department

Engineering Technology

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Gaurav Nanda

Advisor/Supervisor/Committee co-chair

Raji Sundararajan

Additional Committee Member 2

Huachao Mao

Usage metrics

Keywords

probablistic approach Machine Learning Ensemble Method ensemble approaches unstructured text data Injury classification sentiment analyis

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

PROBABILISTIC ENSEMBLE MACHINE LEARNING APPROACHES FOR UNSTRUCTURED TEXTUAL DATA CLASSIFICATION

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Usage metrics

Categories

Keywords

Licence

Exports