Purdue University Graduate School
Browse

ADDRESSING DATA IMBALANCE IN BREAST CANCER PREDICTION USING SUPERVISED MACHINE LEARNING

Download (8.23 MB)
thesis
posted on 2022-07-28, 16:41 authored by Shuning YinShuning Yin

Every 12 minutes, 12 women are diagnosed with breast cancer in the US, and 1 dies out of  it. Globally, every 46 seconds, a woman loses her life due to breast cancer, meaning more than  1,800 deaths every day. The condition makes the prediction of breast cancer very important. To  achieve the goal, supervised machine learning (ML) methods are used for breast cancer  likelihood predictions. However, due to imbalance in the real-world data with very low portion  of positive cases, the prediction accuracy of ML models for positive cancer cases was limited. Two procedures were done to address the issues in the study. Firstly, four supervised ML  models, including Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Multilayer Perceptron (MLP), using WEKA, the industry-standard software, were  applied to the Breast Cancer Surveillance Consortium (BCSC) dataset to assess the impact of the  data imbalance on breast cancer prediction. Secondly, the data was manually built as balanced  (24,558 cases, 12,279 for each class-positive and negative) and unbalanced (99,000 cases for  negative) training datasets and a non-overlapping testing dataset (11,000 cases) based on the  same dataset and a decision support system was developed for two ML models, NB and LR to  tackle the class imbalance issue for breast cancer prediction. Overall, the results indicate that  MLP had the best performance on positive breast cancer prediction with 0.959 sensitivity and  0.907 PPV and balanced dataset predicted better results for all ML models than unbalanced  dataset. Furthermore, the proposed method improved the sensitivity of positive cancer case  prediction from 0.687 to 0.936 using the NB model and from 0.358 to 0.8306 using the LR  model. The improvement demonstrated that the approach provided higher confidence ML-based  predictions and filtered weaker ones, and the technique could efficiently address the class  imbalance issue in breast cancer likelihood prediction and be used in clinical practice.

Funding

Collaborative research: Variation in the awarding and effectiveness of STEM graduate student funding across teaching and research assistantships, fellowships, and traineeships

Directorate for Education & Human Resources

Find out more...

History

Degree Type

  • Master of Science

Department

  • Engineering Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Raji Sundararajan

Advisor/Supervisor/Committee co-chair

Gaurav Nanda

Additional Committee Member 2

Ignacio Camarillo

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC