Purdue University Graduate School
Browse

Assessing the Accuracy of Protein Function Prediction using Machine Learning

Download (6.4 MB)
thesis
posted on 2025-06-19, 15:51 authored by Emilia TugolukovaEmilia Tugolukova

Proteins are necessary for the survival of biological life forms. Protein research is essential, as it contributes to the development of new medicines and vaccines, among other applications. Traditionally, the functions of proteins have been annotated in experimental laboratories, but this can require significant time and resources. Only about 400,000 protein sequences within the 250 million sequences-rich UniProt have experimental protein-level annotations. Aiming to close the growing gap, computational protein annotation methods can predict protein functions. Protein Function Prediction (PFP) is a rule-based method developed in 2006 by Kihara and Hawkins. In the 2006 CASP7 function prediction category challenge, PFP outperformed other teams and the organizer's consensus sets. Protein-specific GO term predictions are ranked by their raw scores, representing the strength and frequency of GO term appearance in homologous sequences. While raw scores generally correlate with annotation accuracy, many incorrect high-scoring GO terms exist, suggesting that a different GO term accuracy metric is required. To address this, multiple machine learning regression models, including linear, decision tree-based, and neural network models, were trained to predict the Lin score, a substitute for accuracy. When evaluating the combined predictions from five-fold cross-validation datasets, there was no strong winning model, with most models scoring around (Weighted MSE~7.5 %, Weighted R²~8.5%). Substituting MSE with a weighted MSE as an objective function in neural network models significantly improved the combined performance metrics (Weighted MSE=6.17%, Weighted R²=25.9%). Assuming that future experiments can increase the models’ performance, it remains essential to determine and implement the individual accuracy of GO term predictions to assist protein functional annotation efforts.

Funding

Knox Fellowship

History

Degree Type

  • Master of Science

Department

  • Biological Sciences

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Daisuke Kihara

Additional Committee Member 2

Jennifer H. Wisecaver

Additional Committee Member 3

Daoguo Zhou

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC