Purdue University Graduate School
MengmengZhu_dissertation_for_deposit.pdf (2.76 MB)


Download (2.76 MB)
posted on 2020-04-30, 12:41 authored by Mengmeng ZhuMengmeng Zhu

Data science uses the latest techniques in statistics and machine learning to extract insights from data. With the increasing amount of protein data, a number of novel research approaches have become feasible.

Micropeptides are an emerging field in the protein universe. They are small proteins with <= 100 amino acid residues (aa) and are translated from small open reading frames (sORFs) of <= 303 base pairs (bp). Traditionally, their existence was ignored because of the technical difficulties in isolating them. With technological advances, a growing number of micropeptides have been characterized and shown to play vital roles in many biological processes. Yet, we lack bioinformatics methods for predicting them directly from DNA sequences, which could substantially facilitate research in this field with minimal cost. With the increasing amount of data, developing new methods to address this need becomes possible. We therefore developed MiPepid, a machine-learning-based method specifically designed for predicting micropeptides from DNA sequences by curating a high-quality dataset and by training MiPepid using logistic regression with 4-mer features. MiPepid performed exceptionally well on holdout test sets and performed much better than existing methods. MiPepid is available for downloading, easy to use, and runs sufficiently fast.

Long noncoding RNAs (LncRNAs) are transcripts of > 200 bp and does not encode a protein. Contrary to their “noncoding” definition, an increasing number of lncRNAs have been found to be translated into functional micropeptides. Therefore, whether most lncRNAs are translated is an open question of great significance. To address this question, by harnessing the availability of large-scale human variation data, we have explored the relationships between lncRNAs, micropeptides, and canonical regular proteins (> 100 aa) from the perspective of genetic variation, which has long been used to study natural selection to infer functional relevance. Through rigorous statistical analyses, we find that lncRNAs share a similar genetic variation profile with proteins regarding single nucleotide polymorphism (SNP) density, SNP spectrum, enrichment of rare SNPs, etc., suggesting lncRNAs are under similar negative selection strength with proteins. Our study revealed similarities between micropeptides, lncRNAs, and canonical proteins and is the first attempt to explore the relationships between the three groups from a genetic variation perspective.

Deep learning has been tremendously successful in 2D image recognition. Protein binding ligand prediction is fundamental topic in protein research as most proteins bind ligands to function. Proteins are 3D structures and can be considered as 3D images. Prediction of binding ligands of proteins can then be converted to a 3D image classification problem. In addition, a large number of protein structure data are available now. We therefore utilized deep learning to predict protein binding ligands by designing a 3D convolutional neural network from scratch and by building a large 3D image dataset of protein structures. The trained model achieved an average F1 score of over 0.8 across 151 classes on the holdout test set. Compared to existing methods, our model performed better. In summary, we showed the feasibility of deploying deep learning in protein structure research.

In conclusion, by exploring various edges of the protein universe from the perspective of data science, we showed that the increasing amount of data and the advancement of data science methods made it possible to address a wide variety of pressing biological questions. We showed that for a successful data science study, the three components – goal, data, method – all of them are indispensable. We provided three successful data science studies: the careful data cleaning and selection of machine learning algorithm lead to the development of MiPepid that fits the urgent need of a micropeptide prediction method; identifying the question and exploring it from a different angle lead to the key insight that lncRNAs resemble micropeptides; applying deep learning to protein structure data lead to a new approach to the long-standing question of protein-ligand binding. The three studies serve as excellent examples in solving a wide range of data science problems with a variety of issues.


Degree Type

  • Doctor of Philosophy


  • PULSe (Life Sciences)

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Michael Gribskov

Additional Committee Member 2

Dan Goldwasser

Additional Committee Member 3

Faming Liang

Additional Committee Member 4

Cynthia Stauffacher