Statistical Learning of Proteomics Data and Global Testing for Data with Correlations

Chen, Donglai

doi:10.25394/PGS.7776230.v1

Donglai_thesis_190225.pdf (1.33 MB)

Statistical Learning of Proteomics Data and Global Testing for Data with Correlations

thesis

posted on 2019-05-15, 15:16 authored by Donglai ChenDonglai Chen

This dissertation consists of two parts. The first part is a collaborative project with Dr. Szymanski's group in Agronomy at Purdue, to predict protein complex assemblies and interactions. Proteins in the leaf cytosol of Arabidopsis were fractionated using Size Exclusion Chromatography (SEC) and mixed-bed Ion Exchange Chromatography (IEX).

Protein mass spectrometry data were obtained for the two platforms of separation and two replicates of each. We combine the four data sets and conduct a series of statistical learning, including 1) data filtering, 2) a two-round hierarchical clustering to integrate multiple data types, 3) validation of clustering based on known protein complexes,

4) mining dendrogram trees for prediction of protein complexes. Our method is developed for integrative analysis of different data types and it eliminates the difficulty of choosing an appropriate cluster number in clustering analysis. It provides a statistical learning tool to globally analyze the oligomerization state of a system of protein complexes.

The second part examines global hypothesis testing under sparse alternatives and arbitrarily strong dependence. Global tests are used to aggregate information and reduce the burden of multiple testing. A common situation in modern data analysis is that variables with nonzero effects are sparse. The minimum p-value and higher criticism tests are particularly effective and more powerful than the F test under sparse alternatives. This is the common setting in genome-wide association study (GWAS) data. However, arbitrarily strong dependence among variables poses a great challenge towards the p-value calculation of these optimal tests. We develop a latent variable adjusted method to correct minimum p-value test. After adjustment, test statistics become weakly dependent and the corresponding null distributions are valid. We show that if the latent variable is not related to the response variable, power can be improved. Simulation studies show that our method is more powerful than other methods in highly sparse signal and correlated marginal tests setting. We also show its application in a real dataset.

History

Degree Type

Doctor of Philosophy

Department

Statistics

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Jun Xie

Additional Committee Member 2

Hyonho Chun

Additional Committee Member 3

Lingsong Zhang

Additional Committee Member 4

Anindya Bhadra

Usage metrics

Keywords

mass spectrometry proteomic methods Global test Genome-Wide Association Study Latent Factor Modeling Correlation structures Applied Statistics Biostatistics Statistical Theory Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Statistical Learning of Proteomics Data and Global Testing for Data with Correlations

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports