On the Interplay Between Statistical Concepts and Computational Models in Omics Applications

Goossens, Emery T.

doi:10.25394/PGS.11328212.v1

On_the_Interplay_Between_Computational_Models_and_Statistical_Concepts_in_Omics_Applications (46).pdf (2.04 MB)

On the Interplay Between Statistical Concepts and Computational Models in Omics Applications

thesis

posted on 2019-12-06, 04:34 authored by Emery T. GoossensEmery T. Goossens

Technological advancements have lead to the generation of enormous amounts ofdata. In order to capitalize on this trend, however, both computational and sta-tistical challenges must be tackled. While computational efficiency is important,interpretability of models and algorithms are essential to ensuring the validity of anyconclusions drawn. Nowhere is this more clear than in the case of biomedical data,where inferences drawn from large datasets are used to inform future directions ofresearch, diagnose diseases, and generate leads for the development of new pharma-ceuticals. This work examines the interplay between statistical concepts and compu-tational models in three applications. Specifically, quantifying protein expression offluorescent images, classifying somatic mutations in cancer, and combining p-valuescomputed from genomic summary statistics. Across these applications, there are threerecurring themes: accounting for technical and biological variation in data process-ing, evaluating the performance of a model in its end use case, and integrating resultswith outside data. Within these applications and themes, many statistical conceptsare employed including Bayes theorem, and type I error rate control alongside com-putational models such a convolutional neural networks and Monte Carlo samplingalgorithms. The results of these investigations inform much broader application ar-eas such as biomedical imaging, modeling genomic sequences, and hypothesis testingin high-dimensions. Specific contributions in the application of Convolutional NeuralNetworks include demonstrating their ability to replicate the quantification of proteinexpression images from various manually-generated or deterministic label sets as wellas the creation of a modeling framework for sequencing-based cancer diagnostics and the prioritization of unvalidated somatic mutations. In the area of hypothesis test-ing, novel algorithms are proposed that enable the use of a powerful and interpretabletechnique of combining p-values in the large-scale setting of genome-wide association studies.

History

Degree Type

Doctor of Philosophy

Department

Statistics

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Rebecca W. Doerge

Advisor/Supervisor/Committee co-chair

Vinayak Rao

Additional Committee Member 2

Lingsong Zhang

Additional Committee Member 3

Jennifer Neville

Additional Committee Member 4

Lei Sun

Usage metrics

Keywords

Genome databases Statistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

On the Interplay Between Statistical Concepts and Computational Models in Omics Applications

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports