Purdue University Graduate School
Browse

DEEP LEARNING FOR STATISTICAL DATA ANALYSIS: DIMENSION REDUCTION AND CAUSAL STRUCTURE INFERENCE

Download (3.89 MB)
thesis
posted on 2021-12-19, 19:42 authored by Siqi LiangSiqi Liang
During the past decades, deep learning has been proven to be an important tool for statistical data analysis. Motivated by the promise of deep learning in tackling the curse of dimensionality, we propose three innovative methods which apply deep learning techniques to high-dimensional data analysis in this dissertation.

Firstly, we propose a nonlinear sufficient dimension reduction method, the so-called split-and-merge deep neural networks (SM-DNN), which employs the split-and-merge technique on deep neural networks to obtain nonlinear sufficient dimension reduction of the input data and then learn a deep neural network on the dimension reduced data. We show that the DNN-based dimension reduction is sufficient for data drawn from exponential family, which retains all information on response contained in the explanatory data. Our numerical experiments indicate that the SM-DNN method can lead to significant improvement in phenotype prediction for a variety of real data examples. In particular, with only rare variants, we achieved a remarkable prediction accuracy of over 74\% for the Early-Onset Myocardial Infarction (EOMI) exome sequence data.

Secondly, we propose another nonlinear SDR method based on a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for high-dimensional data. The proposed stochastic neural network can be trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.

Finally, we propose a structure learning method for learning the causal structure hidden in the high-dimensional data, which consists of two stages:
we first conduct Bayesian sparse learning for variable screening to build a primary graph, and then we perform conditional independence tests to refine the primary graph.
Extensive numerical experiments and quantitative tests confirm the generality, effectiveness and power of the proposed methods.

History

Degree Type

  • Doctor of Philosophy

Department

  • Statistics

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Faming Liang

Additional Committee Member 2

Dabao Zhang

Additional Committee Member 3

Min Zhang

Additional Committee Member 4

Xiao Wang

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC