File(s) under embargo
until file(s) become available
DEEP LEARNING FOR STATISTICAL DATA ANALYSIS: DIMENSION REDUCTION AND CAUSAL STRUCTURE INFERENCE
thesisposted on 2021-12-19, 19:42 authored by Siqi LiangSiqi Liang
During the past decades, deep learning has been proven to be an important tool for statistical data analysis. Motivated by the promise of deep learning in tackling the curse of dimensionality, we propose three innovative methods which apply deep learning techniques to high-dimensional data analysis in this dissertation.
Firstly, we propose a nonlinear sufficient dimension reduction method, the so-called split-and-merge deep neural networks (SM-DNN), which employs the split-and-merge technique on deep neural networks to obtain nonlinear sufficient dimension reduction of the input data and then learn a deep neural network on the dimension reduced data. We show that the DNN-based dimension reduction is sufficient for data drawn from exponential family, which retains all information on response contained in the explanatory data. Our numerical experiments indicate that the SM-DNN method can lead to significant improvement in phenotype prediction for a variety of real data examples. In particular, with only rare variants, we achieved a remarkable prediction accuracy of over 74\% for the Early-Onset Myocardial Infarction (EOMI) exome sequence data.
Secondly, we propose another nonlinear SDR method based on a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for high-dimensional data. The proposed stochastic neural network can be trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.
Finally, we propose a structure learning method for learning the causal structure hidden in the high-dimensional data, which consists of two stages:
we first conduct Bayesian sparse learning for variable screening to build a primary graph, and then we perform conditional independence tests to refine the primary graph.
Extensive numerical experiments and quantitative tests confirm the generality, effectiveness and power of the proposed methods.