Variational Inference with Theoretical Guarantees
thesisposted on 2020-12-09, 20:27 authored by Jincheng BaiJincheng Bai
Variational inference (VI) or Variational Bayes (VB) is a popular alternative to MCMC, which doesn't scale well on complex Bayesian learning tasks with large datasets. Despite its huge empirical successes, the statistical properties of VI have not been carefully studied only until recently. In this dissertation, we are concerned with both the implementation and theoretical guarantee of VI.
In the first part of this dissertation, we propose a VI procedure for high-dimensional linear model inferences with heavy tail shrinkage priors, such as student-t prior. Theoretically, we establish the consistency of the proposed VI method and prove that under the proper choice of prior specifications, the contraction rate of the VB posterior is nearly optimal. It justifies the validity of VB inference as an alternative of MCMC sampling. Meanwhile, comparing to conventional MCMC methods, the VI procedure achieves much higher computational efficiency, which greatly alleviates the computing burden for modern machine learning applications such as massive data analysis. Through numerical studies, we demonstrate that the proposed VI method leads to shorter computing time, higher estimation accuracy, and lower variable selection error than competitive sparse Bayesian methods.
In the second part of this dissertation, we focus on sparse deep learning, which aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. We train sparse deep neural networks with a fully Bayesian treatment under two classes spike-and-slab priors, and develop sets of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. Given a pre-specified sparse DNN structure, the corresponding variational contraction rate is characterized that reveals a trade-off between the statistical estimation error, the variational error, and the approximation error, which are all determined by the network structural complexity (i.e., depth, width and sparsity). Note that the optimal network structure, which strikes the balance of the aforementioned trade-off and yields the best rate, is generally unknown. However, our methods could always achieve the best contraction rate as if the optimal network structure is known. In particular, when the true function is Hölder smooth, the variational inferences are capable to attain nearly minimax rate without the knowledge of smoothness level. In addition, our empirical results demonstrate that the variational procedures provide uncertainty quantification in terms of Bayesian predictive distribution and are also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.
- Doctor of Philosophy
- West Lafayette