Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo
thesisposted on 18.12.2021, 20:23 by Wei DengWei Deng
The rise of artificial intelligence (AI) hinges on the efficient training of modern deep neural networks (DNNs) for non-convex optimization and uncertainty quantification, which boils down to a non-convex Bayesian learning problem. A standard tool to handle the problem is Langevin Monte Carlo, which proposes to approximate the posterior distribution with theoretical guarantees. However, non-convex Bayesian learning in real big data applications can be arbitrarily slow and often fails to capture the uncertainty or informative modes given a limited time. As a result, advanced techniques are still required.
In this thesis, we start with the replica exchange Langevin Monte Carlo (also known as parallel tempering), which is a Markov jump process that proposes appropriate swaps between exploration and exploitation to achieve accelerations. However, the na\"ive extension of swaps to big data problems leads to a large bias, and the bias-corrected swaps are required. Such a mechanism leads to few effective swaps and insignificant accelerations. To alleviate this issue, we first propose a control variates method to reduce the variance of noisy energy estimators and show a potential to accelerate the exponential convergence. We also present the population-chain replica exchange and propose a generalized deterministic even-odd scheme to track the non-reversibility and obtain an optimal round trip rate. Further approximations are conducted based on stochastic gradient descents, which yield a user-friendly nature for large-scale uncertainty approximation tasks without much tuning costs.
In the second part of the thesis, we study scalable dynamic importance sampling algorithms based on stochastic approximation. Traditional dynamic importance sampling algorithms have achieved successes in bioinformatics and statistical physics, however, the lack of scalability has greatly limited their extensions to big data applications. To handle this scalability issue, we resolve the vanishing gradient problem and propose two dynamic importance sampling algorithms based on stochastic gradient Langevin dynamics. Theoretically, we establish the stability condition for the underlying ordinary differential equation (ODE) system and guarantee the asymptotic convergence of the latent variable to the desired fixed point. Interestingly, such a result still holds given non-convex energy landscapes. In addition, we also propose a pleasingly parallel version of such algorithms with interacting latent variables. We show that the interacting algorithm can be theoretically more efficient than the single-chain alternative with an equivalent computational budget.
Bilsland dissertation fellowship
Degree TypeDoctor of Philosophy
Campus locationWest Lafayette
Advisor/Supervisor/Committee ChairGuang Lin
Advisor/Supervisor/Committee co-chairFaming Liang
Additional Committee Member 2Samy Tindel
Additional Committee Member 3Michael Zhu
Monte Carlo AlgorithmArtificial intelligenceImportance samplingComputer visionLangevin DynamicsVariance reduction techniquesWang-Landau algorithmInteracting particlesHamiltonian Monte CarloLog-Sobolev inequalityMetropolis HastingDeep neural networkStochastic variance-reduced gradientWasserstein distanceConvolutional neural networkDeterministic even odd schemeNon-reversibilityStochastic approximation Monte CarloStochastic differential equationStochastic gradient descentParallel temperingStochastic approximationReplica exchangeStochastic gradient Langevin dynamicsMarkov Chain Monte CarloStochastic Gradient Markov Chain Monte Carlo