File(s) under embargo
Reason: Chapter under submission.
until file(s) become available
Efficient and Scalable Subgraph Statistics using Regenerative Markov Chain Monte Carlo
In recent years there has been a growing interest in data mining and graph machine learning for techniques that can obtain frequencies of k-node Connected Induced Subgraphs (k-CIS) contained in large real-world graphs. While recent work has shown that 5-CISs can be counted exactly, no exact polynomial-time algorithms are known that solve this task for k > 5. In the past, sampling-based algorithms that work well in moderately-sized graphs for k ≤ 8 have been proposed. In this thesis I push this boundary up to k ≤ 16 for graphs containing up to 120M edges, and to k ≤ 25 for smaller graphs containing between a million to 20M edges. I do so by re-imagining two older, but elegant and memory-efficient algorithms -- FANMOD and PSRW -- which have large estimation errors by modern standards. This is because FANMOD produces highly correlated k-CIS samples and the cost of sampling the PSRW Markov chain becomes prohibitively expensive for k-CIS’s larger than k > 8.
In this thesis, I introduce:
(a) RTS: a novel regenerative Markov chain Monte Carlo (MCMC) sampling procedure on the tree, generated on-the-fly by the FANMOD algorithm. RTS is able to run on multiple cores and multiple machines (embarrassingly parallel) and compute confidence intervals of estimates, all this while preserving the memory-efficient nature of FANMOD. RTS is thus able to estimate subgraph statistics for k ≤ 16 for larger graphs containing up to 120M edges, and for k ≤ 25 for smaller graphs containing between a million to 20M edges.
(b) R-PSRW: which scales the PSRW algorithm to larger CIS-sizes using a rejection sampling procedure to efficiently sample transitions from the PSRW Markov chain. R-PSRW matches RTS in terms of scaling to larger CIS sizes.
(c) Ripple: which achieves unprecedented scalability by stratifying the R-PSRW Markov chain state-space into ordered strata via a new technique that I call sequential stratified regeneration. I show that the Ripple estimator is consistent, highly parallelizable, and scales well. Ripple is able to count CISs of size up to k ≤ 12 in real world graphs containing up to 120M edges.
My empirical results show that the proposed methods offer a considerable improvement over the state-of-the-art. Moreover my methods are able to run at a scale that has been considered unreachable until now, not only by prior MCMC-based methods but also by other sampling approaches.
Optimization of Restricted Boltzmann Machines. In addition, I also propose a regenerative transformation of MCMC samplers of Restricted Boltzmann Machines RBMs. My approach, Markov Chain Las Vegas (MCLV) gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has a maximum number of Markov chain step-count K (referred as MCLV-K). I present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K). LVS-K significantly outperforms CD-K in the task of training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.