## File(s) under embargo

**Reason:** Chapter under submission.

## 2

month(s)## 22

day(s)until file(s) become available

# Efficient and Scalable Subgraph Statistics using Regenerative Markov Chain Monte Carlo

In recent years there has been a growing interest in data mining and graph machine learning for techniques that can obtain frequencies of *k*-node Connected Induced Subgraphs (*k*-CIS) contained in large real-world graphs. While recent work has shown that 5-CISs can be counted exactly, no exact polynomial-time algorithms are known that solve this task for *k *> 5. In the past, sampling-based algorithms that work well in moderately-sized graphs for *k* ≤ 8 have been proposed. In this thesis I push this boundary up to *k* ≤ 16 for graphs containing up to 120M edges, and to *k* ≤ 25 for smaller graphs containing between a million to 20M edges. I do so by re-imagining two older, but elegant and memory-efficient algorithms -- FANMOD and PSRW -- which have large estimation errors by modern standards. This is because FANMOD produces highly correlated k-CIS samples and the cost of sampling the PSRW Markov chain becomes prohibitively expensive for k-CIS’s larger than *k *> 8.

In this thesis, I introduce:

(a) **RTS:** a novel regenerative Markov chain Monte Carlo (MCMC) sampling procedure on the tree, generated on-the-fly by the FANMOD algorithm. RTS is able to run on multiple cores and multiple machines (embarrassingly parallel) and compute confidence intervals of estimates, all this while preserving the memory-efficient nature of FANMOD. RTS is thus able to estimate subgraph statistics for *k* ≤ 16 for larger graphs containing up to 120M edges, and for *k* ≤ 25 for smaller graphs containing between a million to 20M edges.

(b) **R-PSRW:** which scales the PSRW algorithm to larger CIS-sizes using a rejection sampling procedure to efficiently sample transitions from the PSRW Markov chain. R-PSRW matches RTS in terms of scaling to larger CIS sizes.

(c) **Ripple:** which achieves unprecedented scalability by stratifying the R-PSRW Markov chain state-space into ordered strata via a new technique that I call *sequential stratified regeneration*. I show that the Ripple estimator is consistent, highly parallelizable, and scales well. Ripple is able to *count* CISs of size up to *k *≤ 12 in real world graphs containing up to 120M edges.

My empirical results show that the proposed methods offer a considerable improvement over the state-of-the-art. Moreover my methods are able to run at a scale that has been considered unreachable until now, not only by prior MCMC-based methods but also by other sampling approaches.

**Optimization of Restricted Boltzmann Machines. **In addition, I also propose a regenerative transformation of MCMC samplers of Restricted Boltzmann Machines RBMs. My approach, Markov Chain Las Vegas (MCLV) gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has a maximum number of Markov chain step-count *K* (referred as MCLV-*K*). I present a MCLV-*K* gradient estimator (LVS-*K*) for RBMs and explore the correspondence and differences between LVS-*K* and Contrastive Divergence (CD-*K*). LVS-*K* significantly outperforms CD-*K* in the task of training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.