Hypothesis testing and community detection on networks with missingness and block structure
thesisposted on 06.12.2019, 14:57 by Guilherme Maia Rodrigues Gomes
Statistical analysis of networks has grown rapidly over the last few years with increasing number of applications. Graph-valued data carries additional information of dependencies which opens the possibility of modeling highly complex objects in vast number of fields such as biology (e.g. brain networks , fungi networks, genes co-expression), chemistry (e.g. molecules fingerprints), psychology (e.g. social networks) and many others (e.g. citation networks, word co-occurrences, financial systems, anomaly detection). While the inclusion of graph structure in the analysis can further help inference, simple statistical tasks in a network is very complex. For instance, the assumption of exchangeability of the nodes or the edges is quite strong, and it brings issues such as sparsity, size bias and poor characterization of the generative process of the data. Solutions to these issues include adding specific constraints and assumptions on the data generation process. In this work, we approach this problem by assuming graphs are globally sparse but locally dense, which allows exchangeability assumption to hold in local regions of the graph. We consider problems with two types of locality structure: block structure (also framed as multiple graphs or population of networks) and unstructured sparsity which can be seen as missing data. For the former, we developed a hypothesis testing framework for weighted aligned graphs; and a spectral clustering method for community detection on population of non-aligned networks. For the latter, we derive an efficient spectral clustering approach to learn the parameters of the zero inflated stochastic blockmodel. Overall, we found that incorporating multiple local dense structures leads to a more precise and powerful local and global inference. This result indicates that this general modeling scheme allows for exchangeability assumption on the edges to hold while generating more realistic graphs. We give theoretical conditions for our proposed algorithms, and we evaluate them on synthetic and real-world datasets, we show our models are able to outperform the baselines on a number of settings.