Provable Guarantees of Learning with Incomplete and Latent Data
Real-world datasets are rarely clean. This causes the discrepancy between the claimed performance of machine learning algorithms on paper, and their actual performance on real-world problems. When dealing with missing or hidden information in a dataset, researchers have been using heuristic imputation methods since the first day of machine learning. However, it is known that many imputation methods do not have theoretical guarantees in various machine learning tasks, including clustering, community detection, sparsity recovery, to name a few. On the other hand, theoretical machine learning papers often follow simplistic assumptions, which are rarely fulfilled in real-world datasets. My research focuses on developing statistically and computationally efficient learning algorithms with provable guarantees under novel incomplete and latent assumptions. We consider problems with arguably more realistic incomplete and latent assumptions.We provide analysis to community detection in various network models, inference with latent variables in an arbitrary planted model, federated myopic community detection, and high-order tensor models. We analyze the interaction between the missing or latent structures and the inference / recoverability conditions, and proposed algorithms to solve the problems efficiently.
Our main contributions in this thesis are as follows.
- We analyze the information-theoretic limits for the recovery of node labels in several network models. We analyze the information-theoretic limits for community detection. We carefully construct restricted ensembles for a subclass of network models, and provide a series of novel results.
- We analyze the necessary and sufficient conditions for exact inference of a latent model. We show that exact inference can be achieved using a semidefinite programming approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker conditions and the spectrum of a properly defined matrix.
- We study the problem of recovering the community structure of a network under federated myopic learning. Under this paradigm, we have several clients, each of them having a myopic view, i.e., observing a small subgraph of the network. Each client sends a censored evidence graph to a central server. We provide an efficient algorithm, which computes a consensus signed weighted graph from clients evidence, and recovers the underlying network structure in the central server. We analyze the topological structure conditions of the network, as well as the signal and noise levels of the clients that allow for recovery of the network structure. Our analysis shows that exact recovery is possible and can be achieved in polynomial time.
- We study the problem of exact partitioning of high-order models. We consider two different high-order assumptions, and show that exact partitioning of high-order planted models is achievable through solving a convex optimization problem with a novel Carathéodory symmetric tensor cone in one case, and with a tensor nuclear norm constraint in the other.
- We study the problem of inference in high-order structured prediction tasks. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.
- We study the problem of partial recovery through semidefinite programming. We are interested in the scenarios in which the SDP returns a solution that is partially correct without any rounding. We analyze the optimality condition of partial recovery and provide statistical and topological guarantees.
History
Degree Type
- Doctor of Philosophy
Department
- Computer Science
Campus location
- West Lafayette