BEYOND TRADITIONAL METHODS FOR LEARNING FROM DISCRETE AND GRAPH DATA
The fields of Artificial Intelligence (AI) and Machine Learning (ML) have witnessed transformative growth, driven by increased data availability, computational power, and algorithmic innovation. At the center of these advancements and innovations were text and image generation. This yielded standard or “traditional” models like Convolutional Neural Networks (CNNs) and Transformers which are optimized for grid-like and sequential data. While image and text share a uniform and intuitive structure, other data types lack these properties, making it difficult to model their underlying distributions or evaluate models that attempt to do so. For example, discrete categorical data lacks an order, and graph data represented by a collection of nodes and edges is far from uniform and mostly lacks semantic meaning when viewed by humans. This dissertation, "Beyond Traditional Methods for Learning from Discrete and Graph Data," addresses such challenges by developing novel methodologies tailored to the unique properties of these data types.
A primary focus of this work lies in advancing generative models, which learn underlying data distributions to synthesize novel samples. Recognizing the dichotomy between large-scale foundation models and the need for smaller, efficient, specialized alternatives, this dissertation presents contributions geared towards the latter. The first major contribution centers on the efficient generation of discrete categorical data. This includes an application-focused contribution of developing a multithreaded C++ implementation of the Pritchard-Stephens-Donelly (PSD) model capable of generating synthetic terascale genomic data scaling up to one million rows by one million columns. %NEWADDITION%
Complementing this is a methodological innovation, Discrete Tree Flows (DTF), a novel tree-based normalizing flow model specifically designed for discrete categorical data. The use of decision tree-like structures and specialized learning algorithms within DTF allows it to overcome limitations of prior discrete flow models, such as non-differentiability and computational expense, all while avoiding the use of pseudo-gradients and achieving competitive performance with greater parameter and computational efficiency.
The second major area of contribution addresses the critical challenge of evaluating implicit graph generative models. The current landscape of available evaluation approaches is not heavily standardized and relies on calculating aggregate statistics between samples generated by the model and some held-out samples. This approach fails to adequately assess performance in underrepresented ("thin support") regions. To bridge this evaluation gap, we introduce Vertical Validation (VV), a novel framework that systematically assesses a model’s ability to generate plausible and diverse graphs in these thin support regions. VV’s premise draws inspiration from cross-validation and creates a framework that systematically thins out the support in some regions of the data distribution so that we can simulate in advance how a model would learn if that was the case. This approach allows us to more accurately measure the generalization of the model on thin support regions.
Collectively, the contributions presented in this dissertation---spanning specialized data generation tools, novel generative model architectures for discrete data, and a rigorous evaluation framework for graph generation---are unified by the theme of moving beyond traditional methods to address the distinct challenges posed by discrete and graph-structured data.
History
Degree Type
- Doctor of Philosophy
Department
- Computer Science
Campus location
- West Lafayette