Robust Representation Learning for Out-of-Distribution Extrapolation in Relational Data
Recent advancements in representation learning have significantly enhanced the analysis of relational data across various domains, including social networks, bioinformatics, and recommendation systems. In general, these methods assume that the training and test datasets come from the same distribution, an assumption that often fails in real-world scenarios due to evolving data, privacy constraints, and limited resources. The task of out-of-distribution (OOD) extrapolation emerges when the distribution of test data differs from that of the training data, presenting a significant, yet unresolved challenge within the field. This dissertation focuses on developing robust representations for effective OOD extrapolation, specifically targeting relational data types like graphs and sets. For successful OOD extrapolation, it's essential to first acquire a representation that is adequately expressive for tasks within the distribution. In the first work, we introduce Set Twister, a permutation-invariant set representation that generalizes and enhances the theoretical expressiveness of DeepSets, a simple and widely used permutation-invariant representation for set data, allowing it to capture higher-order dependencies. We showcase its implementation simplicity and computational efficiency, as well as its competitive performances with more complex state-of-the-art graph representations in several graph node classification tasks. Secondly, we address OOD scenarios in graph classification and link prediction tasks, particularly when faced with varying graph sizes. Under causal model assumptions, we derive approximately invariant graph representations that improve extrapolation in OOD graph classification task. Furthermore, we provide the first theoretical study of the capability of graph neural networks for inductive OOD link prediction and present a novel representation model that produces structural pairwise embeddings, maintaining predictive accuracy for OOD link prediction as the test graph size increases. Finally, we investigate the impact of environmental data as a confounder between input and target variables, proposing a novel approach utilizing an auxiliary dataset to mitigate distribution shifts. This comprehensive study not only advances our understanding of representation learning in OOD contexts but also highlights potential pathways for future research in enhancing model robustness across diverse applications.
History
Degree Type
- Doctor of Philosophy
Department
- Statistics
Campus location
- West Lafayette