Scaling Graph Neural Networks for Large Graphs and Analyzing Common Vulnerabilities via Large Language Models
Graph Neural Networks (GNNs) have emerged as powerful tools for learning from graph-structured data with applications in multiple domains. Despite their success, the computational and memory requirements of GNNs can be significant for large-scale graphs. One of the ways to handle such large-scale graphs without compromising the prediction quality of downstream tasks is by systematically removing nonessential edges. Graph sparsification is a technique that can reduce computational costs and memory requirements and significantly speed up GNN training and inference. The sparsification process needs to be adapted to graph types such as homophilic graphs, where most neighbors of a node are similar to it, or heterophilic graphs, where many neighbors of a node are dissimilar. Current GNNs have been designed to be effective in inference tasks for homophilic graphs, and there has been limited research on scalable GNNs suited for heterophilic graphs. This thesis investigates the importance of taking biased subgraph samples depending on graph features, structure, and available label information. We then design unsupervised and supervised graph sparsification methods that can improve a GNN's performance on homophilic and heterophilic graphs in terms of runtime and prediction quality. We developed an unsupervised feature-guided sparsification method that computes subgraphs based on feature similarity and feature diversity, considering the graphs' underlying homophily. Additionally, we have developed a supervised graph sparsification method that learns the sampling distribution of edges based on downstream tasks. This supervised sparsification removes task-independent and noisy edges, producing high-quality sampled subgraphs of user-defined size. Our methods outperform state-of-the-art approaches on both heterophilic and homophilic graphs. In addition to the novel sparsification strategies, we design compatible GNNs that work synergistically to enhance performance further.
Another avenue this thesis explores is the application of graph-based techniques in conjunction with Large Language Models to address software vulnerabilities in the cybersecurity domain. Cybersecurity management tools must regularly assess an enterprise's cyber-risk by comprehensively identifying associations among attack techniques, weaknesses, and vulnerabilities. Such associations often rely on manual interpretations that are slow compared to the speed of attacks and, therefore, ineffective in combating the ever-increasing list of vulnerabilities and attack actions. Therefore, developing methodologies to associate vulnerabilities to all relevant attack techniques automatically and accurately is critically important. This thesis presents frameworks that can automatically identify all relevant Attack techniques (CAPEC) of a Vulnerability (CVE) via Weakness (CWE) based on their text descriptions, applying natural language processing (NLP) techniques. The framework consists of a novel two-tiered classification approach, where the first tier classifies vulnerabilities to weaknesses, and the second tier classifies weaknesses to attack techniques. We achieve the task of mapping weaknesses to attack techniques by applying Link Prediction and Text-to-Text techniques. Additionally, we scale up these techniques to perform training and inference on modern AI accelerators.
Funding
U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program, Grant 17-SC-20-SC, Exascale Computing Project, ExaGraph
U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program, Data Summarization, Grant SC-0022260
U.S. Department of Energy’s (DOE) Office of Advanced Scientific Computing Research as part of the Center for Artificial Intelligence-focused Architectures and Algorithms, and DE-FG02-13ER26135
The High Performance Data Analytics Program at the Pacific Northwest National Laboratory, by IGEM22-001 and by the National Science Foundation through awards NSF grants CCF-1637534 and #1820685
History
Degree Type
- Doctor of Philosophy
Department
- Computer Science
Campus location
- West Lafayette