TRUSTWORTHY REUSE IN THE MACHINE LEARNING MODEL SUPPLY CHAIN
Machine Learning (ML) models are being adopted as components in software systems. Creating and specializing ML models from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, ML engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks and environments. This practice constructs the ML model supply chain. Traditional software reuse practices and challenges are well understood. However, the foundations for trustworthiness and reusability in the ML supply chain are still largely unexplored.
To investigate the challenges and practices in the ML model supply chain, this dissertation conducts a series of empirical analyses, repository mining studies, and automated tool development, aiming to characterize detailed insights into the challenges and practices in PTM ecosystems. Utilizing mining software repository techniques, I have extracted, analyzed, and interpreted the rich data of deep learning reengineering process, and within PTM packages. My work first adopts traditional software engineering methodologies to understand the challenges and practices of deep learning software. I also characterized PTM naming practices and developed a Deep Neural Network (DNN) architecture assessment pipeline (DARA) to enhance trust and promote more effective reuse in the ML model supply chain. Our finding indicates that ML model naming convention is unique from traditional software packages. Building on my findings, I developed a package confusion detection system and adapted it to ML model supply chain. To enable further research, I released two open-source datasets of PTM packages.
This dissertation compares the PTM model supply chain with the traditional software supply chain across multiple dimensions. The findings reveal that while the ML model supply chain shares many challenges with traditional software, it also introduces unique issues and practices. This work informs future research in ML supply chain analysis, model recommendation systems, model and dataset lineage tracking, and the automated simplification of reengineering processes.
Funding
Collaborative Research: OAC Core: Advancing Low-Power Computer Vision at the Edge
Directorate for Computer & Information Science & Engineering
Find out more...POSE: Phase I: Scoping An Open-Source Ecosystem Around Proactive Software Supply Chain Monitoring
Directorate for Technology, Innovation and Partnerships
Find out more...Collaborative Research: OAC Core: Advancing Low-Power Computer Vision at the Edge
Directorate for Computer & Information Science & Engineering
Find out more...CDSE: Collaborative: Cyber Infrastructure to Enable Computer Vision Applications at the Edge Using Automated Contextual Analysis
Directorate for Computer & Information Science & Engineering
Find out more...Google: Unrestricted gift to support research on machine learning reproducibility
Cisco: Trustworthy Re-use of Pre-Trained Neural Networks
Socket: Unrestricted Gift: Typosquat Detection in Open-Source Ecosystem
History
Degree Type
- Doctor of Philosophy
Department
- Electrical and Computer Engineering
Campus location
- West Lafayette