Purdue University Graduate School
Browse

TRUSTWORTHY REUSE IN THE MACHINE LEARNING MODEL SUPPLY CHAIN

thesis
posted on 2025-05-02, 01:25 authored by Wenxin JiangWenxin Jiang

Machine Learning (ML) models are being adopted as components in software systems. Creating and specializing ML models from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, ML engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks and environments. This practice constructs the ML model supply chain. Traditional software reuse practices and challenges are well understood. However, the foundations for trustworthiness and reusability in the ML supply chain are still largely unexplored.

To investigate the challenges and practices in the ML model supply chain, this dissertation conducts a series of empirical analyses, repository mining studies, and automated tool development, aiming to characterize detailed insights into the challenges and practices in PTM ecosystems. Utilizing mining software repository techniques, I have extracted, analyzed, and interpreted the rich data of deep learning reengineering process, and within PTM packages. My work first adopts traditional software engineering methodologies to understand the challenges and practices of deep learning software. I also characterized PTM naming practices and developed a Deep Neural Network (DNN) architecture assessment pipeline (DARA) to enhance trust and promote more effective reuse in the ML model supply chain. Our finding indicates that ML model naming convention is unique from traditional software packages. Building on my findings, I developed a package confusion detection system and adapted it to ML model supply chain. To enable further research, I released two open-source datasets of PTM packages.

This dissertation compares the PTM model supply chain with the traditional software supply chain across multiple dimensions. The findings reveal that while the ML model supply chain shares many challenges with traditional software, it also introduces unique issues and practices. This work informs future research in ML supply chain analysis, model recommendation systems, model and dataset lineage tracking, and the automated simplification of reengineering processes.

Funding

Collaborative Research: OAC Core: Advancing Low-Power Computer Vision at the Edge

Directorate for Computer & Information Science & Engineering

Find out more...

POSE: Phase I: Scoping An Open-Source Ecosystem Around Proactive Software Supply Chain Monitoring

Directorate for Technology, Innovation and Partnerships

Find out more...

Collaborative Research: OAC Core: Advancing Low-Power Computer Vision at the Edge

Directorate for Computer & Information Science & Engineering

Find out more...

CDSE: Collaborative: Cyber Infrastructure to Enable Computer Vision Applications at the Edge Using Automated Contextual Analysis

Directorate for Computer & Information Science & Engineering

Find out more...

Google: Unrestricted gift to support research on machine learning reproducibility

Cisco: Trustworthy Re-use of Pre-Trained Neural Networks

Socket: Unrestricted Gift: Typosquat Detection in Open-Source Ecosystem

History

Degree Type

  • Doctor of Philosophy

Department

  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

James C. Davis

Additional Committee Member 2

David I. Inouye

Additional Committee Member 3

Xiaokang Qiu

Additional Committee Member 4

Yung-Hsiang Lu

Additional Committee Member 5

Zahra Ghodsi